piątek, 27 maja 2016

Regression Modeling in Practice: WEEK 3

Multiple Regression


AssignmentTest a Multiple Linear Regression Model 

Source: Data from OECD (“The Organisation for Economic Co-operation and Development”)

Variables Used

*Employment Rate -  It is the number of employed persons aged 15 to 64 over the population of the same age. (source: OECD)

*Life Satisfaction - This indicator considers people's evaluation of their life as a whole (source: OECD)

*GDP - Gross Domestic Product per capita (source: OECD)

All variables used in the analysis are quantitative.

I centered the explanatory variables and checked the coding by using the means procedure.

Introduction:

My previous data analysis revealed a strong positive correlation between Life Satisfaction (response variable) and Employment Rate (explanatory variable).

This time, I added one more explanatory variable (GDP per capita) in order to run a multiple linear regression.

Hypothesis: There is a significant association between two explanatory variables and one response variable.

Code:



Output:









Summary:

After adding the second explanatory variable, the correlation between Life Satisfaction (response variable) and Employment Rate (initial explanatory variable) remained significantly and positively associated (b=0.056, p=0.0014). However, it appeared that that there is no significant relationship between GDP per capita of a country (second explanatory variable) and the Life Satisfaction of its citizens (b=0.000018, p=0.0531). This would suggest that this explanatory variable is confounding the results. It can ruin the experiment and give useless results.

Results which I obtained did not support my hypothesis. The assumption that both explanatory variables are significantly correlated with the response variable proved to be wrong. Only one of them (Employment Rate) has a strong relationship with the response variable (Life Satisfaction).

Using a second explanatory variable slightly increases the R-squared value of the model. The R-square value of 0.532020 indicates that the proportion of variance in the response variable that can be attributed to the explanatory variable is 53.2%.  

Q-Q Plot
The Q-Q Plot shows that the residuals generally follow a straight line, but deviate somewhat at lower and middle quantiles, i.e. the residuals do not follow perfect normal distribution.

Standard Residuals 
This procedure shows that almost the same number of countries have standard residuals grater and lower than 0. Only one of them  is greater than 2 and one other lower than -2, making this model acceptable. 

Outliers and Leverage 
The Outlier and Leverage Diagnostics plot shows that the majority of the points have close to zero leverage and are within a residual standardized value of 2. That is, the majority of the observations have no leverage on the model. However, there are 2 observations that are outliers (red) and 2 that have high leverage (green). There are no points which are both an outlier and have high leverage.



sobota, 21 maja 2016

Regression Modeling in Practice: WEEK 2


Basics of Linear Regression


Assignment: Test a Basic Linear Regression Model 

Source: Data from OECD (“The Organisation for Economic Co-operation and Development”)

Variables Used

*Employment Rate -  It is the number of employed persons aged 15 to 64 over the population of the same age. (source: OECD)

*Life Satisfaction - This indicator considers people's evaluation of their life as a whole (source: OECD)
Introduction:

Introduction:

This week I decided to test the association between Employment Rate (explanatory variable) and Life Satisfaction (response variable) using basic linear regression model.

First, I centered the quantitative explanatory variable "Work" (Employment Rate) by substracting the mean and created a new variable "Work_c" (Centered Employment Rate). I checked the centering by using the means procedure for the new variable and it appeared that the value of the mean was exactly zero (0).

Next, I ran linear regression procedure with the new explanatory variable (“Work_c" - Centered Employment Rate)  and Life Satisfaction (response variable).

Program:



Output:








Summary:


After centering the explanatory quantitative variable (Employment Rate) I obtained a new variable (centered Employment Rate), with the mean equal to zero (0), and I used it in the Linear regression model.

The results of the linear regression model indicate that Life Satisfaction (F =28.49, p<.0001) is significantly and positively associated with the centered Employment Rate.

The parameter estimates show a coefficient value of 0.074943946 and an intercept value of 6.588235294. Therefore, the best fit line equation for the linear regression is:

Life Satisfaction = 0.074943946*Employment Rate (centered) + 6.588235294

The p-values for both the intercept and coefficient values are very small (both p < 0.0001). This indicates there is indeed a straight-line relationship between Life Satisfaction and Employment Rate.

The R-square value of 0.47 indicates that the proportion of variance in the response variable that can be attributed to the explanatory variable is 47%.