niedziela, 5 czerwca 2016

Regression Modeling in Practice: WEEK 4

Logistic Regression


AssignmentTest a Logic Regression Model 

Source: Data from OECD (“The Organisation for Economic Co-operation and Development”)

Variables Used

*Employment Rate -  It is the number of employed persons aged 15 to 64 over the population of the same age. (source: OECD)

*Life Satisfaction - This indicator considers people's evaluation of their life as a whole. (source: OECD)

*Household Disposable Income- It´s the maximum amount that household can afford to consume without having to reduce its assets or to increase its liabilities. (source: OECD)


Explanatory variables were standardized for the Logistic procedures.
Response variable (Life Satisfaction) was binned into 2 categories.

Introduction:


In this logistic model I coded my response variable of Life Satisfaction as 0 if the country has Life Satisfaction Index below or equal to 6.8 (on the scale of 1 to 10), and 1 if it has Life Satisfaction above 6.8.

I used the centered Employment Rate and centered Household Disposable Income as the explanatory variables. 

My hypothesis: There is a strong correlation between Life Satisfaction and Employment Rate.

CODE:

OUTPUT:
[STAGE 1]
[STAGE 2]
Summary:

In my analysis, I introduced two stages. First, I ran logistic regression for the primary explanatory variable and response variable. After receiving positive results, I added the second explanatory in order to check whether it is significant or, on the contrary, confounding the relationship. The results of the two stages are as follows:

[STAGE 1]
The primary explanatory variable “Employment Rate” has a significant relationship with the response variable “Life Satisfaction” (p<0.0019). The null hypothesis may be rejected. The likelihood ratio in testing Null Hypothesis gives p<.0001.

The explanatory variable (parameter estimate= 0.2845 p-value p<0.0019, odds ratio= 1.329) shows that countries with high Employment rates are 1.329 times more likely to have Life Satisfaction Index of more than 6.8 (on the scale from 1 to 10). 
There is 95% confidence that the likelihood falls between 1.111 and 1.590.


[STAGE 2]
After adding the second explanatory variable “Household Disposable Income,” the correlation with “Life Satisfaction” remains significant with p= 0.0477 and 0.0421 for Employment Rate and Household Income respectively. Therefore, Household Income does not confound the results.

This time the odds ratio for Employment Rate is 1.203. Countries with high Employment rates are 1.203 times more likely to have high Life Satisfaction. There is 95% confidence between 1.002 and 1.446.
The odds ratio for Household disposable income is 1.000 and there is 95% confidence between 1.000 and 1.001.


The results support my original hypothesis of the significant and positive relationship between the Life Satisfaction rate and the Employment Rate. It appears that people are more satisfied with their lives in countries where employment rate is high.


There was no evidence of confounding for the association between my primary explanatory variable (Employment Rate) and the response variable (Life Satisfaction). After adding the second explanatory variable (Household Disposable Income) the relationship remained statistically significant. 


piątek, 27 maja 2016

Regression Modeling in Practice: WEEK 3

Multiple Regression


AssignmentTest a Multiple Linear Regression Model 

Source: Data from OECD (“The Organisation for Economic Co-operation and Development”)

Variables Used

*Employment Rate -  It is the number of employed persons aged 15 to 64 over the population of the same age. (source: OECD)

*Life Satisfaction - This indicator considers people's evaluation of their life as a whole (source: OECD)

*GDP - Gross Domestic Product per capita (source: OECD)

All variables used in the analysis are quantitative.

I centered the explanatory variables and checked the coding by using the means procedure.

Introduction:

My previous data analysis revealed a strong positive correlation between Life Satisfaction (response variable) and Employment Rate (explanatory variable).

This time, I added one more explanatory variable (GDP per capita) in order to run a multiple linear regression.

Hypothesis: There is a significant association between two explanatory variables and one response variable.

Code:



Output:









Summary:

After adding the second explanatory variable, the correlation between Life Satisfaction (response variable) and Employment Rate (initial explanatory variable) remained significantly and positively associated (b=0.056, p=0.0014). However, it appeared that that there is no significant relationship between GDP per capita of a country (second explanatory variable) and the Life Satisfaction of its citizens (b=0.000018, p=0.0531). This would suggest that this explanatory variable is confounding the results. It can ruin the experiment and give useless results.

Results which I obtained did not support my hypothesis. The assumption that both explanatory variables are significantly correlated with the response variable proved to be wrong. Only one of them (Employment Rate) has a strong relationship with the response variable (Life Satisfaction).

Using a second explanatory variable slightly increases the R-squared value of the model. The R-square value of 0.532020 indicates that the proportion of variance in the response variable that can be attributed to the explanatory variable is 53.2%.  

Q-Q Plot
The Q-Q Plot shows that the residuals generally follow a straight line, but deviate somewhat at lower and middle quantiles, i.e. the residuals do not follow perfect normal distribution.

Standard Residuals 
This procedure shows that almost the same number of countries have standard residuals grater and lower than 0. Only one of them  is greater than 2 and one other lower than -2, making this model acceptable. 

Outliers and Leverage 
The Outlier and Leverage Diagnostics plot shows that the majority of the points have close to zero leverage and are within a residual standardized value of 2. That is, the majority of the observations have no leverage on the model. However, there are 2 observations that are outliers (red) and 2 that have high leverage (green). There are no points which are both an outlier and have high leverage.



sobota, 21 maja 2016

Regression Modeling in Practice: WEEK 2


Basics of Linear Regression


Assignment: Test a Basic Linear Regression Model 

Source: Data from OECD (“The Organisation for Economic Co-operation and Development”)

Variables Used

*Employment Rate -  It is the number of employed persons aged 15 to 64 over the population of the same age. (source: OECD)

*Life Satisfaction - This indicator considers people's evaluation of their life as a whole (source: OECD)
Introduction:

Introduction:

This week I decided to test the association between Employment Rate (explanatory variable) and Life Satisfaction (response variable) using basic linear regression model.

First, I centered the quantitative explanatory variable "Work" (Employment Rate) by substracting the mean and created a new variable "Work_c" (Centered Employment Rate). I checked the centering by using the means procedure for the new variable and it appeared that the value of the mean was exactly zero (0).

Next, I ran linear regression procedure with the new explanatory variable (“Work_c" - Centered Employment Rate)  and Life Satisfaction (response variable).

Program:



Output:








Summary:


After centering the explanatory quantitative variable (Employment Rate) I obtained a new variable (centered Employment Rate), with the mean equal to zero (0), and I used it in the Linear regression model.

The results of the linear regression model indicate that Life Satisfaction (F =28.49, p<.0001) is significantly and positively associated with the centered Employment Rate.

The parameter estimates show a coefficient value of 0.074943946 and an intercept value of 6.588235294. Therefore, the best fit line equation for the linear regression is:

Life Satisfaction = 0.074943946*Employment Rate (centered) + 6.588235294

The p-values for both the intercept and coefficient values are very small (both p < 0.0001). This indicates there is indeed a straight-line relationship between Life Satisfaction and Employment Rate.

The R-square value of 0.47 indicates that the proportion of variance in the response variable that can be attributed to the explanatory variable is 47%. 

czwartek, 28 kwietnia 2016

Regression Modeling in Practice: WEEK 1

Introduction to Regression

Assignment: Writing About Your Data

Sample

Step 1: Describe your sample. Provide enough detail so that your reader can clearly understand the population that the study sample came from. Use meaningful labels. Do not use abbreviations (“PPM100”) or variable names.

a) Describe the study population (who or what was studied).

b) Report the level of analysis studied (individual, group, or aggregate).

c) Report the number of observations in the data set.

d) Describe your data analytic sample (the sample you are using for your analyses).

ANSWERS to Step 1:

a) The sample comes from the Organisation for Economic Co-operation and Development (OECD). The main goal of OECD is to promote policies that will improve the economic and social well-being of people around the world. 

It collects and provides important data concerning “the quality of life” in 34 OECD member countries, i.e.: Australia, Austria, Belgium, Canada, Chile, Czech Republic, Denmark, Estonia, Finland, France, Germany, Greece, Hungary, Iceland, Ireland, Israel, Italy, Japan, Korea, Luxembourg, Mexico, Netherlands, New Zealand, Norway, Poland, Portugal, Slovak Republic, Slovenia, Spain, Sweden, Switzerland, Turkey, United Kingdom, United States.

The project includes a wide variety of data, concerning both material well-being (such as income, jobs and housing) and the broader quality of people’s lives (such as their health, education, work-life balance, environment, social connections, civic engagement, subjective well-being and safety).

b) The level of the analysis is aggregate.

c) Number of observations: 34 countries and 24 corresponding variables. No data is missing.

d) The data analytic sample for this study includes 34 countries and the following 6 variables: Gross Domestic Product (GDP) per capita, Employment Rate, Personal Earnings, Household Income, Percent of People with at least High School Education and Life Satisfaction Index. Gross Domestic Product is the indicator of a country´s wealth, while the other variables are the indicator´s of people´s well-being.

Procedure


Step 2: Describe the procedures that were used to collect the data.

a) Report the study design that generated that data (for example: data reporting, surveys, observation, experiment).

b) Describe the original purpose of the data collection.

c) Describe how the data were collected.

d) Report when the data were collected.

e) Report where the data were collected.

ANSWERS to Step 2:

a) The data on the most interesting variable, i.e. Life Satisfaction Index were obtained by the internet survey on the OECD website. The participants were asked to give subjective opinion about their life quality on the scale of 0 to 10 using the Cantril Ladder (known also as the "Self-Anchoring Striving Scale"). Additionally, other questions were given, concerning participants´ economical, educational and employment background, in order to make sure that the sample used for statistical assessment of Life Satisfaction is representative. Life Satisfaction Index is updated on the basis of new surveys every year.
The data on other variables were calculated by using existing data from national and international statistical databases and applying special mathematical formulas.
The data on GDP per capita is based on GDP data from the OECD Annual National Accounts. It is expenditure on final goods and services minus imports.
The data on Employment Rate comes from OECD Labour Force Statistics Database. It is the number of employed people in the working age, i.e. 15 to 65, over the population of the same age.
Personal Earnings are calculated combining data from the OECD Earnings distribution database and OECD average annual earnings per full-time and full-year equivalent dependent employee database. It is total wage bill divided by the average number of employees, which is then multiplied by the ratio of usual weekly hours per full-time employee to average usually weekly hours for all employees.
Household Disposable Income variable is calculated by OECD calculations on the basis of OECD National Accounts at a Glance and Statistics New Zealand. It's obtained adding to people’s gross income, the social transfers in-kind that households receive from governments, and then subtracting the taxes on income and wealth, the social security contributions paid by households as well as the depreciation of capital goods consumed by households.
Education variable comes from OECD Education at glance database. It is the number of adults aged 25 to 64 holding at least an upper secondary degree over the population of the same age.

b) The purpose of the original data collection was to compare the quality of life around the world.

c) One variable, i.e. "Life Satisfaction index" was collected by the on-line survey on OECD website. The other variables were collected and calculated using existing data from the databases of national and international statistical institutions.

d) The data were collected by trained OECD statisticians during 2012, 2013 and 2014.

e) The data were collected in 34 member countries of the Organization for Economic Cooperation and Development OECD.
The names of these countries are mentioned at the beginning of this blog entry.

No further detail concerning the procedure is provided by OECD.


Measures

Step 3: Describe your variables.

a) Describe what your explanatory and response variables measured.

b) Describe the response scales for your explanatory and response variables.

c) Describe how you managed your explanatory and response variables.

ANSWERS to Step 3:

a) The variables from the data analytic sample measure the followings:

1 - GDPGross domestic product per capita. It is used as an indicator of a country´s wealth. 

2 - Employment rate - is a number of employed people at the working age, i.e. 15 to 64, over the population of the same age. Employed people are those who report that they worked for at least one hour in the previous week.

3 - Personal Earnings - refer to the average annual wages per full-time equivalent dependent employee.

4 -Household disposable income - It´s the maximum amount that a household can afford to consume without having to reduce its assets or to increase its liabilities.

5 - Percent of people with at least High School Education considers the number of adults aged 25 to 64 holding at least an upper secondary degree over the population of the same age, as defined by the OECD-ISCED classification.

6 - Life Satisfaction- considers people's evaluation of their life as a whole. It is a weighted-sum of different response categories based on people's rates of their current life relative to the best and worst possible lives for them on a scale from 0 to 10, using the Cantril Ladder (known also as the "Self-Anchoring Striving Response Scale").


b) Self-Anchoring Striving Response Scale (used for measuring the "Life Satisfaction index") was developed by a social researcher Dr. Hadley Cantril. It is an example of wellbeing assessment. It uses following steps:
  • Please imagine a ladder with steps numbered from zero at the bottom to 10 at the top.
  • The top of the ladder represents the best possible life for youand the bottom of the ladder represents the worst possible life for you.
  • On which step of the ladder would you say you personally feel you stand at this time? (ladder present)
  • On which step do you think you will stand about five years from now? (ladder-future)

No other response scales were used.


c) In my research,  I´ve been analyzing the relationship between different pairs or groups of three variables in order to check the strength of their correlation.

For the reason of broad range of data, I categorized the explanatory and response variables  and created new variables with 4 to 6 levels.

Initially, I tested how the country´s wealth (GDP) - explanatory variable - affects the indicators of people´s well-being - response variables - i.e. Personal Earnings, Household Income, Employment, Education and Life Satisfaction Index.


The most important variable in my research is the "Life Satisfaction Index" as it refers to how happy people are with their life. Therefore, I have also done a number of tests analyzing the relationship between Employment Rate, Education, Personal Earnings, Household Income (as explanatory variables) and Life Satisfaction (as response variable).


The main purpose of my study is to observe what is the most important for people to be happy - whether it is money, education, work or other aspects?!

Until now, the well-being analysis has brought me a lot of interesting results. The finding might be observed in the following entries:

http://mygapminder.blogspot.pt/2016/03/assignment-week-4.html

http://mygapminder.blogspot.pt/2016/04/data-analysis-tools-week-1.html

http://mygapminder.blogspot.pt/2016/04/data-analysis-tools-week-4.html