Regression Analysis (explain a phenomena or history)

Explain the known using the unknown

Estimate coefficients

If past history is accurate, you can predict the future

Use independent variable to get information on dependent variables

Correlation is the relationship between two variables

Process:

    1. Identify Variables (independent and dependent)
    2. Theory, correlation, scatter diagrams for each variable

      Look for relationship (linear or nonlinear)

    3. Collect and organize data (from multiple sources)
    4. Estimate coefficients (using least squares method – r2)
    5. % of variance in dependent variable explaining the model

    6. Test model (for overall and individual significance)
    7. Overall significance (y = B0 + B1X1 + B2X2 . . ± E)

      Y = the explained (variance)

      B0 = slope

      (B0 + B1X1 + B2X2) is the explained or y predicted

      E = unexplained error

      Null hypothesis – none of the independent variables are significant

      All Bs = 0, at least B ¹ 0 (H1) (predictors of y)

      Test statistic

      F is the distribution of ratio to variance

      P value = actual error (if less than 5% you are ok)

      Compute p value and either accept or reject H1

      Type I error: have a model, but should reject (use incorrect model)

      Type II error: accept as null, could have used (missing an opportunity)

      Significance level leads to the probability of making an error (µ )

      Individual Significance

      Low p value, low probability of making an error

      High confidence that the coefficient is not zero

      Compute p value and decide on H0

      H1 = B < 0, B > 0 (Select only one based on the relationship)

      Stop when you have all significant variables

    8. Validate and implement

 

 

The F statistic is equal to the regression mean square (MSR) divided by the error mean square (MSE). Where P = number of explanatory variables in the regression model

F = test statistic from an F distribution with P and n-P-1-1 degrees of freedom.

The decision rule is to reject H0 at the a level of significance if F > F u(p,n,-p-1); otherwise do not reject H0.

 

Test for overall significance Excel output:

ANOVA

df

SS

MS

F

Significance F

Regression

6

2228586.427

371431.07

1.0736532

0.42042919

Residual

15

5189260.346

345950.69

Total

21

7417846.773

df Regression = P, the number of explanatory variables

df Total = n of observations - 1

F = MRS/MSE

Significance F = p value The p value is the probability of obtaining a test statistic equal to or more extreme than the result obtained from the sample data. The p value is often referred to as the observed level of significance, the smallest level at which H0 can be rejected for a given data set.

R2 = SSR/SST - Measures the proportion of variation that is explained by the independent variable X in the regression model.

 

Regression analysis I used primarily for the purpose of prediction. The goal in regression analysis is the development of a statistical model that can be used to predict values of a dependent variable or response variable based on the values of least one explanatory or independent variable.

Correlation analysis, in contrast to regression, is used to measure the strength of the association between numerical values.

There are four major assumptions of regression:

    1. Normality: requires the values of Y be normally distributed at each value X.
    2. Homosecasticity: requires that the variation around the line of regression be constant for all values of X. This means that Y varies the same amount when X is a low value as when X is a high value.
    3. Independence of errors: requires that the error (residual difference between observed and predicted values of Y) should be independent for each value of X.
    4. Linearity: states that the relationship among variables is linear.