Regression models

0-0.3 = weak linear relationship
0.3 – 0.7= moderate linear  relationship
0.7=1 = Strong  linear relationship
a)      Regression conditions and assumptions
Multiple linear regression needs at least 3 variables of metric (ratio or interval) scale.  A rule of thumb for the sample size is that regression analysis requires at least 20 cases per independent variable in the analysis, in the simplest case of having just two independent variables that requires n > 40.  G*Power can also be used to calculate a more exact, appropriate sample size.

Firstly, multiple linear regression needs the relationship between the independent and dependent variables to be linear. It is also important to check for outliers since multiple linear regression is sensitive to outlier effects. The linearity assumption can best be tested with scatter plots, the following two examples depict two cases, where no and little linearity is present.
·           Linearity
Linearity. This assumption says that the relationship between x and y must be linear.
If this assumption is not met, then the parameters will be biased.
·           Independence
·           Randomization
·           Equal variance Assumption
·           Equal spread assumption
·           Normality assumption
·           Nearly normal assumption
b)      R² of Regression

 What Is R-squared?

R2/R-Squared: Multiple R-Squared and Adjusted R-Squared are both statistics derived from the regression equation to quantify model performance. The value of R-squared ranges from 0 to 100 percent. If your model fits the observed dependent variable values perfectly, R-squared is 1.0 (and you, no doubt, have made an error… perhaps you've used a form of y to predict y). More likely, you will see R-squared values like 0.49, for example, which you can interpret by saying: this model explains 49% of the variation in the dependent variable. To understand what the R-squared value is getting at, create a bar graph showing both the estimated and observed Y values sorted by the estimated values. Notice how much overlap there is. This graphic provides a visual representation of how well the model's predicted values explain the variation in the observed dependent variable values. View an illustration. The Adjusted R-Squared value is always a bit lower than the Multiple R-Squared value because it reflects model complexity (the number of variables) as it relates to the data.

R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression.
The definition of R-squared is fairly straight-forward; it is the percentage of the response variable variation that is explained by a linear model. Or:
R-squared = Explained variation / Total variation written as
R-squared = sum of squared regression( SSR) /total sum of square (SSTotal)
R squared or r squared adjusted  
R-squared is always between 0 and 100%:
·         0% indicates that the model explains none of the variability of the response data around its mean.
·         100% indicates that the model explains all the variability of the response data around its mean.
In general, the higher the R-squared, the better the model fits your data. However, there are important conditions for this guideline that I’ll talk about both in this post and my next post.
Graphical Representation of R-squared
Plotting fitted values by observed values graphically illustrates different R-squared values for regression models.
Regression plots of fitted by observed responses to illustrate R-squared
The regression model on the left accounts for 38.0% of the variance while the one on the right accounts for 87.4%. The more variance that is accounted for by the regression model the closer the data points will fall to the fitted regression line. Theoretically, if a model could explain 100% of the variance, the fitted values would always equal the observed values and, therefore, all the data points would fall on the fitted regression line.
R squared is coeefiiant of determination. It tells us if our regression model is a good one or not.
When is zero, means 0% of variability in y (dependant variable) explained by variability in X (independant variable)
When is 1, means 100% of variability in y (dependant variable) explained by variability in X (independant variable) and in such case thee will be no residuals , so it’s a perfect model.

R-squared = sum of squared regression( SSR) /total sum of square (SSTotal)
= 484.789/
R-squared is 99%
This means that the variability in dependant variable (Receipts) is 99.9% explained by variability in
Independent variables (Paid Attendance, No. of shows and Avg ticket price) and we have some residuals

a)      What I the regression model?
Receipts =? + Paid Attendance + No. of shows + Avg ticket price + Ɛ
Receipts = -18.320 + 0.076 (Paid attendance) + 0.0070 (No. of shows) + 0.24(Avg Ticket Price) + Ɛ
b)    What does the coefficient of Paid attendance mean in this regression? Does this make sense?
Each independent variable is associated with a regression coefficient . this coefficient describes the strength and the sign of that variable's relationship to the dependent variable.
Regression coefficients (β): coefficients are computed by the regression tool. They are values, one for each explanatory variable, that represent the strength and type of relationship the explanatory variable has to the dependent variable.
Coefficient of Paid attendance is + 0.0070 indicating a poisitve relationship between PA and R. (When the relationship is positive, the sign for the associated coefficient is also positive).
β0 is the regression intercept. It represents the expected value for the dependent variable if all of the independent variables are zero.

Answer: βPA = 0.0070
Means for a single unit change in PA, receipts (R) would be changed by 0.076 units, assuming other independent variables remain constant.

Construct hyp.
H0: Β1 = 0  (meaning Paid Attendance (PA)  do not affect  Receipts)
Ha: Β1 ≠ 0  (meaning Paid Attendance (PA)  do affect Receipts)

H0: Model is not good. 
Ha: Model is good
When      Fcalc  >  Ftab    reject the null hyp.

We assume α = 0.05
 Ftab           F0.05, (3-74)

F0.05, (3 – 60)= 2.76
F0.05, (3 – 75)= 2.73

=     2.76 – 2.73    = 0.002

(should we take 0.025 to calc f)
0.002 x 14 = 0.028

2.73   +  0.028 =  2.758

F0.05, (3 – 74)=  2.758

c)     In a week in which the
Paid Attendance was 200,000,
Customers attending shows 30
Avg ticket price $ 70
What would you estimate the receipts would be?

Receipts = -18.30 + 0.076(200,000) + (0.0070)30 + (0.24)70 + Ɛ
d)      Is this likely to be a good prediction? Why do you think that?
a)      How was the t ratio of 126.7 found for paid attendance?

tcalc = β - β H0

 =         0.076 – 0

=  126.7

The t-statistic for the significance of the slope is essentially a test to determine if the regression model (equation) is usable. If the slope is significantly different than zero, then we can use the regression model to predict the dependent variable for any value of the independent variable

b)      How many weeks are included in this regression? How can you tell?

Answer : 24 weeks

89/4 = 22.25 + 2 = 24.25

c)      The t-ratio for the intercept is negative. What does that mean?

tint = βint - β H0

=      – 18.320 – 0

=     – 58.6

Assuming if all independent variables were zero, in the case receipts  ( R) would be -18.320, indicating a loss of receipts. (Fixed costs)

Draw loss graph)

a negative value for your constant/intercept should not be a cause for concern. This simply means that the expected value on dependent variable will be less than 0 (i.e. negative) when all independent/predictor variables are set to 0.
The intercept is usually called the constant, and the slope is referred to as the coefficient. 

a)    State the standard null and alternative hyp for the true coefficient of No. of Shows.
ΒNS =0.0070

our hypothesis
H0: Β1 = 0  (meaning NS do not affect Receipts)
Ha: Β1 ≠ 0  (meaning NS do affect Receipts)
H0: The slope of the regression line is equal to zero. 
Ha: The slope of the regression line is not equal to zero.
If the relationship between No. of shows and receipts is significant, the slope will not equal zero.

b)    Test the null hypothesis ( at α = 0.05) and state your conclusion.
tNS = 1.59

P-Value NS = 0.166 ≥ 0.05/2

P-Value NS = 0.166 ≥ 0.025

As p-value is greater than  α we do not reject, null hypothesis. Means No. of shows do not have significant effect over receipts.

c)    A Broadway investor challenges our decision/analysis. He points out scatterplot of Receipts vs no. of shows.  Explain to him why our answer in b isn’t a contradiction



Q22) which regression conditions can be checked in the three plots?

Plot 1
Linear assumption cant be explained because there is no obvious pattern.

Plot 2
Normality assumption is met as plot looks like a normal distribution.

Plot 3

Shows increases towards year end and beginning (around winter time) and then goes down in mid year (summer time). So we see some pattern in the plot. Around winter time the residual goes up

Impact on recepits ?

If Ɛ i.e .residuals increase our receipts increase, assuming Ɛ  is a normal distribution

Scatterplot shows there is some interdependence in between residual i.e. in between receipts so independence condition isn’t met.