-1≤correlation≤1
0-0.3 = weak
linear relationship
0.3 – 0.7=
moderate linear relationship
0.7=1 =
Strong linear relationship
Q12
a)
Regression conditions and assumptions
Multiple linear regression needs at least 3
variables of metric (ratio or interval) scale. A rule of thumb for the
sample size is that regression analysis requires at least 20 cases per independent
variable in the analysis, in the simplest case of having just two independent
variables that requires n > 40. G*Power can also be used to calculate
a more exact, appropriate sample size.
Firstly,
multiple linear regression needs the relationship between the independent and
dependent variables to be linear. It is also important to check for outliers
since multiple linear regression is sensitive to outlier effects. The linearity
assumption can best be tested with scatter plots, the following two examples
depict two cases, where no and little linearity is present.
See more at:
http://www.statisticssolutions.com/academic-solutions/resources/directory-of-statistical-analyses/assumptions-of-multiple-linear-regression/#sthash.5AnM14GS.dpuf
·
Linearity
Linearity. This assumption says that the
relationship between x and y must be linear.
If this assumption is not met,
then the parameters will be biased.
·
Independence
·
Randomization
·
Equal variance Assumption
·
Equal spread assumption
·
Normality assumption
·
Nearly normal assumption
b)
R² of Regression
What Is R-squared?
R2/R-Squared: Multiple R-Squared and Adjusted R-Squared are both
statistics derived from the regression equation to quantify model performance.
The value of R-squared ranges from 0 to 100 percent. If your model fits the
observed dependent variable values perfectly, R-squared is 1.0 (and you, no
doubt, have made an error… perhaps you've used a form of y to predict y). More likely, you will see
R-squared values like 0.49, for example, which you can interpret by saying:
this model explains 49% of the variation in the dependent variable. To
understand what the R-squared value is getting at, create a bar graph showing
both the estimated and observed Y values sorted by the estimated values. Notice
how much overlap there is. This graphic provides a visual representation of how
well the model's predicted values explain the variation in the observed
dependent variable values. View
an illustration. The Adjusted
R-Squared value is always a bit lower than the Multiple R-Squared value because
it reflects model complexity (the number of variables) as it relates to the
data.
R-squared is a
statistical measure of how close the data are to the fitted regression line. It
is also known as the coefficient of determination, or the coefficient of
multiple determination for multiple regression.
The definition of
R-squared is fairly straight-forward; it is the percentage of the response
variable variation that is explained by a linear model. Or:
R-squared = Explained
variation / Total variation written as
R-squared = sum of squared regression( SSR) /total
sum of square (SSTotal)
R squared or r squared adjusted
R-squared is always
between 0 and 100%:
·
0% indicates that the model explains none of the variability of the
response data around its mean.
·
100% indicates that the model explains all the variability of the
response data around its mean.
In general, the
higher the R-squared, the better the model fits your data. However, there are
important conditions for this guideline that I’ll talk about both in this post
and my next post.
Graphical Representation of R-squared
Plotting fitted
values by observed values graphically illustrates different R-squared values
for regression models.
The regression model
on the left accounts for 38.0% of the variance while the one on the right
accounts for 87.4%. The more variance that is accounted for by the regression
model the closer the data points will fall to the fitted regression line.
Theoretically, if a model could explain 100% of the variance, the fitted values
would always equal the observed values and, therefore, all the data points
would fall on the fitted regression line.
R squared is coeefiiant of determination. It tells
us if our regression model is a good one or not.
-1≤ R²≤1
When R² is zero, means 0% of variability in y (dependant variable) explained by
variability in X (independant variable)
When R² is 1, means 100% of variability in y (dependant variable) explained by
variability in X (independant variable) and in such case thee will be no
residuals , so it’s a perfect model.
R-squared = sum of
squared regression( SSR) /total sum of square (SSTotal)
= 484.789/
R-squared is 99%
This means that
the variability in dependant variable (Receipts) is 99.9% explained by variability
in
Independent
variables (Paid Attendance, No. of shows and Avg ticket price) and we have some
residuals
Q14)
a)
What I the regression model?
Receipts =? + Paid
Attendance + No. of shows + Avg ticket price + Ɛ
Receipts = -18.320 + 0.076 (Paid attendance) + 0.0070
(No. of shows) + 0.24(Avg Ticket Price) + Ɛ
b)
What does the coefficient of Paid attendance mean in this
regression? Does this make sense?
Each independent variable
is associated with a regression coefficient . this coefficient describes the
strength and the sign of that variable's relationship to the dependent
variable.
Regression coefficients (β): coefficients are computed by the
regression tool. They are values, one for each explanatory variable, that
represent the strength and type of relationship the explanatory variable has to
the dependent variable.
Coefficient of Paid
attendance is + 0.0070 indicating a poisitve relationship between PA and R.
(When the relationship is positive, the sign for the associated coefficient is
also positive).
β0 is the regression intercept. It represents the
expected value for the dependent variable if all of the independent variables
are zero.
Answer: βPA = 0.0070
Means for a single unit change in PA,
receipts (R) would be changed by 0.076 units, assuming other independent
variables remain constant.
DOES IT MAKE SENSE?
Construct hyp.
H0: Β1 =
0 (meaning Paid Attendance (PA) do not affect Receipts)
Ha: Β1 ≠ 0 (meaning Paid Attendance (PA) do affect Receipts)
Ha: Β1 ≠ 0 (meaning Paid Attendance (PA) do affect Receipts)
H0:
Model is not good.
Ha: Model is good
When
Fcalc > Ftab reject the null hyp.
We
assume α = 0.05
Ftab F0.05, (3-74)
F0.05, (3 – 60)= 2.76
F0.05, (3 – 75)= 2.73
= 2.76 – 2.73 = 0.002
15
(should we take 0.025 to
calc f)
0.002 x 14 = 0.028
2.73 +
0.028 = 2.758
So,
F0.05, (3 – 74)= 2.758
c)
In a week in which the
Paid Attendance was 200,000,
Customers attending shows 30
Avg ticket price $ 70
What would you estimate the receipts would be?
Answer
Receipts = -18.30
+ 0.076(200,000) + (0.0070)30 + (0.24)70 + Ɛ
=15198.71
d)
Is this likely to be a good prediction? Why do you think that?
?
Q16)
a)
How was the t ratio of 126.7 found for paid attendance?
tcalc = β
- β H0
SE
=
0.076 – 0
0.0006
=
126.7
The t-statistic for the significance of the slope is essentially a
test to determine if the regression model (equation) is usable. If the slope is
significantly different than zero, then we can use the regression model to
predict the dependent variable for any value of the independent variable
b)
How many weeks are included in this regression? How can you tell?
Answer : 24 weeks
89/4 = 22.25 + 2
= 24.25
c)
The t-ratio for the intercept is negative. What does that mean?
tint = βint
- β H0
SE
= – 18.320
– 0
0.3127
=
– 58.6
Assuming if all independent variables were
zero, in the case receipts ( R) would be
-18.320, indicating a loss of receipts. (Fixed costs)
(
Draw loss graph)
a negative value for your constant/intercept
should not be a cause for concern. This
simply means that the expected value on dependent variable will be less than 0
(i.e. negative) when all independent/predictor variables are set to 0.
The intercept is usually called the constant,
and the slope is referred to as the coefficient.
Q18)
a)
State the standard null and
alternative hyp for the true coefficient of No. of Shows.
ΒNS =0.0070
our hypothesis
H0: Β1 =
0 (meaning NS do not affect Receipts)
Ha: Β1 ≠ 0 (meaning NS do affect Receipts)
Ha: Β1 ≠ 0 (meaning NS do affect Receipts)
H0: The
slope of the regression line is equal to zero.
Ha: The slope of the regression line is not equal to zero.
Ha: The slope of the regression line is not equal to zero.
If the relationship between
No. of shows and receipts is significant, the slope will not equal zero.
b)
Test the null hypothesis ( at α
= 0.05) and state your conclusion.
tNS =
1.59
P-Value
NS
= 0.166 ≥ 0.05/2
P-Value
NS
= 0.166 ≥ 0.025
As p-value
is greater than α we do not reject, null hypothesis.
Means No. of shows do not have significant effect over receipts.
c)
A Broadway investor challenges our
decision/analysis. He points out scatterplot of Receipts vs no. of shows. Explain to him why our answer in b isn’t a
contradiction
Multicollinearity
Q20)
Q22) which regression conditions can be checked in the three
plots?
Plot 1
Linear assumption cant be
explained because there is no obvious pattern.
Plot 2
Normality assumption is met as plot looks like a normal
distribution.
Plot 3
Shows increases towards year end and beginning (around winter
time) and then goes down in mid year (summer time). So we see some pattern in
the plot. Around winter time the residual goes up
Impact on recepits ?
If Ɛ i.e .residuals increase our receipts increase, assuming Ɛ is a normal distribution
Scatterplot shows there is some interdependence in between
residual i.e. in between receipts so independence condition isn’t met.