## Statistical Techniques Form PDF Details

Statistical techniques form the backbone of any data-driven field. In order to make informed decisions, researchers and analysts rely on these techniques to help them understand the patterns inherent in their data. The use of statistics allows for more accurate predictions and a better understanding of what is happening in our world. In this blog post, we will discuss some of the most common statistical techniques used by researchers and analysts. We will also provide examples of how these techniques can be applied to real-world situations.

Here, you will discover quite a few particulars about statistical techniques form PDF. It might be useful to learn its length, the actual time to complete the form, the fields you should fill in, and so forth.

Form NameStatistical Techniques Form
Form Length58 pages
Fillable?No
Fillable fields0
Avg. time to fill out14 min 30 sec
Other namesstatistical techniques in business and economics solution pdf, lind marshal 15 th edition statistic, statistical techniques in business and economics 14th edition solutions manual pdf, statistical techniques in business and economics 13th edition solution manual pdf

## Form Preview Example Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

Multiple Regression

and Correlation Analysis

A mortgage department of a large bank is studying its recent loans. A random sample of 25 recent loans is obtained, searching for factors such as the value of the home, education level of borrower, age, monthly mortgage payment and gender relate to the family income. Are these variables effective predictors of the income of the household? (See Exercise 26 and Goal 1.)

14

G O A L S

When you have completed this chapter you will be able to:

1Describe the relationship between several independent variables and a dependent variable using multiple regression analysis.

2Set up, interpret, and apply an ANOVA table.

3Compute and interpret the multiple standard error of estimate, the coefficient of multiple determination, and the adjusted coefficient of multiple determination.

4Conduct a test of hypothesis to determine whether regression coefficients differ from zero.

5Conduct a test of hypothesis on each of the regression coefficients.

6Use residual analysis to evaluate the assumptions of multiple regression analysis.

7Evaluate the effects of correlated independent variables.

8Use and understand qualitative independent variables.

9Understand and interpret the stepwise regression method.

10Understand and interpret possible interaction among independent variables. Lind−Marchal−Wathen: 14. Multiple Regressions Text Statistical Techniques in and Correlation Analysis Business and Economics, 13th Edition
 512 Chapter 14 Introduction

©The McGraw−Hill Companies, 2008

In Chapter 13 we described the relationship between a pair of interval- or ratio-scaled variables. We began the chapter by studying the coefficient of correlation, which measures the strength of the relationship. A coefficient near plus or minus 1.00 (.88 or .78, for example) indicates a very strong linear relationship, whereas a value near 0 (.12 or .18, for example) means that the relationship is weak. Next we developed a procedure to determine a linear equation to express the relationship between the two variables. We referred to this as a regression line. This line describes the rela- tionship between the variables. It also describes the overall pattern of a dependent variable (Y ) to a single independent or explanatory variable (X ).

In multiple linear correlation and regression we use additional independent vari- ables (denoted X1, X2, . . . , and so on) that help us better explain or predict the dependent variable (Y ). Almost all of the ideas we saw in simple linear correlation and regression extend to this more general situation. However, the additional inde- pendent variables do lead to some new considerations. Multiple regression analysis can be used either as a descriptive or as an inferential technique.

Multiple Regression Analysis

The general descriptive form of a multiple linear equation is shown in formula (14–1). We use k to represent the number of independent variables. So k can be any positive integer.

 GENERAL MULTIPLE ˆ b2X2 b3X3 . . . bkXk [14–1] REGRESSION EQUATION Y a b1X1

where

a is the intercept, the value of Y when all the X’s are zero.

bj is the amount by which Y changes when that particular Xj increases by one unit, with the values of all other independent variables held constant. The subscript j is simply a label that helps to identify each independent variable; it is not used in any calculations. Usually the subscript is an integer value between 1 and k, which is the number of independent variables. However, the subscript can also be a short or abbreviated label. For example, age could be used as a subscript.

In Chapter 13, the regression analysis described and tested the relationship between

a dependent variable, ˆ and a single independent variable, . The relationship

Y,X

between ˆ and was graphically portrayed by a line. When there are two inde-

Y X

pendent variables, the regression equation is

ˆ

Y a b1X1 b2X2

Because there are two independent variables, this relationship is graphically por- trayed as a plane and is shown in Chart 14–1. The chart shows the residuals as the

difference between the actual and the fitted ˆ on the plane. If a multiple regres-

YY

sion analysis includes more than two independent variables, we cannot use a graph to illustrate the analysis since graphs are limited to three dimensions.

To illustrate the interpretation of the intercept and the two regression coefficients, suppose a vehicle’s mileage per gallon of gasoline is directly related to the octane rat- ing of the gasoline being used (X1) and inversely related to the weight of the automobile (X2). Assume that the regression equation, calculated using statistical software, is:

ˆ 6.3 0.2 0.001

Y X1 X2 Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 Multiple Regression and Correlation Analysis 513

Y

Observed point (Y )

^

Estimated point (Y )

Plane formed through the sample points

 X 1 ^ a b1 X1 b2 X2 Y

X2

Example

CHART 14–1 Regression Plane with Ten Sample Points

The intercept value of 6.3 indicates the regression equation intersects the Y-axis at 6.3 when both X1 and X2 are zero. Of course, this does not make any physical sense to own an automobile that has no (zero) weight and to use gasoline with no octane. It is important to keep in mind that a regression equation is not generally used outside the range of the sample values.

The b1 of 0.2 indicates that for each increase of 1 in the octane rating of the gasoline, the automobile would travel 2/10 of a mile more per gallon, regardless of the weight of the vehicle. The b2 value of 0.001 reveals that for each increase of one pound in the vehicle’s weight, the number of miles traveled per gallon decreases by 0.001, regardless of the octane of the gasoline being used.

As an example, an automobile with 92-octane gasoline in the tank and weighing

2,000 pounds would travel an average 22.7 miles per gallon, found by:

 ˆ b2X2 6.3 0.2(92) 0.001(2,000) 22.7 Y a b1X1

The values for the coefficients in the multiple linear equation are found by using the method of least squares. Recall from the previous chapter that the least squares method makes the sum of the squared differences between the fitted and actual values of Y as small as possible. The calculations are very tedious, so they are usually performed by a statistical software package, such as Excel or MINITAB.

In the following example, we show a multiple regression analysis using three independent variables using Excel and MINITAB. Both packages report a standard set of statistics and reports. However, MINITAB also provides advanced regression analysis techniques that we will use later in the chapter.

Salsberry Realty sells homes along the east coast of the United States. One of the questions most frequently asked by prospective buyers is: If we purchase this home, how much can we expect to pay to heat it dur- ing the winter? The research department at Salsberry has been asked to develop some guidelines regarding heat- ing costs for single-family homes. Three variables are thought to relate to the heating costs: (1) the mean daily outside temperature, (2) the number of inches of insula- tion in the attic, and (3) the age in years of the furnace. To investigate, Salsberry’s research department selected a random sample of 20 recently sold homes. It deter- mined the cost to heat each home last January, as well Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

514

Statistics in Action

Many studies indi- cate a woman will earn about 70 per- cent of what a man would for the same work. Researchers at the University of Michigan Institute for Social Research found that about one-third of the difference can be explained by such social factors as dif- ferences in educa- tion, seniority, and work interruptions. The remaining two- thirds is not ex- plained by these social factors.

Solution

Chapter 14

TABLE 14–1 Factors in January Heating Cost for a Sample of 20 Homes

 Heating Cost Mean Outside Attic Insulation Age of Furnace Home (\$) Temperature (F) (inches) (years) 1 \$250 35 3 6 2 360 29 4 10 3 165 36 7 3 4 43 60 6 9 5 92 65 5 6 6 200 30 5 5 7 355 10 6 7 8 290 7 10 10 9 230 21 9 11 10 120 55 2 5 11 73 54 12 4 12 205 48 5 1 13 400 20 5 15 14 320 39 4 7 15 72 60 8 6 16 272 20 5 8 17 94 58 7 3 18 190 40 8 11 19 235 27 9 8 20 139 30 7 5

as the January outside temperature in the region, the number of inches of insu- lation in the attic, and the age of the furnace. The sample information is reported in Table 14–1.

The data in Table 14–1 is available in both Excel and MINITAB formats on the student CD. The basic instructions for using Excel and MINITAB for this data are in the Software Commands section at the end of this chapter.

Determine the multiple regression equation. Which variables are the indepen- dent variables? Which variable is the dependent variable? Discuss the regression coefficients. What does it indicate if some coefficients are positive and some coef- ficients are negative? What is the intercept value? What is the estimated heating cost for a home if the mean outside temperature is 30 degrees, there are 5 inches of insulation in the attic, and the furnace is 10 years old?

We begin the analysis by defining the dependent and independent variables. The dependent variable is the January heating cost. It is represented by Y. There are three independent variables:

The mean outside temperature in January, represented by X1.

The number of inches of insulation in the attic, represented by X2.

The age in years of the furnace, represented by X3.

Given these definitions, the general form of the multiple regression equation follows.

 ˆ The value Y is used to estimate the value of Y. ˆ b2X2 b3X3. Y a b1X1

Now that we have defined the regression equation, we are ready to use either Excel or MINITAB to compute all the statistics needed for the analysis. The outputs from the two software systems are shown below.

To use the regression equation to predict the January heating cost, we need to know the values of the regression coefficients, bj. These are highlighted in Lind−Marchal−Wathen: 14. Multiple Regressions Text Statistical Techniques in and Correlation Analysis Business and Economics, 13th Edition

Multiple Regression and Correlation Analysis

©The McGraw−Hill Companies, 2008

515

the software reports. Note that the software used the variable names or labels associated with each independent variable. The regression equation intercept, a, is labeled as “constant” in the MINITAB output and “intercept” in the Excel output.

In this case the estimated regression equation is:

 ˆ 14.831X2 6.101X3 Y 427.194 4.583X1

We can now estimate or predict the January heating cost for a home if we know the mean outside temperature, the inches of insulation, and the age of the furnace. For an example home, the mean outside temperature for the month is 30 degrees Lind−Marchal−Wathen: 14. Multiple Regressions Text Statistical Techniques in and Correlation Analysis Business and Economics, 13th Edition
 516 Chapter 14

©The McGraw−Hill Companies, 2008

(X1), there are 5 inches of insulation in the attic (X2), and the furnace is 10 years old (X3). By substituting the values for the independent variables:

Y

ˆ 427.194 4.583(30) 14.831(5) 6.101(10) 276.56

The estimated January heating cost is \$276.56.

The regression coefficients, and their algebraic signs, also provide information about their individual relationships with the January heating cost. The regression coefficient for mean outside temperature is 4.583. The coefficient is negative and shows an inverse relationship between heating cost and temperature. This is not sur- prising. As the outside temperature increases, the cost to heat the home decreases. The numeric value of the regression coefficient provides more information. If we increase temperature by 1 degree and hold the other two independent variables con- stant, we can estimate a decrease of \$4.583 in monthly heating cost. So if the mean temperature in Boston is 25 degrees and it is 35 degrees in Philadelphia, all other things being the same (insulation and age of furnace), we expect the heating cost would be \$45.83 less in Philadelphia.

The attic insulation variable also shows an inverse relationship: the more insu- lation in the attic, the less the cost to heat the home. So the negative sign for this coefficient is logical. For each additional inch of insulation, we expect the cost to heat the home to decline \$14.83 per month, holding the outside temperature and the age of the furnace constant.

The age of the furnace variable shows a direct relationship. With an older fur- nace, the cost to heat the home increases. Specifically, for each additional year older the furnace is, we expect the cost to increase \$6.10 per month.

Self-Review 14–1 There are many restaurants in northeastern South Carolina. They serve beach vacationers in the summer, golfers in the fall and spring, and snowbirds in the winter. Bill and Joyce Tuneall manage several restaurants in the North Jersey area and are considering moving to Myrtle Beach, SC to open a new restaurant. Before making a final decision they wish to investigate existing restaurants and what variables seem to be related to profitability. They gather sample information where profit (reported in \$000) is the dependent variable and the independent variables are:

X1 the number of parking spaces near the restaurant.

X2 the number of hours the restaurant is open per week.

X3 the distance from the Pavilion (a landmark in the central area) in Myrtle Beach.

X4 the number of servers employed.

X5 the number of years the current owner has owned the restaurant.

The following is part of the output obtained using statistical software.

 Predictor Coef SE Coef T Constant 2.50 1.50 1.667 X1 3.00 1.500 2.000 X2 4.00 3.000 1.333 X3 3.00 0.20 15.00 X4 0.20 .05 4.00 X5 1.00 1.50 0.667

(a)What is the amount of profit for a restaurant with 40 parking spaces and that is open 72 hours per week, is 10 miles from the Pavilion, has 20 servers, and has been open 5 years?

(b)Interpret the values of b2 and b3 in the multiple regression equation.

Exercises

1.The director of marketing at Reeves Wholesale Products is studying monthly sales. Three independent variables were selected as estimators of sales: regional population, per Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 Multiple Regression and Correlation Analysis 517

capita income, and regional unemployment rate. The regression equation was computed to be (in dollars):

 ˆ 9.6X2 11,600X3 Y 64,100 0.394X1

a.What is the full name of the equation?

b.Interpret the number 64,100.

c.What are the estimated monthly sales for a particular region with a population of 796,000, per capita income of \$6,940, and an unemployment rate of 6.0 percent?

2.Thompson Photo Works purchased several new, highly sophisticated processing machines. The production department needed some guidance with respect to qualifica- tions needed by an operator. Is age a factor? Is the length of service as an operator (in years) important? In order to explore further the factors needed to estimate performance on the new processing machines, four variables were listed:

X1 Length of time an employee was in the industry. X2 Mechanical aptitude test score.

X3 Prior on-the-job rating. X4 Age

Performance on the new machine is designated Y.

Thirty employees were selected at random. Data were collected for each, and their performances on the new machines were recorded. A few results are:

 Performance Length of Mechanical Prior on New Time in Aptitude On-the-Job Machine, Industry, Score, Performance, Age, Name Y X1 X2 X3 X4 Mike Miraglia 112 12 312 121 52 Sue Trythall 113 2 380 123 27 The equation is: ˆ 0.112X3 0.002X4 Y 11.6 0.4X1 0.286X2

a.What is this equation called?

b.How many dependent variables are there? Independent variables?

c.What is the number 0.286 called?

d.As age increases by one year, how much does estimated performance on the new machine increase?

e.Carl Knox applied for a job at Photo Works. He has been in the business for six years, and scored 280 on the mechanical aptitude test. Carl’s prior on-the-job per- formance rating is 97, and he is 35 years old. Estimate Carl’s performance on the new machine.

3.A sample of General Mills employees was studied to determine their degree of satis- faction with their present life. A special index, called the index of satisfaction, was used to measure satisfaction. Six factors were studied, namely, age at the time of first

marriage (X1), annual income (X2), number of children living (X3), value of all assets (X4), status of health in the form of an index (X5), and the average number of social activi- ties per week—such as bowling and dancing (X6). Suppose the multiple regression equation is:

 ˆ 0.0028X2 42X3 0.0012X4 0.19X5 26.8X6 Y 16.24 0.017X1

a. What is the estimated index of satisfaction for a person who first married at 18, has an annual income of \$26,500, has three children living, has assets of \$156,000, has an index of health status of 141, and has 2.5 social activities a week on the average?

b.Which would add more to satisfaction, an additional income of \$10,000 a year or two more social activities a week?

4.Cellulon, a manufacturer of home insulation, wants to develop guidelines for builders and consumers on how the thickness of the insulation in the attic of a home and the outdoor Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 518 Chapter 14

temperature affect natural gas consumption. In the laboratory it varied the insulation thickness and temperature. A few of the findings are:

 Monthly Natural Thickness of Outdoor Gas Consumption Insulation Temperature (cubic feet), (inches), (F), Y X1 X2 30.3 6 40 26.9 12 40 22.1 8 49

On the basis of the sample results, the regression equation is:

 ˆ 0.52X2 Y 62.65 1.86X1

a. How much natural gas can homeowners expect to use per month if they install 6 inches of insulation and the outdoor temperature is 40 degrees F?

b. What effect would installing 7 inches of insulation instead of 6 have on the monthly natural gas consumption (assuming the outdoor temperature remains at 40 degrees F)?

c.Why are the regression coefficients b1 and b2 negative? Is this logical?

How Well Does the Equation Fit the Data?

Once you have the multiple regression equation, it is natural to ask “how well does the equation fit the data?” In linear regression, discussed in the previous chapter, you used summary statistics such as the standard error of estimate and the coef- ficient of determination to describe how effectively a single independent variable explained the variation of the dependent variable. The same procedures, broadened to additional independent variables, are used in multiple regression.

Multiple Standard Error of Estimate

We begin with the multiple standard error of estimate. Recall that the standard error of estimate is comparable to the standard deviation. The standard deviation uses squared deviations from the mean, (Y Y )2, whereas the standard error of

estimate utilizes squared deviations from the regression line, ( ˆ )2. To explain Y Y

the details of the standard error of estimate, refer to the first sampled home in Table 14–1 in the previous example on page 514. The actual heating cost for the first observation, Y, is \$250, the outside temperature, X1, is 35 degrees, the depth of insulation, X2, is 3 inches, and the age of the furnace, X3, is 6 years. Using the regression equation developed in the previous section, the estimated heating cost for this home is:

 ˆ 14.831X2 6.101X3 Y 427.194 4.583X1

427.194 4.583(35) 14.831(3) 6.101(6)

258.90

So we would estimate that a home with a mean January outside temperature of 35 degrees, 3 inches of insulation, and a 6-year-old furnace would cost \$258.90 to heat. The actual heating cost was \$250, so the residual—which is the

 difference between the actual value and the estimated value—is ˆ Y Y

250 258.90 8.90. This difference of \$8.90 is the random or unexplained error for the first item sampled. Our next step is to square this difference, that is find

 ˆ 2 (250 258.90) 2 (8.90) 2 79.21. We repeat these operations for the (Y Y )

other 19 observations and total these squared values. This value is the numerator Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 Multiple Regression and Correlation Analysis 519

of the multiple standard error of estimate. The denominator is the degrees of free- dom, that is n (k 1). The formula for the standard error is:

 MULTIPLE STANDARD © (Y ˆ 2 Y ) [14–2] ERROR OF ESTIMATE sY.123...k Bn (k 1)

where

Y is the actual observation.

ˆis the estimated value computed from the regression equation.

Y

n is the number of observations in the sample. k is the number of independent variables.

In this example n 20 and k 3 (three independent variables) and we use the Excel

 ˆ 2 . Note: There are small discrepancies due software system to find the term ©(Y Y )

to rounding.

Since we have 3 independent variables, we identify the multiple standard error as sY.123. The subscripts indicate that three independent variables are being used to estimate Y.

 © (Y ˆ 2 41,695.28 Y)

sY.123 Bn (k 1) B20 (3 1) 51.05

How do we interpret the standard error of estimate of 51.05? It is the typical “error” when we use this equation to predict the cost. First, the units are the same as the dependent variable, so the standard error is in dollars, \$51.05. Second, we expect the residuals to be approximately normally distributed, so about 68 per- cent of the residuals will be within \$51.05 and about 95 percent within

2(51.05) \$102.10. Refer to column F of the Excel output, headed ˆ . Of

Y Y the 20 values in this column, 14 (or 70 percent) are less than \$51.05 and all are within \$102.10, which is very close to the guidelines of 68 percent and 95 percent. Lind−Marchal−Wathen: 14. Multiple Regressions Text Statistical Techniques in and Correlation Analysis Business and Economics, 13th Edition
 520 Chapter 14 The ANOVA Table

©The McGraw−Hill Companies, 2008

As we said before, the multiple regression computations are long. Luckily, many sta- tistical software systems do the calculations. Most of them report the results in a standard format. The outputs from Excel and MINITAB on page 515 are typical. In particular, they include an analysis of variance (ANOVA) table. The output from MINITAB is repeated here.

Focus on the analysis of variance table. It is similar to the ANOVA table used in Chapter 12. In that chapter the variation was divided into two components: varia- tion due to the treatments and variation due to random error. Here total variation is also separated into two components:

Variation in the dependent variable explained by the regression model (the inde- pendent variables).

The residual or error variation. This is the random error due to sampling.

Incidentally, the term residual error will sometimes be called random error or just error. There are three categories identified in the first or Source column in the ANOVA table; namely, the regression or explained variation, the residual or unexplained

variation, and the total variation.

The second column is labeled df in the ANOVA table. It is the degrees of free- dom. The degrees of freedom in the “Regression” row is the number of indepen- dent variables. We let k represent the number of independent variables, so k 3. The degrees of freedom in the “Error” is n (k 1) 20 (3 1) 16. In this example, there are 20 observations so n 20. The total degrees of freedom is n 1 20 1 19.

The heading SS in the third column of the ANOVA table is the sum of squares or the variation.

 ˆ 2 212,916 Total variation SS total ©(Y Y ) ˆ 2 41,695 Residual error or error variance SSE ©(Y Y ) ˆ 2 SS total SSE Regression variation SSR ©(Y Y )

212,916 41,695 171,220 Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 Multiple Regression and Correlation Analysis 521

(There is a small “round off” difference of one unit, which will have no effect on later calculations.)

The fourth column heading, MS or mean square, is obtained by dividing the SS quantity by the matching df. Thus MSR, the mean square regression, is equal to SSRk. Similarly, MSE, the mean square error, is SSE(n (k 1)).

The following ANOVA table summarizes the process.

 Source df SS MS F Regression k SSR MSR SSRk MSRMSE Residual or error n (k 1) SSE MSE SSE(n (k 1)) Total n 1 SS total

Each value in the ANOVA table plays an important role in the evaluation and inter- pretation of a multiple regression equation. Notice, for example, that the standard error of estimate can very easily be computed from the ANOVA table.

sY.123 2MSE 22,606 51.05

Coefficient of Multiple Determination

Next, let’s look at the coefficient of multiple determination. Recall from the previous chapter the coefficient of determination is defined as the percent of variation in the dependent variable explained, or accounted for, by the independent variable. In the multiple regression case we extend this definition as follows.

COEFFICIENT OF MULTIPLE DETERMINATION The percent of variation in the dependent variable, Y, explained by the set of independent variables, X1, X2, X3, . . . Xk.

The characteristics of the coefficient of multiple determination are:

1.It is symbolized by a capital R squared. In other words, it is written as R2 because it behaves like the square of a correlation coefficient.

2.It can range form 0 to 1. A value near 0 indicates little association between the set of independent variables and the dependent variable. A value near 1 means a strong association.

3.It cannot assume negative values. Any number that is squared or raised to the second power cannot be negative.

4.It is easy to interpret. Because R2 is a value between 0 and 1 it is easy to interpret, compare, and understand.

We can calculate the coefficient of determination from the information found in the ANOVA table. We look in the sum of squares column, which is labeled SS in the MINITAB output, and use the regression sum of squares, SSR, then divide by the total sum of squares, SS total.

 COEFFICIENT OF MULTIPLE DETERMINATION R2 SSR [14–3] SS total Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 522 Chapter 14

The ANOVA portion of the MINITAB output in the heating cost example is repeated below.

 Analysis of Variance Source DF SS MS F P Regression 3 171220 57073 21.90 0.000 Residual Error 16 41695 2606 Total 19 212916

Use formula (14–3) to calculate the coefficient of multiple determination.

 R2 SSR 171,220 .804 SS total 212,916

How do we interpret this value? We say the independent variables (outside temperature, amount of insulation, and age of furnace) explain, or account for,

80.4percent of the variation in heating cost. To put it another way, 19.6 percent of the variation is due to other sources, such as random error or variables not included in the analysis. Using the ANOVA table, 19.6 percent is the error sum of squares divided by the total sum of squares. Knowing that the SSR SSE SS total, the following relationship is true.

 1 R2 1 SSR SSE 41,695 .196 SS total SS total 212,916

Adjusted Coefficient of Determination

The number of independent variables in a multiple regression equation makes the coefficient of determination larger. Each new independent variable causes the pre- dictions to be more accurate. That, in turn, makes SSE smaller and SSR larger. Hence, R2 increases only because of the total number of independent variables and not because the added independent variable is a good predictor of the dependent variable. In fact, if the number of variables, k, and the sample size, n, are equal, the coefficient of determination is 1.0. In practice, this situation is rare and would also be ethically questionable. To balance the effect that the number of independent vari- ables has on the coefficient of multiple determination, statistical software packages use an adjusted coefficient of multiple determination.

 SSE ADJUSTED COEFFICIENT OF DETERMINATION Radj2 1 n (k 1) [14–4] SS total

n 1

The error and total sum of squares are divided by their degrees of freedom. Notice especially the degrees of freedom for the error sum of squares includes k, the number of independent variables. For the cost of heating example, the adjusted coefficient of determination is:

 41,695 Radj2 1 20 (3 1) 1 2,606 1 .23 .77 212,916 11,206.0 20 1

If we compare the R2 (0.80) to the adjusted R2 (0.77), the difference in this case is small. Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 Multiple Regression and Correlation Analysis 523

Self-Review 14–2 Refer to Self-Review 14–1 on the subject of restaurants in Myrtle Beach. The ANOVA portion of the regression output is presented below.

 Analysis of Variance Source DF SS MS Regression 5 100 20 Residual Error 20 40 2 Total 25 140

(a)How large was the sample?

(b)How many independent variables are there?

(c)How many dependent variables are there?

(d)Compute the standard error of estimate. About 95 percent of the residuals will be between what two values?

(e)Determine the coefficient of multiple determination. Interpret this value.

(f)Find the coefficient of multiple determination, adjusted for the degrees of freedom.

Exercises

5.Consider the ANOVA table that follows.

 Analysis of Variance Source DF SS MS F P Regression 2 77.907 38.954 4.14 0.021 Residual Error 62 583.693 9.414 Total 64 661.600

a. Determine the standard error of estimate. About 95 percent of the residuals will be between what two values?

b.Determine the coefficient of multiple determination. Interpret this value.

c.Determine the coefficient of multiple determination, adjusted for the degrees of freedom.

6.Consider the ANOVA table that follows.

Analysis of Variance

 Source DF SS MS F Regression 5 3710.00 742.00 12.89 Residual Error 46 2647.38 57.55 Total 51 6357.38

a. Determine the standard error of estimate. About 95 percent of the residuals will be between what two values?

b.Determine the coefficient of multiple determination. Interpret this value.

c.Determine the coefficient of multiple determination, adjusted for the degrees of freedom.

Inferences in Multiple Linear Regression

Thus far, multiple regression analysis has been viewed only as a way to describe the relationship between a dependent variable and several independent vari- ables. However, the least squares method also has the ability to draw inferences or generalizations about the relationship for an entire population. Recall that when you create confidence intervals or perform hypothesis tests as a part of inferential statistics, you view the data as a random sample taken from some population. Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 524 Chapter 14

In the multiple regression setting, we assume there is an unknown population regression equation that relates the dependent variable to the k independent vari- ables. This is sometimes called a model of the relationship. In symbols we write:

 ˆ 2 X2 . . . k Xk Y 1X1

This equation is analogous to formula (14–1) except the coefficients are now reported as Greek letters. We use the Greek letters to denote population parameters. Then under a certain set of assumptions, which will be discussed shortly, the computed values of a and bj are sample statistics. These sample statistics are point estimates of the corresponding population parameters and j. For example, the sample regression coefficient b2 is a point estimate of the population parameter 2. The sam- pling distribution of these point estimates follows the normal probability distribution. These sampling distributions are each centered at their respective parameter values. To put it another way, the means of the sampling distributions are equal to the pa- rameter values to be estimated. Thus, by using the properties of the sampling distri- butions of these statistics, inferences about the population parameters are possible.

Global Test: Testing the Multiple Regression Model

We can test the ability of the independent variables X1, X2, . . . , Xk to explain the behavior of the dependent variable Y. To put this in question form: Can the depen- dent variable be estimated without relying on the independent variables? The test used is referred to as the global test. Basically, it investigates whether it is possi- ble all the independent variables have zero regression coefficients.

To relate this question to the heating cost example, we will test whether the independent variables (amount of insulation in the attic, mean daily outside tem- perature, and age of furnace) effectively estimate home heating costs.

In testing a hypothesis, we first state the null hypothesis and the alternate hypothesis. In the heating cost example, there are three independent variables. Recall that b1, b2, and b3 are sample regression coefficients. The corresponding coefficients in the population are given the symbols 1, 2, and 3. We now test whether the net regression coefficients in the population are all zero. The null hypothesis is:

H0: 1 2 3 0

The alternate hypothesis is:

H1: Not all the i’s are 0.

If the null hypothesis is true, it implies the regression coefficients are all zero and, logically, are of no use in estimating the dependent variable (heating cost). Should that be the case, we would have to search for some other independent variables— or take a different approach—to predict home heating costs.

To test the null hypothesis that the multiple regression coefficients are all zero, we employ the F distribution introduced in Chapter 12. We will use the .05 level of significance. Recall these characteristics of the F distribution:

1.There is a family of F distributions. Each time the degrees of freedom in either the numerator or the denominator changes a new F distribution is created.

2.The F distribution cannot be negative. The smallest possible value is 0.

3.It is a continuous distribution. The distribution can assume an infinite number of values between 0 and positive infinity.

4.It is positively skewed. The long tail of the distribution is to the right-hand side. As the number of degrees of freedom increases in both the numerator and the denominator, the distribution approaches the normal probability distribution. That is, the distribution will move toward a symmetric distribution.

5.It is asymptotic. As the values of X increase, the F curve will approach the hor- izontal axis, but will never touch it. Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 Multiple Regression and Correlation Analysis 525

The degrees of freedom for the numerator and the denominator may be found in the Excel ANOVA table that follows. The ANOVA output is highlighted in light green. The top number in the column marked “df” is 3, indicating there are 3 degrees of freedom in the numerator. This value corresponds to the number of independent variables. The middle number in the “df” column (16) indicated there are 16 degrees of freedom in the denominator. The number 16 is found by (n (k 1)) 20 (3 1) 16.

The critical value of F is found in Appendix B.4. Using the table for the .05 sig- nificance level, move horizontally to 3 degrees of freedom in the numerator, then down to 16 degrees of freedom in the denominator, and read the critical value. It is

3.24.The region where H0 is not rejected and the region where H0 is rejected are shown in the following diagram.

 F distribution df = (3, 16) Region of Region rejection where H0 is (.05 level) not rejected 3.24 Scale of F

Continuing with the global test, the decision rule is: Do not reject the null hypothesis that all the regression coefficients are 0 if the computed value of F is less than or equal to 3.24. If the computed F is greater than 3.24, reject H0 and accept the alternate hypothesis, H1. Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 526 Chapter 14 The value of F is found from the following equation. GLOBAL TEST F SSRk [14–5] SSE[n (k 1)]

SSR is the sum of the squares regression, SSE the sum of squares error, n the num- ber of observations, and k the number of independent variables. Inserting the heat- ing cost example values in formula (14–5) gives:

SSRk171,2203

F SSE[n (k 1)] 41,695[20 (3 1)] 21.90

The computed value of F is 21.90, which is in the rejection region. The null hypoth- esis that all the multiple regression coefficients are zero is therefore rejected. This means that some of the independent variables (amount of insulation, etc.) do have the ability to explain the variation in the dependent variable (heating cost). We expected this deci- sion. Logically, the outside temperature, the amount of insulation, and age of the fur- nace have a great bearing on heating costs. The global test assures us that they do.

Evaluating Individual Regression Coefficients

So far we have shown that at least one, but not necessarily all, of the regression coefficients are not equal to zero and thus useful for predictions. The next step is to test the independent variables individually to determine which regression coeffi- cients may be 0 and which are not.

Why is it important to know if any of the i’s equal 0? If a could equal 0, it implies that this particular independent variable is of no value in explaining any vari- ation in the dependent value. If there are coefficients for which H0 cannot be rejected, we may want to eliminate them from the regression equation.

We will now conduct three separate tests of hypothesis—for temperature, for insulation, and for the age of the furnace.

 For temperature: For insulation: For furnace age: H0: 1 0 H0: 2 0 H0: 3 0 H1: 1 0 H1: 2 0 H1: 3 0

We will test the hypotheses at the .05 level. The way the alternate hypothesis is stated indicates that the test is two-tailed.

The test statistic follows Student’s t distribution with n (k 1) degrees of freedom. The number of sample observations is n. There are 20 homes in the study, so n 20. The number of independent variables is k, which is 3. Thus, there are n (k 1) 20 (3 1) 16 degrees of freedom.

The critical value for t is in Appendix B.2. For a two-tailed test with 16 degrees of freedom using the .05 significance level, H0 is rejected if t is less than 2.120 or greater than 2.120.

Refer to the Excel output in the previous section. (See page 525.) The column highlighted in yellow, headed Coefficients, shows the values for multiple regression equation:

 ˆ 14.831X2 6.101X3 Y 427.194 4.583X1

Interpreting the term 4.583X1 in the equation: For each degree the temperature increases, it is expected that the heating cost will decrease about \$4.58, holding the two other variables constant.

The column on the Excel output labeled Standard Error indicates the standard error of the sample regression coefficient. Recall that Salsberry Realty selected a sample of 20 homes along the east coast of the United States. If it was to select a second random sample and compute the regression coefficients of that sample, Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 Multiple Regression and Correlation Analysis 527

the values would not be exactly the same. If it repeated the sampling process many times, however, we could design a sampling distribution of these regression coeffi- cients. The column labeled Standard Error estimates the variability of these regres- sion coefficients. The sampling distribution of Coefficients/Standard Error follows the t distribution with n (k 1) degrees of freedom. Hence, we are able to test the independent variables individually to determine whether the net regression coeffi- cients differ from zero. The computed t ratio is 5.934 for temperature and 3.119 for insulation. Both of these t values are in the rejection region to the left of 2.120. Thus, we conclude that the regression coefficients for the temperature and insula- tion variables are not zero. The computed t for the age of the furnace is 1.521, so we conclude that b3 could equal 0. The independent variable age of the furnace is not a significant predictor of heating cost. It can be dropped from the analysis. We can test individual regression coefficients using the t distribution. The formula is:

 TESTING INDIVIDUAL bi 0 REGRESSION COEFFICIENTS t [14–6] sbi

The bi refers to any one of the regression coefficients and sbi refers to standard devi- ation of that distribution of the regression coefficient. We include 0 in the equation because the null hypothesis is i 0.

To illustrate this formula, refer to the test of the regression coefficient for the independent variable temperature. We let b1 refer to the regression coefficient. From the computer output on page 525 it is 4.583. sbi is the standard deviation of the sampling distribution of the regression coefficient for the independent variable tem- perature. Again, from the computer output on page 525, it is 0.772. Inserting these values in formula (14–6):

t b1 0 4.583 0 5.936

sb10.772

This is the value found in the t Stat column of the Excel output. [There is a slight difference due to rounding.]

At this point we need to develop a strategy for deleting independent variables. In the Salsberry Realty case there were three independent variables, and one (the age of the furnace) had a regression coefficient that did not differ from 0. It is clear that we should drop that variable and rerun the regression equation. Below is the MINITAB output where heating cost is the dependent variable and outside temper- ature and amount of insulation are the independent variables. Lind−Marchal−Wathen: 14. Multiple Regressions Text Statistical Techniques in and Correlation Analysis Business and Economics, 13th Edition
 528 Chapter 14

Summarizing the results from this new MINITAB output:

 1. The new regression equation is: ˆ 14.718X2 Y 490.29 5.1499X1

©The McGraw−Hill Companies, 2008

Notice that the regression coefficients for outside temperature (X1) and amount of insulation (X2) are similar to but not exactly the same as when we included the independent variable age of the furnace. Compare the above equation to that in the Excel output on page 525. Both of the regression coefficients are negative as in the earlier equation.

2.The details of the global test are as follows:

H0: 1 2 0

H1: Not all of the i’s 0

The F distribution is the test statistic and there are k 2 degrees of freedom in the numerator and n (k 1) 20 (2 1) 17 degrees of freedom in the denominator. Using the .05 significance level and Appendix B.4, the decision rule is to reject H0 if F is greater than 3.59. We compute the value of F as follows:

 SSRk 165,1952 F 29.42 SSE(n (k 1)) 47,721(20 (2 1))

Because the computed value of F (29.42) is greater than the critical value (3.59), the null hypothesis is rejected and the alternate accepted. We conclude that at least one of the regression coefficients is different from 0.

3.The next step is to conduct a test of the regression coefficients individually. We want to find out if one or both of the regression coefficients are different from 0. The null and alternate hypotheses for each of the independent variables are:

 Outside Temperature Insulation H0: 1 0 H0: 2 0 H1: 1 0 H1: 2 0

The test statistic is the t distribution with n (k 1) 20 (2 1) 17 degrees of freedom. Using the .05 significance level and Appendix B.2, the decision rule is to reject H0 if the computed value of t is less than 2.110 or greater than 2.110.

 Outside Temperature Insulation t b1 0 5.1499 0 7.34 t b2 0 14.718 0 2.98 0.7019 4.934 sb1 sb2

In both tests we reject H0 and accept H1. We conclude that each of the regression coefficients is different from 0. Both outside temperature and amount of insulation are useful variables in explaining the variation in heating costs.

In the heating cost example, it was clear which independent variable to delete. However, in some instances which variable to delete may not be as clear-cut. To explain, suppose we develop a multiple regression equation based on five indepen- dent variables. We conduct the global test and find that some of the regression coef- ficients are different from zero. Next, we test the regression coefficients individually and find that three are significant and two are not. The preferred procedure is to drop the single independent variable with the smallest absolute t value or largest p-value and rerun the regression equation with the four remaining variables, then, on the new regression equation with four independent variables, conduct the individual tests. If there are still regression coefficients that are not significant, again drop the variable with the smallest absolute t value. To describe the process in another way, we should delete only one variable at a time. Each time we delete a variable, we need to rerun the regression equation and check the remaining variables. Lind−Marchal−Wathen: 14. Multiple Regressions Text Statistical Techniques in and Correlation Analysis Business and Economics, 13th Edition

Multiple Regression and Correlation Analysis

©The McGraw−Hill Companies, 2008

529

This process of selecting variables to include in a regression model can be auto- mated, using Excel, MINITAB, MegaStat, or other statistical software. Most of the soft- ware systems include methods to sequentially remove and/or add independent variables and at the same time provide estimates of the percentage of variation ex- plained (the R-square term). Two of the common methods are stepwise regression and best subset regression. It may take a long time, but in the extreme we could compute every regression between the dependent variable and any possible subset of the independent variables.

Unfortunately, on occasion, the software may work “too hard” to find an equation that fits all the quirks of your particular data set. The suggested equation may not rep- resent the relationship in the population. A judgment is needed to choose among the equations presented. Consider whether the results are logical. They should have a sim- ple interpretation and be consistent with your knowledge of the application under study.

Self-Review 14–3 The regression output about eating places in Myrtle Beach is repeated below (see earlier self-reviews).

 Predictor Coef SE Coef T Constant 2.50 1.50 1.667 X1 3.00 1.500 2.000 X2 4.00 3.000 1.333 X3 3.00 0.20 15.00 X4 0.20 .05 4.00 X5 1.00 1.50 0.667
 Analysis of Variance Source DF SS MS Regression 5 100 20 Residual Error 20 40 2 Total 25 140

(a)Perform a global test of hypothesis to check if any of the regression coefficients are different from 0. What do you decide? Use the .05 significance level.

(b)Do an individual test of each independent variable. Which variables would you consider eliminating? Use the .05 significance level.

(c)Outline a plan for possibly removing independent variables.

Exercises

7.Given the following regression output,

 Predictor Coef SE Coef T P Constant 84.998 1.863 45.61 0.000 X1 2.391 1.200 1.99 0.051 X2 0.4086 0.1717 2.38 0.020 Analysis of Variance Source DF SS MS F P Regression 2 77.907 38.954 4.14 0.021 Residual Error 62 583.693 9.414 Total 64 661.600

answer the following questions:

a.Write the regression equation.

b.If X1 is 4 and X2 is 11, what is the value of the dependent variable?

c.How large is the sample? How many independent variables are there?

d.Conduct a global test of hypothesis to see if any of the set of regression coefficients could be different from 0. Use the .05 significance level. What is your conclusion?

e.Conduct a test of hypothesis for each independent variable. Use the .05 significance level. Which variable would you consider eliminating?

f.Outline a strategy for deleting independent variables in this case. Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 530 Chapter 14

8.The following regression output was obtained from a study of architectural firms. The dependent variable is the total amount of fees in millions of dollars.

 Predictor Coef SE Coef T Constant 7.987 2.967 2.69 X1 0.12242 0.03121 3.92 X2 0.12166 0.05353 2.27 X3 0.06281 0.03901 1.61 X4 0.5235 0.1420 3.69 X5 0.06472 0.03999 1.62 Analysis of Variance Source DF SS MS F Regression 5 3710.00 742.00 12.89 Residual Error 46 2647.38 57.55 Total 51 6357.38

X1 is the number of architects employed by the company.

X2 is the number of engineers employed by the company.

X3 is the number of years involved with health care projects.

X4 is the number of states in which the firm operates.

X5 is the percent of the firm’s work that is health care–related.

a.Write out the regression equation.

b.How large is the sample? How many independent variables are there?

c.Conduct a global test of hypothesis to see if any of the set of regression coefficients could be different from 0. Use the .05 significance level. What is your conclusion?

d.Conduct a test of hypothesis for each independent variable. Use the .05 significance level. Which variable would you consider eliminating?

e.Outline a strategy for deleting independent variables in this case.

Evaluating the

Assumptions of Multiple Regression

In the previous section, we described the methods to statistically evaluate the multiple regression equation. The results of the test let us know if at least one of the coefficients was not equal to zero and we described a procedure of eval- uating each regression coefficient. We also discussed the decision-making pro- cess for including and excluding independent variables in the multiple regression equation.

It is important to know that the validity of the statistical global and individ- ual tests rely on several assumptions. That is, if the assumptions are not true, the results might be biased or misleading. It should be mentioned, however, that in practice strict adherence to the following assumptions is not always possible. Fortunately the statistical techniques discussed in this chapter appear to work well even when one or more of the assumptions are violated. Even if the values in the multiple regression equation are “off” slightly, our estimates using a multi- ple regression equation will be closer than any that could be made otherwise. Usually the statistical procedures are robust enough to overcome violations of some assumptions.

In Chapter 13 we listed the necessary assumptions for regression when we con- sidered only a single independent variable. (See page 480.) The assumptions for multiple regression are similar.

1.There is a linear relationship. That is, there is a straight-line relationship between the dependent variable and the set of independent variables. Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 Multiple Regression and Correlation Analysis 531

2.The variation in the residuals is the same for both large and small values

ˆ The put it another way, ( ˆ ) is unrelated to whether ˆ is large or small.

of Y.Y YY

3.The residuals follow the normal probability distribution. Recall the residual

is the difference between the actual value of and the estimated value ˆ. So

YY

the term ( ˆ ) is computed for every observation in the data set. These resid- Y Y

uals should approximately follow a normal probability distribution. In addition, the mean of the residuals should be 0.

4.The independent variables should not be correlated. That is, we would like to select a set of independent variables that are not themselves correlated.

5.The residuals are independent. This means that successive observations of the dependent variable are not correlated. This assumption is often violated when time is involved with the sampled observations.

In this section we present a brief discussion of each of these assumptions. In addi- tion, we provide methods to validate these assumptions and indicate the conse- quences if these assumptions cannot be met. For those interested in additional discussion, Kutner, Nachtscheim, and Neter, Applied Linear Regression Models, 4th ed. (McGraw-Hill: 2004), is an excellent reference.

Linear Relationship

Let’s begin with the linearity assumption. The idea is that the relationship between the set of independent variables and the dependent variable is linear. If we are considering two independent variables, we can visualize this assumption. The two independent variables and the dependent variable would form a three- dimensional space. The regression equation would then form a plane as shown on page 513. We can evaluate this assumption with scatter diagrams and residual plots.

Using Scatter Diagrams The evaluation of a multiple regression equation should always include a scatter diagram that plots the dependent variable against each independent variable. These graphs help us to visualize the relationships and provide some initial information about the direction (positive or negative), linearity, and strength of the relationship. For example, the scatter diagrams for the home heat- ing example follow. The plots suggest a fairly strong negative, linear relationship between heating cost and temperature, and a negative relationship between heating cost and insulation.

 Using Residual Plots ˆ Recall that a residual (Y Y ) can be computed using the

multiple regression equation for each observation in a data set. In Chapter 13, we Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 532 Chapter 14

discussed the idea that the best regression line passed through the center of the data in a scatter plot. In this case, you would find a good number of the observa- tions above the regression line (these residuals would have a positive sign), and a good number of the observations below the line (these residuals would have a neg- ative sign). Further, the observations would be scattered above and below the line over the entire range of the independent variable.

The same concept is true for multiple regression, but we cannot graphically por- tray the multiple regression. However, plots of the residuals can help us evaluate the linearity of the multiple regression equation. To investigate, the residuals are plot-

ted on the vertical axis against the predictor variable, ˆ. The graph on the left below

Y

show the residual plots for the home heating cost example. Notice the following:

The residuals are plotted on the vertical axis and are centered around zero. There are both positive and negative residuals.

The residual plots show a random distribution of positive and negative values across the entire range of the variable plotted on the horizontal axis.

The points are scattered and there is no obvious pattern, so there is no reason to doubt the linearity assumption.

This plot supports the assumption of linearity.

Residuals Versus the Fitted Values

(response in Units)

Residual

50

25

0

-25

-50

 -50 0 50 100 150 200 Fitted Value

If there is a pattern to the points in the scatter plot, further investigation is necessary. The points in the graph on the right above show nonrandom residuals. See that the residual plot does not show a random distribution of positive and negative values across the entire range of the variable plotted on the horizontal axis. In fact, the graph shows a curvature to the residual plots. This indicates the relationship may not be linear. In this case perhaps the equation is quadratic, indicating that the square of one of the variables is needed. We discussed this possibility in Chapter 13.

Variation in Residuals Same for

Large and Small ˆ Values

Y

This requirement indicates that the variation about the predicted values is con- stant, regardless of whether the predicted values are large or small. To cite a specific example, which may violate the assumption, suppose we use the single independent variable age to explain the variation in income. We suspect that as age increases so does salary, but it also seems reasonable that as age increases there may be more variation around the regression line. That is, there will likely be more variation in income for a 50-year-old person than for a 35-year-old Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 Multiple Regression and Correlation Analysis 533

person. The requirement for constant variation around the regression line is called homoscedasticity.

HOMOSCEDASTICITY The variation around the regression equation is the same for all of the values of the independent variables.

To check for homoscedasticity the residuals are plotted against the fitted values of Y. This is the same graph that we used to evaluate the assumption of linearity. (See page 532.) Based on the scatter diagram in that software output, it is reasonable to conclude that this assumption has not been violated.

Distribution of Residuals

To be sure that the inferences that we make in the global and individual hypothe- ses tests are valid, we evaluate the distribution of residuals. Ideally, the residuals should follow a normal probability distribution.

To evaluate this assumption, we can organize the residuals into a frequency dis- tribution. The MINITAB histogram of the residuals is shown following on the left for the home heating cost example. Although it is difficult to show that the residuals follow a normal distribution with only 20 observations, it does appear the normality assumption is reasonable.

Both MINITAB and Excel offer another graph that helps to evaluate the assump- tion of normally distributed residuals. It is a called a normal probability plot and is shown to the right of the histogram. Without detailing the calculations, the normal probability plot supports the assumption of normally distributed residuals if the plot- ted points are fairly close to a straight line drawn from the lower left to the upper right of the graph.

In this case, both graphs support the assumption that the residuals follow the normal probability distribution. Therefore, the inferences that we made based on the global and individual hypothesis tests are supported with the results of this evaluation.

Multicollinearity

Multicollinearity exists when independent variables are correlated. Correlated inde- pendent variables make it difficult to make inferences about the individual regression coefficients and their individual effects on the dependent variable. In practice, it is Lind−Marchal−Wathen: 14. Multiple Regressions Text Statistical Techniques in and Correlation Analysis Business and Economics, 13th Edition
 534 Chapter 14

©The McGraw−Hill Companies, 2008

nearly impossible to select variables that are completely unrelated. To put it another way, it is nearly impossible to create a set of independent variables that are not correlated to some degree. However, a general understanding of the issue of multi- collinearity is important.

First, we should point out that multicollinearity does not affect a multiple regres- sion equation’s ability to predict the dependent variable. However, when we are interested in evaluating the relationship between each independent variable and the dependent variable, multicollinearity may show unexpected results.

For example, if we use two highly multicollinearity, high school GPA and high school class rank, to predict the GPA of incoming college freshmen (dependent variable), we would expect that both independent variables would be positively related to the dependent variable. However, because the independent variables are highly correlated, one of the independent variables may have an unexpected and inexplicable negative sign. In essence, these two independent variables are redundant in that they explain the same variation in the dependent variable.

A second reason for avoiding correlated independent variables is they may lead to erroneous results in the hypothesis tests for the individual independent variables. This is due to the instability of the standard error of estimate. Several clues that indicate problems with multicollinearity include the following:

1.An independent variable known to be an important predictor ends up having a regression coefficient that is not significant.

2.A regression coefficient that should have a positive sign turns out to be nega- tive, or vice versa.

3.When an independent variable is added or removed, there is a drastic change in the values of the remaining regression coefficients.

In our evaluation of a multiple regression equation, an approach to reducing the effects of multicollinearity is to carefully select the independent variables that are included in the regression equation. A general rule is if the correlation between two independent variables is between 0.70 and 0.70 there likely is not a prob- lem using both of the independent variables. A more precise test is to use the variance inflation factor. It is usually written VIF. The value of VIF is found as follows:

 VARIANCE INFLATION FACTOR 1 VIF [14–7] 1 Rj2

Example

The term R2j refers to the coefficient of determination, where the selected independent variable is used as a dependent variable and the remaining inde- pendent variables are used as independent variables. A VIF greater than 10 is considered unsatisfactory, indicating that the independent variable should be removed from the analysis. The following example will explain the details of finding the VIF.

Refer to the data in Table 14–1, which relates the heating cost to the independent variables outside temperature, amount of insulation, and age of furnace. Develop a correlation matrix for all the independent variables. Does it appear there is a prob- lem with multicollinearity? Find and interpret the variance inflation factor for each of the independent variables. Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

Solution

 Multiple Regression and Correlation Analysis 535

We begin by using the MINITAB system to find the correlation matrix for the depen- dent variable and the four independent variables. A portion of that output follows:

 Cost Temp Insul Temp 0.812 Insul 0.257 0.103 Age 0.537 0.486 0.064

Cell Contents: Pearson correlation

None of the correlations among the independent variables exceed .70 or .70, so we do not suspect problems with multicollinearity. The largest correlation among the independent variables is 0.486 between age and temperature.

To confirm this conclusion we compute the VIF for each of the three indepen- dent variables. We will consider the independent variable temperature first. We use MINITAB to find the multiple coefficient of determination with temperature as the dependent variable and amount of insulation and age of the furnace as indepen- dent variables. The relevant MINITAB output follows.

Regression Analysis: Temp versus Insul, Age

 The regression equation is Temp 58.0 0.51 Insul 2.51 Age Predictor Coef SE Coef T P VIF Constant 57.99 12.35 4.70 0.000 Insul 0.509 1.488 0.34 0.737 1.0 Age 2.509 1.103 2.27 0.036 1.0 S 16.0311 R-Sq 24.1% R-Sq(adj) 15.2% Analysis of Variance Source DF SS MS F P Regression 2 1390.3 695.1 2.70 0.096 Residual Error 17 4368.9 257.0 Total 19 5759.2

The coefficient of determination is .241, so inserting this value into the VIF formula:

11

VIF 1 R21 1 .241 1.32

The VIF value of 1.32 is less than the upper limit of 10. This indicates that the inde- pendent variable temperature is not strongly correlated with the other independent variables.

Again, to find the VIF for insulation we would develop a regression equation with insulation as the dependent variable and temperature and age of furnace as independent variables. For this equation we would determine the coefficient of determination. This would be the value for R22. We would substitute this value in equation 14–7, and solve for VIF.

Fortunately, MINITAB will generate the VIF values for each of the independent variables. These values are reported in the right-hand column under the heading VIF of the MINITAB output. Both of these values are 1.0, hence we conclude there is not a problem with multicollinearity in this example.

Independent Observations

The fifth assumption about regression and correlation analysis is that successive residuals should be independent. This means that there is not a pattern to the residuals, the residuals are not highly correlated, and there are not long runs of Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

536

Statistics in Action

In recent years, mul- tiple regression has been used in a vari- ety of legal proceed- ings. It is particularly useful in cases alleg- ing discrimination by gender or race. As an example, suppose that a woman alleges that Company X’s wage rates are unfair to women. To sup- port the claim, the plaintiff produces data showing that, on the average, women earn less than men. In re- sponse, Company X argues that its wage rates are based on experience, training, and skill and that its female employees, on the average, are younger and less ex- perienced than the male employees. In fact, the company might further argue that the current situ- ation is actually due to its recent success- ful efforts to hire more women.

Chapter 14

positive or negative residuals. When successive residuals are correlated we refer to this condition as autocorrelation.

Autocorrelation frequently occurs when the data are collected over a period of time. For example, we wish to predict yearly sales of Ages Software, Inc., based on the time and the amount spent on advertising. The dependent variable is yearly sales and the independent variables are time and amount spent on advertising. It is likely that for a period of time the actual points will be above the regression plane (remember there are two independent variables) and then for a period of time the points will be below the regression plane. The graph

below shows the residuals plotted on the vertical axis and the fitted values ˆ on

Y

the horizontal axis. Note the run of residuals above the mean of the residuals, followed by a run below the mean. A scatter plot such as this would indicate possible autocorrelation.

 ^ Y ) (Y – Residuals 0

^

Fitted valuesY

There is a test for autocorrelation, called the Durbin-Watson. We present the details of this test in Chapter 16.

Qualitative Independent Variables

In the previous example regarding heating cost, the two independent variables out- side temperature and insulation were quantitative; that is, numerical in nature. Fre- quently we wish to use nominal-scale variables—such as gender, whether the home has a swimming pool, or whether the sports team was the home or the visiting team—in our analysis. These are called qualitative variables because they describe a particular quality, such as male or female. To use a qualitative variable in regression analysis, we use a scheme of dummy variables in which one of the two possible conditions is coded 0 and the other 1.

DUMMY VARIABLE A variable in which there are only two possible outcomes. For analysis, one of the outcomes is coded a 1 and the other a 0.

For example, we might be interested in estimating an executive’s salary on the basis of years of job experience and whether he or she graduated from college. “Graduation from college” can take on only one of two conditions: yes or no. Thus, it is considered a qualitative variable.

Suppose in the Salsberry Realty example that the independent variable “garage” is added. For those homes without an attached garage, 0 is used; for homes with an attached garage, a 1 is used. We will refer to the “garage” variable as X4. The data from Table 14–2 are entered into the MINITAB system. Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 Multiple Regression and Correlation Analysis 537

TABLE 14–2 Home Heating Costs, Temperature, Insulation, and Presence of a Garage for a Sample of 20 Homes

 Cost, Temperature, Insulation, Garage, Y X1 X2 X4 \$250 35 3 0 360 29 4 1 165 36 7 0 43 60 6 0 92 65 5 0 200 30 5 0 355 10 6 1 290 7 10 1 230 21 9 0 120 55 2 0 73 54 12 0 205 48 5 1 400 20 5 1 320 39 4 1 72 60 8 0 272 20 5 1 94 58 7 0 190 40 8 1 235 27 9 0 139 30 7 0

The output from MINITAB is:

What is the effect of the garage variable? Should it be included in the analy- sis? To show the effect of the variable, suppose we have two houses exactly alike next to each other in Buffalo, New York; one has an attached garage, and the other does not. Both homes have 3 inches of insulation, and the mean January temper- ature in Buffalo is 20 degrees. For the house without an attached garage, a 0 is Lind−Marchal−Wathen: 14. Multiple Regressions Text Statistical Techniques in and Correlation Analysis Business and Economics, 13th Edition
 538 Chapter 14

©The McGraw−Hill Companies, 2008

substituted for X4 in the regression equation. The estimated heating cost is \$280.90, found by:

 ˆ 11.3X2 77.4X4 Y 394 3.96X1

394 3.96(20) 11.3(3) 77.4(0) 280.90

For the house with an attached garage, a 1 is substituted for X4 in the regression equation. The estimated heating cost is \$358.30, found by:

 ˆ 11.3X2 77.4X4 Y 394 3.96X1

394 3.96(20) 11.3(3) 77.4(1) 358.30

The difference between the estimated heating costs is \$77.40 (\$358.30 \$280.90). Hence, we can expect the cost to heat a house with an attached garage to be \$77.40 more than the cost for an equivalent house without a garage.

We have shown the difference between the two types of homes to be \$77.40, but is the difference significant? We conduct the following test of hypothesis.

H0: 4 0

H1: 4 0

The information necessary to answer this question is on the MINITAB output above. The net regression coefficient for the independent variable garage is 77.43, and the standard deviation of the sampling distribution is 22.78. We identify this as the fourth independent variable, so we use a subscript of 4. Finally, we insert these values in formula (14–6).

t b4 0 77.43 0 3.40

sb422.78

There are three independent variables in the analysis, so there are n (k 1) 20 (3 1) 16 degrees of freedom. The critical value from Appendix B.2 is 2.120. The decision rule, using a two-tailed test and the .05 significance level, is to reject H0 if the computed t is to the left of 2.120 or to the right of 2.120. Since the computed value of 3.40 is to the right of 2.120, the null hypothesis is rejected. It is concluded that the regression coefficient is not zero. The independent variable garage should be included in the analysis.

Is it possible to use a qualitative variable with more than two possible outcomes? Yes, but the coding scheme becomes more complex and will require a series of dummy variables. To explain, suppose a company is studying its sales as they relate to adver- tising expense by quarter for the last 5 years. Let sales be the dependent variable and advertising expense be the first independent variable, X1. To include the qualitative information regarding the quarter, we use three additional independent variables. For the variable X2, the five observations referring to the first quarter of each of the 5 years are coded 1 and the other quarters 0. Similarly, for X3 the five observations referring to the second quarter are coded 1 and the other quarters 0. For X4 the five observations referring to the third quarter are coded 1 and the other quarters 0. An observation that does not refer to any of the first three quarters must refer to the fourth quarter, so a distinct independent variable referring to this quarter is not necessary.

Stepwise Regression

In our heating cost example (see sample information in Table 14–1 and Table 14–2) we considered four independent variables: the mean outside temperature, the amount of insulation in the home, the age of the furnace, and whether or not there was an attached garage. To obtain the equation, we first ran a global or “all at once” test to determine if any of the regression coefficients were significant. When we found at least one to be significant, we tested the regression coefficients individually to deter- mine which were important. We left out the independent variables that did not have significant regression coefficients and kept the others. By retaining the independent Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 Multiple Regression and Correlation Analysis 539

variables with significant coefficients, we found the regression equation that used the fewest independent variables. This made the regression equation easy to interpret and explained as much variation in the dependent variable as possible.

We are now going to describe a technique called stepwise regression, which is more efficient in building the regression equation.

STEPWISE REGRESSION A step-by-step method to determine a regression equation that begins with a single independent variable and adds or deletes independent variables one by one. Only independent variables with nonzero regression coefficients are included in the regression equation.

In the stepwise method, we develop a sequence of equations. The first equation con- tains only one independent variable. However, this independent variable is the one from the set of proposed independent variables that explains the most variation in the depen- dent variable. Stated differently, if we compute all the simple correlations between each independent variable and the dependent variable, the stepwise method first selects the independent variable with the strongest correlation with the dependent variable.

Next, the stepwise method looks at the remaining independent variables and then selects the one that will explain the largest percentage of the variation yet unexplained. We continue this process until all the independent variables with sig- nificant regression coefficients are included in the regression equation. The advan- tages to the stepwise method are:

1.Only independent variables with significant regression coefficients are entered into the equation.

2.The steps involved in building the regression equation are clear.

3.It is efficient in finding the regression equation with only significant regression coefficients.

4.The changes in the multiple standard error of estimate and the coefficient of determination are shown.

The stepwise MINITAB output for the heating cost problem follows. Note that the final equation, which is reported in the column labeled 3 includes the independent variables temperature, garage, and insulation. These are the same independent vari- ables that were included in our equation using the global test and the test for indi- vidual independent variables. (See page 537.) The independent variable age, for age of the furnace, is not included because it is not a significant predictor of cost. Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 540 Chapter 14

Reviewing the steps and interpreting output:

1.The stepwise procedure selects the independent variable temperature first. This variable explains more of the variation in heating cost than any of the other three proposed independent variables. Temperature explains 65.85 percent of the variation in heating cost. The regression equation is:

ˆ 388.8 4.93

Y X1

There is an inverse relationship between heating cost and temperature. For each degree the temperature increases, heating cost is reduced by \$4.93.

2.The next independent variable to enter the regression equation is garage. When this variable is added to the regression equation, the coefficient of determina- tion is increased from 65.85 percent to 80.46 percent. That is, by adding garage as an independent variable, we increase the coefficient of determination by

14.61 percent. The regression equation after step 2 is:

 ˆ 93.0X2 Y 300.3 3.56X1

Usually the regression coefficients will change from one step to the next. In this case the coefficient for temperature retained its negative sign, but it changed from 4.93 to 3.56. This change is reflective of the added influence of the independent variable garage. Why did the stepwise method select the inde- pendent variable garage instead of either insulation or age? The increase in R2, the coefficient of determination, is larger if garage is included rather than either of the other two variables.

3.At this point there are two unused variables remaining, insulation and age. Notice on the third step the procedure selects insulation and then stops. This indicates the variable insulation explains more of the remaining variation in heating cost than the age variable does. After the third step, the regression equation is:

 ˆ 77.0 X2 11.3 X3 Y 393.7 3.96 X1

At this point 86.98 percent of the variation in heating cost is explained by the three independent variables temperature, garage, and insulation. This is the same R2 value and regression equation we found on page 537 except for round- ing differences.

4.At this point the stepwise procedure stops. This means the independent vari- able age does not add significantly to the coefficient of determination.

The stepwise method developed the same regression equation, selected the same independent variables, and found the same coefficient of determination as the global and individual tests described earlier in the chapter. The advantages to the step- wise method is that it is more direct than using a combination of the global and individual procedures.

Other methods of variable selection are available. The stepwise method is also called the forward selection method because we begin with no indepen- dent variables and add one independent variable to the regression equation at each iteration. There is also the backward elimination method, which begins with the entire set of variables and eliminates one independent variable at each iteration.

The methods described so far look at one variable at a time and decide whether to include or eliminate that variable. Another approach is the best-subset regression. With this method we look at the best model using one independent variable, the best model using two independent variables, the best model with three and so on. The criterion is to find the model with the largest R2 value, regardless of the number of independent variables. Also, each independent variable does not neces- sarily have a nonzero regression coefficient. Since each independent variable could Lind−Marchal−Wathen: 14. Multiple Regressions Text Statistical Techniques in and Correlation Analysis Business and Economics, 13th Edition

Multiple Regression and Correlation Analysis

©The McGraw−Hill Companies, 2008

541

Example

Solution

either be included or not included, there are 2k 1 possible models, where k refers to the number of independent variables. In our heating cost example we considered four independent variables so there are 15 possible regression models, found by 24 1 16 1 15. We would examine all regression models using one inde- pendent variable, all combinations using two variables, all combinations using three independent variables, and the possibility of using all four independent variables. The advantages to the best-subset method is it may examine combinations of inde- pendent variables not considered in the stepwise method. The process is available in MINITAB and MegaStat.

Regression Models with Interaction

In Chapter 12 we discussed interaction among independent variables. To explain, suppose we are studying weight loss and assume, as the current literature suggests, that diet and exercise are related. So the dependent variable is amount of change in weight and the independent variables are: diet (yes or no) and exercise (none, moderate, significant). We are interested in whether there is interaction among the independent variables. That is, if those studied maintain their diet and exercise significantly, will that increase the mean amount of weight lost? Is total weight loss more than the sum of the loss due to the diet effect and the loss due to the exercise effect?

We can expand on this idea. Instead of having two nominal-scale variables, diet and exercise, we can examine the effect (interaction) of several ratio-scale variables. For example, suppose we want to study the effect of room temperature (68, 72, 76, or 80 degrees Fahrenheit) and noise level (60, 70, or 80 decibels) on the number of units produced. To put it another way, does the combination of noise level in the room and the temperature of the room have an effect on the productivity of the workers? Would the workers in a quiet, cool room produce more units than those in a hot, noisy room?

In regression analysis, interaction can be examined as a separate independent variable. An interaction prediction variable can be developed by multiplying the data values in one independent variable by the values in another independent variable, thereby creating a new independent variable. A two-variable model that includes an interaction term is:

Y 1X1 2 X2 3X1X2

The term X1X2 is the interaction term. We create this variable by multiplying the val- ues of X1 and X2 to create the third independent variable. We then develop a regres- sion equation using the three independent variables and test the significance of the third independent variable using the individual test for independent variables, described earlier in the chapter. An example will illustrate the details.

Refer to the heating cost example and the data in Table 14–1. Is there an interac- tion between the outside temperature and the amount of insulation? If both vari- ables are increased, is the effect on heating cost greater than the sum of savings from warmer temperature and the savings from increased insulation separately?

The information from Table 14–1 for the independent variables temperature and insulation is repeated below. We create the interaction variable by multiplying the temperature variable by the insulation. For the first sampled home the value temperature is 35 degrees and insulation is 3 inches so the value of the interaction variable is 35 3 105. The values of the other interaction products are found in a similar fashion. Lind−Marchal−Wathen: 14. Multiple Regressions Text Statistical Techniques in and Correlation Analysis Business and Economics, 13th Edition
 542 Chapter 14

©The McGraw−Hill Companies, 2008

We find the multiple regression using temperature, insulation, and the interaction of temperature and insulation as independent variables. The regression equation is reported below.

 ˆ 30.161X2 0.385X1X2 Y 598.070 7.811 X1

The question we wish to answer is whether the interaction variable is significant. We will use the .05 significance level. In terms of a hypothesis:

H0: 3 0

H1: 3 0

There is n (k 1) 20 (3 1) 16 degrees of freedom. Using the .05 signif- icance level and a two-tailed test, the critical values of t are 2.120 and 2.120. We reject the null hypothesis if t is less than 2.120 or t is greater than 2.120. From the output, b3 0.385 and sb3 0.291. To find the value of t we use formula (14–6).

t b3 0 0.385 0 1.324

sb30.291

Because the computed value of 1.324 is less than the critical value of 2.120, we do not reject the null hypothesis. We conclude that there is not a significant interaction between temperature and insulation.

There are other situations that can occur when studying interaction among inde- pendent variables.

1.It is possible to have a three-way interaction among the independent variables. In our heating example, we might have considered the three-way interaction between temperature, insulation, and age of the furnace.

2.It is possible to have an interaction where one of the independent variables is nominal scale. In our heating cost example, we could have studied the inter- action between temperature and garage.

Studying all possible interactions can become very complex. However, careful con- sideration to possible interactions among independent variables can often provide useful insight into the regression models. Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 Multiple Regression and Correlation Analysis 543

Self-Review 14–4 A study by the American Realtors Association investigated the relationship between the commissions earned by sales associates last year and the number of months since the associates earned their real estate licenses. Also of interest in the study is the gender of the sales associate. Below is a portion of the regression output. The dependent variable is commissions, which is reported in \$000, and the independent variables are months since the license was earned and gender (female 1 and male 0).

 Regression Analysis R2 0.642 Adjusted R2 0.600 n 20 R 0.801 k 2 Std. Error 3.219 Dep. Var. Commissions ANOVA table Source SS df MS F p-value Regression 315.9291 2 157.9645 15.25 .0002 Residual 176.1284 17 10.3605 Total 492.0575 19 Regression output Variables coefficients std. error t (df 17) p-value 95% lower 95% upper Intercept 15.7625 3.0782 5.121 .0001 9.2680 22.2570 Months 0.4415 0.0839 5.263 .0001 0.2645 0.6186 Gender 3.8598 1.4724 2.621 .0179 0.7533 6.9663

(a)Write out the regression equation. How much commission would you expect a female agent to make who earned her license 30 months ago?

(b)Do the female agents on the average make more or less than the male agents? How much more?

(c)Conduct a test of hypothesis to determine if the independent variable gender should be included in the analysis. Use the 0.05 significance level. What is your conclusion?

Exercises

9.The production manager of High Point Sofa and Chair, a large furniture manufacturer located in North Carolina, is studying the job performance ratings of a sample of 15 elec- trical repairmen employed by the company. An aptitude test is required by the human resources department to become an electrical repairman. The production manager was able to get the score for each repairman in the sample. In addition, he determined which of the repairmen were union members (code 1) and which were not (code 0). The sam- ple information is reported below.

 Job Performance Worker Score Aptitude Test Score Union Membership Abbott 58 5 0 Anderson 53 4 0 Bender 33 10 0 Bush 97 10 0 Center 36 2 0 Coombs 83 7 0 Eckstine 67 6 0 Gloss 84 9 0 Herd 98 9 1 Householder 45 2 1 Iori 97 8 1 Lindstrom 90 6 1 Mason 96 7 1 Pierse 66 3 1 Rohde 82 6 1 Lind−Marchal−Wathen: 14. Multiple Regressions Text Statistical Techniques in and Correlation Analysis Business and Economics, 13th Edition
 544 Chapter 14

©The McGraw−Hill Companies, 2008

a.Use a statistical software package to develop a multiple regression equation using the job performance score as the dependent variable and aptitude test score and union membership as independent variables.

b.Comment on the regression equation. Be sure to include the coefficient of determi- nation and the effect of union membership. Are these two variables effective in explaining the variation in job performance?

c.Conduct a test of hypothesis to determine if union membership should be included as an independent variable.

d.Repeat the analysis considering possible interaction terms.

10.Cincinnati Paint Company sells quality brands of paints through hardware stores through- out the United States. The company maintains a large sales force whose job it is to call on existing customers as well as look for new business. The national sales manager is investigating the relationship between the number of sales calls made and the miles driven by the sales representative. Also, do the sales representatives who drive the most miles and make the most calls necessarily earn the most in sales commissions? To inves- tigate, the vice president of sales selected a sample of 25 sales representatives and determined:

The amount earned in commissions last month (Y ).

The number of miles driven last month (X1)

The number of sales calls made last month (X2)

The information is reported below.

 Commissions Commissions (\$000) Calls Driven (\$000) Calls Driven 22 139 2,371 38 146 3,290 13 132 2,226 44 144 3,103 33 144 2,731 29 147 2,122 38 142 3,351 38 144 2,791 23 142 2,289 37 149 3,209 47 142 3,449 14 131 2,287 29 138 3,114 34 144 2,848 38 139 3,342 25 132 2,690 41 144 2,842 27 132 2,933 32 134 2,625 25 127 2,671 20 135 2,121 43 154 2,988 13 137 2,219 34 147 2,829 47 146 3,463

Develop a regression equation including an interaction term. Is there a significant inter- action between the number of sales calls and the miles driven?

11.An art collector is studying the relationship between the selling price of a painting and two independent variables. The two independent variables are the number of bidders at the particular auction and the age of the painting, in years. A sample of 25 paintings revealed the following sample information.

 Painting Auction Price Bidders Age Painting Auction Price Bidders Age 1 3,470 10 67 14 4,020 6 79 2 3,500 8 56 15 4,190 4 83 3 3,700 7 73 16 4,130 3 71 4 3,860 4 71 17 4,130 9 89 5 3,920 12 99 18 4,370 5 103 6 3,900 10 87 19 4,450 3 106 7 3,830 11 78 20 4,390 8 93 8 3,940 8 83 21 4,380 8 88 9 3,880 13 90 22 4,540 4 96 10 3,940 13 98 23 4,660 5 94 11 4,200 0 91 24 4,710 3 88 12 4,060 7 93 25 4,880 1 84 13 4,200 2 97 Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 Multiple Regression and Correlation Analysis 545

a. Develop a multiple regression equation using the independent variables number of bidders and age of painting to estimate the dependent variable auction price. Discuss the equation. Does it surprise you that there is an inverse relationship between the number of bidders and the price of the painting?

b. Create an interaction variable and include it in the regression equation. Explain the meaning of the interaction. Is this variable significant?

c.Use the stepwise method and the independent variables for the number of bidders, the age of the painting, and the interaction between the number of bidders and the age of the painting. Which variables would you select?

12.A real estate developer wishes to study the relationship between the size of home a client will purchase (in square feet) and other variables. Possible independent variables include the family income, family size, whether there is a senior adult parent living with the fam- ily (1 for yes, 0 for no), and the total years of education beyond high school for the hus- band and wife. The sample information is reported below.

 Square Income Family Senior Family Feet (000s) Size Parent Education 1 2,240 60.8 2 0 4 2 2,380 68.4 2 1 6 3 3,640 104.5 3 0 7 4 3,360 89.3 4 1 0 5 3,080 72.2 4 0 2 6 2,940 114 3 1 10 7 4,480 125.4 6 0 6 8 2,520 83.6 3 0 8 9 4,200 133 5 0 2 10 2,800 95 3 0 6

Develop an appropriate multiple regression equation. Which independent variables would you include in the final regression equation? Use the stepwise method.

Chapter Summary

I.The general form of a multiple regression equation is:

 ˆ b2X2 . . . bk Xk [14–1] Y a b1X1

where a is the Y-intercept when all X’s are zero, bi refers to the sample regression coefficients, and Xi refers to the value of the various independent variables.

A. There can be any number of independent variables.

B. The least squares criterion is used to develop the regression equation. C. A statistical software package is needed to perform the calculations.

II.There are two measures of the effectiveness of the regression equation.

A. The multiple standard error of estimate is similar to the standard deviation.

1.It is measured in the same units as the dependent variable.

2.It is based on squared deviations from the regression equation.

3.It ranges from 0 to plus infinity.

4.It is calculated from the following equation.

 © (Y ˆ 2 sY.123...k B Y ) [14–2] n (k 1)

B. The coefficient of multiple determination reports the percent of the variation in the dependent variable explained by the set of independent variables.

1.It may range from 0 to 1.

2.It is also based on squared deviations from the regression equation. Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 546 Chapter 14

3.It is found by the following equation.

 R2 SSR [14–3] SS total

4.When the number of independent variables is large, we adjust the coefficient of determination for the degrees of freedom as follows.

 SSE Radj2 1 n (k 1) [14–4] SS total

n 1

III.An ANOVA table summarizes the multiple regression analysis.

A.It reports the total amount of the variation in the dependent variable and divides this variation into that explained by the set of independent variables and that not explained.

B.It reports the degrees of freedom associated with the independent variables, the

error variation, and the total variation.

IV. A correlation matrix shows all possible simple correlation coefficients between pairs of variables.

A. It shows the correlation between each independent variable and the dependent variable.

B. It shows the correlation between each pair of independent variables.

V.A global test is used to investigate whether any of the independent variables have significant regression coefficients.

A.The null hypothesis is: All the regression coefficients are zero.

B.The alternate hypothesis is: At least one regression coefficient is not zero.

C.The test statistic is the F distribution with k (the number of independent variables) degrees of freedom in the numerator and n (k 1) degrees of freedom in the denominator, where n is the sample size.

D.The formula to calculate the value of the test statistic for the global test is:

 SSR/k F SSE/[n (k 1)] [14–5]

VI. The test for individual variables determines which independent variables have nonzero regression coefficients.

A. The variables that have zero regression coefficients are usually dropped from the analysis.

B. The test statistic is the t distribution with n (k 1) degrees of freedom.

C. The formula to calculate the value of the test statistic for the individual test is:

bi 0

t [14–6] sbi

VII. Dummy variables are used to represent qualitative variables and can assume only one of two possible outcomes.

VIII. There are five assumptions to use multiple regression analysis.

A. The relationship between the dependent variable and the set of independent vari- ables must be linear.

1.To verify this assumption develop a scatter diagram and plot the residuals on the vertical axis and the fitted values on the horizontal axis.

2.If the plots appear random, we conclude the relationship is linear.

B.Y

The variation is the same for both large and small values of ˆ.

1.Homoscedasticity means the variation is the same for all fitted values of the dependent variable.

2.This condition is checked by developing a scatter diagram with the residuals on the vertical axis and the fitted values on the horizontal axis.

3.If there is no pattern to the plots—that is, they appear random—the residuals meet the homoscedasticity requirement. Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 Multiple Regression and Correlation Analysis 547

C. The residuals follow the normal probability distribution.

1.This condition is checked by developing a histogram of the residuals to see if they follow a normal distribution.

2.The mean of the distribution of the residuals is 0.

D.The independent variables are not correlated.

1.A correlation matrix will show all possible correlation among independent vari- ables. Signs of trouble are correlations larger than 0.70 or less than 0.70.

2.Signs of correlated independent variables include when an important predictor variable is found insignificant, when an obvious reversal occurs in signs in one or more of the independent variables, or when a variable is removed from the solution, there is a large change in the regression coefficients.

3.The variance inflation factor is used to identify correlated independent variables.

 1 [14–7] VIF 1 Rj2

E.Each residual is independent of other residuals.

1.Autocorrelation occurs when successive residuals are correlated.

2.When autocorrelation exists, the value of the standard error will be biased

and will return poor results for tests of hypothesis regarding the regression coefficients.

IX. Several techniques help build a regression model.

A. A dummy or qualitative independent variable can assume one of two possible outcomes.

1.A value of 1 is assigned to one outcome and 0 the other.

2.Use formula (14–6) to determine if the dummy variable should remain in the equation.

B.Stepwise regression is a step-by-step process to find the regression equation.

1.Only independent variables with nonzero regression coefficients enter the equation.

2.Independent variables are added one at a time to the regression equation.

C.Interaction is the case in which one independent variable (such as X2) affects the relationship with another independent variable (X1) and the dependent variable (Y ).

Pronunciation Key

 SYMBOL MEANING PRONUNCIATION

b1

bk

sy.12...k

 Regression coefficient for the first b sub 1 independent variable Regression coefficient for any b sub k independent variable Multiple standard error of estimate s sub y dot 1, 2 . . . k

Chapter Exercises

13. A multiple regression equation yields the following partial results.

 Source Sum of Squares df Regression 750 4 Error 500 35

a. What is the total sample size?

b. How many independent variables are being considered? c. Compute the coefficient of determination.

d. Compute the standard error of estimate.

e. Test the hypothesis that none of the regression coefficients is equal to zero. Let .05. Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 548 Chapter 14

14.In a multiple regression equation two independent variables are considered, and the sam- ple size is 25. The regression coefficients and the standard errors are as follows.

 b1 2.676 sb1 0.56 b2 0.88 sb2 0.71

Conduct a test of hypothesis to determine whether either independent variable has a coefficient equal to zero. Would you consider deleting either variable from the regression equation? Use the .05 significance level.

15.The following output was obtained.

 Analysis of variance SOURCE DF SS MS Regression 5 100 20 Error 20 40 2 Total 25 140 Predictor Coef StDev t-ratio Constant 3.00 1.50 2.00 X1 4.00 3.00 1.33 X2 3.00 0.20 15.00 X3 0.20 0.05 4.00 X4 2.50 1.00 2.50 X5 3.00 4.00 0.75

a.What is the sample size?

b.Compute the value of R2.

c.Compute the multiple standard error of estimate.

d.Conduct a global test of hypothesis to determine whether any of the regression coef- ficients are significant. Use the .05 significance level.

e.Test the regression coefficients individually. Would you consider omitting any vari- able(s)? If so, which one(s)? Use the .05 significance level.

16.In a multiple regression equation k 5 and n 20, the MSE value is 5.10, and SS total is 519.68. At the .05 significance level, can we conclude that any of the regression coef- ficients are not equal to 0?

17.The district manager of Jasons, a large discount electronics chain, is investigating why certain stores in her region are performing better than others. She believes that three fac- tors are related to total sales: the number of competitors in the region, the population in the surrounding area, and the amount spent on advertising. From her district, consisting of several hundred stores, she selects a random sample of 30 stores. For each store she gathered the following information.

Y total sales last year (in \$ thousands).

X1 number of competitors in the region.

X2 population of the region (in millions).

X3 advertising expense (in \$ thousands).

The sample data were run on MINITAB, with the following results.

 Analysis of variance SOURCE DF SS MS Regression 3 3050.00 1016.67 Error 26 2200.00 84.62 Total 29 5250.00 Predictor Coef StDev t-ratio Constant 14.00 7.00 2.00 X1 1.00 0.70 1.43 X2 30.00 5.20 5.77 X3 0.20 0.08 2.50 Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 Multiple Regression and Correlation Analysis 549

a. What are the estimated sales for the Bryne store, which has four competitors, a regional population of 0.4 (400,000), and advertising expense of 30 (\$30,000)?

b.Compute the R2 value.

c.Compute the multiple standard error of estimate.

d.Conduct a global test of hypothesis to determine whether any of the regression coef- ficients are not equal to zero. Use the .05 level of significance.

e.Conduct tests of hypotheses to determine which of the independent variables have significant regression coefficients. Which variables would you consider eliminating? Use the .05 significance level.

18.Suppose that the sales manager of a large automotive parts distributor wants to esti- mate as early as April the total annual sales of a region. On the basis of regional sales, the total sales for the company can also be estimated. If, based on past experience, it is found that the April estimates of annual sales are reasonably accurate, then in future years the April forecast could be used to revise production schedules and maintain the correct inventory at the retail outlets.

Several factors appear to be related to sales, including the number of retail outlets in the region stocking the company’s parts, the number of automobiles in the region reg- istered as of April 1, and the total personal income for the first quarter of the year. Five independent variables were finally selected as being the most important (according to the sales manager). Then the data were gathered for a recent year. The total annual sales for that year for each region were also recorded. Note in the following table that for region 1 there were 1,739 retail outlets stocking the company’s automotive parts, there were 9,270,000 registered automobiles in the region as of April 1 and so on. The sales for that year were \$37,702,000.

 Number of Average Annual Number of Automobiles Personal Age of Sales Retail Registered Income Automobiles Number of (\$ millions), Outlets, (millions), (\$ billions), (years), Supervisors, Y X1 X 2 X 3 X4 X 5 37.702 1,739 9.27 85.4 3.5 9.0 24.196 1,221 5.86 60.7 5.0 5.0 32.055 1,846 8.81 68.1 4.4 7.0 3.611 120 3.81 20.2 4.0 5.0 17.625 1,096 10.31 33.8 3.5 7.0 45.919 2,290 11.62 95.1 4.1 13.0 29.600 1,687 8.96 69.3 4.1 15.0 8.114 241 6.28 16.3 5.9 11.0 20.116 649 7.77 34.9 5.5 16.0 12.994 1,427 10.92 15.1 4.1 10.0

a.Consider the following correlation matrix. Which single variable has the strongest cor- relation with the dependent variable? The correlations between the independent vari- ables outlets and income and between cars and outlets are fairly strong. Could this be a problem? What is this condition called?

 sales outlets cars income age outlets 0.899 cars 0.605 0.775 income 0.964 0.825 0.409 age 0.323 0.489 0.447 0.349 bosses 0.286 0.183 0.395 0.155 0.291

b.The output for all five variables is on the following page. What percent of the varia- tion is explained by the regression equation? Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 550 Chapter 14 The regression equation is sales 19.7 0.00063 outlets 1.74 cars 0.410 income 2.04 age 0.034 bosses Predictor Coef StDev t-ratio Constant 19.672 5.422 3.63 outlets 0.000629 0.002638 0.24 cars 1.7399 0.5530 3.15 income 0.40994 0.04385 9.35 age 2.0357 0.8779 2.32 bosses 0.0344 0.1880 0.18 Analysis of Variance SOURCE DF SS MS Regression 5 1593.81 318.76 Error 4 9.08 2.27 Total 9 1602.89

c.Conduct a global test of hypothesis to determine whether any of the regression coef- ficients are not zero. Use the .05 significance level.

d.Conduct a test of hypothesis on each of the independent variables. Would you con- sider eliminating “outlets” and “bosses”? Use the .05 significance level.

e.The regression has been rerun below with “outlets” and “bosses” eliminated. Compute the coefficient of determination. How much has R2 changed from the previous analysis?

 The regression equation is sales 18.9 1.61 cars 0.400 income 1.96 age Predictor Coef StDev t-ratio Constant 18.924 3.636 5.20 cars 1.6129 0.1979 8.15 income 0.40031 0.01569 25.52 age 1.9637 0.5846 3.36 Analysis of Variance SOURCE DF SS MS Regression 3 1593.66 531.22 Error 6 9.23 1.54 Total 9 1602.89

f. Following is a histogram and a stem-and-leaf chart of the residuals. Does the nor- mality assumption appear reasonable?

 Histogram of residual N 10 Stem-and-leaf of residual N 10 Leaf Unit 0.10 Midpoint Count 1.5 1 * 1 1 7 1.0 1 * 2 1 2 0.5 2 ** 2 0 0.0 2 ** 5 0 440 0.5 2 ** 5 0 24 1.0 1 * 3 0 68 1.5 1 * 1 1 1 1 7

Following is a plot of the fitted values of (i.e., ˆ ) and the residuals. Do you see any

g.Y Y violations of the assumptions? Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 Multiple Regression and Correlation Analysis 551

^ Residual (Y – Y )

1.2

0

–1.2

8 16 24 32 40 Y

^

Fitted

19.The administrator of a new paralegal program at Seagate Technical College wants to esti- mate the grade point average in the new program. He thought that high school GPA, the verbal score on the Scholastic Aptitude Test (SAT), and the mathematics score on the SAT would be good predictors of paralegal GPA. The data on nine students are:

 High School SAT SAT Paralegal Student GPA Verbal Math GPA 1 3.25 480 410 3.21 2 1.80 290 270 1.68 3 2.89 420 410 3.58 4 3.81 500 600 3.92 5 3.13 500 490 3.00 6 2.81 430 460 2.82 7 2.20 320 490 1.65 8 2.14 530 480 2.30 9 2.63 469 440 2.33

a.Consider the following correlation matrix. Which variable has the strongest correlation with the dependent variable? Some of the correlations among the independent vari- ables are strong. Does this appear to be a problem?

 legal gpa verbal gpa 0.911 verbal 0.616 0.609 math 0.487 0.636 0.599

b.Consider the following output. Compute the coefficient of multiple determination.

The regression equation is

legal 0.411 1.20 gpa 0.00163 verbal 0.00194 math

 Predictor Coef StDev t-ratio Constant 0.4111 0.7823 0.53 gpa 1.2014 0.2955 4.07 verbal 0.001629 0.002147 0.76 math 0.001939 0.002074 0.94 Analysis of Variance SOURCE DF SS MS Regression 3 4.3595 1.4532 Error 5 0.7036 0.1407 Total 8 5.0631

c.Conduct a global test of hypothesis from the preceding output. Does it appear that any of the regression coefficients are not equal to zero?

d.Conduct a test of hypothesis on each independent variable. Would you consider elimi- nating the variables “verbal” and “math”? Let .05. Lind−Marchal−Wathen: 14. Multiple Regressions Text Statistical Techniques in and Correlation Analysis Business and Economics, 13th Edition
 552 Chapter 14

©The McGraw−Hill Companies, 2008

e.The analysis has been rerun without “verbal” and “math.” See the following output. Compute the coefficient of determination. How much has R2 changed from the pre- vious analysis?

The regression equation is legal 0.454 1.16 gpa

 Predictor Coef StDev t-ratio Constant 0.4542 0.5542 0.82 gpa 1.1589 0.1977 5.86
 Analysis of Variance SOURCE DF SS MS Regression 1 4.2061 4.2061 Error 7 0.8570 0.1224 Total 8 5.0631

f.Following are a histogram and a stem-and-leaf diagram of the residuals. Does the nor- mality assumption for the residuals seem reasonable?

Histogram of residual N 9

 Midpoint Count 0.4 1 * 0.2 3 ***

0.03 ***

0.21 *

0.40

0.61 *

Stem-and-leaf of residual N 9

 Leaf unit 0.10 1 0 4 2 0 2 (3) 0 110 4 0 00 2 0 1 0 1 0 6

Following is a plot of the residuals and the ˆ values. Do you see any violation of the

g.Y assumptions?

 ) 0.70 ^ Y – 0.35 (Y Residuals 0.00 –0.35

1.50 2.00 2.50 3.00 3.50 4.00

^

Y

20.Mike Wilde is president of the teachers’ union for Otsego School District. In preparing for upcoming negotiations, he would like to investigate the salary structure of classroom teachers in the district. He believes there are three factors that affect a teacher’s salary: years of experience, a rating of teaching effectiveness given by the principal, and whether the teacher has a master’s degree. A random sample of 20 teachers resulted in the fol-

lowing data. Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 Multiple Regression and Correlation Analysis 553 Salary Years of Principal’s Master’s (\$ thousands), Experience, Rating, Degree,* Y X1 X2 X3 31.1 8 35 0 33.6 5 43 0 29.3 2 51 1 43.0 15 60 1 38.6 11 73 0 45.0 14 80 1 42.0 9 76 0 36.8 7 54 1 48.6 22 55 1 31.7 3 90 1 25.7 1 30 0 30.6 5 44 0 51.8 23 84 1 46.7 17 76 0 38.4 12 68 1 33.6 14 25 0 41.8 8 90 1 30.7 4 62 0 32.8 2 80 1 42.8 8 72 0

*1 yes, 0 no.

a.Develop a correlation matrix. Which independent variable has the strongest correla- tion with the dependent variable? Does it appear there will be any problems with multicollinearity?

b.Determine the regression equation. What salary would you estimate for a teacher with five years’ experience, a rating by the principal of 60, and no master’s degree?

c.Conduct a global test of hypothesis to determine whether any of the regression coef- ficients differ from zero. Use the .05 significance level.

d.Conduct a test of hypothesis for the individual regression coefficients. Would you con- sider deleting any of the independent variables? Use the .05 significance level.

e.If your conclusion in part (d) was to delete one or more independent variables, run the analysis again without those variables.

f.Determine the residuals for the equation of part (e). Use a stem-and-leaf chart or a histogram to verify that the distribution of the residuals is approximately normal.

g.Plot the residuals computed in part (f) in a scatter diagram with the residuals on the

 Y-axis and the ˆ values on the X-axis. Does the plot reveal any violations of the Y

assumptions of regression?

21.The district sales manager for a major automobile manufacturer is studying car sales. Specifically, he would like to determine what factors affect the number of cars sold at a dealership. To investigate, he randomly selects 12 dealers. From these dealers he obtains the number of cars sold last month, the minutes of radio advertising purchased last month, the number of full-time salespeople employed in the dealership, and whether the dealer is located in the city. The information is as follows:

 Cars Sold Sales Cars Sold Sales Last Month, Advertising, Force, City, Last Month, Advertising, Force, City, Y X1 X2 X3 Y X1 X2 X3 127 18 10 Yes 161 25 14 Yes 138 15 15 No 180 26 17 Yes 159 22 14 Yes 102 15 7 No 144 23 12 Yes 163 24 16 Yes 139 17 12 No 106 18 10 No 128 16 12 Yes 149 25 11 Yes Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 554 Chapter 14

a.Develop a correlation matrix. Which independent variable has the strongest correla- tion with the dependent variable? Does it appear there will be any problems with multicollinearity?

b.Determine the regression equation. How many cars would you expect to be sold by a dealership employing 20 salespeople, purchasing 15 minutes of advertising, and located in a city?

c.Conduct a global test of hypothesis to determine whether any of the regression coef- ficients differ from zero. Let .05.

d.Conduct a test of hypothesis for the individual regression coefficients. Would you con- sider deleting any of the independent variables? Let .05.

e.If your conclusion in part (d) was to delete one or more independent variables, run the analysis again without those variables.

f.Determine the residuals for the equation of part (e). Use a stem-and-leaf chart or a histogram to verify that the distribution of the residuals is approximately normal.

g.Plot the residuals computed in part (f) in a scatter diagram with the residuals on the

 Y-axis and the ˆ values on the X-axis. Does the plot reveal any violations of the Y

assumptions of regression?

22.Fran’s Convenience Marts are located throughout metropolitan Erie, Pennsylvania. Fran, the owner, would like to expand into other communities in northwestern Pennsylvania and southwestern New York, such as Jamestown, Corry, Meadville, and Warren. To prepare her presentation to the local bank, she would like to better understand the factors that make a particular outlet profitable. She must do all the work herself, so she will not be able to study all her outlets. She selects a random sample of 15 marts and records the average daily sales (Y ), the floor space (area), the number of parking spaces, and the median income of families in that ZIP code region for each. The sample information is reported below.

 Sampled Daily Store Parking Income Mart Sales Area Spaces (\$ thousands) 1 \$1,840 532 6 44 2 1,746 478 4 51 3 1,812 530 7 45 4 1,806 508 7 46 5 1,792 514 5 44 6 1,825 556 6 46 7 1,811 541 4 49 8 1,803 513 6 52 9 1,830 532 5 46 10 1,827 537 5 46 11 1,764 499 3 48 12 1,825 510 8 47 13 1,763 490 4 48 14 1,846 516 8 45 15 1,815 482 7 43

a.Determine the regression equation.

b.What is the value of R2? Comment on the value.

c.Conduct a global hypothesis test to determine if any of the independent variables are different from zero.

d.Conduct individual hypothesis tests to determine if any of the independent variables can be dropped.

e.If variables are dropped, recompute the regression equation and R2.

23.Great Plains Roofing and Siding Company, Inc., sells roofing and siding products to home repair retailers, such as Lowe’s and Home Depot, and commercial contractors. The owner is interested in studying the effects of several variables on the value of shingles sold (\$000). The marketing manager is arguing that the company should spend more money on adver- tising, while a market researcher suggests it should focus more on making its brand and product more distinct from its competitors.

The company has divided the United States into 26 marketing districts. In each dis- trict it collected information on the following variables: volume of sales (in thousands of Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 Multiple Regression and Correlation Analysis 555

dollars), advertising dollars (in thousands), number of active accounts, number of compet- ing brands, and a rating of district potential.

 Advertising Sales Dollars Number of Number of Market (000s) (000s) Accounts Competitors Potential 79.3 5.5 31 10 8 200.1 2.5 55 8 6 163.2 8.0 67 12 9 200.1 3.0 50 7 16 146.0 3.0 38 8 15 177.7 2.9 71 12 17 30.9 8.0 30 12 8 291.9 9.0 56 5 10 160.0 4.0 42 8 4 339.4 6.5 73 5 16 159.6 5.5 60 11 7 86.3 5.0 44 12 12 237.5 6.0 50 6 6 107.2 5.0 39 10 4 155.0 3.5 55 10 4 291.4 8.0 70 6 14 100.2 6.0 40 11 6 135.8 4.0 50 11 8 223.3 7.5 62 9 13 195.0 7.0 59 9 11 73.4 6.7 53 13 5 47.7 6.1 38 13 10 140.7 3.6 43 9 17 93.5 4.2 26 8 3 259.0 4.5 75 8 19 331.2 5.6 71 4 9

Conduct a multiple regression analysis to find the best predictors of sales.

a.Draw a scatter diagram comparing sales volume with each of the independent vari- ables. Comment on the results.

b.Develop a correlation matrix. Do you see any problems? Does it appear there are any redundant independent variables?

c.Develop a regression equation. Conduct the global test. Can we conclude that some of the independent variables are useful in explaining the variation in the dependent variable?

d.Conduct a test of each of the independent variables. Are there any that should be dropped?

e.Refine the regression equation so the remaining variables are all significant.

f.Develop a histogram of the residuals and a normal probability plot. Are there any problems?

g.Determine the variance inflation factor for each of the independent variables. Are there any problems?

24.The Times-Observer is a daily newspaper in Metro City. Like many city newspapers, the Times-Observer is suffering through difficult financial times. The circulation manager is studying other papers in similar cities in the United States and Canada. She is particu- larly interested in what variables relate to the number of subscriptions to the paper. She is able to obtain the following sample information on 25 newspapers in similar cities. The following notation is used:

Sub Number of subscriptions (in thousands). Popul The metropolitan population (in thousands).

Adv The advertising budget of the paper (in \$ hundreds).

Income The median family income in the metropolitan area (in \$ thousands). Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 556 Chapter 14 Paper Sub Popul Adv Income Paper Sub Popul Adv Income 1 37.95 588.9 13.2 35.1 14 38.39 586.5 15.4 35.5 2 37.66 585.3 13.2 34.7 15 37.29 544.0 11.0 34.9 3 37.55 566.3 19.8 34.8 16 39.15 611.1 24.2 35.0 4 38.78 642.9 17.6 35.1 17 38.29 643.3 17.6 35.3 5 37.67 624.2 17.6 34.6 18 38.09 635.6 19.8 34.8 6 38.23 603.9 15.4 34.8 19 37.83 598.9 15.4 35.1 7 36.90 571.9 11.0 34.7 20 39.37 657.0 22.0 35.3 8 38.28 584.3 28.6 35.3 21 37.81 595.2 15.4 35.1 9 38.95 605.0 28.6 35.1 22 37.42 520.0 19.8 35.1 10 39.27 676.3 17.6 35.6 23 38.83 629.6 22.0 35.3 11 38.30 587.4 17.6 34.9 24 38.33 680.0 24.2 34.7 12 38.84 576.4 22.0 35.4 25 40.24 651.2 33.0 35.8 13 38.14 570.8 17.6 35.0

a.Determine the regression equation.

b.Conduct a global test of hypothesis to determine whether any of the regression coef- ficients are not equal to zero.

c.Conduct a test for the individual coefficients. Would you consider deleting any coefficients?

d.Determine the residuals and plot them against the fitted values. Do you see any problems?

e.Develop a histogram of the residuals. Do you see any problems with the normality assumption?

25.How important is GPA in determining the starting salary of recent business school gradu- ates? Does graduating from a business school increase the starting salary? The director of undergraduate studies at a major university wanted to study these questions. She gathered the following sample information on 15 graduates last spring to investigate these questions.

 Student Salary GPA Business Student Salary GPA Business 1 \$31.5 3.245 0 9 \$34.7 3.355 1 2 33.0 3.278 0 10 32.5 3.080 0 3 34.1 3.520 1 11 31.5 3.025 0 4 35.4 3.740 1 12 32.2 3.146 0 5 34.2 3.520 1 13 34.0 3.465 1 6 34.0 3.421 1 14 32.8 3.245 0 7 34.5 3.410 1 15 31.8 3.025 0 8 35.0 3.630 1

The salary is reported in \$000, GPA on the traditional 4-point scale. A 1 indicates the student graduated from a school of business; a 0 indicates that the student graduated from one of the other schools.

a.Develop a correlation matrix. Do you see any problems with multicollinearity?

b.Determine the regression equation. Discuss the regression equation. How much does graduating from a college of business add to a starting salary? What starting salary would you estimate for a student with a GPA of 3.00 who graduated from a college of business?

c.What is the value of R2? Can we conclude that this value is greater than 0?

d.Would you consider deleting either of the independent variables?

e.Plot the residuals in a histogram. Is there any problem with the normality assumption?

f.Plot the fitted values against the residuals. Does this plot indicate any problems with homoscedasticity?

26.A mortgage department of a large bank is studying its recent loans. Of particular interest is how such factors as the value of the home (in thousands of dollars), education level of the head of the household, age of the head of the household, current monthly mortgage payment (in dollars), and gender of the head of the household (male 1, female 0) relate to the family income. Are these variables effective predictors of the income of the household? A random sample of 25 recent loans is obtained. Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 Multiple Regression and Correlation Analysis 557 Income Value Years of Mortgage (\$ thousands) (\$ thousands) Education Age Payment Gender \$40.3 \$190 14 53 \$230 1 39.6 121 15 49 370 1 40.8 161 14 44 397 1 40.3 161 14 39 181 1 40.0 179 14 53 378 0 38.1 99 14 46 304 0 40.4 114 15 42 285 1 40.7 202 14 49 551 0 40.8 184 13 37 370 0 37.1 90 14 43 135 0 39.9 181 14 48 332 1 40.4 143 15 54 217 1 38.0 132 14 44 490 0 39.0 127 14 37 220 0 39.5 153 14 50 270 1 40.6 145 14 50 279 1 40.3 174 15 52 329 1 40.1 177 15 47 274 0 41.7 188 15 49 433 1 40.1 153 15 53 333 1 40.6 150 16 58 148 0 40.4 173 13 42 390 1 40.9 163 14 46 142 1 40.1 150 15 50 343 0 38.5 139 14 45 373 0

a.Determine the regression equation.

b.What is the value of R2? Comment on the value.

c.Conduct a global hypothesis test to determine whether any of the independent variables are different from zero.

d.Conduct individual hypothesis tests to determine whether any of the independent variables can be dropped.

e.If variables are dropped, recompute the regression equation and R2.

27.Fred G. Hire is the manager of human resources at Crescent Tool and Die, Inc. As part of his yearly report to the CEO, he is required to present an analysis of the salaried employ- ees. Because there are over 1,000 employees, he does not have the staff to gather infor- mation on each salaried employee, so he selects a random sample of 30. For each employee, he records monthly salary; service at Crescent, in months; gender (1 male, 0 female); and whether the employee has a technical or clerical job. Those working technical jobs are coded 1, and those who are clerical 0.

 Sampled Monthly Length of Employee Salary Service Age Gender Job 1 \$1,769 93 42 1 0 2 1,740 104 33 1 0 3 1,941 104 42 1 1 4 2,367 126 57 1 1 5 2,467 98 30 1 1 6 1,640 99 49 1 1 7 1,756 94 35 1 0 8 1,706 96 46 0 1 9 1,767 124 56 0 0 10 1,200 73 23 0 1 continued Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 558 Chapter 14
 Sampled Monthly Length of Employee Salary Service Age Gender Job 11 \$1,706 110 67 0 1 12 1,985 90 36 0 1 13 1,555 104 53 0 0 14 1,749 81 29 0 0 15 2,056 106 45 1 0 16 1,729 113 55 0 1 17 2,186 129 46 1 1 18 1,858 97 39 0 1 19 1,819 101 43 1 1 20 1,350 91 35 1 1 21 2,030 100 40 1 0 22 2,550 123 59 1 0 23 1,544 88 30 0 0 24 1,766 117 60 1 1 25 1,937 107 45 1 1 26 1,691 105 32 0 1 27 1,623 86 33 0 0 28 1,791 131 56 0 1 29 2,001 95 30 1 1 30 1,874 98 47 1 0

a. Determine the regression equation, using salary as the dependent variable and the other four variables as independent variables.

b.What is the value of R2? Comment on this value.

c.Conduct a global test of hypothesis to determine whether any of the independent vari- ables are different from 0.

d.Conduct an individual test to determine whether any of the independent variables can be dropped.

e.Rerun the regression equation, using only the independent variables that are signifi- cant. How much more does a man earn per month than a woman? Does it make a difference whether the employee has a technical or a clerical job?

28.Many regions along the coast in North and South Carolina and Georgia have experienced rapid population growth over the last 10 years. It is expected that the growth will con- tinue over the next 10 years. This has motivated many of the large grocery store chains building new stores in the region. The Kelley’s Super Grocery Stores, Inc., chain is no exception. The director of planning for Kelley’s Super Grocery Stores wants to study adding more stores in this region. He believes there are two main factors that indicate the amount families spend on groceries. The first is their income and the other is the number of people in the family. The director gathered the following sample information.

 Family Food Income Size Family Food Income Size 1 \$5.04 \$ 73.98 4 14 \$4.92 \$ 171.36 2 2 4.08 54.90 2 15 6.60 82.08 9 3 5.76 94.14 4 16 5.40 141.30 3 4 3.48 52.02 1 17 6.00 36.90 5 5 4.20 65.70 2 18 5.40 56.88 4 6 4.80 53.64 4 19 3.36 71.82 1 7 4.32 79.74 3 20 4.68 69.48 3 8 5.04 68.58 4 21 4.32 54.36 2 9 6.12 165.60 5 22 5.52 87.66 5 10 3.24 64.80 1 23 4.56 38.16 3 11 4.80 138.42 3 24 5.40 43.74 7 12 3.24 125.82 1 25 4.80 48.42 5 13 6.60 77.58 7 Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 Multiple Regression and Correlation Analysis 559

Food and income are reported in thousands of dollars per year, and the variable size refers to the number of people in the household.

a.Develop a correlation matrix. Do you see any problems with multicollinearity?

b.Determine the regression equation. Discuss the regression equation. How much does an additional family member add to the amount spent on food?

c.What is the value of R2? Can we conclude that this value is greater than 0?

d.Would you consider deleting either of the independent variables?

e.Plot the residuals in a histogram. Is there any problem with the normality assumption?

f.Plot the fitted values against the residuals. Does this plot indicate any problems with homoscedasticity?

29.An investment advisor is studying the relationship between a common stock’s price to earn- ings (P/E) ratio and factors that she thinks would influence it. She has the following data on the earnings per share (EPS) and the dividend percentage (Yield) for a sample of 20 stocks.

 Stock P/E EPS Yield Stock P/E EPS Yield 1 20.79 \$2.46 1.42 11 1.35 \$2.93 2.59 2 3.03 2.69 4.05 12 25.43 2.07 1.04 3 44.46 0.28 4.16 13 22.14 2.19 3.52 4 41.72 0.45 1.27 14 24.21 0.83 1.56 5 18.96 1.60 3.39 15 30.91 2.29 2.23 6 18.42 2.32 3.86 16 35.79 1.64 3.36 7 34.82 0.81 4.56 17 18.99 3.07 1.98 8 30.43 2.13 1.62 18 30.21 1.71 3.07 9 29.97 2.22 5.10 19 32.88 0.35 2.21 10 10.86 1.44 1.17 20 15.19 5.02 3.50

a.Develop a multiple linear regression with P/E as the dependent variable.

b.Are either of the two independent variables an effective predictor of P/E?

c.Interpret the regression coefficients.

d.Do any of these stocks look particularly undervalued?

e.Plot the residuals and check the normality assumption. Plot the fitted values against the residuals.

f.Does there appear to be any problems with homoscedasticity?

g.Develop a correlation matrix. Do any of the correlations indicate multicollinearity?

30.The Conch Café, located in Gulf Shores, Alabama, features casual lunches with a great view of the Gulf of Mexico. To accommodate the increase in business during the summer vacation season, Fuzzy Conch, the owner, hires a large number of servers as seasonal help. When he interviews a prospective server he would like to provide data on the amount a server can earn in tips. He believes that the amount of the bill and the number of diners are both related to the amount of the tip. He gathered the following sample information.

 Amount Amount Number of Amount Amount Number of Customer of Tip of Bill Diners Customer of Tip of Bill Diners 1 \$7.00 \$48.97 5 16 \$3.30 \$23.59 2 2 4.50 28.23 4 17 3.50 22.30 2 3 1.00 10.65 1 18 3.25 32.00 2 4 2.40 19.82 3 19 5.40 50.02 4 5 5.00 28.62 3 20 2.25 17.60 3 6 4.25 24.83 2 21 5.50 44.47 4 7 0.50 6.24 1 22 3.00 20.27 2 8 6.00 49.20 4 23 1.25 19.53 2 9 5.00 43.26 3 24 3.25 27.03 3 10 4.75 31.36 4 25 3.00 21.28 2 11 5.25 32.87 4 26 6.25 43.38 4 12 6.00 34.99 3 27 5.60 28.12 4 13 4.00 33.91 4 28 2.50 26.25 2 14 3.35 23.06 2 29 9.25 56.81 5 15 0.75 4.65 1 30 8.25 50.65 5 Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 560 Chapter 14 a. Develop a multiple regression equation with the amount of tips as the dependent vari- able and the amount of the bill and the number of diners as independent variables. Write out the regression equation. How much does another diner add to the amount of the tips? b. Conduct a global test of hypothesis to determine if at least one of the independent variables is significant. What is your conclusion? c. Conduct an individual test on each of the variables. Should one or the other be deleted? d. Use the equation developed in part (c) to determine the coefficient of determination. Interpret the value. e. Plot the residuals. Is it reasonable to assume they follow the normal distribution? f. Plot the residuals against the fitted values. Is it reasonable to conclude they are random? 31. The president of Blitz Sales Enterprises sells kitchen products through television com- mercials, often called infomercials. He gathered data from the last 15 weeks of sales to determine the relationship between sales and the number of infomercials. Infomercials Sales (\$000s) Infomercials Sales (\$000s) 20 3.2 22 2.5 15 2.6 15 2.4 25 3.4 25 3.0 10 1.8 16 2.7 18 2.2 12 2.0 18 2.4 20 2.6 15 2.4 25 2.8 12 1.5

a. Determine the regression equation. Are the sales predictable from the number of commercials?

b. Determine the residuals and plot a histogram. Does the normality assumption seem reasonable?

32.The director of special events for Sun City believed that the amount of money spent on fireworks displays on the 4th of July was predictive of attendance at the Fall Festival held in October. She gathered the following data to test her suspicion.

 4th of July (\$000) Fall Festival (000) 4th of July (\$000) Fall Festival (000) 10.6 8.8 9.0 9.5 8.5 6.4 10.0 9.8 12.5 10.8 7.5 6.6 9.0 10.2 10.0 10.1 5.5 6.0 6.0 6.1 12.0 11.1 12.0 11.3 8.0 7.5 10.5 8.8 7.5 8.4

Determine the regression equation. Is the amount spent on fireworks related to atten- dance at the Fall Festival? Conduct a hypothesis test to determine if there is a problem with autocorrelation.

33.You are a new hire at Laurel Woods Real Estate which specializes in selling foreclosed homes via public auction. Your boss has asked you to use the following data (mortgage balance, monthly payments, payments made before default, and final auction price) on a random sample of recent sales in order to estimate what the actual auction price will be. Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 Multiple Regression and Correlation Analysis 561 Monthly Payments Auction Monthly Payments Auction Loan Payments Made Price Loan Payments Made Price \$ 85,600 \$ 985.87 1 \$16,900 \$105,200 \$ 915.24 34 \$52,600 115,300 902.56 33 75,800 105,900 905.67 38 51,900 103,100 736.28 6 43,900 94,700 810.70 25 43,200 84,600 945.45 9 16,600 105,600 891.33 20 52,600 97,600 821.07 24 40,700 104,100 864.38 7 42,700 104,400 983.27 26 63,100 85,700 1074.73 30 22,200 113,800 1075.54 19 72,600 113,600 871.61 24 77,000 116,400 1087.16 35 72,300 119,400 1021.23 58 69,000 100,000 900.01 33 58,100 90,600 836.46 3 35,600 92,800 683.11 36 37,100 104,500 1056.37 22 63,000

a.Carry out a global test of hypothesis to verify if any of the regression coefficients are different from zero.

b.Do an individual test of the independent variables. Would you remove any of the variables?

c.If it seems one or more of the independent variables is not needed, remove it and work out the revised regression equation.

34.Think about the figures from the previous exercise. Add a new variable that describes the potential interaction between the loan amount and the number of payments made. Then do a test of hypothesis to check if the interaction is significant.

exercises.com

35.The National Institute of Standards and Technology provides several data sets to allow any user to test the accuracy of its statistical software. Go to the website: http://www.itl.nist.gov/div898/strd. Select the Dataset Archives section and, within that, the Linear Regression section. You will find the names of 11 small data sets stored in ASCII format on this page. Select one and run the data through your statistical software. Compare your results with the “official” results of the federal government.

36.As described in the exercises in Chapters 12 and 13, many real estate companies and rental agencies now publish their listings on the Web. One example is Dunes Realty Company, located in Garden City and Surfside Beaches in South Carolina. Go to the Web site http://www.dunes.com, select Vacation Rentals, then Beach Home Search, then indicate 5 bedrooms, accommodations for 14 people, oceanfront, and no pool or floating dock, select a period in July and August, indicate that you are willing to spend \$10,000 per week, and then click on Search the Beach Homes. The output should include details on the cottages that meet your criteria. Develop a multiple linear regres- sion equation using the rental price per week as the dependent variable and number of bedrooms, number of bathrooms, and how many people the cottage will accom- modate as independent variables. Analyze the regression equations. Would you con- sider deleting any independent variables? What is the coefficient of determination? If you delete any of the variables, rerun the regression equation and discuss the new equation.

Data Set Exercises

37.Refer to the Real Estate data, which report information on homes sold in the Denver, Col- orado, area during the last year. Use the selling price of the home as the dependent vari- able and determine the regression equation with number of bedrooms, size of the house, whether there is a pool, whether there is an attached garage, distance from the center of the city, and number of bathrooms as independent variables.

a.Write out the regression equation. Discuss each of the variables. For example, are you surprised that the regression coefficient for distance from the center of the city is nega- tive? How much does a garage or a swimming pool add to the selling price of a home? Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 562 Chapter 14 b. Determine the value of R2. Interpret. c. Develop a correlation matrix. Which independent variables have strong or weak corre- lations with the dependent variable? Do you see any problems with multicollinearity? d. Conduct the global test on the set of independent variables. Interpret. e. Conduct a test of hypothesis on each of the independent variables. Would you con- sider deleting any of the variables? If so, which ones? f. Rerun the analysis until only significant regression coefficients remain in the analysis. Identify these variables. g. Develop a histogram or a stem-and-leaf display of the residuals from the final regres- sion equation developed in part (f). Is it reasonable to conclude that the normality assumption has been met? h. Plot the residuals against the fitted values from the final regression equation devel- oped in part (f) against the fitted values of Y. Plot the residuals on the vertical axis and the fitted values on the horizontal axis. 38. Refer to the Baseball 2005 data, which report information on the 30 Major League Base- ball teams for the 2005 season. Let the number of games won be the dependent variable and the following variables be independent variables: team batting average, number of stolen bases, number of errors committed, team ERA, number of home runs, and whether the team’s home field is natural grass or artificial turf. a. Write out the regression equation. Discuss each of the variables. For example, are you surprised that the regression coefficient for ERA is negative? How many wins does playing on natural grass for a home field add to or subtract from the total wins for the season? b. Determine the value of R2. Interpret. c. Develop a correlation matrix. Which independent variables have strong or weak corre- lations with the dependent variable? Do you see any problems with multicollinearity? d. Conduct a global test on the set of independent variables. Interpret. e. Conduct a test of hypothesis on each of the independent variables. Would you con- sider deleting any of the variables? If so, which ones? f. Rerun the analysis until only significant net regression coefficients remain in the analy- sis. Identify these variables. g. Develop a histogram or a stem-and-leaf display of the residuals from the final regres- sion equation developed in part (f). Is it reasonable to conclude that the normality assumption has been met? h. Plot the residuals against the fitted values from the final regression equation devel- oped in part (f) against the fitted values of Y. Plot the residuals on the vertical axis and the fitted values on the horizontal axis. 39. Refer to the Wage data, which report information on annual wages for a sample of 100 workers. Also included are variables relating to industry, years of education, and gender for each worker. Determine the regression equation using annual wage as the dependent variable and years of education, gender, years of work experience, age in years, and whether or not the worker is a union member. a. Write out the regression equation. Discuss each variable. b. Determine and interpret the R2 value. c. Develop a correlation matrix. Which independent variables have strong or weak correla- tions with the dependent variable? Do you see any problems with multicollinearity? d. Conduct a global test of hypothesis on the set of independent variables. Interpret your findings. Is it reasonable to continue with the analysis or should you stop here? e. Conduct a test of hypothesis on each of the independent variables. Would you consider deleting any of these variables? If so, which ones? f. Rerun the analysis deleting any of the independent variables that are not significant. Delete the variables one at a time.

g. Develop a histogram or a stem-and-leaf chart of the residuals from the final regression equation. Is it reasonable to conclude that the normality assumption has been met?

h. Plot the residuals against the fitted values from the final regression equation. Plot the residuals on the vertical axis and the fitted values on the horizontal axis.

40. Refer to the CIA data, which report demographic and economic information on 46 coun- tries. Let unemployment be the dependent variable and percent of the population over 65, life expectancy, and literacy be the independent variables.

a. Determine the regression equation using a software package. Write out the regression equation.

b. What is the value of the coefficient of determination? Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 Multiple Regression and Correlation Analysis 563

c.Check the independent variables for multicollinearity.

d.Conduct a global test on the set of independent variables.

e.Test each of the independent variables to determine if they differ from zero.

f.Would you delete any of the independent variables? If so, rerun the regression analysis and report the new equation.

g.Make a histogram of the residuals from your final regression equation. Is it reasonable to conclude that the residuals follow a normal distribution?

h.Plot the residuals versus the fitted values and check. Are there any problems?

Software Commands

Note: We do not show steps for all the statistical soft- ware used in this chapter. Below are the first two, which show the basic steps.

1.The MINITAB commands for the multiple regres- sion output on page 515 are:

a.Import the data from the CD. The file name is

Tbl14–1.

b.Select Stat, Regression, and then click on

Regression.

c.Select Cost as the Response variable, and Temp, Insul, and Age as the Predictors, then click on OK.

2.The Excel commands to produce the multiple regression output on page 515 are:

a.Import the data from the CD. The file name is

Tbl14.

b.Select Tools, then Data Analysis, highlight Regression, and click OK.

c.Make the Input Y Range A1:A21, the Input X Range B1:D21, check the Labels box, the Output Range is F1, then click OK. 2.621

Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 564 Chapter 14

Chapter 14 Answers to Self-Review

14–1 a. \$389,500 or 389.5 (in \$000); found by

2.5 3(40) 4(72) 3(10) .2(20) 1(5)

3895

b.The b2 of 4 shows profit will go up \$4,000 for each extra hour the restaurant is open (if none

of the other variables change). The b3 of 3 implies profit will fall \$3,000 for each added mile away from the central area (if none of the other variables change).

14–2 a. The total degrees of freedom (n 1) is 25. So the sample size is 26.

b.There are 5 independent variables.

c.There is only 1 dependent 1variable (profit).

d.SY.12345 1.414, found by 2. Ninety-five

percent of the residuals will be between 2.828 and 2.828, found by 2(1.414).

e.R2 .714, found by 100140. 71.4% of the deviation in profit is accounted for by these five variables.

f.R2adj .643, found by

 c 40 d c 140 d (26 (5 1)) (26 1)

14–3 a. H0: 1 2 3 4 5 0 H1: Not all of the ’s are 0.

The decision rule is to reject H0 if F 7 2.71. The computed value of F is 10, found by 20/2. So, you reject H0, which indicates at least one of the regression coefficients is different from zero.

b.For variable 1: H0: 1 0 and H1: 1 0

The decision rule is: Reject H0 if t 6 2.086 or t 7 2.086. Since 2.000 does not go beyond either of those limits. We fail to reject the null hypothesis. This regression coefficient could be zero. We can consider dropping this variable. By parallel logic the null hypothesis is rejected for variables 3 and 4.

c.We should consider dropping variables 1, 2 and 5. Variable 5 has the smallest absolute value of t.

So delete it first and refigure the regression analysis.

 ˆ 0.4415X1 3.8598X2 14–4 a. Y 15.7625 ˆ 0.4415(30) 3.8598(1) Y 15.7625

32.87

b.Female agents make \$3,860 more than male agents.

c.H0: 3 0

H1: 3 0

df 17, reject H0 if t 6 2.110 or t 7 2.110

3.8598 0

t

1.4724

Reject H0 gender should be included in the regression equation. Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 A Review of Chapters 13 and 14 565

A Review of Chapters 13 and 14

Simple regression and correlation examine the relationship between two variables.

Multiple regression and correlation is concerned with relationship between two or more independent variables and the dependent variable.

Computer invaluable in multiple regression and correlation.

This section is a review of the major concepts and terms introduced in Chapters 13 and 14. Chapter 13 noted that the strength of the relationship between the independent variable and the dependent variable can be measured by the coefficient of correlation. The coefficient of correlation is designated by the letter r. It can assume any value between 1.00 and 1.00 inclusive. Coefficients of 1.00 and 1.00 indicate perfect relationship, and 0 indicates no relationship. A value near 0, such as .14 or .14, indicates a weak relationship. A value near 1 or 1, such as .90 or .90, indicates a strong relationship. Squaring r gives the coefficient of determination, also called r 2. It indicates the proportion of the total variation in the dependent variable explained by the independent variable.

Likewise, the strength of the relationship between several independent variables and a dependent variable is measured by the coefficient of multiple determination, R2. It measures the proportion of the variation in Y explained by two or more independent variables.

The linear relationship in the simple case involving one independent variable and one

dependent variable is described by the equation ˆ. For three independent variables, Y a bx

X1, X2, and X3, the same multiple regression equation is

 ˆ b2X2 . . . b3X3 Y a b1X1

Solving for b1, b2, b3, . . . , bk would involve tedious calculations. Fortunately, this type of problem can be quickly solved using one of the many statistical software packages and spreadsheet packages. Various measures, such as the coefficient of determination, the mul- tiple standard error of estimate, the results of the global test, and the test of the individual variables, are reported in the output of most computer software programs.

Glossary

Chapter 13

Coefficient of correlation A measure of the strength of association between two variables.

Coefficient of determination The proportion of the total variation in the dependent variable that is explained by the in- dependent variable. It can assume any value between 0 and 1.00 inclusive. A coefficient of .82 indicates that 82 percent of the variation in Y is accounted for by X. This coefficient is computed by squaring the coefficient of correlation, r.

Correlation analysis A group of statistical techniques used to measure the strength of the relationship between two variables.

Covariance The variance of X and Y together. Dependent variable The variable that is being predicted or estimated.

Independent variable A variable that provides the basis for estimation.

Least squares method A technique used to arrive at the regression equation by minimizing the sum of the squares of the vertical distances between the actual Y values and the predicted Y values.

Linear regression equation A mathematical equation

that defines the relationship between two variables. It has

the form ˆ. It is used to predict based on a

Y a bXY

selected X value. Y is the dependent variable and X the in- dependent variable.

Scatter diagram A chart that visually depicts the rela- tionship between two variables.

Standard error of estimate Measures the dispersion of the actual Y values about the regression line. It is reported in the same units as the dependent variable.

t test of significance of r A formula to answer the ques- tion: Is the correlation in the population from which the

sample was selected zero? The test statistic is t, and the number of degrees of freedom is n 2.

 r 1 n 2 t 1 [13–2] 1 r2 Chapter 14 Autocorrelation Correlation of successive residuals.

This condition frequently occurs when time is involved in the analysis.

Correlation matrix A listing of all possible simple coeffi- cients of correlation. A correlation matrix includes the cor- relations between each of the independent variables and the dependent variable, as well as those among all the independent variables.

Dummy variable A qualitative variable. It can assume only one of two possible outcomes.

Global test A test used to determine if any of the set of independent variables have regression coefficients differ- ent from zero.

Homoscedasticity The standard error of estimate is the same for all fitted values of the dependent variable. Individual test A test to determine if a particular indepen- dent variable has a regression coefficient different from zero. Interaction The case in which one independent variable (such as X2) affects the relationship between another inde- pendent variable (X1) and the dependent variable (Y ). Multicollinearity A condition that occurs in multiple regression analysis if the independent variables are them- selves correlated.

Multiple regression equation The relationship in the

form of a mathematical equation between several inde- pendent variables and a dependent variable. The general

 ˆ b2X2 b3X3 . . . bk XK. It is form is Y a b1X1

used to estimate Y given h independent variables, Xi. Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

566Chapter 14

Qualitative variables A nominal-scale variable that is coded to assume only one of two possible outcomes. For example, a person is considered either employed or unemployed.

Residual The difference between the actual value of the dependent variable and the estimated value of the depen-

dent variable, that is, ˆ. Y Y

Stepwise regression A step-by-step process for finding the regression equation. Only independent variables with nonzero regression coefficients enter the regression equa- tion. Independent variables are added one at a time to the regression equation.

Variance inflation factor A test used to detect correla- tion among independent variables.

Exercises

Part I—Multiple Choice

1.The strength of the association between a set of independent variables X and a dependent variable Y is measured by the

a.Coefficient of correlation.

b.Coefficient of determination.

c.Standard error of estimate.

d.All of the above.

2.The percent of total variation of the dependent variable Y explained by the set of independent variables X is measured by the

a.Coefficient of correlation.

b.Coefficient of determination.

c.Standard error of estimate.

d.Multicollinearity.

3.A coefficient of correlation is computed to be 0.90. This result means:

a.The relationship between two variables is weak.

b.The relationship between two variables is strong and positive.

c.The relationship between two variables is strong and negative.

d.The relationship between four variables is strong.

4.The coefficient of determination was computed to be .38 in a problem involving one indepen- dent variable and one dependent variable. This result means

a.The relationship between the two variables is negative.

b.The correlation coefficient is also .38.

c.38 percent of the total variation is explained by the independent variable.

d.38 percent of the total variation is explained by the dependent variable.

5.What is the relationship between the coefficient of correlation and the coefficient of determination?

a.They are unrelated.

b.The coefficient of determination is the coefficient of correlation squared.

c.The coefficient of determination is the square root of the coefficient of correlation.

d.They are equal.

6.Multicollinearity exists when

a.Independent variables are correlated less than 0.70 or more than 0.70.

b.An independent variable is strongly associated with a dependent variable.

c.There is only one independent variable.

d.The relationship between the dependent and independent variables is nonlinear.

7.If “time” is used as the independent variable in a simple linear regression analysis, which of the following assumptions could be violated?

a.There is a linear relationship between the independent and dependent variables.

b.The residual variation is the same for all fitted values of Y.

c.The residuals are normally distributed.

d.Successive observations of the dependent variable are uncorrelated.

8.In multiple regression, when the global test of significance is rejected, we can conclude:

a.All of the net sample regression coefficients are equal to zero.

b.All of the sample regression coefficients are not equal to zero.

c.At least one sample regression coefficient is not equal to zero.

d.The regression equation intersects the Y-axis at zero.

9.A residual is defined as:

ˆ.

a.Y Y

b.Error sum of squares.

c.Regression sum of squares.

d.Type I error.

10.What test statistic is used for a global test of significance?

a.z statistic.

b.t statistic. Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 A Review of Chapters 13 and 14 567

c.Chi-square statistic.

d.F statistic.

Part II—Problems

11.The accounting department at Crate and Barrel wishes to estimate the profit for each of the chain’s many stores on the basis of the number of employees in the store, overhead costs, average markup, and theft loss. A few statistics from the stores are:

 Net Number Overhead Average Theft Profit of Cost Markup Loss Store (\$ thousands) Employees (\$ thousands) (percent) (\$ thousands) 1 \$846 143 \$79 69% \$52 2 513 110 64 50 45 a. The dependent variable is _____. b. The general equation for this problem is ______. ˆ c. The multiple regression equation was computed to be 67 8X1 10X2 0.004X3 3X4. Y

What are predicted sales for a store with 112 employees, an overhead cost of \$65,000, a markup rate of 50 percent, and a loss from theft of \$50,000?

d.Suppose R2 was computed to be .86. Explain.

e.Suppose that the multiple standard error of estimate was 3 (in \$ thousands). Explain what this means in this problem.

12.Quick-print firms in a large downtown business area spend most of their advertising dollars on displays on bus benches. A research project involves predicting monthly sales based on the annual amount spent on placing ads on bus benches. A sample of quick-print firms revealed these advertising expenses and sales:

 Annual Bus Bench Monthly Advertising Sales Firm (\$ thousands) (\$ thousands) A 2 10 B 4 40 C 5 30 D 7 50 E 3 20

a.Draw a scatter diagram.

b.Determine the coefficient of correlation.

c.What is the coefficient of determination?

d.Compute the regression equation.

e.Estimate the monthly sales of a quick-print firm that spends \$4,500 on bus bench advertisements.

13.The following ANOVA output is given.

 SOURCE Sum of Squares DF MS Regression 1050.8 4 262.70 Error 83.8 20 4.19 Total 1134.6 24 Predictor Coef St.Dev. t-ratio Constant 70.06 2.13 32.89 X1 0.42 0.17 2.47 X2 0.27 0.21 1.29 X3 0.75 0.30 2.50 X4 0.42 0.07 6.00

a.Compute the coefficient of determination.

b.Compute the multiple standard error of estimate.

c.Conduct a test of hypothesis to determine whether any of the regression coefficients are dif- ferent from zero.

d.Conduct a test of hypothesis on the individual regression coefficients. Can any of the variables be deleted? Lind−Marchal−Wathen:

Statistical Techniques in

13th Edition

14.Multiple Regressions and Correlation Analysis

Text

©The McGraw−Hill Companies, 2008

 568 Chapter 14

Cases

A. The Century National Bank

Refer to the Century National Bank data. Using checking account balance as the dependent variable and using as independent variables the number of ATM transactions, the number of other services used, whether the individual has a debit card, and whether interest is paid on the par- ticular account, write a report indicating which of the vari- ables seem related to the account balance and how well they explain the variation in account balances. Should all of the independent variables proposed be used in the analysis or can some be dropped?

B. Terry and Associates:

The Time to Deliver Medical Kits

Terry and Associates is a specialized medical testing cen- ter in Denver, Colorado. One of the firm’s major sources of revenue is a kit used to test for elevated amounts of lead in the blood. Workers in auto body shops, those in the lawn care industry, and commercial house painters are exposed to large amounts of lead and thus must be ran- domly tested. It is expensive to conduct the test, so the kits are delivered on demand to a variety of locations throughout the Denver area.

Kathleen Terry, the owner, is concerned about setting appropriate costs for each delivery. To investigate, Ms. Terry gathered information on a random sample of 50 recent deliveries. Factors thought to be related to the cost of delivering a kit were:

Prep The time in minutes between when the cus- tomized order is phoned into the company and when it is ready for delivery.

Delivery The actual travel time in minutes from Terry’s plant to the customer.

Mileage The distance in miles from Terry’s plant to the customer.

 Sample Number Cost Prep Delivery Mileage 1 \$32.60 10 51 20 2 23.37 11 33 12 3 31.49 6 47 19 4 19.31 9 18 8 5 28.35 8 88 17 6 22.63 9 20 11 7 22.63 9 39 11 8 21.53 10 23 10 9 21.16 13 20 8 10 21.53 10 32 10 11 28.17 5 35 16 12 20.42 7 23 9 13 21.53 9 21 10 14 27.55 7 37 16 15 23.37 9 25 12 16 17.10 15 15 6 17 27.06 13 34 15
 Sample Number Cost Prep Delivery Mileage 18 \$15.99 8 13 4 19 17.96 12 12 4 20 25.22 6 41 14 21 24.29 3 28 13 22 22.76 4 26 10 23 28.17 9 54 16 24 19.68 7 18 8 25 25.15 6 50 13 26 20.36 9 19 7 27 21.16 3 19 8 28 25.95 10 45 14 29 18.76 12 12 5 30 18.76 8 16 5 31 24.29 7 35 13 32 19.56 2 12 6 33 22.63 8 30 11 34 21.16 5 13 8 35 21.16 11 20 8 36 19.68 5 19 8 37 18.76 5 14 7 38 17.96 5 11 4 39 23.37 10 25 12 40 25.22 6 32 14 41 27.06 8 44 16 42 21.96 9 28 9 43 22.63 8 31 11 44 19.68 7 19 8 45 22.76 8 28 10 46 21.96 13 18 9 47 25.95 10 32 14 48 26.14 8 44 15 49 24.29 8 34 13 50 24.35 3 33 12

1.Develop a multiple linear regression equation that describes the relationship between the cost of delivery and the other variables. Do these three variables explain a reasonable amount of the variation in the dependent variable? Estimate the delivery cost for a kit that takes 10 minutes for preparation, takes 30 minutes to deliver, and must cover a distance of 14 miles.

2.Test to determine that at least one net regression co- efficient differs from zero. Also test to see whether any of the variables can be dropped from the analysis. If some of the variables can be dropped, rerun the re- gression equation until only significant variables are included. Write a brief report interpreting the final re- gression equation.

## Watch Statistical Techniques Form Video Instruction

If you believe this page is infringing on your copyright, please familiarize yourself with and follow our DMCA notice and takedown process - click here to proceed .