Page 323 - DMTH404_STATISTICS
P. 323
Unit 23: Regression Analysis
Introduction Notes
If the coefficient of correlation calculated for bivariate data (X , Y ), i = 1,2, ...... n, is reasonably
i i
high and a cause and effect type of relation is also believed to be existing between them, the next
logical step is to obtain a functional relation between these variables. This functional relation is
known as regression equation in statistics. Since the coefficient of correlation is measure of the
degree of linear association of the variables, we shall discuss only linear regression equation.
This does not, however, imply the non-existence of non-linear regression equations.
The regression equations are useful for predicting the value of dependent variable for given
value of the independent variable. As pointed out earlier, the nature of a regression equation is
different from the nature of a mathematical equation, e.g., if Y = 10 + 2X is a mathematical
equation then it implies that Y is exactly equal to 20 when X = 5. However, if Y = 10 + 2X is a
regression equation, then Y = 20 is an average value of Y when X = 5.
The term regression was first introduced by Sir Francis Galton in 1877. In his study of the
relationship between heights of fathers and sons, he found that tall fathers were likely to have
tall sons and vice-versa. However, the mean height of sons of tall fathers was lower than the
mean height of their fathers and the mean height of sons of short fathers was higher than the
mean height of their fathers. In this way, a tendency of the human race to regress or to return to
a normal height was observed. Sir Francis Galton referred this tendency of returning to the
mean height of all men as regression in his research paper, "Regression towards mediocrity in
hereditary stature". The term 'Regression', originated in this particular context, is now used in
various fields of study, even though there may be no existence of any regressive tendency.
23.1 Two Lines of Regression
For a bivariate data (X , Y ), i = 1,2, ...... n, we can have either X or Y as independent variable. If X
i i
is independent variable then we can estimate the average values of Y for a given value of X. The
relation used for such estimation is called regression of Y on X. If on the other hand Y is used for
estimating the average values of X, the relation will be called regression of X on Y. For a
bivariate data, there will always be two lines of regression. It will be shown later that these two
lines are different, i.e., one cannot be derived from the other by mere transfer of terms, because
the derivation of each line is dependent on a different set of assumptions.
23.1.1 Line of Regression of Y on X
The general form of the line of regression of Y on X is Y = a + bX , where Y denotes the
Ci i Ci
average or predicted or calculated value of Y for a given value of X = X . This line has two
i
constants, a and b. The constant a is defined as the average value of Y when X = 0. Geometrically,
it is the intercept of the line on Y- axis. Further, the constant b, gives the average rate of change
of Y per unit change in X, is known as the regression coefficient.
The above line is known if the values of a and b are known. These values are estimated from the
observed data (X , Y ), i = 1,2, ...... n.
i i
Note It is important to distinguish between Y and Y . Where as Y is the observed
Ci i i
value, Y is a value calculated from the regression equation.
Ci
Using the regression Y = a + bX , we can obtain Y , Y , ...... Y corresponding to the X values
Ci i C1 C2 Cn
X , X , ...... X respectively. The difference between the observed and calculated value for a
1 2 n
LOVELY PROFESSIONAL UNIVERSITY 315