Page 323 - DMTH404_STATISTICS
P. 323

Unit 23: Regression Analysis



            Introduction                                                                          Notes


            If the coefficient of correlation calculated for bivariate data (X , Y ), i = 1,2, ...... n, is reasonably
                                                              i  i
            high and a cause and effect type of relation is also believed to be existing between them, the next
            logical step is to obtain a functional relation between these variables. This functional relation is
            known as regression equation in statistics. Since the coefficient of correlation is measure of the
            degree of linear association of the variables, we shall discuss only linear regression equation.
            This does not, however, imply the non-existence of non-linear regression equations.

            The regression equations are useful for predicting the value of dependent variable for given
            value of the independent variable. As pointed out earlier, the nature of a regression equation is
            different from  the nature of a mathematical equation,  e.g., if Y = 10 + 2X is a mathematical
            equation then it implies that Y is exactly equal to 20 when X = 5. However, if Y = 10 + 2X is a
            regression equation, then Y = 20 is an average value of Y when X = 5.
            The term regression was first introduced by Sir Francis Galton in  1877.  In his study of  the
            relationship between heights of fathers and sons, he found that tall fathers were likely to have
            tall sons and vice-versa. However, the mean height of sons of tall fathers was lower than the
            mean height of their fathers and the mean height of sons of short fathers was higher than the
            mean height of their fathers. In this way, a tendency of the human race to regress or to return to
            a normal  height was observed. Sir Francis Galton referred this  tendency of returning to the
            mean height of all men as regression in his research paper, "Regression towards mediocrity in
            hereditary stature". The term 'Regression', originated in this particular context, is now used in
            various fields of study, even though there may be no existence of any regressive tendency.

            23.1 Two Lines of Regression

            For a bivariate data (X , Y ), i = 1,2, ...... n, we can have either X or Y as independent variable. If X
                              i  i
            is independent variable then we can estimate the average values of Y for a given value of X. The
            relation used for such estimation is called regression of Y on X. If on the other hand Y is used for
            estimating the  average values  of  X, the  relation will  be called regression of  X on Y. For  a
            bivariate data, there will always be two lines of regression. It will be shown later that these two
            lines are different, i.e., one cannot be derived from the other by mere transfer of terms, because
            the derivation of each line is dependent on a different set of assumptions.

            23.1.1 Line of Regression of Y on X

            The general form of the line of regression of  Y on X is Y  = a + bX  ,  where Y  denotes the
                                                           Ci      i         Ci
            average or predicted or calculated value of  Y  for a given value of X = X . This line has two
                                                                        i
            constants, a and b. The constant a is defined as the average value of Y when X = 0. Geometrically,
            it is the intercept of the line on Y- axis. Further, the constant b, gives the average rate of change
            of Y per unit change in X, is known as the regression coefficient.

            The above line is known if the values of a and b are known. These values are estimated from the
            observed data (X , Y ), i = 1,2, ...... n.
                         i  i



               Note    It is important to distinguish between Y  and Y . Where as Y  is the observed
                                                        Ci    i          i
              value, Y  is a value calculated from the regression equation.
                     Ci
            Using the regression Y  = a + bX , we can obtain Y , Y , ...... Y  corresponding to the X values
                              Ci      i             C1  C2    Cn
            X , X , ...... X  respectively.  The difference between the observed and calculated value for a
             1  2     n


                                             LOVELY PROFESSIONAL UNIVERSITY                                  315
   318   319   320   321   322   323   324   325   326   327   328