Latest Economics NCERT Notes, Solutions and Extra Q & A (Class 9th to 12th) | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9th | 10th | 11th | 12th |
Chapter 6 Correlation
In previous chapters, you have learned how to summarize large datasets and describe changes in variables. This chapter introduces the concept of **correlation**, which allows us to examine the relationship or association between two different variables. Understanding correlation helps us determine if the value of one variable tends to change when the value of another variable changes, whether they move in the same or opposite directions, and the strength of this relationship.
Introduction
Having covered data summarization and describing changes in single variables, this chapter focuses on exploring the connection between two distinct variables. The goal is to understand how two variables might influence or move together.
Types Of Relationship
Relationships between variables can take various forms. Some might suggest a cause-and-effect link (causation), like the relationship between the price of a commodity and the quantity demanded. Lower prices often lead to higher demand, while higher prices lead to lower demand. Similarly, low rainfall can be related to low agricultural productivity.
However, correlation does not necessarily imply causation. Some relationships might be coincidental, like the arrival of migratory birds and birth rates in a locality. The relationship between shoe size and the amount of money in your pocket is another such example of a lack of logical connection.
In other cases, a third variable might be influencing the relationship between two variables. For instance, a high number of ice cream sales might coincide with a higher number of deaths due to drowning. This isn't because eating ice cream causes drowning. Instead, a third factor, like rising temperature, leads both to increased ice cream sales and more people going swimming, potentially increasing drowning incidents. Thus, temperature is the underlying cause of the observed relationship between ice cream sales and drowning deaths.
What Does Correlation Measure?
Correlation is a statistical measure that studies and quantifies the **direction** and **intensity** of the relationship between variables. It specifically measures **covariation**, meaning how two variables tend to move together, but not necessarily a cause-and-effect relationship (causation).
If a correlation exists between two variables, say X and Y, it means that when the value of X changes in a certain direction, the value of Y tends to change in a predictable way – either in the same direction (positive correlation) or the opposite direction (negative correlation).
For simplicity, correlation analysis often focuses on **linear relationships**, where the movement between the two variables can be approximated by a straight line when plotted on a graph.
Types Of Correlation
Correlation is commonly classified into two main types:
- Positive Correlation: Occurs when two variables move together in the same direction. If one variable increases, the other also tends to increase; if one decreases, the other also tends to decrease. Examples: When income rises, consumption tends to rise; higher temperatures are related to increased sales of ice cream.
- Negative Correlation: Occurs when two variables move in opposite directions. If one variable increases, the other tends to decrease, and vice versa. Examples: When the price of a good falls, its demand tends to increase; spending more time studying is related to a decrease in the chances of failing.
Techniques For Measuring Correlation
Several tools are used to study and measure correlation:
Scatter Diagram
A scatter diagram is a simple but useful graphical technique for visualizing the form of the relationship between two variables. Pairs of values for the two variables are plotted as points on a graph. The pattern and closeness of these points give a visual impression of the nature and strength of the relationship.
- If the points cluster around an upward-sloping line (Fig. 6.1), it indicates a **positive correlation** (variables move in the same direction).
- If the points cluster around a downward-sloping line (Fig. 6.2), it indicates a **negative correlation** (variables move in opposite directions).
- If the points are widely dispersed with no clear pattern (Fig. 6.3), it suggests **no correlation** (or no linear correlation).
- If the points lie exactly on an upward-sloping straight line (Fig. 6.4), it shows **perfect positive correlation**.
- If the points lie exactly on a downward-sloping straight line (Fig. 6.5), it shows **perfect negative correlation**.
The degree of closeness of the points to a line indicates the strength of the correlation: closer points suggest stronger correlation, dispersed points suggest weaker correlation. If the points follow a straight line, the relationship is linear. If they follow a curved pattern (Fig. 6.6, Fig. 6.7), the relationship is non-linear.
Karl Pearson’s Coefficient Of Correlation
This is a numerical measure that provides a precise value for the degree of **linear** relationship between two quantitative variables, X and Y. It is also known as the product moment correlation coefficient and is denoted by $r$.
The formula for Karl Pearson's coefficient of correlation is:
$r = \frac{\text{Cov}(X,Y)}{\sigma_x \sigma_y}$
Where Cov(X,Y) is the covariance between X and Y, and $\sigma_x$ and $\sigma_y$ are the standard deviations of X and Y, respectively.
Covariance is given by $\text{Cov}(X,Y) = \frac{\sum (X - \bar{X})(Y - \bar{Y})}{N}$, where $\bar{X}$ and $\bar{Y}$ are the means, and N is the number of observations.
Alternative formulas for $r$ based on raw data or deviations:
$r = \frac{\sum (X - \bar{X})(Y - \bar{Y})}{\sqrt{\sum (X - \bar{X})^2} \sqrt{\sum (Y - \bar{Y})^2}}$
$r = \frac{N \sum XY - (\sum X)(\sum Y)}{\sqrt{N \sum X^2 - (\sum X)^2} \sqrt{N \sum Y^2 - (\sum Y)^2}}$
It is essential to use Pearson's $r$ only when a scatter diagram suggests a linear relationship. Calculating $r$ for a non-linear relationship can be misleading.
Properties Of Correlation Coefficient
The correlation coefficient $r$ has several important properties:
- It is a pure number and has no unit of measurement.
- A negative value of $r$ indicates an inverse or negative relationship (variables move in opposite directions).
- A positive value of $r$ indicates a positive relationship (variables move in the same direction).
- The value of $r$ always lies between -1 and +1 ($-1 \le r \le 1$). Values outside this range indicate a calculation error.
- The magnitude of $r$ is unaffected by changes in origin or scale of the variables. This property simplifies calculations using methods like step deviation.
- If $r = 0$, the variables are **uncorrelated**, meaning there is no linear relationship between them. However, a non-linear relationship might still exist.
- If $r = 1$ or $r = -1$, there is **perfect correlation**, indicating an exact linear relationship.
- A value of $r$ close to +1 or -1 indicates a strong linear relationship.
- A value of $r$ close to 0 indicates a weak linear relationship.
Correlation measures covariation, not causation. A positive correlation between deaths and doctors during an epidemic might occur if doctors are sent to severely affected areas (influenced by a third variable like severity), not because doctors cause deaths.
Example 1. Calculate the correlation coefficient between No. of years of schooling of farmers (X) and Annual yield per acre in ’000 (Rs) (Y).
X: 0, 2, 4, 6, 8, 10, 12
Y: 4, 4, 6, 10, 10, 8, 7
Answer:
Years of Education (X) | (X– $\bar{X}$ ) | (X– $\bar{X}$ )2 | Annual yield (Y) | (Y– $\bar{Y}$ ) | (Y– $\bar{Y}$ )2 | (X– $\bar{X}$ )(Y– $\bar{Y}$ ) |
---|---|---|---|---|---|---|
0 | –6 | 36 | 4 | –3 | 9 | 18 |
2 | –4 | 16 | 4 | –3 | 9 | 12 |
4 | –2 | 4 | 6 | –1 | 1 | 2 |
6 | 0 | 0 | 10 | 3 | 9 | 0 |
8 | 2 | 4 | 10 | 3 | 9 | 6 |
10 | 4 | 16 | 8 | 1 | 1 | 4 |
12 | 6 | 36 | 7 | 0 | 0 | 0 |
$\sum X=42$ | $\sum (X– \bar{X})=0$ | $\sum (X– \bar{X})^2=112$ | $\sum Y=49$ | $\sum (Y– \bar{Y})=0$ | $\sum (Y– \bar{Y})^2=38$ | $\sum (X– \bar{X})(Y– \bar{Y})=42$ |
N = 7.
$\bar{X} = \frac{42}{7} = 6$, $\bar{Y} = \frac{49}{7} = 7$.
$\sigma_x = \sqrt{\frac{\sum (X– \bar{X})^2}{N}} = \sqrt{\frac{112}{7}} = \sqrt{16} = 4$.
$\sigma_y = \sqrt{\frac{\sum (Y– \bar{Y})^2}{N}} = \sqrt{\frac{38}{7}} \approx \sqrt{5.428} \approx 2.33$
Cov(X,Y) = $\frac{\sum (X– \bar{X})(Y– \bar{Y})}{N} = \frac{42}{7} = 6$.
$r = \frac{\text{Cov}(X,Y)}{\sigma_x \sigma_y} = \frac{6}{4 \times 2.33} = \frac{6}{9.32} \approx 0.644$
There is a positive correlation of approximately 0.644 between years of schooling and yield per acre.
Step Deviation Method To Calculate Correlation Coefficient.
This method simplifies calculations for large values by transforming the variables X and Y using assumed means and common factors.
Let $U = \frac{X - A}{h}$ and $V = \frac{Y - B}{k}$, where A and B are assumed means, and h and k are common factors. The correlation coefficient between U and V ($r_{UV}$) is equal to the correlation coefficient between X and Y ($r_{XY}$).
$r_{XY} = r_{UV} = \frac{N \sum UV - (\sum U)(\sum V)}{\sqrt{N \sum U^2 - (\sum U)^2} \sqrt{N \sum V^2 - (\sum V)^2}}$
Example 2. Calculate the correlation coefficient between Price index (X) and Money supply in Rs crores (Y).
X: 120, 150, 190, 220, 230
Y: 1800, 2000, 2500, 2700, 3000
Answer:
Let A = 100, h = 10. Let B = 1700, k = 100.
$U = \frac{X - 100}{10}$, $V = \frac{Y - 1700}{100}$.
X | Y | U | V | U2 | V2 | UV |
---|---|---|---|---|---|---|
120 | 1800 | 2 | 1 | 4 | 1 | 2 |
150 | 2000 | 5 | 3 | 25 | 9 | 15 |
190 | 2500 | 9 | 8 | 81 | 64 | 72 |
220 | 2700 | 12 | 10 | 144 | 100 | 120 |
230 | 3000 | 13 | 13 | 169 | 169 | 169 |
$\sum U=41$ | $\sum V=35$ | $\sum U^2=423$ | $\sum V^2=343$ | $\sum UV=378$ |
N = 5.
$r = \frac{N \sum UV - (\sum U)(\sum V)}{\sqrt{N \sum U^2 - (\sum U)^2} \sqrt{N \sum V^2 - (\sum V)^2}} = \frac{5 \times 378 - (41)(35)}{\sqrt{5 \times 423 - (41)^2} \sqrt{5 \times 343 - (35)^2}}$
$r = \frac{1890 - 1435}{\sqrt{2115 - 1681} \sqrt{1715 - 1225}} = \frac{455}{\sqrt{434} \sqrt{490}} = \frac{455}{\sqrt{212660}} \approx \frac{455}{461.6} \approx 0.985$
There is a strong positive correlation (approximately 0.985) between price index and money supply.
Spearman’s Rank Correlation
Developed by C.E. Spearman, this method measures the linear association between the **ranks** assigned to individual items or observations, rather than their actual values. It is particularly useful in situations where precise numerical measurement is difficult or impossible, such as measuring subjective attributes like honesty, beauty, or intelligence.
It can also be used when data has extreme values (as it's not affected by them) or when the relationship is non-linear but its direction is clear. The formula uses ranks (R):
$r_s = 1 - \frac{6 \sum D^2}{n(n^2 - 1)}$
Where $n$ is the number of observations/pairs and $D$ is the difference between the ranks assigned to the same item for the two variables ($D = R_x - R_y$).
The properties of Pearson's $r$ generally apply to Spearman's $r_s$ as well; it ranges from -1 to +1 and has no unit. However, $r_s$ is generally less accurate than Pearson's $r$ when precise quantitative data is available because it uses only rank information, not the actual magnitude of differences between values.
Calculation Of Rank Correlation Coefficient
Calculation of rank correlation depends on whether ranks are already provided, need to be assigned, or if there are ties (repeated ranks).
Case 1: When The Ranks Are Given
If ranks for both variables are given, calculate the difference in ranks (D) for each pair, square the differences ($D^2$), sum the squares ($\sum D^2$), and apply the formula $r_s = 1 - \frac{6 \sum D^2}{n(n^2 - 1)}$.
Example 3. Five persons are assessed by three judges (A, B, C) in a beauty contest. Ranks given by each judge are provided. Find which pair of judges has the nearest approach to common perception of beauty.
Competitors (Rank by Judge A): 1, 2, 3, 4, 5
Competitors (Rank by Judge B): 2, 4, 1, 5, 3
Competitors (Rank by Judge C): 1, 3, 5, 2, 4
Answer:
Calculate rank correlation for each pair of judges (A vs B, A vs C, B vs C).
Judge A vs Judge B:
Competitor | Rank A (RA) | Rank B (RB) | D = RA – RB | D2 |
---|---|---|---|---|
1 | 1 | 2 | –1 | 1 |
2 | 2 | 4 | –2 | 4 |
3 | 3 | 1 | 2 | 4 |
4 | 4 | 5 | –1 | 1 |
5 | 5 | 3 | 2 | 4 |
$\sum D=0$ | $\sum D^2=14$ |
$n=5$. $r_{AB} = 1 - \frac{6 \times 14}{5(5^2 - 1)} = 1 - \frac{84}{5 \times 24} = 1 - \frac{84}{120} = 1 - 0.7 = 0.3$.
Judge A vs Judge C:
Competitor | Rank A (RA) | Rank C (RC) | D = RA – RC | D2 |
---|---|---|---|---|
1 | 1 | 1 | 0 | 0 |
2 | 2 | 3 | –1 | 1 |
3 | 3 | 5 | –2 | 4 |
4 | 4 | 2 | 2 | 4 |
5 | 5 | 4 | 1 | 1 |
$\sum D=0$ | $\sum D^2=10$ |
$n=5$. $r_{AC} = 1 - \frac{6 \times 10}{5(5^2 - 1)} = 1 - \frac{60}{120} = 1 - 0.5 = 0.5$.
Judge B vs Judge C:
Competitor | Rank B (RB) | Rank C (RC) | D = RB – RC | D2 |
---|---|---|---|---|
1 | 2 | 1 | 1 | 1 |
2 | 4 | 3 | 1 | 1 |
3 | 1 | 5 | –4 | 16 |
4 | 5 | 2 | 3 | 9 |
5 | 3 | 4 | –1 | 1 |
$\sum D=0$ | $\sum D^2=28$ |
$n=5$. $r_{BC} = 1 - \frac{6 \times 28}{5(5^2 - 1)} = 1 - \frac{168}{120} = 1 - 1.4 = -0.4$.
Comparing the correlation coefficients: $r_{AB}=0.3$, $r_{AC}=0.5$, $r_{BC}=-0.4$. The highest absolute value is $|-0.4| = 0.4$ (which is actually the weakest positive or strongest negative relationship here). The coefficient closest to +1 (or -1) in absolute value indicates stronger correlation. $r_{AC}=0.5$ is the highest positive value, $r_{BC}=-0.4$ is the highest negative value in magnitude (0.4), $r_{AB}=0.3$ is the weakest positive. The question asks for the nearest approach to common perception, which means the highest positive correlation. This is $r_{AC}=0.5$. However, the text states B and C have very different tastes ($r_{BC}=-0.4$) and A and C are closest ($r_{AC}=0.5$), while A and B correlation is 0.3. This suggests comparing the values directly. $0.5 > 0.3 > -0.4$. The closest positive value to 1 indicates highest similarity in perception, so Judge A and C have the closest perception (0.5).
Case 2: When The Ranks Are Not Given
Assign ranks to the values of each variable independently. The highest value gets rank 1, the next highest rank 2, and so on. If two values are equal, they are given the average of the ranks they would have occupied. Once ranks are assigned (Rx and Ry), calculate D = Rx - Ry, $D^2$, $\sum D^2$, and use the formula $r_s = 1 - \frac{6 \sum D^2}{n(n^2 - 1)}$.
Example 4. Calculate the rank correlation coefficient between marks secured by 5 students in Economics (Y) and Statistics (X): Student A(85, 60), B(60, 48), C(55, 49), D(65, 50), E(75, 55). (Marks in X, Marks in Y)
Answer:
Assign ranks:
Student | Statistics (X) | Economics (Y) | Rank X (RX) | Rank Y (RY) | D = RX – RY | D2 |
---|---|---|---|---|---|---|
A | 85 | 60 | 1 | 1 | 0 | 0 |
B | 60 | 48 | 4 | 5 | –1 | 1 |
C | 55 | 49 | 5 | 4 | 1 | 1 |
D | 65 | 50 | 3 | 3 | 0 | 0 |
E | 75 | 55 | 2 | 2 | 0 | 0 |
$\sum D=0$ | $\sum D^2=2$ |
$n=5$. $r_s = 1 - \frac{6 \sum D^2}{n(n^2 - 1)} = 1 - \frac{6 \times 2}{5(5^2 - 1)} = 1 - \frac{12}{120} = 1 - 0.1 = 0.9$.
There is a strong positive rank correlation (0.9).
Case 3: When The Ranks Are Repeated
When ranks are repeated (ties), assign the average rank to the tied values. A correction factor is added to the formula $\sum D^2$. For each group of tied ranks of size m, add $\frac{m(m^2 - 1)}{12}$ to $\sum D^2$.
$r_s = 1 - \frac{6 \left( \sum D^2 + \sum \frac{m(m^2-1)}{12} \right)}{n(n^2 - 1)}$
Example 5. Calculate the rank correlation coefficient between X and Y from the following data: X: 1200, 1150, 1000, 990, 800, 780, 760, 750, 730, 700, 620, 600; Y: 75, 65, 50, 100, 90, 85, 90, 40, 50, 60, 50, 75.
Answer:
Assign ranks. Tied values get the average rank. For X, no ties. For Y, 50 appears 3 times (ranks 9, 10, 11, avg = 10), 75 appears 2 times (ranks 5, 6, avg = 5.5), 90 appears 2 times (ranks 2, 3, avg = 2.5).
X | Y | Rank X (RX) | Rank Y (RY) | D = RX – RY | D2 |
---|---|---|---|---|---|
1200 | 75 | 1 | 5.5 | –4.5 | 20.25 |
1150 | 65 | 2 | 7 | –5 | 25.00 |
1000 | 50 | 3 | 10 | –7 | 49.00 |
990 | 100 | 4 | 1 | 3 | 9.00 |
800 | 90 | 5 | 2.5 | 2.5 | 6.25 |
780 | 85 | 6 | 4 | 2 | 4.00 |
760 | 90 | 7 | 2.5 | 4.5 | 20.25 |
750 | 40 | 8 | 12 | –4 | 16.00 |
730 | 50 | 9 | 10 | –1 | 1.00 |
700 | 60 | 10 | 8 | 2 | 4.00 |
620 | 50 | 11 | 10 | 1 | 1.00 |
600 | 75 | 12 | 5.5 | 6.5 | 42.25 |
$\sum D^2=198.00$ |
$n=12$. Ties in Y: Value 50 (m=3), Value 75 (m=2), Value 90 (m=2).
Correction Factor = $\frac{3(3^2-1)}{12} + \frac{2(2^2-1)}{12} + \frac{2(2^2-1)}{12} = \frac{3(8)}{12} + \frac{2(3)}{12} + \frac{2(3)}{12} = \frac{24}{12} + \frac{6}{12} + \frac{6}{12} = 2 + 0.5 + 0.5 = 3$
$r_s = 1 - \frac{6 (\sum D^2 + \text{Correction Factor})}{n(n^2 - 1)} = 1 - \frac{6 (198 + 3)}{12(12^2 - 1)} = 1 - \frac{6 \times 201}{12(144 - 1)} = 1 - \frac{1206}{12 \times 143} = 1 - \frac{1206}{1716} \approx 1 - 0.7039 \approx 0.296$
There is a positive rank correlation of approximately 0.296.
Conclusion
Correlation analysis provides techniques to study the relationship between two variables, particularly linear relationships. Scatter diagrams offer a visual representation. Karl Pearson's coefficient ($r$) and Spearman's rank correlation coefficient ($r_s$) provide numerical measures of linear association. Pearson's $r$ is used for precisely measured quantitative data, while Spearman's $r_s$ is suitable for ranked data, subjective attributes, or data with extreme values. It is important to remember that correlation indicates covariation, not causation. The knowledge of correlation helps understand the direction and intensity of how variables change together, but does not explain why they are related.
Recap:
- Correlation analysis studies the relationship between two variables.
- Scatter diagrams provide a visual representation of the relationship.
- Karl Pearson’s coefficient of correlation ($r$) numerically measures only linear relationships and ranges from –1 to 1.
- Spearman’s rank correlation ($r_s$) numerically measures linear relationships between ranks assigned to variables, useful when precise measurement is difficult.
- Correction factors are needed for repeated ranks in rank correlation.
- Correlation indicates covariation (variables moving together), not causation (one variable causing the other).