Correlation
Correlation: Definition and Types (Positive, Negative, Zero)
Definition
Correlation is a statistical concept that measures the **strength** and **direction** of a **linear relationship** between two quantitative variables. When we examine two variables, say $X$ and $Y$, correlation tells us how consistently changes in one variable are associated with changes in the other, specifically in a straight-line pattern.
- If two variables are correlated, it means they tend to vary together.
- A positive correlation means they tend to increase or decrease together.
- A negative correlation means one tends to increase as the other decreases.
The strength of the correlation indicates how closely the relationship follows a perfect linear pattern. A strong correlation means the points lie very close to a straight line, while a weak correlation means they are scattered more broadly around a line.
Important Considerations:
- Linear Relationship: Correlation specifically measures the strength of a *linear* association. Two variables might have a strong non-linear relationship (e.g., quadratic or exponential), but their linear correlation might be weak or even zero.
- Correlation is NOT Causation: A strong correlation between two variables does not imply that one variable causes the other. There might be a confounding variable influencing both, or the relationship might be coincidental. Establishing causation requires controlled experiments or other advanced statistical methods.
Types of Correlation based on Direction
Based on the direction of the linear relationship, correlation can be classified into three main types:
-
Positive Correlation:
Positive correlation exists when two variables tend to move in the same direction. As one variable increases, the other variable also tends to increase, and vice versa.
- If plotted on a scatter diagram, the points will generally cluster around a line that slopes upwards from left to right.
- Example: As the number of hours studied increases, the test scores tend to increase. As temperature rises, ice cream sales tend to increase. Height and weight often show a positive correlation.
-
Negative Correlation:
Negative correlation exists when two variables tend to move in opposite directions. As one variable increases, the other variable tends to decrease, and vice versa.
- If plotted on a scatter diagram, the points will generally cluster around a line that slopes downwards from left to right.
- Example: As altitude increases, air pressure tends to decrease. As the price of a product increases, the quantity demanded by consumers tends to decrease. The number of hours spent watching TV and the number of hours spent exercising might show a negative correlation.
-
Zero Correlation (or No Linear Correlation):
Zero correlation (or negligible linear correlation) exists when there is no discernible linear relationship between the two variables. Changes in one variable are not consistently associated with either an increase or a decrease in the other variable in a linear pattern.
- If plotted on a scatter diagram, the points will appear randomly scattered, forming no clear linear pattern (neither upwards nor downwards).
- Example: The relationship between a person's shoe size and their IQ. The relationship between hair colour and marks in a statistics exam.
- It is important to remember that zero linear correlation does not rule out the possibility of a strong non-linear relationship.
The **strength** of the correlation is measured by a correlation coefficient (like Karl Pearson's coefficient, discussed later). The coefficient ranges from -1 to +1. A value of +1 indicates a perfect positive linear correlation, -1 indicates a perfect negative linear correlation, and 0 indicates no linear correlation. Values closer to +1 or -1 indicate stronger linear relationships, while values closer to 0 indicate weaker linear relationships.
Scatter Diagram
Definition
A Scatter Diagram, also known as a scatter plot, is a basic and essential graphical tool used to visualize the relationship between two quantitative variables. It is constructed by plotting pairs of observations from two variables, say $X$ and $Y$, as points on a two-dimensional Cartesian coordinate system.
In a scatter diagram:
- The horizontal axis (x-axis) represents the values of one variable (commonly the independent or predictor variable).
- The vertical axis (y-axis) represents the values of the other variable (commonly the dependent or response variable).
- Each point on the graph corresponds to a single observation (e.g., one student, one day, one product), showing its values for both variables simultaneously.
The pattern formed by the collection of plotted points provides a visual representation of the relationship between the two variables.
Construction
To construct a scatter diagram for a set of paired observations $(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)$:
-
Draw and Label Axes:
Draw two perpendicular axes, the horizontal axis (x-axis) and the vertical axis (y-axis), intersecting at the origin (0). Label the x-axis with the name of the first variable and the y-axis with the name of the second variable. Include units if applicable.
-
Determine Scales:
Choose appropriate scales for both axes based on the range of values for each variable in your dataset. The scales should be chosen such that all data points fit comfortably on the graph and the scatter is clearly visible. The axes do not necessarily have to start at zero, especially if the data values are far from zero; use a break in the axis if needed.
-
Plot Points:
For each pair of observations $(x_i, y_i)$, locate the corresponding value on the x-axis ($x_i$) and the y-axis ($y_i$). Plot a single point at the intersection of the horizontal line from $x_i$ and the vertical line from $y_i$. Repeat this for all $n$ pairs of observations.
-
Add Title:
Give the scatter diagram a clear, concise title that describes the relationship being visualized (e.g., "Relationship between Maths Score and Physics Score").
Interpretation
Interpreting the pattern of points in a scatter diagram is a crucial first step in analyzing the relationship between two variables. Look for the following characteristics:
-
Form of the Relationship:
Does the cloud of points appear to follow a straight line pattern (suggesting a linear relationship)? Or does it follow a curve (suggesting a non-linear relationship)? Or is there no obvious pattern at all?
-
Direction of the Relationship:
If the relationship appears to be roughly linear, does the cloud of points tend to rise or fall as you move from left to right along the x-axis?
- If the points generally trend **upwards** from left to right, it suggests a **positive linear correlation** (as X increases, Y tends to increase).
- If the points generally trend **downwards** from left to right, it suggests a **negative linear correlation** (as X increases, Y tends to decrease).
- If the points appear randomly scattered with **no clear upward or downward trend**, it suggests **zero or negligible linear correlation**.
-
Strength of the Relationship:
How tightly clustered are the points around the apparent linear or non-linear pattern? The closer the points are to forming a perfect line (or curve), the stronger the relationship.
- If the points are tightly clustered and form a narrow band, it suggests a **strong correlation**.
- If the points are widely scattered, forming a broad cloud, it suggests a **weak correlation**.
- If there is no discernible pattern, the correlation is effectively zero (for linear relationships).
-
Outliers:
Look for any points that lie far away from the main cluster of points. These are potential outliers. Outliers can significantly influence the calculation of correlation coefficients and might warrant further investigation.
A scatter diagram provides a quick visual summary and guides the choice of appropriate quantitative correlation methods.
Example
Example 1. The scores obtained by 6 students in Maths (x) and Physics (y) in a test are given as pairs of (Maths Score, Physics Score): (80, 75), (60, 65), (90, 85), (50, 55), (70, 70), (95, 90). Draw a scatter diagram for this data and interpret it.
Answer:
Given: Paired scores of 6 students in Maths and Physics.
To Draw: A scatter diagram.
To Interpret: The scatter diagram.
Solution:
- Choose Axes: Let the Maths Score be on the x-axis and the Physics Score be on the y-axis.
- Determine Scales: The Maths scores range from 50 to 95, and Physics scores range from 55 to 90. We can choose a scale for both axes starting from, say, 40 and going up to 100, with increments of 10.
- Plot Points: Plot the 6 given pairs of points on the graph paper: (80, 75), (60, 65), (90, 85), (50, 55), (70, 70), (95, 90).
- Label Axes and Title: Label the x-axis "Maths Score" and the y-axis "Physics Score". Add the title "Scatter Diagram of Maths Score vs. Physics Score".
Title: Scatter Diagram of Maths Score vs. Physics Score
Interpretation:
Observing the pattern of the points in the scatter diagram, we can see that:
- The points appear to cluster roughly along a straight line.
- This line slopes upwards from the lower left to the upper right corner of the graph.
This pattern suggests a **positive linear relationship** between Maths scores and Physics scores for this group of students. Students who scored higher in Maths generally tended to score higher in Physics as well, and students who scored lower in Maths tended to score lower in Physics.
The points are reasonably close to forming a straight line, indicating a moderately strong positive linear correlation.
Methods of Measuring Correlation (Karl Pearson's Coefficient - Implicit)
Quantifying Linear Relationship
While a scatter diagram provides a visual assessment of the relationship between two quantitative variables, it does not give a precise numerical measure of the strength and direction of the linear association. To quantify this linear relationship, we use a statistical measure called a correlation coefficient.
The most widely used method for measuring the strength and direction of a **linear relationship** between two quantitative variables is **Karl Pearson's Product-Moment Correlation Coefficient**.
Karl Pearson's Product-Moment Correlation Coefficient ($r$)
Definition:
Karl Pearson's coefficient of correlation, commonly denoted by $r$, is a statistical measure that quantifies the degree and direction of the **linear association** between two quantitative variables, $X$ and $Y$. It is a standardized measure that ranges from -1 to +1.
Range of $r$:
The value of Pearson's $r$ always falls within the range of -1 to +1, inclusive:
$-1 \le r \le +1$
... (1)
Interpretation of $r$:
The value of $r$ indicates both the direction and the strength of the linear relationship:
- $r = +1$: Represents a **perfect positive linear correlation**. All data points lie exactly on a straight line with a positive slope. As $X$ increases, $Y$ increases perfectly linearly.
- $r = -1$: Represents a **perfect negative linear correlation**. All data points lie exactly on a straight line with a negative slope. As $X$ increases, $Y$ decreases perfectly linearly.
- $r = 0$: Indicates **no linear correlation**. There is no tendency for $Y$ to consistently increase or decrease in a linear fashion as $X$ increases. (Note: A value of 0 does not mean there is no relationship at all, only no *linear* relationship).
- $r > 0$ (between 0 and +1): Indicates a **positive linear correlation**. As $X$ increases, $Y$ tends to increase. The closer $r$ is to +1, the stronger the positive linear relationship (points are closer to a straight line).
- $r < 0$ (between -1 and 0): Indicates a **negative linear correlation**. As $X$ increases, $Y$ tends to decrease. The closer $r$ is to -1, the stronger the negative linear relationship (points are closer to a straight line).
Guidelines for interpreting the strength of $r$ (these are general guidelines and can vary by field):
- $|r|$ between 0.0 and 0.2: Very weak or negligible linear relationship.
- $|r|$ between 0.2 and 0.4: Weak linear relationship.
- $|r|$ between 0.4 and 0.6: Moderate linear relationship.
- $|r|$ between 0.6 and 0.8: Strong linear relationship.
- $|r|$ between 0.8 and 1.0: Very strong linear relationship.
Formula for Calculation:
Pearson's $r$ can be calculated using several equivalent formulas. For paired data $(x_i, y_i)$ with $n$ pairs, a widely used computational formula is:
$$r = \frac{n(\sum\limits_{i=1}^{n} x_i y_i) - (\sum\limits_{i=1}^{n} x_i)(\sum\limits_{i=1}^{n} y_i)}{\sqrt{[n(\sum\limits_{i=1}^{n} x_i^2) - (\sum\limits_{i=1}^{n} x_i)^2] [n(\sum\limits_{i=1}^{n} y_i^2) - (\sum\limits_{i=1}^{n} y_i)^2]}}$$
... (2)
Where:
- $n$ = number of pairs of observations.
- $\sum x_i$ = sum of all $x$ values.
- $\sum y_i$ = sum of all $y$ values.
- $\sum x_i y_i$ = sum of the products of each corresponding $x$ and $y$ value.
- $\sum x_i^2$ = sum of the squares of each $x$ value.
- $\sum y_i^2$ = sum of the squares of each $y$ value.
Another formula expresses $r$ in terms of covariance and standard deviations:
$r = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}$
... (3)
Where $\text{Cov}(X, Y) = \frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{n}$ is the population covariance, $\sigma_X = \sqrt{\frac{\sum(x_i-\bar{x})^2}{n}}$ is the population standard deviation of $X$, and $\sigma_Y = \sqrt{\frac{\sum(y_i-\bar{y})^2}{n}}$ is the population standard deviation of $Y$. Note that $n(\sum x_i^2) - (\sum x_i)^2 = n \sum (x_i - \bar{x})^2$ and similarly for y, linking Formula 2 to the standard deviation formula components.
Calculation Procedure:
To calculate $r$, you typically create a table with columns for $x$, $y$, $xy$, $x^2$, and $y^2$. Calculate the sum for each column. Then substitute these sums and $n$ into Formula (2).
Pearson's $r$ is suitable for quantitative data measured on interval or ratio scales when the relationship is believed to be approximately linear and the data does not contain extreme outliers that could heavily influence the mean and standard deviation.
Rank Correlation (Spearman's Rank Correlation Coefficient - Implicit)
Measuring Monotonic Relationship
While Pearson's correlation coefficient measures the strength of a **linear** relationship, sometimes we are interested in measuring the strength of a **monotonic** relationship. A monotonic relationship is one where the variables tend to move in the same direction (always increasing or always decreasing together), but not necessarily at a constant rate (i.e., not necessarily in a straight line).
Spearman's Rank Correlation Coefficient is a non-parametric measure that assesses the strength and direction of the monotonic association between two variables. It is particularly useful in the following situations:
- When the data is naturally in the form of ranks (ordinal data).
- When dealing with quantitative data where the relationship is suspected to be non-linear but monotonic.
- When the data contains significant outliers, as ranking the data first reduces the impact of extreme values.
Spearman's correlation is essentially Pearson's correlation calculated on the ranks of the data values rather than the raw values themselves.
Spearman's Rank Correlation Coefficient ($r_s$)
Definition:
Spearman's Rank Correlation Coefficient, usually denoted by $r_s$ or $\rho$ (rho), measures the strength and direction of the **monotonic relationship** between two variables. It evaluates how well the relationship between the two variables can be described using a monotonic function.
Procedure for Calculation:
To calculate $r_s$ for $n$ paired observations $(x_i, y_i)$:
- Rank the X values: Assign ranks to the values of the first variable ($X$) from 1 to $n$. The smallest value gets rank 1, the next smallest gets rank 2, and so on. If there are ties (two or more values are the same), assign the average of the ranks they would have occupied to each tied value. Let these ranks be $R_{xi}$.
- Rank the Y values: Similarly, assign ranks to the values of the second variable ($Y$) from 1 to $n$. Use the same rule for ties. Let these ranks be $R_{yi}$.
- Calculate Differences in Ranks: For each pair of observations, calculate the difference ($d_i$) between the ranks of $X$ and $Y$: $d_i = R_{xi} - R_{yi}$.
- Square the Differences: Square each difference: $d_i^2$.
- Sum the Squared Differences: Calculate the sum of the squared differences: $\sum_{i=1}^{n} d_i^2$.
- Apply the Formula:
If there are no ties in the ranks (or very few ties), the following simplified formula can be used:
$$r_s = 1 - \frac{6 \sum\limits_{i=1}^{n} d_i^2}{n(n^2 - 1)}$$
... (1)
If there are significant ties, it is more accurate (though more complex) to calculate Pearson's correlation coefficient directly on the rank values $R_x$ and $R_y$ using the standard Pearson's formula (Formula 2 from Section I3), treating the ranks as the new data points.
Range of $r_s$:
Like Pearson's $r$, Spearman's $r_s$ also ranges from -1 to +1:
$-1 \le r_s \le +1$
... (2)
Interpretation of $r_s$:
The interpretation is similar to Pearson's $r$, but it relates to the monotonic relationship of the ranks:
- $r_s = +1$: Perfect positive monotonic relationship. As $X$ increases, $Y$ consistently increases (their ranks are in the same order).
- $r_s = -1$: Perfect negative monotonic relationship. As $X$ increases, $Y$ consistently decreases (their ranks are in perfect reverse order).
- $r_s = 0$: No monotonic relationship.
- Values closer to +1 or -1 indicate stronger monotonic relationships.
The strength guidelines (very weak, weak, moderate, strong, very strong) used for Pearson's $r$ can also be applied to Spearman's $r_s$, but the interpretation refers to the strength of the monotonic trend, not necessarily a linear one.
Spearman's rank correlation is a valuable non-parametric alternative when assumptions for Pearson's $r$ (linearity, normally distributed variables) are not met, or when dealing with ordinal data or potential outliers.