Menu Top
Classwise Concept with Examples
6th 7th 8th 9th 10th 11th 12th

Class 11th Chapters
1. Sets 2. Relations and Functions 3. Trigonometric Functions
4. Principle of Mathematical Induction 5. Complex Numbers and Quadratic Equations 6. Linear Inequalities
7. Permutations and Combinations 8. Binomial Theorem 9. Sequences and Series
10. Straight Lines 11. Conic Sections 12. Introduction to Three Dimensional Geometry
13. Limits and Derivatives 14. Mathematical Reasoning 15. Statistics
16. Probability

Content On This Page
Describing the Dispersion Different Methods of Measuring Dispersion Range
Mean Deviation Variance and Standard Deviation Coefficient of Variation


Chapter 15 Statistics (Concepts)

Welcome to this advanced chapter on Statistics, where we significantly deepen our analysis of data distributions, moving beyond the central point summaries learned in Class 10. While measures of central tendency – the Mean, Median, and Mode – provide valuable information about the 'typical' value or the center of a dataset, they paint an incomplete picture. Imagine two different batsmen in cricket having the exact same average score; one might consistently score around the average, while the other might score very high in some innings and very low in others. Central tendency alone doesn't capture this difference in consistency or variability. This chapter introduces the crucial concept of Measures of Dispersion, which quantify the extent to which data points are spread out, scattered, or vary around a central value. Understanding dispersion is essential for assessing consistency, reliability, and the overall nature of a distribution.

We begin our exploration with the simplest measure of spread: the Range. Calculated merely as the difference between the maximum and minimum values observed in the dataset (Range = Maximum Value - Minimum Value), it provides a quick, though often crude, indication of the total spread. While easy to compute, the range is highly susceptible to the influence of extreme values (outliers) and ignores the distribution of data points between the extremes, making it a limited measure in many contexts.

To get a more representative measure of variability that considers every data point, we introduce the concept of Mean Deviation. This measures the average distance of the observations from a central point, typically either the mean or the median. Crucially, we use the absolute values of the deviations to ensure that positive and negative deviations don't cancel each other out, reflecting the total magnitude of variation. The formulas are:

While intuitive, the absolute value function can be mathematically inconvenient for further analysis.

This leads us to the most important and widely used measures of dispersion: Variance and Standard Deviation. Instead of using absolute values, variance overcomes the issue of deviation signs by squaring them. The Variance, denoted by $\sigma^2$ (sigma squared), is defined as the average of the squared deviations of each observation from the arithmetic mean ($\bar{x}$). Squaring not only eliminates negative signs but also gives greater weight to observations that are further away from the mean. The formulas are:

While variance provides a good measure of spread, its units are the square of the original data units (e.g., $cm^2$ if data is in $cm$), making direct interpretation difficult.

To address the unit issue, we define the Standard Deviation, universally denoted by $\sigma$ (sigma), simply as the positive square root of the variance: $$ \mathbf{\sigma = \sqrt{\text{Variance}}} = \sqrt{\frac{\sum\limits_{i} (x_i - \bar{x})^2}{N}} $$ (using N for total frequency in grouped/ungrouped cases). The standard deviation is expressed in the same units as the original data, making it much more interpretable as a typical deviation from the mean. It is the most common and statistically significant measure of dispersion. Shortcut formulas, often involving $\sum\limits x_i^2$ or $\sum\limits f_i x_i^2$, are frequently derived and used to simplify the computation of variance and standard deviation, especially for large datasets.

We will practice calculating these measures – Range, Mean Deviation, Variance, and Standard Deviation – for both ungrouped (raw list of observations) and grouped frequency distributions. For grouped data, remember calculations typically involve using the class marks ($x_i$) as representative values for each interval, weighted by their frequencies ($f_i$) (potentially determined using tally marks like $||||$ or $\bcancel{||||}$ during data organization).

Finally, to compare the variability or consistency of two or more datasets that might have different units or vastly different means, we introduce a relative measure of dispersion called the Coefficient of Variation (CV). It expresses the standard deviation as a percentage of the mean: $$ \mathbf{CV = \left( \frac{\sigma}{\bar{x}} \right) \times 100} $$ Since the CV is a unitless ratio, it allows for meaningful comparison of dispersion across different datasets. A dataset with a lower CV is considered more consistent or stable (less variable relative to its mean) than a dataset with a higher CV. This chapter equips you with a comprehensive toolkit to not only describe the center of your data but also to quantify its spread, leading to a much more complete understanding of statistical distributions.



Describing the Dispersion

In statistics, when we analyze a set of data, our first step is often to find a "typical" or "central" value that represents the entire dataset. This is done using measures of central tendency, such as the mean, median, and mode. However, a measure of center alone provides an incomplete and sometimes misleading picture of the data.

To fully understand a dataset, we also need to know how the data values are spread out. Dispersion is the statistical term for the degree to which data points in a distribution are scattered or spread out. A measure of dispersion is a number that quantifies this spread, telling us whether the data is tightly clustered together or widely scattered.


Why is Measuring Dispersion So Important?

Imagine a doctor is comparing two different treatments for blood pressure. Both treatments result in an average (mean) blood pressure of 120 mmHg. Based on the mean alone, the treatments seem equally effective. But what if the individual results were:

While both have a mean of 120, Treatment A is clearly more reliable and consistent. Its results are tightly clustered around the mean. Treatment B is highly unpredictable, with some patients experiencing dangerously low pressure and others dangerously high pressure. Measuring dispersion allows us to see this crucial difference.

Key reasons to measure dispersion include:

  1. To Judge Reliability: A small dispersion indicates that the central value is a good and reliable representative of the data. A large dispersion suggests that the central value is less representative.
  2. To Compare Variability: We can compare the consistency of two or more datasets. For example, which of two cricket batsmen is more consistent? The one whose scores have a lower dispersion. Which of two stocks is riskier? The one whose returns have a higher dispersion.
  3. To Control Quality: In manufacturing, the goal is to produce items that are as identical as possible. Measuring the dispersion of a product's dimensions helps to monitor and control the quality of the production process.
  4. To Facilitate Further Analysis: Many advanced statistical techniques rely on measures of dispersion to work correctly.

An Illustrative Example

Let's formalize the concept with a simple example. Consider the performance of two students, Anjali and Bimal, in five math tests. Their scores are:

Anjali's Scores: 84, 85, 85, 86, 85

Bimal's Scores: 60, 100, 85, 70, 90

First, let's calculate the mean score for each student.

Mean for Anjali = $\frac{84 + 85 + 85 + 86 + 85}{5} = \frac{425}{5} = 85$

Mean for Bimal = $\frac{60 + 100 + 85 + 70 + 90}{5} = \frac{425}{5} = 85$

Two number lines, both with a mean of 85. The top line shows Anjali's scores clustered tightly around 85. The bottom line shows Bimal's scores spread out widely from 60 to 100.

Observation:

Both students have the exact same average score of 85. If we only looked at the mean, we would think their performance is identical. However, by looking at the raw scores, it is obvious that Anjali is an extremely consistent student, with all her scores clustered tightly around the mean. Bimal, on the other hand, is highly inconsistent; his scores are spread out over a wide range.

This difference in consistency or variability is what measures of dispersion are designed to capture numerically. A simple measure like the range shows this clearly: Anjali's range is $86 - 84 = 2$, while Bimal's range is $100 - 60 = 40$. The larger range for Bimal immediately tells us that his scores are more dispersed.

In the following sections, we will explore more robust and widely used measures for quantifying dispersion, such as Mean Deviation and Standard Deviation.



Different Methods of Measuring Dispersion

To numerically capture the spread or variability within a dataset, statisticians use several different methods. Each method provides a single number that summarizes the dispersion, but they do so in different ways and are useful in different contexts. These methods can be grouped into two main categories: absolute measures and relative measures.


Absolute Measures of Dispersion

An absolute measure of dispersion describes the variability of a dataset using the same units as the original data. For example, if we are measuring the heights of students in centimeters (cm), an absolute measure of dispersion like the standard deviation will also be expressed in cm. This makes them easy to interpret in the context of the original data.

However, absolute measures are not suitable for comparing the variability of two datasets with different units (e.g., comparing the variability of students' heights in cm with their weights in kg) or with vastly different average values (e.g., comparing the variability in salaries at a small startup vs. a large corporation).

Common Absolute Measures:

  1. Range: The simplest and quickest measure of dispersion. It is the difference between the maximum and minimum values in the dataset.

    Formula: Range = Maximum Value – Minimum Value

    Usefulness: Provides a quick, rough estimate of the total spread. It is highly sensitive to outliers (extreme values).

  2. Quartile Deviation: A measure that focuses on the spread of the middle 50% of the data, making it resistant to outliers. It is half of the interquartile range (IQR).

    Formula: Quartile Deviation = $\frac{Q_3 - Q_1}{2}$, where $Q_3$ is the third quartile and $Q_1$ is the first quartile.

    Usefulness: Good for skewed distributions or data with extreme values.

  3. Mean Deviation: This measure calculates the average distance of each data point from a central value (usually the mean or median). It considers every value in the dataset.

    Formula: Mean Deviation = $\frac{\sum\limits |x_i - \text{Mean}|}{n}$

    Usefulness: More comprehensive than the range as it uses all data points.

  4. Variance ($\sigma^2$): One of the most important measures of dispersion. It is the average of the squared distances of each data point from the mean. Squaring the differences ensures that all values are positive and gives greater weight to points that are further from the mean.

    Formula: Variance ($\sigma^2$) = $\frac{\sum\limits (x_i - \text{Mean})^2}{n}$

    Usefulness: Crucial for many advanced statistical theories and tests. Its units are the square of the original data's units (e.g., cm²), making it hard to interpret directly.

  5. Standard Deviation ($\sigma$): This is the most widely used and important measure of dispersion. It is simply the positive square root of the variance.

    Formula: Standard Deviation ($\sigma$) = $\sqrt{\text{Variance}} = \sqrt{\frac{\sum\limits (x_i - \text{Mean})^2}{n}}$

    Usefulness: By taking the square root, the standard deviation is expressed in the same units as the original data, making it much more interpretable than the variance.


Relative Measures of Dispersion

A relative measure of dispersion is a unit-free number, often expressed as a ratio or a percentage. It is designed to compare the variability of two or more datasets, especially when their units or average values are different.

For example, is a variation of 5 cm in the height of men more or less significant than a variation of 5 kg in their weight? A relative measure helps answer such questions by standardizing the dispersion.

Common Relative Measures:

  1. Coefficient of Range: The range expressed as a fraction of the sum of the maximum and minimum values.

    Formula: Coefficient of Range = $\frac{\text{Max} - \text{Min}}{\text{Max} + \text{Min}}$

  2. Coefficient of Quartile Deviation: The quartile deviation expressed as a fraction of the average of the quartiles.

    Formula: Coefficient of QD = $\frac{Q_3 - Q_1}{Q_3 + Q_1}$

  3. Coefficient of Variation (CV): This is the most common and important relative measure of dispersion. It expresses the standard deviation as a percentage of the mean.

    CV = $\frac{\text{Standard Deviation}}{\text{Mean}} \times 100\%$

    A dataset with a higher CV is considered to have greater relative variability or be less consistent than a dataset with a lower CV. It is the standard tool for comparing consistency across different groups.

In the following sections, we will focus on the calculation and application of the most important measures: Range, Mean Deviation, Variance, and Standard Deviation.



Range

The Range is the most straightforward and intuitive measure of dispersion. It provides a quick snapshot of the total spread of a dataset by focusing only on its most extreme values.


Definition of Range

The Range is simply the difference between the highest value (maximum) and the lowest value (minimum) in a dataset.

If $X_{\text{max}}$ is the maximum value and $X_{\text{min}}$ is the minimum value, then the formula is:

Range $= X_{\text{max}} - X_{\text{min}}$

... (i)

The range is an absolute measure of dispersion, meaning its units are the same as the units of the data itself (e.g., if the data is in kilograms, the range is in kilograms).


Calculation of Range


Coefficient of Range

To compare the spread of two datasets with very different scales (e.g., salaries in thousands vs. pocket money in hundreds), we use the Coefficient of Range. This is a relative measure that expresses the range as a fraction of the sum of the extreme values, making it a unit-free number.

Coefficient of Range $= \frac{X_{\text{max}} - X_{\text{min}}}{X_{\text{max}} + X_{\text{min}}}$

... (ii)


Advantages and Disadvantages of Range

Advantages (Merits):

Disadvantages (Demerits):

Due to these significant limitations, the range is typically used for a quick preliminary look at the data or in specific applications like statistical quality control, but it is not considered a robust measure of dispersion.


Example 1. Find the range and the coefficient of range for the following dataset of daily temperatures (°C): 15, 25, 18, 32, 40, 28, 12, 35.

Answer:

Given:

The data is: 15, 25, 18, 32, 40, 28, 12, 35.

Solution:

Step 1: Identify the maximum and minimum values.

By inspecting the data, we find:

Maximum Value ($X_{\text{max}}$) = 40 °C

Minimum Value ($X_{\text{min}}$) = 12 °C

Step 2: Calculate the Range.

Using the formula Range $= X_{\text{max}} - X_{\text{min}}$:

Range = 40 – 12 = 28 °C

Step 3: Calculate the Coefficient of Range.

Using the formula Coefficient of Range $= \frac{X_{\text{max}} - X_{\text{min}}}{X_{\text{max}} + X_{\text{min}}}$:

Coefficient of Range = $\frac{40 - 12}{40 + 12} = \frac{28}{52}$

Simplifying the fraction by dividing the numerator and denominator by 4:

Coefficient of Range = $\frac{7}{13}$

(As a decimal, this is approximately 0.538).

The final answer is: Range = 28 °C, Coefficient of Range = $\frac{7}{13}$.



Mean Deviation

In statistics, after understanding the central tendency (like mean, median, mode) of a dataset, the next important aspect is to understand its variability or dispersion. Dispersion measures the extent to which the values in a distribution are spread out or scattered from the average. A simple measure is the Range, which is the difference between the maximum and minimum values. However, the range is a crude measure as it only depends on two extreme values and ignores the distribution of the rest of the observations.

To overcome this limitation, we use measures that involve all the data points. One such measure is the Mean Deviation. It provides a more robust understanding of the spread by calculating the average distance of each observation from a central value.


Definition and Concept of Mean Deviation

The Mean Deviation (MD) is defined as the arithmetic mean of the absolute deviations of the observations from a suitable measure of central tendency. This central value can be the mean, median, or mode, but it is most commonly calculated with respect to the mean or the median.

Why Absolute Deviations?

A deviation is the difference between an observation and the central value (e.g., $x_i - \overline{x}$). Some of these deviations will be positive (for values greater than the mean), and some will be negative (for values less than the mean). A key property of the arithmetic mean is that the sum of these deviations is always zero, i.e., $\sum (x_i - \overline{x}) = 0$.

For example, for data {2, 4, 9}, the mean is $\overline{x} = 5$. The deviations are $(2-5) = -3$, $(4-5) = -1$, and $(9-5) = 4$. The sum is $-3 - 1 + 4 = 0$.

Because the sum is always zero, the average deviation would also be zero, which is not a useful measure of spread. To solve this, we take the absolute value of each deviation, i.e., $|x_i - \overline{x}|$. This makes all deviations positive, and their average gives a meaningful value representing the average distance of the data points from the center.


Mean Deviation for Ungrouped Data

Ungrouped data refers to data that is given as individual data points.

1. Mean Deviation about the Mean ($\text{MD}_{\overline{x}}$)

This measures the average absolute distance of each data point from the arithmetic mean of the dataset.

Formula and Derivation

Let the given data consist of $n$ distinct observations $x_1, x_2, ..., x_n$.

Step 1: Calculate the mean of the data.

$\overline{x} = \frac{\sum\limits_{i=1}^{n} x_i}{n}$

Step 2: Find the deviation of each observation $x_i$ from the mean $\overline{x}$, which is $(x_i - \overline{x})$.

Step 3: Find the absolute value of these deviations, which is $|x_i - \overline{x}|$.

Step 4: Find the arithmetic mean of these absolute deviations. This is the Mean Deviation about the Mean.

$\text{MD}_{\overline{x}} = \frac{\sum\limits_{i=1}^{n} |x_i - \overline{x}|}{n}$

... (i)

Example 1. Find the mean deviation about the mean for the data: 6, 7, 10, 12, 13, 4, 8, 12.

Answer:

Given:

Data observations: $x_i$ = 6, 7, 10, 12, 13, 4, 8, 12.

Number of observations, $n=8$.

To Find:

Mean Deviation about the Mean ($\text{MD}_{\overline{x}}$).

Solution:

Step 1: Calculate the mean ($\overline{x}$).

Sum of observations = $6 + 7 + 10 + 12 + 13 + 4 + 8 + 12 = 72$.

Mean, $\overline{x} = \frac{\sum x_i}{n} = \frac{72}{8} = 9$.

Step 2: Calculate the absolute deviations from the mean, $|x_i - 9|$.

We create a table for clarity:

$x_i$ $|x_i - \overline{x}| = |x_i - 9|$
4$|4 - 9| = 5$
6$|6 - 9| = 3$
7$|7 - 9| = 2$
8$|8 - 9| = 1$
10$|10 - 9| = 1$
12$|12 - 9| = 3$
12$|12 - 9| = 3$
13$|13 - 9| = 4$
Total $\sum\limits_{i=1}^{8} |x_i - \overline{x}| = 22$

Step 3: Calculate the mean deviation about the mean.

Using the formula (i):

$\text{MD}_{\overline{x}} = \frac{\sum\limits_{i=1}^{8} |x_i - \overline{x}|}{n} = \frac{22}{8} = 2.75$.

Thus, the mean deviation about the mean is 2.75.

2. Mean Deviation about the Median ($\text{MD}_M$)

This measures the average absolute distance of each data point from the median of the dataset. An important property is that the mean deviation is minimum when calculated from the median.

Formula and Derivation

Let the given data consist of $n$ distinct observations $x_1, x_2, ..., x_n$.

Step 1: Arrange the data in ascending order.

Step 2: Calculate the median ($M$) of the data.

$M = \begin{cases} \left(\frac{n+1}{2}\right)^{th} \text{observation} & , & \text{if } n \text{ is odd} \\ \frac{\left(\frac{n}{2}\right)^{th} \text{obs} + \left(\frac{n}{2}+1\right)^{th} \text{obs}}{2} & , & \text{if } n \text{ is even} \end{cases}$

Step 3: Find the absolute value of the deviations from the median, which is $|x_i - M|$.

Step 4: Find the arithmetic mean of these absolute deviations.

$\text{MD}_M = \frac{\sum\limits_{i=1}^{n} |x_i - M|}{n}$

... (ii)

Example 2. Find the mean deviation about the median for the data: 3, 9, 5, 3, 12, 10, 18, 4, 7, 19, 21.

Answer:

Given:

Data observations: $x_i$ = 3, 9, 5, 3, 12, 10, 18, 4, 7, 19, 21.

Number of observations, $n=11$.

To Find:

Mean Deviation about the Median ($\text{MD}_M$).

Solution:

Step 1: Arrange the data in ascending order.

3, 3, 4, 5, 7, 9, 10, 12, 18, 19, 21.

Step 2: Calculate the median ($M$).

Since $n = 11$ (odd), the median is the $\left(\frac{11+1}{2}\right)^{th}$ term, which is the 6th term.

Median, $M = 9$.

Step 3: Calculate the absolute deviations from the median, $|x_i - 9|$.

$x_i$ $|x_i - M| = |x_i - 9|$
3$|3 - 9| = 6$
3$|3 - 9| = 6$
4$|4 - 9| = 5$
5$|5 - 9| = 4$
7$|7 - 9| = 2$
9$|9 - 9| = 0$
10$|10 - 9| = 1$
12$|12 - 9| = 3$
18$|18 - 9| = 9$
19$|19 - 9| = 10$
21$|21 - 9| = 12$
Total $\sum\limits |x_i - M| = 58$

Step 4: Calculate the mean deviation about the median.

Using the formula (ii):

$\text{MD}_{M} = \frac{\sum\limits_{i=1}^{11} |x_i - M|}{n} = \frac{58}{11} \approx 5.27$.

Thus, the mean deviation about the median is approximately 5.27.


Mean Deviation for Grouped Data

Grouped data is data that has been organized into a frequency distribution.

1. Discrete Frequency Distribution

In this format, each observation $x_i$ has a corresponding frequency $f_i$.

(a) Mean Deviation about the Mean

The formula is an extension of the ungrouped data formula, where each absolute deviation is weighted by its frequency.

$\text{MD}_{\overline{x}} = \frac{\sum\limits_{i=1}^{k} f_i |x_i - \overline{x}|}{\sum\limits_{i=1}^{k} f_i} = \frac{\sum f_i |x_i - \overline{x}|}{N}$

... (iii)

where $k$ is the number of distinct observations, $N = \sum f_i$ is the total frequency, and the mean is $\overline{x} = \frac{\sum f_i x_i}{N}$.

Example 3. Find the mean deviation about the mean for the following data:

$x_i$ 25681012
$f_i$ 2810785

Answer:

Solution:

We first need to calculate the mean $\overline{x}$. We can do this in a tabular format, which also helps in calculating the mean deviation.

$x_i$ $f_i$ $f_i x_i$ $|x_i - \overline{x}| = |x_i - 7.5|$ $f_i |x_i - 7.5|$
224$|2-7.5|=5.5$$2 \times 5.5 = 11.0$
5840$|5-7.5|=2.5$$8 \times 2.5 = 20.0$
61060$|6-7.5|=1.5$$10 \times 1.5 = 15.0$
8756$|8-7.5|=0.5$$7 \times 0.5 = 3.5$
10880$|10-7.5|=2.5$$8 \times 2.5 = 20.0$
12560$|12-7.5|=4.5$$5 \times 4.5 = 22.5$
Total $N=40$ $\sum f_i x_i = 300$ $\sum f_i|x_i - \overline{x}|=92.0$

Step 1: Calculate the mean.

$\overline{x} = \frac{\sum f_i x_i}{N} = \frac{300}{40} = 7.5$.

Step 2: Calculate $\sum f_i|x_i - \overline{x}|$.

From the table, this sum is 92.0.

Step 3: Calculate the mean deviation.

$\text{MD}_{\overline{x}} = \frac{\sum f_i |x_i - \overline{x}|}{N} = \frac{92.0}{40} = 2.3$.

The mean deviation about the mean is 2.3.

(b) Mean Deviation about the Median

The formula is similar, using the median as the central value.

$\text{MD}_M = \frac{\sum\limits_{i=1}^{k} f_i |x_i - M|}{N}$

... (iv)

To find the median for discrete data, we first find the cumulative frequency (c.f.). The median is the observation whose cumulative frequency is just greater than or equal to $\frac{N}{2}$.

Example 4. Find the mean deviation about the median for the data in Example 3.

Answer:

Solution:

We first need to find the median. For this, we calculate the cumulative frequency (c.f.).

$x_i$ $f_i$ c.f. $|x_i - M| = |x_i - 8|$ $f_i |x_i - 8|$
222$|2-8|=6$$2 \times 6 = 12$
5810$|5-8|=3$$8 \times 3 = 24$
61020$|6-8|=2$$10 \times 2 = 20$
8727$|8-8|=0$$7 \times 0 = 0$
10835$|10-8|=2$$8 \times 2 = 16$
12540$|12-8|=4$$5 \times 4 = 20$
Total $N=40$ $\sum f_i|x_i - M|=92$

Step 1: Find the median.

Here, $N=40$. We look for $\frac{N}{2} = \frac{40}{2} = 20$.

The cumulative frequency just equal to 20 corresponds to the observation $x_i = 6$. The cumulative frequency for the next observation (8) is 27, which corresponds to observations from 21st to 27th. Since $N$ is even, the median is the average of the 20th and 21st observations.

The 20th observation is 6.

The 21st observation is 8.

Median, $M = \frac{6+8}{2} = 7$.

Alternate Median Calculation for this specific problem:

Let's re-calculate using the value $M=7$.

$x_i$ $f_i$ $|x_i - M| = |x_i - 7|$ $f_i |x_i - 7|$
22510
58216
610110
8717
108324
125525
Total $N=40$ $\sum f_i|x_i - M|=92$

Step 2: Calculate $\sum f_i|x_i - M|$.

From the table, the sum is 92.

Step 3: Calculate the mean deviation.

$\text{MD}_{M} = \frac{\sum f_i |x_i - M|}{N} = \frac{92}{40} = 2.3$.

In this particular case, the mean deviation about the mean and median are the same. This is not always true.

2. Continuous Frequency Distribution

In this format, data is given in class intervals. We use the mid-point (or class mark) of each interval as the representative value $x_i$ for that class.

Mid-point $x_i = \frac{\text{Lower limit} + \text{Upper limit}}{2}$.

(a) Mean Deviation about the Mean

The formula is the same as for the discrete distribution, but $x_i$ are now the mid-points of the classes.

$\text{MD}_{\overline{x}} = \frac{\sum f_i |x_i - \overline{x}|}{N}$

... (v)

where $\overline{x} = \frac{\sum f_i x_i}{N}$.

Example 5. Calculate the mean deviation about the mean for the following data:

Marks obtained Number of students
10 - 202
20 - 303
30 - 408
40 - 5014
50 - 608
60 - 703
70 - 802

Answer:

Solution:

We construct a table to calculate the necessary values.

Marks $f_i$ Mid-point ($x_i$) $f_i x_i$ $|x_i - \overline{x}| = |x_i - 45|$ $f_i |x_i - 45|$
10-20215303060
20-30325752060
30-408352801080
40-50144563000
50-608554401080
60-703651952060
70-802751503060
Total $N=40$ $\sum f_i x_i = 1800$ $\sum f_i|x_i - \overline{x}|=400$

Step 1: Calculate the mean.

$\overline{x} = \frac{\sum f_i x_i}{N} = \frac{1800}{40} = 45$.

Step 2: Calculate $\sum f_i|x_i - \overline{x}|$.

From the table, this sum is 400.

Step 3: Calculate the mean deviation.

$\text{MD}_{\overline{x}} = \frac{\sum f_i |x_i - \overline{x}|}{N} = \frac{400}{40} = 10$.

The mean deviation about the mean is 10.

(b) Mean Deviation about the Median

For a continuous distribution, we first find the median class and then calculate the median using a formula.

Step 1: Find the Median Class. It is the class interval whose cumulative frequency is just greater than or equal to $\frac{N}{2}$.

Step 2: Calculate Median ($M$).

$M = l + \frac{\frac{N}{2} - C}{f} \times h$

where,

Step 3: Calculate Mean Deviation about the Median.

$\text{MD}_M = \frac{\sum f_i |x_i - M|}{N}$

... (vi)

Example 6. Calculate the mean deviation about the median for the data in Example 5.

Answer:

Solution:

First, we find the median by constructing a table with cumulative frequencies.

Marks $f_i$ c.f. Mid-point ($x_i$) $|x_i - M| = |x_i - 45|$ $f_i |x_i - 45|$
10-2022153060
20-3035252060
30-40813351080
40-5014274500
50-60835551080
60-70338652060
70-80240753060
Total $N=40$ $\sum f_i|x_i - M|=400$

Step 1: Find the median class.

$N=40$, so $\frac{N}{2} = 20$.

The cumulative frequency just greater than 20 is 27. The corresponding class is 40-50. So, the Median Class is 40-50.

Step 2: Calculate the median.

$l = 40$, $N=40$, $C = 13$, $f = 14$, $h = 10$.

$M = 40 + \frac{20 - 13}{14} \times 10 = 40 + \frac{7}{14} \times 10 = 40 + \frac{1}{2} \times 10 = 40 + 5 = 45$.

Median $M = 45$.

Step 3: Calculate the mean deviation about the median.

Since the median (45) is the same as the mean in this case, the calculations for $|x_i - M|$ and $f_i |x_i - M|$ will be identical to the mean deviation calculation.

From the table, $\sum f_i|x_i - M|=400$.

$\text{MD}_{M} = \frac{\sum f_i |x_i - M|}{N} = \frac{400}{40} = 10$.

The mean deviation about the median is 10.


Merits and Demerits of Mean Deviation

Merits (Advantages):

Demerits (Disadvantages):



Variance and Standard Deviation

In the study of dispersion, Mean Deviation provides a good intuitive measure of spread by averaging the absolute distances from a central point. However, the use of the absolute value function ($|...|$) makes it mathematically inconvenient for more advanced statistical analysis and inference. To overcome this algebraic limitation, statisticians developed Variance and Standard Deviation. These are the most crucial and widely used measures of dispersion in statistics because they are based on squared deviations, which have more desirable mathematical properties.


Variance ($\sigma^2$)

The Variance is defined as the arithmetic mean of the squared deviations of the observations from their arithmetic mean. It quantifies the degree of spread in a set of data points.

The process involves:

  1. Calculating the mean ($\overline{x}$) of the data.
  2. Finding the deviation of each observation from the mean ($x_i - \overline{x}$).
  3. Squaring each deviation ($(x_i - \overline{x})^2$).
  4. Finding the average of these squared deviations.

Squaring the deviations accomplishes two important things:

Variance is denoted by the Greek letter sigma squared ($\sigma^2$). A significant drawback of variance is that its units are the square of the original data units (e.g., if the data is in centimetres, the variance is in square centimetres). This makes it difficult to interpret in the context of the original data.


Standard Deviation ($\sigma$)

The Standard Deviation is the measure of dispersion that resolves the unit-interpretation problem of variance. It is defined as the positive square root of the variance.

By taking the square root, the units of the standard deviation become the same as the units of the original data, making it directly comparable and interpretable. The Standard Deviation is the most common and important measure of dispersion.

It represents a "typical" or "standard" amount of deviation (distance) of a data point from the mean. It is denoted by $\sigma$ (sigma).


Variance and Standard Deviation for Ungrouped Data

1. Definitional Formulas

For a set of $n$ observations $x_1, x_2, ..., x_n$ with mean $\overline{x}$:

The Variance is given by:

$\sigma^2 = \frac{\sum\limits_{i=1}^{n} (x_i - \overline{x})^2}{n}$

... (i)

The Standard Deviation is given by:

$\sigma = \sqrt{\frac{\sum\limits_{i=1}^{n} (x_i - \overline{x})^2}{n}}$

... (ii)

2. Shortcut (Computational) Formula and its Derivation

Calculating $(x_i - \overline{x})$ for every data point can be tedious, especially if the mean $\overline{x}$ is a decimal. A computationally simpler formula exists.

Derivation:

We start with the definitional formula for variance:

$\sigma^2 = \frac{1}{n}\sum\limits_{i=1}^{n} (x_i - \overline{x})^2$

Expanding the squared term:

$\sigma^2 = \frac{1}{n}\sum\limits_{i=1}^{n} (x_i^2 - 2x_i\overline{x} + \overline{x}^2)$

Distributing the summation:

$\sigma^2 = \frac{1}{n} \left[ \sum\limits_{i=1}^{n} x_i^2 - \sum\limits_{i=1}^{n} 2x_i\overline{x} + \sum\limits_{i=1}^{n} \overline{x}^2 \right]$

Since $2\overline{x}$ and $\overline{x}^2$ are constants with respect to the summation:

$\sigma^2 = \frac{1}{n} \left[ \sum x_i^2 - 2\overline{x}\sum x_i + n\overline{x}^2 \right]$

We know that the mean $\overline{x} = \frac{\sum x_i}{n}$, which implies $\sum x_i = n\overline{x}$. Substituting this:

$\sigma^2 = \frac{1}{n} \left[ \sum x_i^2 - 2\overline{x}(n\overline{x}) + n\overline{x}^2 \right]$

$\sigma^2 = \frac{1}{n} \left[ \sum x_i^2 - 2n\overline{x}^2 + n\overline{x}^2 \right]$

$\sigma^2 = \frac{1}{n} \left[ \sum x_i^2 - n\overline{x}^2 \right]$

Distributing the $\frac{1}{n}$ term gives the shortcut formula:

$\sigma^2 = \frac{\sum x_i^2}{n} - \left(\frac{\sum x_i}{n}\right)^2 = \frac{\sum x_i^2}{n} - (\overline{x})^2$

... (iii)

This formula, "the mean of the squares minus the square of the mean," is often much faster for calculations.

Example 1. Find the variance and standard deviation for the data: 6, 8, 10, 12, 14.

Answer:

Method 1: Using the Definitional Formula

Step 1: Calculate the mean ($\overline{x}$).

$\sum x_i = 6 + 8 + 10 + 12 + 14 = 50$

$\overline{x} = \frac{50}{5} = 10$

Step 2: Calculate the sum of squared deviations.

$x_i$$(x_i - \overline{x})$$(x_i - \overline{x})^2$
66 - 10 = -416
88 - 10 = -24
1010 - 10 = 00
1212 - 10 = 24
1414 - 10 = 416
Total $\sum (x_i - \overline{x})^2 = 40$

Step 3: Calculate the Variance ($\sigma^2$).

$\sigma^2 = \frac{\sum (x_i - \overline{x})^2}{n} = \frac{40}{5} = 8$

Method 2: Using the Shortcut Formula

Step 1: Calculate $\sum x_i$ and $\sum x_i^2$.

$\sum x_i = 50$, so $\overline{x} = 10$ and $(\overline{x})^2 = 100$.

$\sum x_i^2 = 6^2 + 8^2 + 10^2 + 12^2 + 14^2 = 36 + 64 + 100 + 144 + 196 \ $$ = 540$.

Step 2: Apply the shortcut formula.

$\sigma^2 = \frac{\sum x_i^2}{n} - (\overline{x})^2 = \frac{540}{5} - (10)^2 = 108 - 100 = 8$.

Conclusion

The variance is $\sigma^2 = 8$.

Step 4: Calculate the Standard Deviation ($\sigma$).

$\sigma = \sqrt{\text{Variance}} = \sqrt{8} = 2\sqrt{2} \approx 2.828$.

The final answer is: Variance = 8, Standard Deviation $\approx 2.828$.


Variance and Standard Deviation for Grouped Data

1. Discrete Frequency Distribution

For data with observations $x_i$ having corresponding frequencies $f_i$:

Variance: $\sigma^2 = \frac{1}{N} \sum\limits_{i=1}^{k} f_i(x_i - \overline{x})^2$, where $N=\sum f_i$ and $\overline{x}=\frac{\sum f_i x_i}{N}$.

Standard Deviation: $\sigma = \sqrt{\frac{1}{N} \sum\limits_{i=1}^{k} f_i(x_i - \overline{x})^2}$.

Shortcut Formula: $\sigma^2 = \frac{\sum f_i x_i^2}{N} - \left(\frac{\sum f_i x_i}{N}\right)^2 = \frac{\sum f_i x_i^2}{N} - (\overline{x})^2$.

Example 2. Find the variance and standard deviation for the following data:

$x_i$481117202432
$f_i$3595431

Answer:

We will use a table to organize the calculations for the shortcut formula.

$x_i$$f_i$$f_i x_i$$x_i^2$$f_i x_i^2$
43121648
854064320
119991211089
175852891445
204804001600
243725761728
3213210241024
Total$N=30$$\sum f_i x_i=420$$\sum f_i x_i^2=7254$

Step 1: Calculate the mean ($\overline{x}$).

$\overline{x} = \frac{\sum f_i x_i}{N} = \frac{420}{30} = 14$.

Step 2: Calculate the Variance ($\sigma^2$) using the shortcut formula.

$\sigma^2 = \frac{\sum f_i x_i^2}{N} - (\overline{x})^2$

$\sigma^2 = \frac{7254}{30} - (14)^2 = 241.8 - 196 = 45.8$.

Step 3: Calculate the Standard Deviation ($\sigma$).

$\sigma = \sqrt{45.8} \approx 6.77$.

The final answer is: Variance $\approx 45.8$, Standard Deviation $\approx 6.77$.

2. Continuous Frequency Distribution and Shortcut Methods

For continuous distributions, we use the mid-point of each class interval as $x_i$. The formulas are the same as for discrete distributions. However, when the mid-points ($x_i$) or frequencies ($f_i$) are large, calculations become tedious. We use the Deviation Method or Step-Deviation Method to simplify them.

Derivation of Variance Formulas for Grouped Data

(a) Deviation Method Formula Derivation

This method aims to simplify calculations by shifting the origin of the data to an 'assumed mean' ($a$). The relationship between the actual deviations from the mean ($x_i - \overline{x}$) and the new deviations from the assumed mean ($d_i = x_i - a$) is used to derive the formula.

Derivation:

We begin with the fundamental formula for variance of a discrete frequency distribution:

$\sigma^2 = \frac{1}{N} \sum\limits_{i=1}^{k} f_i(x_i - \overline{x})^2$

... (A)

We know the relationship between the true mean ($\overline{x}$) and the assumed mean ($a$) is given by $\overline{x} = a + \overline{d}$, where $\overline{d} = \frac{\sum f_i d_i}{N}$ and $d_i = x_i - a$.

Let's substitute this into the term $(x_i - \overline{x})$:

$x_i - \overline{x} = x_i - (a + \overline{d})$

Since $d_i = x_i - a$, we can rewrite the above as:

$x_i - \overline{x} = (x_i - a) - \overline{d} = d_i - \overline{d}$

Now, substitute this back into the variance formula (A):

$\sigma^2 = \frac{1}{N} \sum\limits_{i=1}^{k} f_i(d_i - \overline{d})^2$

This formula is identical in structure to the original definition of variance, just with $d_i$ replacing $x_i$ and $\overline{d}$ replacing $\overline{x}$. We can now apply the same logic used to derive the shortcut formula.

Expand the squared term:

$\sigma^2 = \frac{1}{N} \sum\limits_{i=1}^{k} f_i(d_i^2 - 2d_i\overline{d} + \overline{d}^2)$

Distribute the summation and the frequency $f_i$:

$\sigma^2 = \frac{1}{N} \left[ \sum f_i d_i^2 - \sum 2f_i d_i\overline{d} + \sum f_i \overline{d}^2 \right]$

Since $\overline{d}$ is a constant, it can be taken out of the summation:

$\sigma^2 = \frac{1}{N} \left[ \sum f_i d_i^2 - 2\overline{d} \sum f_i d_i + \overline{d}^2 \sum f_i \right]$

We know that $\sum f_i = N$ and by definition, $\overline{d} = \frac{\sum f_i d_i}{N}$. Substitute these in:

$\sigma^2 = \frac{1}{N} \left[ \sum f_i d_i^2 - 2\overline{d} (N\overline{d}) + \overline{d}^2 (N) \right]$

$\sigma^2 = \frac{1}{N} \left[ \sum f_i d_i^2 - 2N\overline{d}^2 + N\overline{d}^2 \right]$

$\sigma^2 = \frac{1}{N} \left[ \sum f_i d_i^2 - N\overline{d}^2 \right]$

Distribute the $\frac{1}{N}$ term:

$\sigma^2 = \frac{\sum f_i d_i^2}{N} - \overline{d}^2$

Finally, substitute back the expression for $\overline{d}$:

$\sigma^2 = \frac{\sum f_i d_i^2}{N} - \left(\frac{\sum f_i d_i}{N}\right)^2$

... (iv)

This is the required formula for variance using the deviation method.


(b) Step-Deviation Method Formula Derivation

This method further simplifies the deviation method by scaling down the deviations ($d_i$) by a common factor, $h$ (usually the class size). This results in smaller numbers ($u_i$) that are easier to work with.

Derivation:

We start with the derived formula for the deviation method (iv):

$\sigma^2 = \frac{\sum f_i d_i^2}{N} - \left(\frac{\sum f_i d_i}{N}\right)^2$

... (B)

The step-deviation, $u_i$, is defined as:

$u_i = \frac{d_i}{h} = \frac{x_i - a}{h}$

From this definition, we can express the deviation $d_i$ in terms of the step-deviation $u_i$:

$d_i = h u_i$

Now, we substitute $d_i = h u_i$ into the variance formula (B):

$\sigma^2 = \frac{\sum f_i (h u_i)^2}{N} - \left(\frac{\sum f_i (h u_i)}{N}\right)^2$

Since $h$ is a constant for all classes, $h^2$ is also a constant. We can factor these constants out of the summation signs:

$\sigma^2 = \frac{h^2 \sum f_i u_i^2}{N} - \left(\frac{h \sum f_i u_i}{N}\right)^2$

Apply the square to the second term:

$\sigma^2 = \frac{h^2 \sum f_i u_i^2}{N} - \frac{h^2 \left(\sum f_i u_i\right)^2}{N^2}$

Now, we can factor out the common term $h^2$ from the entire expression:

$\sigma^2 = h^2 \left[ \frac{\sum f_i u_i^2}{N} - \left(\frac{\sum f_i u_i}{N}\right)^2 \right]$

... (v)

This is the required formula for variance using the step-deviation method.

The Standard Deviation is simply the positive square root of this variance:

$\sigma = \sqrt{h^2 \left[ \frac{\sum f_i u_i^2}{N} - \left(\frac{\sum f_i u_i}{N}\right)^2 \right]}$

Taking $h^2$ out of the square root gives the final formula for standard deviation:

$\sigma = h \sqrt{ \frac{\sum f_i u_i^2}{N} - \left(\frac{\sum f_i u_i}{N}\right)^2 }$

... (vi)

Example 3. Calculate the standard deviation for the following data using the step-deviation method.

ClassFrequency ($f_i$)
30 - 403
40 - 507
50 - 6012
60 - 7015
70 - 808
80 - 903
90 - 1002

Answer:

Let's use the step-deviation method. We choose the assumed mean $a=65$ (mid-point of the class with highest frequency) and the common factor $h=10$ (the class size).

Class$f_i$Mid-point ($x_i$)$u_i = \frac{x_i - 65}{10}$$f_i u_i$$u_i^2$$f_i u_i^2$
30-40335-3-9927
40-50745-2-14428
50-601255-1-12112
60-701565 (a)0000
70-808751818
80-9038526412
90-10029536918
Total$N=50$$\sum f_i u_i = -15$$\sum f_i u_i^2=105$

Step 1: Identify the values from the table.

$N=50$, $\sum f_i u_i = -15$, $\sum f_i u_i^2=105$, $h=10$.

Step 2: Calculate the Variance ($\sigma^2$) using formula (v).

$\sigma^2 = h^2 \left[ \frac{\sum f_i u_i^2}{N} - \left(\frac{\sum f_i u_i}{N}\right)^2 \right]$

$\sigma^2 = 10^2 \left[ \frac{105}{50} - \left(\frac{-15}{50}\right)^2 \right]$

$\sigma^2 = 100 \left[ 2.1 - (-0.3)^2 \right]$

$\sigma^2 = 100 \left[ 2.1 - 0.09 \right]$

$\sigma^2 = 100 [2.01] = 201$.

Step 3: Calculate the Standard Deviation ($\sigma$).

$\sigma = \sqrt{201} \approx 14.18$.

The standard deviation for the given data is approximately 14.18.



Coefficient of Variation

Measures of dispersion like Standard Deviation and Variance give us an understanding of the absolute spread or variability within a single dataset. For instance, a standard deviation of 10 cm tells us how much the heights in a group typically vary. However, what if we want to compare the variability of two different groups? The standard deviation alone can be misleading in such cases, especially if:

  1. The groups have different units of measurement (e.g., comparing height in cm to weight in kg).
  2. The groups have the same units but their average values (means) are significantly different.

To perform a meaningful comparison, we need a relative measure of dispersion. The most important and widely used relative measure is the Coefficient of Variation (CV).


Definition and Formula of Coefficient of Variation

The Coefficient of Variation (CV) is a standardized, relative measure of dispersion. It elegantly expresses the standard deviation as a percentage of the arithmetic mean. In simple terms, it measures the "scatter per unit of the mean," allowing for a fair comparison of variability between different datasets.

The formula is given by:

CV = $\frac{\sigma}{\overline{x}} \times 100\%$

... (i)

Where:

A key feature of the CV is that it is a pure number without any units. Because both the standard deviation ($\sigma$) and the mean ($\overline{x}$) have the same units, these units cancel out when we divide them. Multiplying by 100 simply presents this ratio as an easy-to-interpret percentage.


Interpretation and Application of CV

The Coefficient of Variation is the primary tool for comparing the consistency, stability, or uniformity of two or more groups. The interpretation is straightforward:

Use Case 1: Comparing Data with Different Units

Imagine a health study where we want to compare the variability of patients' heights (in cm) with the variability of their weights (in kg). Suppose we find:

We cannot conclude that heights are more variable just because 10 is greater than 8. The units are different, so a direct comparison is meaningless. The CV solves this. If the mean height is 170 cm and the mean weight is 70 kg:

CV$_{\text{height}} = \frac{10}{170} \times 100\% \approx 5.9\%$

CV$_{\text{weight}} = \frac{8}{70} \times 100\% \approx 11.4\%$

Now we can make a fair comparison: The variability in weight (11.4%) is relatively much greater than the variability in height (5.9%) for this group.

Use Case 2: Comparing Data with Widely Different Means

Consider two cricket batsmen, Virat and Rohit, with the following statistics for a season:

Looking only at the standard deviation, Virat (15) appears more variable than Rohit (12). However, this is misleading because their average scores are very different. A deviation of 15 runs from an average of 75 is less significant than a deviation of 12 runs from an average of 40. The CV provides the correct perspective:

CV for Virat = $\frac{15}{75} \times 100\% = 20\%$

CV for Rohit = $\frac{12}{40} \times 100\% = 30\%$

The CV reveals that Rohit's scoring is relatively more variable (30%) than Virat's (20%). Therefore, Virat is the more consistent batsman.


Example 1. The mean and standard deviation of the salaries of two firms, A and B, are given below:

FirmMean Salary ()Standard Deviation ()
A 25,000 3,000
B 28,000 3,500

Which firm has greater variability in individual salaries? Which firm has more consistent salaries?

Answer:

Given:

For Firm A: Mean $\overline{x}_A = 25000$, Standard Deviation $\sigma_A = 3000$.

For Firm B: Mean $\overline{x}_B = 28000$, Standard Deviation $\sigma_B = 3500$.

Solution:

To compare the variability of salaries, we must calculate the Coefficient of Variation (CV) for each firm, as their mean salaries are different.

For Firm A:

CV$_A = \frac{\sigma_A}{\overline{x}_A} \times 100\% = \frac{3000}{25000} \times 100\% = 0.12 \times 100\% = 12\%$

For Firm B:

CV$_B = \frac{\sigma_B}{\overline{x}_B} \times 100\% = \frac{3500}{28000} \times 100\% = \frac{1}{8} \times 100\% = 0.125 \times 100\% = 12.5\%$

Conclusion:

1. Variability: Since CV$_B$ (12.5%) > CV$_A$ (12%), Firm B shows greater relative variability in its individual salaries.

2. Consistency: Since Firm A has a lower CV, it means its salaries are more tightly clustered around the mean. Therefore, Firm A has more consistent salaries.


Example 2. An investor is considering two stocks, Stock X and Stock Y. The mean annual return and standard deviation for the past five years are given below.

StockMean Annual ReturnStandard Deviation of Return
X18%6%
Y12%5%

In finance, standard deviation is a measure of risk. Which stock is considered more risky or volatile?

Answer:

Given:

For Stock X: Mean $\overline{x}_X = 18$, Standard Deviation $\sigma_X = 6$.

For Stock Y: Mean $\overline{x}_Y = 12$, Standard Deviation $\sigma_Y = 5$.

Solution:

To compare the risk relative to the average return, we calculate the CV for each stock. A higher CV implies higher risk for every unit of return.

For Stock X:

CV$_X = \frac{\sigma_X}{\overline{x}_X} \times 100\% = \frac{6}{18} \times 100\% = \frac{1}{3} \times 100\% \approx 33.33\%$

For Stock Y:

CV$_Y = \frac{\sigma_Y}{\overline{x}_Y} \times 100\% = \frac{5}{12} \times 100\% \approx 0.4167 \times 100\% \approx 41.67\%$

Conclusion:

Since the CV of Stock Y (41.67%) is greater than the CV of Stock X (33.33%), Stock Y is considered more risky or volatile relative to its average return.