Chapter 15 Statistics (Concepts)

Welcome to Chapter 15: Statistics! While measures of central tendency like mean, median, and mode provide the 'center' of data, they fail to describe its variability or consistency. This chapter introduces the essential concept of Measures of Dispersion, which quantify how data points are scattered around a central value. Understanding spread is vital for assessing the reliability of any statistical distribution.

We begin with the Range, the simplest measure of spread. However, for a more precise analysis, we explore Mean Deviation about the mean or median. The most significant tools in this chapter are Variance ($\sigma^2$) and Standard Deviation ($\sigma$). The standard deviation is defined as the positive square root of variance: $$\sigma = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n}}$$ Because it is expressed in the same units as the original data, it is the most widely used measure of dispersion in scientific research.

To compare the consistency of different datasets, we utilize the Coefficient of Variation (CV): $$CV = \left( \frac{\sigma}{\bar{x}} \right) \times 100$$ A lower CV indicates greater stability. To enhance your understanding, this page includes visualizations, flowcharts, mindmaps, and practical examples. This page is prepared by learningspot.co to ensure a structured and comprehensive learning experience for every student.

Content On This Page
Describing the Dispersion	Different Methods of Measuring Dispersion	Range
Mean Deviation	Variance and Standard Deviation	Coefficient of Variation

Describing the Dispersion

While measures of central tendency like the mean or median provide a single representative value for the entire data set, Dispersion (also known as variability, scatter, or spread) describes how these individual observations are distributed around that central value. Two data sets can have the exact same mean, yet their nature can be entirely different due to their spread.

Significance of Dispersion

The study of dispersion is essential for several reasons:

1. To determine the reliability of an average: If the dispersion is small, the average is a highly representative value of the data. If the dispersion is large, the average is less reliable.

2. To compare the variability of two or more series: A higher degree of dispersion implies greater inconsistency. For instance, in an Indian industrial context, if the daily wages of workers in Factory A have less dispersion than in Factory B, it means wages in Factory A are more uniform.

3. To facilitate the use of other statistical measures: Dispersion is a prerequisite for calculating higher statistical tools like correlation, regression, and hypothesis testing.

Case Study

You have already studied the locational statistics — which give us a sort of central value around which the values of the variable are located. These measures of central tendency give a rough idea of where data points are centred. However, in order to interpret the data better, we should also know how far these values spread around the central value.

Consider an example of two students, Arjun and Vijay, and their performance in 9 weekly mathematics class tests (Maximum Marks: 100).

Test Score Data

Let us assume that Arjun scored: $70, 72, 70, 68, 70, 72, 70, 68, 70$

Let us assume that Vijay scored: $100, 40, 90, 50, 100, 40, 80, 60, 70$

To analyze their performance, we first calculate the Mean ($\bar{x}$) and Median ($M$).

Statistical Summary

The Mean is calculated using the formula:

$\bar{x} = \frac{\sum x_i}{n}$

Student	Mean Score	Median Score
Arjun (Student A)	$70$	$70$
Vijay (Student B)	$70$	$70$

This tells us that the average performance of both Arjun and Vijay is exactly the same. Based strictly on the mean, one might conclude that both students are performing at the same level of proficiency. But this is only part of the story.

Visual Interpretation of Spread

Now let us plot these scores as dots on a number line to see the "scatter" of their marks.

For Arjun:

For Vijay:

Analysis and Reliability

These diagrams show that while the average score of Arjun and Vijay is the same, the score distribution of Arjun is more 'compact' than that of Vijay, whose score distribution is comparatively dispersed (scattered) widely.

When Arjun sits for a test, his teacher is quite sure that he will score around $70$ marks. He is a highly reliable and consistent student. However, when Vijay sits for a test, the teacher keeps their fingers crossed. There is no certainty; he might top the class with $100$ marks or perform very poorly with $40$ marks, even though his overall average is also $70$.

As far as the teacher is concerned, Arjun's performance is predictable, whereas Vijay's performance shows high variability. This leads us to the second set of descriptives — measures of variability or spread of distribution.

Definition of Dispersion

The term Dispersion refers to the extent of scatter or variation of the data points around a central value. While the mean or median tells us about the location of the center, dispersion informs us about the shape of the distribution and the reliability of that center.

Formal Definition and Characteristics

Dispersion is the measure of the variation of the items. Essentially, it quantifies how much the individual observations ($x_1, x_2, \dots, x_n$) differ from their arithmetic mean ($\bar{x}$) or median ($M$).

A good measure of dispersion should possess the following properties:

1. It should be easy to understand and simple to calculate.

2. It should be based on all the observations of the data set.

3. It should be rigidly defined.

4. It should be capable of further mathematical treatment (like calculating the dispersion of a combined series).

Classification of Dispersion Measures

The measures of dispersion can be broadly classified into two categories based on their application and units:

1. Absolute Measures of Dispersion

These measures are expressed in the same statistical units as the original data (e.g., $\textsf{₹}$, kilograms, meters). They are used to measure the spread within a single data set.

Examples: Range, Quartile Deviation, Mean Deviation, and Standard Deviation.

2. Relative Measures of Dispersion

These measures are expressed as ratios or percentages and are independent of the units of measurement. They are primarily used to compare the variability of two or more distributions that may have different units.

Examples: Coefficient of Variation, Coefficient of Range, Coefficient of Quartile Deviation.

Different Methods of Measuring Dispersion

To numerically capture the spread or variability within a dataset, statisticians use several different methods. Each method provides a single number that summarizes the dispersion, but they do so in different ways and are useful in different contexts. These methods can be grouped into two main categories: absolute measures and relative measures.

Absolute Measures of Dispersion

An absolute measure of dispersion describes the variability of a dataset using the same units as the original data. For example, if we are measuring the heights of students in centimeters (cm), an absolute measure of dispersion like the standard deviation will also be expressed in cm. This makes them easy to interpret in the context of the original data.

However, absolute measures are not suitable for comparing the variability of two datasets with different units (e.g., comparing the variability of students' heights in cm with their weights in kg) or with vastly different average values (e.g., comparing the variability in salaries at a small startup vs. a large corporation).

Common Absolute Measures:

Range: The simplest and quickest measure of dispersion. It is the difference between the maximum and minimum values in the dataset.
Formula: Range = Maximum Value – Minimum Value

Usefulness: Provides a quick, rough estimate of the total spread. It is highly sensitive to outliers (extreme values).
Quartile Deviation: A measure that focuses on the spread of the middle 50% of the data, making it resistant to outliers. It is half of the interquartile range (IQR).
Formula: Quartile Deviation = $\frac{Q_3 - Q_1}{2}$, where $Q_3$ is the third quartile and $Q_1$ is the first quartile.

Usefulness: Good for skewed distributions or data with extreme values.
Mean Deviation: This measure calculates the average distance of each data point from a central value (usually the mean or median). It considers every value in the dataset.
Formula: Mean Deviation = $\frac{\sum\limits |x_i - \text{Mean}|}{n}$

Usefulness: More comprehensive than the range as it uses all data points.
Variance ($\sigma^2$): One of the most important measures of dispersion. It is the average of the squared distances of each data point from the mean. Squaring the differences ensures that all values are positive and gives greater weight to points that are further from the mean.
Formula: Variance ($\sigma^2$) = $\frac{\sum\limits (x_i - \text{Mean})^2}{n}$

Usefulness: Crucial for many advanced statistical theories and tests. Its units are the square of the original data's units (e.g., cm²), making it hard to interpret directly.
Standard Deviation ($\sigma$): This is the most widely used and important measure of dispersion. It is simply the positive square root of the variance.
Formula: Standard Deviation ($\sigma$) = $\sqrt{\text{Variance}} = \sqrt{\frac{\sum\limits (x_i - \text{Mean})^2}{n}}$

Usefulness: By taking the square root, the standard deviation is expressed in the same units as the original data, making it much more interpretable than the variance.

In this chapter, we shall study the Range, Mean Deviation about the Mean, Mean Deviation about the Median, Variance, and Standard Deviation.

Relative Measures of Dispersion

A relative measure of dispersion is a unit-free number, often expressed as a ratio or a percentage. It is designed to compare the variability of two or more datasets, especially when their units or average values are different.

For example, is a variation of 5 cm in the height of men more or less significant than a variation of 5 kg in their weight? A relative measure helps answer such questions by standardizing the dispersion.

Common Relative Measures:

Coefficient of Range: The range expressed as a fraction of the sum of the maximum and minimum values.
Formula: Coefficient of Range = $\frac{\text{Max} - \text{Min}}{\text{Max} + \text{Min}}$
Coefficient of Quartile Deviation: The quartile deviation expressed as a fraction of the average of the quartiles.
Formula: Coefficient of QD = $\frac{Q_3 - Q_1}{Q_3 + Q_1}$
Coefficient of Variation (CV): This is the most common and important relative measure of dispersion. It expresses the standard deviation as a percentage of the mean.

CV = $\frac{\text{Standard Deviation}}{\text{Mean}} \times 100\%$

A dataset with a higher CV is considered to have greater relative variability or be less consistent than a dataset with a lower CV. It is the standard tool for comparing consistency across different groups.

In the following sections, we will focus on the calculation and application of the most important measures: Range, Mean Deviation, Variance, and Standard Deviation.

Range

The Range is the most straightforward and intuitive measure of dispersion. It provides a quick snapshot of the total spread of a dataset by focusing only on its most extreme values.

Definition of Range

The Range is simply the difference between the highest value (maximum) and the lowest value (minimum) in a dataset.

If $X_{\text{max}}$ is the maximum value and $X_{\text{min}}$ is the minimum value, then the formula is:

Range $= X_{\text{max}} - X_{\text{min}}$

The range is an absolute measure of dispersion, meaning its units are the same as the units of the data itself (e.g., if the data is in kilograms, the range is in kilograms).

Calculation of Range

For Ungrouped Data: The process is simple. First, scan the data to find the largest and smallest numbers. Then, subtract the smallest from the largest.
For Grouped Data: For data presented in a frequency distribution with class intervals, the range is calculated as the difference between the upper boundary of the highest class and the lower boundary of the lowest class.

Coefficient of Range

To compare the spread of two datasets with very different scales (e.g., salaries in thousands vs. pocket money in hundreds), we use the Coefficient of Range. This is a relative measure that expresses the range as a fraction of the sum of the extreme values, making it a unit-free number.

Coefficient of Range $= \frac{X_{\text{max}} - X_{\text{min}}}{X_{\text{max}} + X_{\text{min}}}$

Advantages and Disadvantages of Range

Advantages (Merits):

Easy to Calculate: It is the simplest measure of dispersion to compute.
Easy to Understand: Its meaning is very clear and intuitive.
Quick: It provides a very fast, though rough, idea of the data's spread.

Disadvantages (Demerits):

Affected by Outliers: The range's biggest weakness is its extreme sensitivity to outliers. A single unusually high or low value can dramatically inflate the range, giving a misleading impression of the overall variability.
Ignores Most Data: It is calculated using only two data points (the maximum and minimum) and completely ignores the distribution and clustering of all the data points in between.
Not Suitable for Further Analysis: Because it is not based on all observations, it is generally not used in more advanced statistical calculations.
Cannot be used for Open-Ended Classes: If a dataset has an open-ended class (e.g., "over 100"), the maximum value is unknown, and the range cannot be calculated.

Due to these significant limitations, the range is typically used for a quick preliminary look at the data or in specific applications like statistical quality control, but it is not considered a robust measure of dispersion.

Example 1. Find the range and the coefficient of range for the following dataset of daily temperatures (°C): 15, 25, 18, 32, 40, 28, 12, 35.

Answer:

Given:

The data is: 15, 25, 18, 32, 40, 28, 12, 35.

Solution:

Step 1: Identify the maximum and minimum values.

By inspecting the data, we find:

Maximum Value ($X_{\text{max}}$) = 40 °C

Minimum Value ($X_{\text{min}}$) = 12 °C

Step 2: Calculate the Range.

Using the formula Range $= X_{\text{max}} - X_{\text{min}}$:

Range = 40 – 12 = 28 °C

Step 3: Calculate the Coefficient of Range.

Using the formula Coefficient of Range $= \frac{X_{\text{max}} - X_{\text{min}}}{X_{\text{max}} + X_{\text{min}}}$:

Coefficient of Range = $\frac{40 - 12}{40 + 12} = \frac{28}{52}$

Simplifying the fraction by dividing the numerator and denominator by 4:

Coefficient of Range = $\frac{7}{13}$

(As a decimal, this is approximately 0.538).

The final answer is: Range = 28 °C, Coefficient of Range = $\frac{7}{13}$.

Mean Deviation

Dispersion measures the extent to which the values in a distribution are spread out or scattered from the average. A simple measure is the Range, which is the difference between the maximum and minimum values. However, the range is a crude measure as it only depends on two extreme values and ignores the distribution of the rest of the observations.

To overcome this limitation, we use measures that involve all the data points. One such measure is the Mean Deviation. It provides a more robust understanding of the spread by calculating the average distance of each observation from a central value.

Definition and Concept of Mean Deviation

The Mean Deviation (MD) is defined as the arithmetic mean of the absolute deviations of the observations from a suitable measure of central tendency. This central value can be the mean, median, or mode, but it is most commonly calculated with respect to the mean or the median.

Why Absolute Deviations?

A deviation is the difference between an observation and the central value (e.g., $x_i - \overline{x}$). Some of these deviations will be positive (for values greater than the mean), and some will be negative (for values less than the mean). A key property of the arithmetic mean is that the sum of these deviations is always zero, i.e., $\sum (x_i - \overline{x}) = 0$.

For example, for data {2, 4, 9}, the mean is $\overline{x} = 5$. The deviations are $(2-5) = -3$, $(4-5) = -1$, and $(9-5) = 4$. The sum is $-3 - 1 + 4 = 0$.

Because the sum is always zero, the average deviation would also be zero, which is not a useful measure of spread. To solve this, we take the absolute value of each deviation, i.e., $|x_i - \overline{x}|$. This makes all deviations positive, and their average gives a meaningful value representing the average distance of the data points from the center.

Mean Deviation for Ungrouped Data

Ungrouped data refers to data that is given as individual data points.

1. Mean Deviation about the Mean ($\text{MD}_{\overline{x}}$)

This measures the average absolute distance of each data point from the arithmetic mean of the dataset.

Formula and Derivation

Let the given data consist of $n$ distinct observations $x_1, x_2, ..., x_n$.

Step 1: Calculate the mean of the data.

$\overline{x} = \frac{\sum\limits_{i=1}^{n} x_i}{n}$

Step 2: Find the deviation of each observation $x_i$ from the mean $\overline{x}$, which is $(x_i - \overline{x})$.

Step 3: Find the absolute value of these deviations, which is $|x_i - \overline{x}|$.

Step 4: Find the arithmetic mean of these absolute deviations. This is the Mean Deviation about the Mean.

$\text{MD}_{\overline{x}} = \frac{\sum\limits_{i=1}^{n} |x_i - \overline{x}|}{n}$

Example 1. Find the mean deviation about the mean for the data: 6, 7, 10, 12, 13, 4, 8, 12.

Answer:

Given:

Data observations: $x_i$ = 6, 7, 10, 12, 13, 4, 8, 12.

Number of observations, $n=8$.

To Find:

Mean Deviation about the Mean ($\text{MD}_{\overline{x}}$).

Solution:

Step 1: Calculate the mean ($\overline{x}$).

Sum of observations = $6 + 7 + 10 + 12 + 13 + 4 + 8 + 12 = 72$.

Mean, $\overline{x} = \frac{\sum x_i}{n} = \frac{72}{8} = 9$.

Step 2: Calculate the absolute deviations from the mean, $|x_i - 9|$.

We create a table for clarity:

$x_i$	$\|x_i - \overline{x}\| = \|x_i - 9\|$
4	$\|4 - 9\| = 5$
6	$\|6 - 9\| = 3$
7	$\|7 - 9\| = 2$
8	$\|8 - 9\| = 1$
10	$\|10 - 9\| = 1$
12	$\|12 - 9\| = 3$
12	$\|12 - 9\| = 3$
13	$\|13 - 9\| = 4$
Total	$\sum\limits_{i=1}^{8} \|x_i - \overline{x}\| = 22$

Step 3: Calculate the mean deviation about the mean.

Using the formula (i):

$\text{MD}_{\overline{x}} = \frac{\sum\limits_{i=1}^{8} |x_i - \overline{x}|}{n} = \frac{22}{8} = 2.75$.

Thus, the mean deviation about the mean is 2.75.

2. Mean Deviation about the Median ($\text{MD}_M$)

This measures the average absolute distance of each data point from the median of the dataset. An important property is that the mean deviation is minimum when calculated from the median.

Formula and Derivation

Let the given data consist of $n$ distinct observations $x_1, x_2, ..., x_n$.

Step 1: Arrange the data in ascending order.

Step 2: Calculate the median ($M$) of the data.

$M = \begin{cases} \left(\frac{n+1}{2}\right)^{th} \text{observation} & , & \text{if } n \text{ is odd} \\ \frac{\left(\frac{n}{2}\right)^{th} \text{obs} + \left(\frac{n}{2}+1\right)^{th} \text{obs}}{2} & , & \text{if } n \text{ is even} \end{cases}$

Step 3: Find the absolute value of the deviations from the median, which is $|x_i - M|$.

Step 4: Find the arithmetic mean of these absolute deviations.

$\text{MD}_M = \frac{\sum\limits_{i=1}^{n} |x_i - M|}{n}$

Example 2. Find the mean deviation about the median for the data: 3, 9, 5, 3, 12, 10, 18, 4, 7, 19, 21.

Answer:

Given:

Data observations: $x_i$ = 3, 9, 5, 3, 12, 10, 18, 4, 7, 19, 21.

Number of observations, $n=11$.

To Find:

Mean Deviation about the Median ($\text{MD}_M$).

Solution:

Step 1: Arrange the data in ascending order.

3, 3, 4, 5, 7, 9, 10, 12, 18, 19, 21.

Step 2: Calculate the median ($M$).

Since $n = 11$ (odd), the median is the $\left(\frac{11+1}{2}\right)^{th}$ term, which is the 6th term.

Median, $M = 9$.

Step 3: Calculate the absolute deviations from the median, $|x_i - 9|$.

$x_i$	$\|x_i - M\| = \|x_i - 9\|$
3	$\|3 - 9\| = 6$
3	$\|3 - 9\| = 6$
4	$\|4 - 9\| = 5$
5	$\|5 - 9\| = 4$
7	$\|7 - 9\| = 2$
9	$\|9 - 9\| = 0$
10	$\|10 - 9\| = 1$
12	$\|12 - 9\| = 3$
18	$\|18 - 9\| = 9$
19	$\|19 - 9\| = 10$
21	$\|21 - 9\| = 12$
Total	$\sum\limits \|x_i - M\| = 58$

Step 4: Calculate the mean deviation about the median.

Using the formula (ii):

$\text{MD}_{M} = \frac{\sum\limits_{i=1}^{11} |x_i - M|}{n} = \frac{58}{11} \approx 5.27$.

Thus, the mean deviation about the median is approximately 5.27.

Mean Deviation for Grouped Data

Grouped data is data that has been organized into a frequency distribution.

1. Discrete Frequency Distribution

In this format, each observation $x_i$ has a corresponding frequency $f_i$.

(a) Mean Deviation about the Mean

The formula is an extension of the ungrouped data formula, where each absolute deviation is weighted by its frequency.

$\text{MD}_{\overline{x}} = \frac{\sum\limits_{i=1}^{k} f_i |x_i - \overline{x}|}{\sum\limits_{i=1}^{k} f_i} = \frac{\sum f_i |x_i - \overline{x}|}{N}$

where $k$ is the number of distinct observations, $N = \sum f_i$ is the total frequency, and the mean is $\overline{x} = \frac{\sum f_i x_i}{N}$.

Example 3. Find the mean deviation about the mean for the following data:

$x_i$	2	5	6	8	10	12
$f_i$	2	8	10	7	8	5

Answer:

Solution:

We first need to calculate the mean $\overline{x}$. We can do this in a tabular format, which also helps in calculating the mean deviation.

$x_i$	$f_i$	$f_i x_i$	$\|x_i - \overline{x}\| $$ = \|x_i - 7.5\|$	$f_i \|x_i - 7.5\|$
2	2	4	$\|2-7.5\|=5.5$	$2 \times 5.5 = 11.0$
5	8	40	$\|5-7.5\|=2.5$	$8 \times 2.5 = 20.0$
6	10	60	$\|6-7.5\|=1.5$	$10 \times 1.5 = 15.0$
8	7	56	$\|8-7.5\|=0.5$	$7 \times 0.5 = 3.5$
10	8	80	$\|10-7.5\|=2.5$	$8 \times 2.5 = 20.0$
12	5	60	$\|12-7.5\|=4.5$	$5 \times 4.5 = 22.5$
Total	$N=40$	$\sum f_i x_i = 300$		$\sum f_i\|x_i - \overline{x}\|=92.0$

Step 1: Calculate the mean.

$\overline{x} = \frac{\sum f_i x_i}{N} = \frac{300}{40} = 7.5$.

Step 2: Calculate $\sum f_i|x_i - \overline{x}|$.

From the table, this sum is 92.0.

Step 3: Calculate the mean deviation.

$\text{MD}_{\overline{x}} = \frac{\sum f_i |x_i - \overline{x}|}{N} = \frac{92.0}{40} = 2.3$.

The mean deviation about the mean is 2.3.

(b) Mean Deviation about the Median

The formula is similar, using the median as the central value.

$\text{MD}_M = \frac{\sum\limits_{i=1}^{k} f_i |x_i - M|}{N}$

To find the median for discrete data, we first find the cumulative frequency (c.f.). The median is the observation whose cumulative frequency is just greater than or equal to $\frac{N}{2}$.

Example 4. Find the mean deviation about the median for the data in Example 3.

Answer:

Solution:

We first need to find the median. For this, we calculate the cumulative frequency (c.f.).

$x_i$	$f_i$	c.f.	$\|x_i - M\| = \|x_i - 8\|$	$f_i \|x_i - 8\|$
2	2	2	$\|2-8\|=6$	$2 \times 6 = 12$
5	8	10	$\|5-8\|=3$	$8 \times 3 = 24$
6	10	20	$\|6-8\|=2$	$10 \times 2 = 20$
8	7	27	$\|8-8\|=0$	$7 \times 0 = 0$
10	8	35	$\|10-8\|=2$	$8 \times 2 = 16$
12	5	40	$\|12-8\|=4$	$5 \times 4 = 20$
Total	$N=40$			$\sum f_i\|x_i - M\|=92$

Step 1: Find the median.

Here, $N=40$. We look for $\frac{N}{2} = \frac{40}{2} = 20$.

The cumulative frequency just equal to 20 corresponds to the observation $x_i = 6$. The cumulative frequency for the next observation (8) is 27, which corresponds to observations from 21st to 27th. Since $N$ is even, the median is the average of the 20th and 21st observations.

The 20th observation is 6.

The 21st observation is 8.

Median, $M = \frac{6+8}{2} = 7$.

Alternate Median Calculation for this specific problem:

Let's re-calculate using the value $M=7$.

$x_i$	$f_i$	$\|x_i - M\| = \|x_i - 7\|$	$f_i \|x_i - 7\|$
2	2	5	10
5	8	2	16
6	10	1	10
8	7	1	7
10	8	3	24
12	5	5	25
Total	$N=40$		$\sum f_i\|x_i - M\|=92$

Step 2: Calculate $\sum f_i|x_i - M|$.

From the table, the sum is 92.

Step 3: Calculate the mean deviation.

$\text{MD}_{M} = \frac{\sum f_i |x_i - M|}{N} = \frac{92}{40} = 2.3$.

In this particular case, the mean deviation about the mean and median are the same. This is not always true.

2. Continuous Frequency Distribution

In this format, data is given in class intervals. We use the mid-point (or class mark) of each interval as the representative value $x_i$ for that class.

Mid-point $x_i = \frac{\text{Lower limit} + \text{Upper limit}}{2}$.

(a) Mean Deviation about the Mean

The formula is the same as for the discrete distribution, but $x_i$ are now the mid-points of the classes.

$\text{MD}_{\overline{x}} = \frac{\sum f_i |x_i - \overline{x}|}{N}$

where $\overline{x} = \frac{\sum f_i x_i}{N}$.

Example 5. Calculate the mean deviation about the mean for the following data:

Marks obtained	Number of students
10 - 20	2
20 - 30	3
30 - 40	8
40 - 50	14
50 - 60	8
60 - 70	3
70 - 80	2

Answer:

Solution:

We construct a table to calculate the necessary values.

Marks	$f_i$	Mid-point ($x_i$)	$f_i x_i$	$\|x_i - \overline{x}\| $$ = \|x_i - 45\|$	$f_i \|x_i - 45\|$
10-20	2	15	30	30	60
20-30	3	25	75	20	60
30-40	8	35	280	10	80
40-50	14	45	630	0	0
50-60	8	55	440	10	80
60-70	3	65	195	20	60
70-80	2	75	150	30	60
Total	$N=40$		$\sum f_i x_i $$ = 1800$		$\sum f_i\|x_i - \overline{x}\| $$ =400$

Step 1: Calculate the mean.

$\overline{x} = \frac{\sum f_i x_i}{N} = \frac{1800}{40} = 45$.

Step 2: Calculate $\sum f_i|x_i - \overline{x}|$.

From the table, this sum is 400.

Step 3: Calculate the mean deviation.

$\text{MD}_{\overline{x}} = \frac{\sum f_i |x_i - \overline{x}|}{N} = \frac{400}{40} = 10$.

The mean deviation about the mean is 10.

(b) Mean Deviation about the Median

For a continuous distribution, we first find the median class and then calculate the median using a formula.

Step 1: Find the Median Class. It is the class interval whose cumulative frequency is just greater than or equal to $\frac{N}{2}$.

Step 2: Calculate Median ($M$).

$M = l + \frac{\frac{N}{2} - C}{f} \times h$

where,

$l$ = lower limit of the median class.
$N$ = sum of frequencies.
$C$ = cumulative frequency of the class preceding the median class.
$f$ = frequency of the median class.
$h$ = class size.

Step 3: Calculate Mean Deviation about the Median.

$\text{MD}_M = \frac{\sum f_i |x_i - M|}{N}$

Example 6. Calculate the mean deviation about the median for the data in Example 5.

Answer:

Solution:

First, we find the median by constructing a table with cumulative frequencies.

Marks	$f_i$	c.f.	Mid-point ($x_i$)	$\|x_i - M\| $$ = \|x_i - 45\|$	$f_i \|x_i - 45\|$
10-20	2	2	15	30	60
20-30	3	5	25	20	60
30-40	8	13	35	10	80
40-50	14	27	45	0	0
50-60	8	35	55	10	80
60-70	3	38	65	20	60
70-80	2	40	75	30	60
Total	$N=40$				$\sum f_i\|x_i - M\| $$=400$

Step 1: Find the median class.

$N=40$, so $\frac{N}{2} = 20$.

The cumulative frequency just greater than 20 is 27. The corresponding class is 40-50. So, the Median Class is 40-50.

Step 2: Calculate the median.

$l = 40$, $N=40$, $C = 13$, $f = 14$, $h = 10$.

$M = 40 + \frac{20 - 13}{14} \times 10 = 40 + \frac{7}{14} \times 10 $$ = 40 + \frac{1}{2} \times 10 = 40 + 5 = 45$.

Median $M = 45$.

Step 3: Calculate the mean deviation about the median.

Since the median (45) is the same as the mean in this case, the calculations for $|x_i - M|$ and $f_i |x_i - M|$ will be identical to the mean deviation calculation.

From the table, $\sum f_i|x_i - M|=400$.

$\text{MD}_{M} = \frac{\sum f_i |x_i - M|}{N} = \frac{400}{40} = 10$.

The mean deviation about the median is 10.

Merits and Demerits of Mean Deviation

Merits (Advantages):

Based on All Observations: Unlike range, it takes into account every single data point. This makes it a much more comprehensive and representative measure of dispersion.
Simple and Intuitive: The concept of an "average distance from the center" is straightforward to understand and explain, even to a non-technical audience.
Less Affected by Extreme Values: Compared to standard deviation (which squares the deviations), the mean deviation gives less weight to extreme observations (outliers), making it a more robust measure in their presence.

Demerits (Disadvantages):

Ignores Algebraic Signs: The use of absolute values ($|...|$) to make deviations positive is a mathematical inconvenience. The absolute value function is difficult to handle algebraically in further statistical theory (e.g., in inference or regression).
Not Mathematically Tractable: Because of the absolute value issue, it's not used in more advanced statistical analysis. The Standard Deviation, which overcomes this by squaring deviations, is mathematically more manageable and has better properties, making it the preferred measure of spread in higher statistics.
Value can change depending on the central tendency used: The value of mean deviation about the mean can be different from the mean deviation about the median.

Variance and Standard Deviation

In the study of dispersion, Mean Deviation provides a good intuitive measure of spread by averaging the absolute distances from a central point. However, the use of the absolute value function ($|...|$) makes it mathematically inconvenient for more advanced statistical analysis and inference. To overcome this algebraic limitation, statisticians developed Variance and Standard Deviation. These are the most crucial and widely used measures of dispersion in statistics because they are based on squared deviations, which have more desirable mathematical properties.

Variance ($\sigma^2$)

The Variance is defined as the arithmetic mean of the squared deviations of the observations from their arithmetic mean. It quantifies the degree of spread in a set of data points.

The process involves:

Calculating the mean ($\overline{x}$) of the data.
Finding the deviation of each observation from the mean ($x_i - \overline{x}$).
Squaring each deviation ($(x_i - \overline{x})^2$).
Finding the average of these squared deviations.

Squaring the deviations accomplishes two important things:

Eliminates Negatives: It ensures that all the terms to be averaged are positive, so they don't cancel each other out (solving the same problem that absolute values solved for mean deviation).
Emphasizes Larger Deviations: By squaring, it gives more weight to values that are further away from the mean. A point that is 4 units away contributes $4^2=16$ to the sum, while a point 2 units away only contributes $2^2=4$. This makes variance highly sensitive to outliers.

Variance is denoted by the Greek letter sigma squared ($\sigma^2$). A significant drawback of variance is that its units are the square of the original data units (e.g., if the data is in centimetres, the variance is in square centimetres). This makes it difficult to interpret in the context of the original data.

Standard Deviation ($\sigma$)

The Standard Deviation is the measure of dispersion that resolves the unit-interpretation problem of variance. It is defined as the positive square root of the variance.

By taking the square root, the units of the standard deviation become the same as the units of the original data, making it directly comparable and interpretable. The Standard Deviation is the most common and important measure of dispersion.

It represents a "typical" or "standard" amount of deviation (distance) of a data point from the mean. It is denoted by $\sigma$ (sigma).

A small standard deviation indicates that the data points tend to be very close to the mean. The dataset shows low variability and high consistency.
A large standard deviation indicates that the data points are spread out over a wide range of values. The dataset shows high variability and low consistency.

Variance and Standard Deviation for Ungrouped Data

1. Definitional Formulas

For a set of $n$ observations $x_1, x_2, ..., x_n$ with mean $\overline{x}$:

The Variance is given by:

$\sigma^2 = \frac{\sum\limits_{i=1}^{n} (x_i - \overline{x})^2}{n}$

The Standard Deviation is given by:

$\sigma = \sqrt{\frac{\sum\limits_{i=1}^{n} (x_i - \overline{x})^2}{n}}$

2. Shortcut (Computational) Formula and its Derivation

Calculating $(x_i - \overline{x})$ for every data point can be tedious, especially if the mean $\overline{x}$ is a decimal. A computationally simpler formula exists.

Derivation:

We start with the definitional formula for variance:

$\sigma^2 = \frac{1}{n}\sum\limits_{i=1}^{n} (x_i - \overline{x})^2$

Expanding the squared term:

$\sigma^2 = \frac{1}{n}\sum\limits_{i=1}^{n} (x_i^2 - 2x_i\overline{x} + \overline{x}^2)$

Distributing the summation:

$\sigma^2 = \frac{1}{n} \left[ \sum\limits_{i=1}^{n} x_i^2 - \sum\limits_{i=1}^{n} 2x_i\overline{x} + \sum\limits_{i=1}^{n} \overline{x}^2 \right]$

Since $2\overline{x}$ and $\overline{x}^2$ are constants with respect to the summation:

$\sigma^2 = \frac{1}{n} \left[ \sum x_i^2 - 2\overline{x}\sum x_i + n\overline{x}^2 \right]$

We know that the mean $\overline{x} = \frac{\sum x_i}{n}$, which implies $\sum x_i = n\overline{x}$. Substituting this:

$\sigma^2 = \frac{1}{n} \left[ \sum x_i^2 - 2\overline{x}(n\overline{x}) + n\overline{x}^2 \right]$

$\sigma^2 = \frac{1}{n} \left[ \sum x_i^2 - 2n\overline{x}^2 + n\overline{x}^2 \right]$

$\sigma^2 = \frac{1}{n} \left[ \sum x_i^2 - n\overline{x}^2 \right]$

Distributing the $\frac{1}{n}$ term gives the shortcut formula:

$\sigma^2 = \frac{\sum x_i^2}{n} - \left(\frac{\sum x_i}{n}\right)^2 = \frac{\sum x_i^2}{n} - (\overline{x})^2$

This formula, "the mean of the squares minus the square of the mean," is often much faster for calculations.

Since Standard Deviation is the positive square root of Variance, we can derive its shortcut formula directly from the above equation:

$\sigma = \sqrt{\frac{\sum x_i^2}{n} - \left(\frac{\sum x_i}{n}\right)^2}$

Alternatively, it can be written as:

$\sigma = \frac{1}{n}\sqrt{n\sum x_i^2 - (\sum x_i)^2}$

Example 1. Find the variance and standard deviation for the data: 6, 8, 10, 12, 14.

Answer:

Method 1: Using the Definitional Formula

Step 1: Calculate the mean ($\overline{x}$).

$\sum x_i = 6 + 8 + 10 + 12 + 14 = 50$

$\overline{x} = \frac{50}{5} = 10$

Step 2: Calculate the sum of squared deviations.

$x_i$	$(x_i - \overline{x})$	$(x_i - \overline{x})^2$
6	6 - 10 = -4	16
8	8 - 10 = -2	4
10	10 - 10 = 0	0
12	12 - 10 = 2	4
14	14 - 10 = 4	16
Total		$\sum (x_i - \overline{x})^2 = 40$

Step 3: Calculate the Variance ($\sigma^2$).

$\sigma^2 = \frac{\sum (x_i - \overline{x})^2}{n} = \frac{40}{5} = 8$

Method 2: Using the Shortcut Formula

Step 1: Calculate $\sum x_i$ and $\sum x_i^2$.

$\sum x_i = 50$, so $\overline{x} = 10$ and $(\overline{x})^2 = 100$.

$\sum x_i^2 = 6^2 + 8^2 + 10^2 + 12^2 + 14^2 = 36 + 64 + 100 + 144 + 196 \ $$ = 540$.

Step 2: Apply the shortcut formula.

$\sigma^2 = \frac{\sum x_i^2}{n} - (\overline{x})^2 = \frac{540}{5} - (10)^2 = 108 - 100 = 8$.

Conclusion

The variance is $\sigma^2 = 8$.

Step 4: Calculate the Standard Deviation ($\sigma$).

$\sigma = \sqrt{\text{Variance}} = \sqrt{8} = 2\sqrt{2} \approx 2.828$.

The final answer is: Variance = 8, Standard Deviation $\approx 2.828$.

Variance and Standard Deviation for Grouped Data

1. Discrete Frequency Distribution

When the data is presented as a discrete frequency distribution, where observations $x_1, x_2, \dots, x_k$ occur with corresponding frequencies $f_1, f_2, \dots, f_k$, the measures of dispersion must account for the weight of each frequency.

The Mean ($\overline{x}$) for such a distribution is calculated as:

$\overline{x} = \frac{\sum\limits_{i=1}^{k} f_i x_i}{\sum\limits_{i=1}^{k} f_i}$

Variance ($\sigma^2$)

Variance is the average of the squared deviations from the mean, weighted by their respective frequencies. It is given by the formula:

$\sigma^2 = \frac{\sum\limits_{i=1}^{k} f_i(x_i - \overline{x})^2}{\sum\limits_{i=1}^{k} f_i}$

Standard Deviation ($\sigma$)

Standard Deviation is the positive square root of the variance. It is expressed in the same units as the observations:

$\sigma = \sqrt{\frac{\sum\limits_{i=1}^{k} f_i(x_i - \overline{x})^2}{\sum\limits_{i=1}^{k} f_i}}$

Shortcut (Computational) Formula

To avoid lengthy calculations involving decimals in the mean, we use the simplified computational formula:

$\sigma^2 = \frac{\sum\limits_{i=1}^{k} f_i x_i^2}{\sum\limits_{i=1}^{k} f_i} - \left(\frac{\sum\limits_{i=1}^{k} f_i x_i}{\sum\limits_{i=1}^{k} f_i}\right)^2$

Which can also be written as:

$\sigma^2 = \frac{\sum\limits_{i=1}^{k} f_i x_i^2}{\sum\limits_{i=1}^{k} f_i} - (\overline{x})^2$

The shortcut formula for Standard Deviation ($\sigma$) is obtained by taking the square root of the computational variance. It is expressed as:

$\sigma = \sqrt{\frac{\sum f_i x_i^2}{\sum f_i} - \left(\frac{\sum f_i x_i}{\sum f_i}\right)^2}$

Alternatively, by taking the common denominator out of the square root, the formula can be written as:

$\sigma = \frac{1}{\sum f_i} \sqrt{\left(\sum f_i\right) \left(\sum f_i x_i^2\right) - \left(\sum f_i x_i\right)^2}$

Example 2. Find the variance and standard deviation for the following data:

$x_i$	4	8	11	17	20	24	32
$f_i$	3	5	9	5	4	3	1

Answer:

We will use a table to organize the calculations for the shortcut formula.

$x_i$	$f_i$	$f_i x_i$	$x_i^2$	$f_i x_i^2$
4	3	12	16	48
8	5	40	64	320
11	9	99	121	1089
17	5	85	289	1445
20	4	80	400	1600
24	3	72	576	1728
32	1	32	1024	1024
Total	$\sum f_i=30$	$\sum f_i x_i=420$		$\sum f_i x_i^2=7254$

Step 1: Calculate the mean ($\overline{x}$).

$\overline{x} = \frac{\sum f_i x_i}{\sum f_i} = \frac{420}{30} = 14$.

Step 2: Calculate the Variance ($\sigma^2$) using the shortcut formula.

$\sigma^2 = \frac{\sum f_i x_i^2}{\sum f_i} - (\overline{x})^2$

$\sigma^2 = \frac{7254}{30} - (14)^2 = 241.8 - 196 = 45.8$.

Step 3: Calculate the Standard Deviation ($\sigma$).

$\sigma = \sqrt{45.8} \approx 6.77$.

The final answer is: Variance $\approx 45.8$, Standard Deviation $\approx 6.77$.

2. Continuous Frequency Distribution and Shortcut Methods

For continuous distributions, we use the mid-point of each class interval as $x_i$. The formulas are the same as for discrete distributions. However, when the mid-points ($x_i$) or frequencies ($f_i$) are large, calculations become tedious. We use the Deviation Method or Step-Deviation Method to simplify them.

Derivation of Variance Formulas for Grouped Data

(a) Derivation of Variance Formula using Deviation Method

The Deviation Method involves choosing an 'assumed mean' ($a$) and calculating the deviations ($d_i$) of mid-points from this value. This method effectively shifts the origin of the data set.

The Setup

Let mid-points of class intervals be $x_1, x_2, \dots, x_k$ with corresponding frequencies $f_1, f_2, \dots, f_k$.

Let $a$ be the assumed mean.

Define the deviation $d_i$ as: $d_i = x_i - a$

The actual mean ($\overline{x}$) is related to the assumed mean by the formula:

$\overline{x} = a + \overline{d}$

[Where $\overline{d} = \frac{\sum f_i d_i}{\sum f_i}$]

The Algebraic Derivation

We begin with the fundamental definition of variance for grouped data:

$\sigma^2 = \frac{\sum f_i(x_i - \overline{x})^2}{\sum f_i}$

... (i)

Now, let us express the term $(x_i - \overline{x})$ in terms of $d_i$ and $\overline{d}$:

$x_i - \overline{x} = x_i - (a + \overline{d})$

$x_i - \overline{x} = (x_i - a) - \overline{d}$

(Rearranging terms)

$x_i - \overline{x} = d_i - \overline{d}$

[Since $d_i = x_i - a$] ... (ii)

Substituting Equation (ii) into the variance formula (i):

$\sigma^2 = \frac{\sum f_i(d_i - \overline{d})^2}{\sum f_i}$

Expanding the squared term $(d_i - \overline{d})^2$:

$\sigma^2 = \frac{\sum f_i(d_i^2 - 2d_i\overline{d} + \overline{d}^2)}{\sum f_i}$

Distributing the summation and the frequency ($f_i$):

$\sigma^2 = \frac{\sum f_i d_i^2 - \sum 2f_i d_i \overline{d} + \sum f_i \overline{d}^2}{\sum f_i}$

Since $\overline{d}$ is a constant value for the entire distribution, we can take it out of the summation:

$\sigma^2 = \frac{\sum f_i d_i^2 - 2\overline{d} \sum f_i d_i + \overline{d}^2 \sum f_i}{\sum f_i}$

Now, divide each term in the numerator by the denominator $\sum f_i$:

$\sigma^2 = \frac{\sum f_i d_i^2}{\sum f_i} - 2\overline{d} \left( \frac{\sum f_i d_i}{\sum f_i} \right) + \overline{d}^2 \left( \frac{\sum f_i}{\sum f_i} \right)$

Since $\frac{\sum f_i d_i}{\sum f_i} = \overline{d}$ and $\frac{\sum f_i}{\sum f_i} = 1$, we get:

$\sigma^2 = \frac{\sum f_i d_i^2}{\sum f_i} - 2\overline{d}(\overline{d}) + \overline{d}^2$

$\sigma^2 = \frac{\sum f_i d_i^2}{\sum f_i} - 2\overline{d}^2 + \overline{d}^2$

This simplifies to the Shortcut Formula for Variance:

$\sigma^2 = \frac{\sum f_i d_i^2}{\sum f_i} - \left( \frac{\sum f_i d_i}{\sum f_i} \right)^2$

By taking the square root of the variance derived above, we obtain the standard deviation using the deviation method:

$\sigma = \sqrt{\frac{\sum f_i d_i^2}{\sum f_i} - \left( \frac{\sum f_i d_i}{\sum f_i} \right)^2}$

Where $d_i = x_i - a$ and $x_i$ is the mid-point of the class interval.

(b) Derivation of Variance Formula using Step-Deviation Method

The Step-Deviation Method is an extension of the deviation method. It is used to further simplify the calculation of variance and standard deviation, especially when the mid-points ($x_i$) are in an arithmetic progression or when the class intervals are of equal size ($h$). By dividing the deviations by the class width, we reduce the numerical values to much smaller integers ($u_i$), making the computation faster and less prone to manual errors.

This method simplifies calculations by scaling down the deviations ($d_i$) by a common factor, $h$ (usually the class size).

The Derivation

We start with the previously derived formula for the deviation method:

$\sigma^2 = \frac{\sum f_i d_i^2}{\sum f_i} - \left(\frac{\sum f_i d_i}{\sum f_i}\right)^2$

... (i)

The step-deviation, $u_i$, is defined as the deviation divided by the class width $h$:

$u_i = \frac{d_i}{h} = \frac{x_i - a}{h}$

[Where $a$ is assumed mean]

From this definition, we can express the deviation $d_i$ in terms of the step-deviation $u_i$:

$d_i = h u_i$

Now, we substitute $d_i = h u_i$ into the variance formula (i):

$\sigma^2 = \frac{\sum f_i (h u_i)^2}{\sum f_i} - \left(\frac{\sum f_i (h u_i)}{\sum f_i}\right)^2$

Since $h$ is a constant for all classes, $h^2$ is also a constant. We can factor these constants out of the summation signs:

$\sigma^2 = \frac{h^2 \sum f_i u_i^2}{\sum f_i} - \left(\frac{h \sum f_i u_i}{\sum f_i}\right)^2$

Applying the square to the term in the parentheses:

$\sigma^2 = \frac{h^2 \sum f_i u_i^2}{\sum f_i} - \frac{h^2 \left(\sum f_i u_i\right)^2}{\left(\sum f_i\right)^2}$

Now, we can factor out the common term $h^2$ from the entire expression to get the Variance formula:

$\sigma^2 = h^2 \left[ \frac{\sum f_i u_i^2}{\sum f_i} - \left(\frac{\sum f_i u_i}{\sum f_i}\right)^2 \right]$

The Standard Deviation ($\sigma$) is the positive square root of this variance:

$\sigma = \sqrt{h^2 \left[ \frac{\sum f_i u_i^2}{\sum f_i} - \left(\frac{\sum f_i u_i}{\sum f_i}\right)^2 \right]}$

By taking $h^2$ out of the square root, we arrive at the final computational formula for Standard Deviation:

$\sigma = h \sqrt{ \frac{\sum f_i u_i^2}{\sum f_i} - \left(\frac{\sum f_i u_i}{\sum f_i}\right)^2 }$

Example 3. Calculate the standard deviation for the following continuous frequency distribution using:

(a) The Deviation Method (Assumed Mean Method)

(b) The Step-Deviation Method

Class	Frequency ($f_i$)
30 - 40	3
40 - 50	7
50 - 60	12
60 - 70	15
70 - 80	8
80 - 90	3
90 - 100	2

Answer:

To calculate the standard deviation, we first determine the mid-points ($x_i$) for each class interval and choose an assumed mean ($a$). Let $a = 65$ and the class width $h = 10$.

Step 1: Calculation Table

Class	$f_i$	$x_i$	$d_i = x_i - 65$	$f_i d_i$	$f_i d_i^2$	$u_i = \frac{d_i}{10}$	$f_i u_i$	$f_i u_i^2$
30 - 40	3	35	-30	-90	2700	-3	-9	27
40 - 50	7	45	-20	-140	2800	-2	-14	28
50 - 60	12	55	-10	-120	1200	-1	-12	12
60 - 70	15	65	0	0	0	0	0	0
70 - 80	8	75	10	80	800	1	8	8
80 - 90	3	85	20	60	1200	2	6	12
90 - 100	2	95	30	60	1800	3	6	18
Total	$\sum f_i $$ = 50$			$\sum f_i d_i $$ = -150$	$\sum f_i d_i^2 $$ = 10500$		$\sum f_i u_i $$ = -15$	$\sum f_i u_i^2 $$ = 105$

(a) Using Deviation Method

The formula for standard deviation using the deviation method is:

$\sigma = \sqrt{\frac{\sum f_i d_i^2}{\sum f_i} - \left(\frac{\sum f_i d_i}{\sum f_i}\right)^2}$

Substituting the values from the table:

$\sigma = \sqrt{\frac{10500}{50} - \left(\frac{-150}{50}\right)^2}$

$\sigma = \sqrt{210 - (-3)^2}$

$\sigma = \sqrt{210 - 9} = \sqrt{201}$

Therefore, $\sigma \approx 14.17$

(b) Using Step-Deviation Method

The formula for standard deviation using the step-deviation method is:

$\sigma = h \times \sqrt{\frac{\sum f_i u_i^2}{\sum f_i} - \left(\frac{\sum f_i u_i}{\sum f_i}\right)^2}$

Substituting the values: $h = 10$, $\sum f_i u_i^2 = 105$, $\sum f_i u_i = -15$, and $\sum f_i = 50$.

$\sigma = 10 \times \sqrt{\frac{105}{50} - \left(\frac{-15}{50}\right)^2}$

$\sigma = 10 \times \sqrt{2.1 - (-0.3)^2}$

$\sigma = 10 \times \sqrt{2.1 - 0.09}$

$\sigma = 10 \times \sqrt{2.01}$

$\sigma = 10 \times 1.4177$

[Since $\sqrt{2.01} \approx 1.4177$]

$\sigma = 14.177 \approx 14.18$

Conclusion: Both methods yield the same result, but the Step-Deviation Method involved working with much smaller numbers (like 105 instead of 10500), reducing the complexity of the calculation.

Coefficient of Variation

Measures of dispersion like Standard Deviation and Variance give us an understanding of the absolute spread or variability within a single dataset. For instance, a standard deviation of $10$ cm tells us how much the heights in a group typically vary. However, what if we want to compare the variability of two different groups? The standard deviation alone can be misleading in such cases, especially if:

1. Different Units of Measurement: Comparing variability between height in cm and weight in kg is impossible using absolute measures because the units are not the same.

2. Different Mean Values: Even if the units are the same (e.g., income in $\textsf{₹}$), comparing a group with a very high mean to a group with a very low mean using Standard Deviation can be unfair. A variation of $\textsf{₹} \ 1,000$ is huge for someone earning $\textsf{₹} \ 5,000$, but negligible for someone earning $\textsf{₹} \ 5,00,000$.

To perform a meaningful comparison, we need a relative measure of dispersion. The most important and widely used relative measure is the Coefficient of Variation (CV).

Definition and Formula of Coefficient of Variation

The Coefficient of Variation (CV) is a standardized, relative measure of dispersion. It elegantly expresses the standard deviation as a percentage of the arithmetic mean. In simple terms, it measures the "scatter per unit of the mean," allowing for a fair comparison of variability between different datasets.

The formula is given by:

$\text{CV} = \frac{\sigma}{\overline{x}} \times 100$

Where:

$\sigma$ is the standard deviation of the dataset.

$\overline{x}$ is the arithmetic mean of the dataset (note: $\overline{x} \neq 0$).

A key feature of the CV is that it is a pure number without any units. Because both the standard deviation ($\sigma$) and the mean ($\overline{x}$) have the same units, these units cancel out when we divide them. Multiplying by $100$ simply presents this ratio as an easy-to-interpret percentage.

Analysis of Consistency

In statistics, the Coefficient of Variation is the primary tool used to determine Consistency or Stability.

1. Higher CV: The data is more variable, less stable, and less consistent.

2. Lower CV: The data is less variable, more stable, and more consistent.

Illustration

Suppose an investor is comparing the monthly returns of two mutual funds in India over a period of one year. The data is as follows:

Particulars	Mutual Fund A	Mutual Fund B
Average Monthly Return ($\overline{x}$)	$15\%$	$25\%$
Standard Deviation ($\sigma$)	$3\%$	$4\%$

Which fund is more consistent in its performance?

To determine consistency, we must calculate the Coefficient of Variation (CV) for both funds.

Step 1: Calculate CV for Mutual Fund A

$\text{CV}_A = \frac{\sigma_A}{\overline{x}_A} \times 100$

$\text{CV}_A = \frac{3}{15} \times 100 = \frac{1}{5} \times 100 = 20\%$

Step 2: Calculate CV for Mutual Fund B

$\text{CV}_B = \frac{\sigma_B}{\overline{x}_B} \times 100$

$\text{CV}_B = \frac{4}{25} \times 100 = 0.16 \times 100 = 16\%$

Step 3: Interpretation

Although Mutual Fund B has a higher absolute standard deviation ($4\%$ vs $3\%$), its Coefficient of Variation ($16\%$) is lower than that of Mutual Fund A ($20\%$).

Therefore, Mutual Fund B is more consistent and stable relative to its mean return compared to Mutual Fund A.

Interpretation and Application of CV

The Coefficient of Variation (CV) is the primary tool for comparing the consistency, stability, or uniformity of two or more groups. Unlike absolute measures of dispersion, the CV allows us to compare datasets that are otherwise incomparable due to differences in their scale or units.

The interpretation of CV follows a simple inverse logic regarding consistency:

1. Lower CV: A lower value indicates that the data points are more tightly clustered around the mean. This implies greater consistency, higher stability, or less relative variability.

2. Higher CV: A higher value indicates that the data points are more spread out relative to their mean. This implies less consistency, lower stability, or greater relative variability.

Use Case 1: Comparing Data with Different Units

In many scientific and economic studies, researchers need to compare variability across different physical quantities. For example, in a nutritional survey of school children in Delhi, we might want to see if their heights are more uniform than their weights.

Illustration: Height vs Weight

Suppose a health study of 100 students provides the following data:

$\bullet$ For Heights (in cm): Standard Deviation ($\sigma_1$) = $10$ cm, Mean ($\overline{x}_1$) = $160$ cm.

$\bullet$ For Weights (in kg): Standard Deviation ($\sigma_2$) = $8$ kg, Mean ($\overline{x}_2$) = $50$ kg.

We cannot conclude that heights are more variable just because $10 > 8$. The units are different, making a direct comparison of standard deviations mathematically invalid. We must calculate the CV:

$\text{CV}_{\text{height}} = \frac{10}{160} \times 100 = 6.25\%$

$\text{CV}_{\text{weight}} = \frac{8}{50} \times 100 = 16.00\%$

Conclusion: Even though the absolute standard deviation for height is higher, the relative variability in weight ($16\%$) is much greater than that of height ($6.25\%$). Thus, the children are more consistent in their heights than in their weights.

Use Case 2: Comparing Data with Widely Different Means

This is frequently applied in the context of Indian Cricket to evaluate player performance. When two players have significantly different scoring averages, their standard deviations cannot be compared directly.

Illustration: Consistency of Batsmen

Consider two Indian batsmen, Player A and Player B, with the following seasonal statistics:

Player	Mean Score ($\overline{x}$)	Standard Deviation ($\sigma$)
Player A	$80$ runs	$16$ runs
Player B	$40$ runs	$12$ runs

At first glance, Player B might seem more consistent because his standard deviation ($12$) is lower than Player A's ($16$). However, Player A maintains a much higher average. Let's find the CV to determine who is truly more reliable:

For Player A:

$\text{CV}_A = \frac{16}{80} \times 100 = 20\%$

For Player B:

$\text{CV}_B = \frac{12}{40} \times 100 = 30\%$

Result: Player A has a lower CV ($20\%$) compared to Player B ($30\%$). This means Player A is relatively more consistent. Player B’s scoring is more volatile relative to his average.

Summary Comparison Table

Scenario	CV Value	Inference
Investment Returns	Low CV	Low risk, steady growth.
Industrial Production	High CV	Poor quality control, high wastage.
Agricultural Yield	Low CV	Drought-resistant, reliable crop.
Salary Distribution	High CV	High income inequality within the firm.

$x_i$	$\|x_i - \overline{x}\| = \|x_i - 9\|$
4	$\|4 - 9\| = 5$
6	$\|6 - 9\| = 3$
7	$\|7 - 9\| = 2$
8	$\|8 - 9\| = 1$
10	$\|10 - 9\| = 1$
12	$\|12 - 9\| = 3$
12	$\|12 - 9\| = 3$
13	$\|13 - 9\| = 4$
Total	$\sum\limits_{i=1}^{8} \|x_i - \overline{x}\| = 22$

$x_i$	$\|x_i - M\| = \|x_i - 9\|$
3	$\|3 - 9\| = 6$
3	$\|3 - 9\| = 6$
4	$\|4 - 9\| = 5$
5	$\|5 - 9\| = 4$
7	$\|7 - 9\| = 2$
9	$\|9 - 9\| = 0$
10	$\|10 - 9\| = 1$
12	$\|12 - 9\| = 3$
18	$\|18 - 9\| = 9$
19	$\|19 - 9\| = 10$
21	$\|21 - 9\| = 12$
Total	$\sum\limits \|x_i - M\| = 58$

Menu

Chapter 15 Statistics (Concepts)

Describing the Dispersion

Significance of Dispersion

Case Study

Test Score Data

Statistical Summary

Visual Interpretation of Spread

Analysis and Reliability

Definition of Dispersion

Formal Definition and Characteristics

Classification of Dispersion Measures

1. Absolute Measures of Dispersion

2. Relative Measures of Dispersion

Different Methods of Measuring Dispersion

Absolute Measures of Dispersion

Common Absolute Measures:

Relative Measures of Dispersion

Common Relative Measures:

Range

Definition of Range

Calculation of Range

Coefficient of Range

Advantages and Disadvantages of Range

Advantages (Merits):

Disadvantages (Demerits):

Mean Deviation

Definition and Concept of Mean Deviation

Why Absolute Deviations?

Mean Deviation for Ungrouped Data

1. Mean Deviation about the Mean ($\text{MD}_{\overline{x}}$)

Formula and Derivation

2. Mean Deviation about the Median ($\text{MD}_M$)

Formula and Derivation

Mean Deviation for Grouped Data

1. Discrete Frequency Distribution

(a) Mean Deviation about the Mean

(b) Mean Deviation about the Median

2. Continuous Frequency Distribution

(a) Mean Deviation about the Mean

(b) Mean Deviation about the Median

Merits and Demerits of Mean Deviation

Merits (Advantages):

Demerits (Disadvantages):

Variance and Standard Deviation

Variance ($\sigma^2$)

Standard Deviation ($\sigma$)

Variance and Standard Deviation for Ungrouped Data

1. Definitional Formulas

2. Shortcut (Computational) Formula and its Derivation

Variance and Standard Deviation for Grouped Data

1. Discrete Frequency Distribution

Variance ($\sigma^2$)

Standard Deviation ($\sigma$)

Shortcut (Computational) Formula

2. Continuous Frequency Distribution and Shortcut Methods

Derivation of Variance Formulas for Grouped Data

(a) Derivation of Variance Formula using Deviation Method

(b) Derivation of Variance Formula using Step-Deviation Method

Step 1: Calculation Table

(a) Using Deviation Method

(b) Using Step-Deviation Method

Coefficient of Variation

Definition and Formula of Coefficient of Variation

Analysis of Consistency

Illustration

Interpretation and Application of CV

Use Case 1: Comparing Data with Different Units

Illustration: Height vs Weight

Use Case 2: Comparing Data with Widely Different Means

Illustration: Consistency of Batsmen

Summary Comparison Table