Organisation of Data

Introduction

In the previous chapter, we learned about the methods and sources for collecting data. Once data is collected, it is usually in a raw, disorganised form. This raw data, in its original state, is often a jumble of numbers and facts that is difficult to comprehend, interpret, or use for making any meaningful conclusions. Imagine having a list of the marks of 1,000 students; simply looking at the list would reveal very little about the overall performance of the group.

To extract meaningful information, this raw data must be organised and presented in a systematic manner. Organisation of data refers to the process of arranging collected data into groups or classes based on their common characteristics. This crucial step simplifies the complexity of the data, facilitates comparison, and prepares it for further statistical analysis and presentation. This chapter deals with the fundamental techniques of classifying and organising raw data into a structured and comprehensible format, primarily through the construction of frequency distributions.

Raw Data

Raw data, also known as primary data or ungrouped data, is the data that has been collected but not yet arranged or organised in any systematic way. It is the data in its original form, as it was recorded during the survey or experiment. Raw data is essentially a list of observations that, on its own, does not provide a clear picture of the underlying patterns or characteristics of the group being studied.

Example 1. Consider the marks (out of 100) obtained by 50 students in a statistics examination. The data collected from their answer sheets is as follows:

41, 55, 62, 79, 35, 41, 88, 92, 62, 48, 55, 70, 79, 88, 35, 48, 62, 70, 41, 55, 92, 62, 88, 79, 41, 35, 55, 70, 48, 62, 88, 92, 55, 79, 48, 41, 62, 70, 55, 88, 48, 35, 62, 79, 55, 88, 70, 41, 62, 79

Analysis:

The data presented above is raw data. If you were asked to find the highest or lowest marks, or the marks obtained by the majority of students, you would have to search through the entire list, which is a tedious and inefficient process. The data in this form is confusing and provides very little insight at a glance. To make sense of it, we need to classify and organise it.

Arranging this data in ascending or descending order (an array) is a first step, but for a large dataset, a more compact form like a frequency distribution is required.

Classification Of Data

Classification is the process of arranging data into sequences and groups according to their common characteristics or separating them into different but related parts. It is the process of sorting 'like' things together. The primary objective of classification is to condense the mass of data in such a way that its main features can be easily understood and compared.

Objectives of Classification

To simplify complex data: It reduces a large volume of raw data into a more manageable and understandable form.
To facilitate comparison: By grouping data, we can easily compare different sets of observations. For example, we can compare the performance of students in two different schools.

To reveal patterns:

To prepare data for tabulation and analysis:

Types of Classification

Data can be classified based on different criteria:

Chronological Classification: Data is classified based on time, such as years, months, or days. For example, the population of India from 1951 to 2011.
Geographical Classification: Data is classified based on location or geographical area, such as country, state, or district. For example, production of wheat in different states of India.
Qualitative Classification: Data is classified based on attributes or qualities that cannot be measured numerically, such as sex (male/female), religion, or literacy (literate/illiterate). This can be a simple classification (based on one attribute) or a manifold classification (based on more than one attribute).
Quantitative Classification: Data is classified based on characteristics that can be measured numerically, such as height, weight, income, or marks. This is the most common type of classification in statistics and forms the basis for frequency distributions.

Variables: Continuous And Discrete

A key concept in the organisation of data is the idea of a variable. A variable is a characteristic or quantity that can be measured and whose value changes or varies from one individual or object to another. For example, the height of students is a variable because it varies from student to student. Variables can be either discrete or continuous.

Discrete Variable

A discrete variable is a variable that can only assume specific, distinct values and cannot take on any value in between. These values are often, but not always, whole numbers (integers). There are finite or countable "jumps" between consecutive values.

Examples:

The number of children in a family (can be 0, 1, 2, 3, ... but not 2.5).

The data generated from a discrete variable is called discrete data.

Continuous Variable

A continuous variable is a variable that can assume any numerical value within a given range. The values are not restricted to whole numbers and can include fractions and decimals. The possible values of a continuous variable are uncountable, and their accuracy is limited only by the precision of the measuring instrument.

Examples:

The height of a person (can be 165 cm, 165.1 cm, 165.11 cm, etc.).

The data generated from a continuous variable is called continuous data. When organising continuous data, we must group it into class intervals.

What Is A Frequency Distribution?

A frequency distribution is a comprehensive way to classify raw data of a quantitative variable. It is a statistical table that displays the number of observations (frequency) falling within each of a series of non-overlapping, defined classes or intervals. It condenses the raw data into a more compact and manageable form, revealing the distribution of data points across the range of values.

How To Prepare A Frequency Distribution?

Let's use the marks of 50 students from the example in section I2. The key terms involved are:

Class: A range of values into which data is grouped (e.g., 30-40).
Class Limits: The two ends of a class. The lowest value is the lower class limit and the highest value is the upper class limit.
Class Interval (or Class Width): The difference between the true upper and lower limits of a class.
Class Mark (or Mid-point): The central value of a class.
$ \text{Class Mark} = \frac{\text{Upper Class Limit} + \text{Lower Class Limit}}{2} $
Range: The difference between the highest and lowest values in the raw data.

Steps to Prepare:

Calculate the Range: In our example data, Highest Value = 92, Lowest Value = 35. So, Range = 92 - 35 = 57.
Decide the Number of Classes: There is no strict rule. This depends on the data and the researcher's judgment. Too few classes will hide details, while too many will be confusing. A common rule of thumb is to have between 5 and 15 classes. Let's decide to have 7 classes.
Determine the Class Interval: $ \text{Class Interval} \approx \frac{\text{Range}}{\text{Number of Classes}} = \frac{57}{7} \approx 8.14 $. We can choose a convenient number like 10 for the class interval.
Decide the Class Limits: We need to define the starting point. Since the lowest value is 35, we can start the first class at 30. The classes would then be 30-40, 40-50, 50-60, and so on. This is the exclusive method, where the upper limit of one class is the lower limit of the next. An observation equal to the upper limit (e.g., 40) is included in the next class (40-50), not the current one (30-40).

Adjustment In Class Interval

Sometimes, data is presented in the inclusive method, e.g., 30-39, 40-49, etc. Here, both limits are included in the class. To perform certain statistical calculations, we need to convert this to an exclusive form.

Derivation of Adjustment Factor:

The adjustment is done to remove the gap between the upper limit of one class and the lower limit of the next.

$ \text{Adjustment Factor} = \frac{(\text{Lower limit of next class}) - (\text{Upper limit of current class})}{2} $

For classes 30-39 and 40-49, the factor is $ \frac{40 - 39}{2} = 0.5 $. We subtract this factor from all lower limits and add it to all upper limits. So, 30-39 becomes 29.5-39.5, and 40-49 becomes 39.5-49.5, thus making the series continuous.

Finding Class Frequency By Tally Marking

Now, we go through the raw data and assign a tally mark ( | ) for each observation in its respective class. Every fifth tally mark is a diagonal line across the previous four (~~||||~~). This makes counting easier.

Class Interval (Marks)	Tally Marks	Frequency (No. of Students)
30 - 40	\|\|\|\|	4
40 - 50	~~\|\|\|\|~~ \|\|\|\|	9
50 - 60	~~\|\|\|\|~~ \|\|\|	8
60 - 70	~~\|\|\|\|~~ \|\|\|\|	9
70 - 80	~~\|\|\|\|~~ \|\|\|	8
80 - 90	~~\|\|\|\|~~ \|	6
90 - 100	\|\|\|	6
Total		50

Loss Of Information

It is important to recognise that forming a frequency distribution involves a loss of information. Once the data is grouped, we know the number of observations in each class, but we lose the exact value of each individual observation. For instance, in the table above, we know 9 students scored between 40 and 50, but we don't know their exact scores from the table alone. This is a trade-off made for the sake of comprehension and simplicity.

Frequency Array

A frequency array is a specific type of frequency distribution used for discrete variables. Instead of grouping the data into class intervals, each distinct value of the variable is listed along with the number of times it appears (its frequency). This method preserves all the information, as no grouping is done.

Bivariate Frequency Distribution

So far, we have discussed the organisation of data for a single variable (univariate data). However, we often collect data on two variables simultaneously to study their relationship. This is called bivariate data. A bivariate frequency distribution, also known as a contingency table or two-way table, is a method of organising and presenting the frequencies of two variables together.

The table shows the frequency of observations for each combination of the categories of the two variables. The categories of one variable are listed in the rows, and the categories of the other are listed in the columns.

Example 2. A survey of 200 households was conducted to study the relationship between their monthly income level and their monthly expenditure on entertainment. The data is presented in a bivariate frequency distribution below.

Answer:

		Monthly Income (in ₹)			Total Households
		Low (< 30,000)	Medium (30k - 70k)	High (> 70k)	Total Households
Monthly Expenditure on Entertainment	Low (< 2,000)	50	15	5	70
	Medium (2k - 5k)	20	45	10	75
	High (> 5k)	5	20	25	55
Total Households		75	80	45	200

Interpretation: This table shows the joint frequency of income and expenditure. For example, the value '45' in the middle of the table indicates that there are 45 households that have a medium income level AND a medium expenditure level. The totals in the margins (e.g., 70, 75, 55 on the right) are called marginal frequencies, which are the univariate frequency distributions for each variable.

Conclusion

The organisation of data is an indispensable step in the process of statistical inquiry. It transforms a chaotic mass of raw data into a structured and intelligible form. By classifying data and constructing frequency distributions, we can condense large datasets, highlight their essential characteristics, and lay a solid foundation for subsequent analysis and visual presentation.

Techniques like creating frequency distributions for univariate data and bivariate tables for studying relationships between two variables are fundamental tools for any researcher. While this process involves a trade-off, where some detail is lost for the sake of simplicity, the clarity and insight gained are invaluable. A well-organised dataset is the first major step towards uncovering the stories hidden within the numbers.