Understanding Measures of Dispersion in an easy manner !

Introduction

In the field of statistics for both sample and population data, when you have a whole population you are 100% sure of the measures you are calculating. When you use sample data and compute statistic then a sample statistic is the approximation of population parameter. When you have 10 different samples which give you 10 different measures.

Measures of dispersion

The mean, median and mode are usually not by sufficient measure to reveal the shape of distribution of a data set. We also need a measure that can provide some information about the variation among data set values.

The measures that helps us to know the spread of data set is called are called as “Measures of dispersion”.  The Measures of Central Tendency and Measures of dispersion taken together gives a better picture about the dataset.

Measures of dispersion are called Measures of variability. Variability also called as dispersion or spread refers how spread data is. It helps to compare data set with other data sets. It helps to determine the consistency. Once we get to know the variation of data, we can control the causes behind that particular variation.

Some measures of dispersion are :

  1. Range
  2. Variance
  3. Standard deviation
  4. Interquartile Range (IQR)

Note: In this blog we won’t be discussing IQR, as it has some other application which we will cover in detail

Range

The difference between the smallest and largest observation in sample is called as “Range”. In easy words, range is the difference between the two extreme values in the dataset.

Let say, if X(max) and X(min) are two extreme values then range will be,

Range = X(max) – X(min)

Example: The minimum and maximum BP are 113 and 170. Find range.

Range = X(max) – X(min)

= 170 – 113

= 57

So, range is 57.

Variance

Now let’s consider two different distributions A and B which has data sets as following

A = {2, 2, 4, 4} and B = {1, 1, 5, 5}

If we compute mean for both the distributions,

                   

We can see that we have got the mean as 3 for both the distribution, but if we observe both the distributions there is difference in the data points. When observing distribution A we can say data points are close to each other there is not a large difference. On the other side when we observer distribution B we can observe that data points are far then each other there is a large difference. We can say that the distance is more that means there is more spread and this spread is called “Variance”.

Variance measures the dispersion of set of data points around their mean. Variance in statistics is a measure of how far each value in the data set from the mean.

The formula for variance is different for both Population and Sample
Why squaring?

Dispersion cannot be negative. Dispersion is nothing but the distance hence it cannot be negative. If we don’t square we will get both negative and positive value which won’t cancel out. Instead, squaring amplifies the effect of large distances.

Let us consider first variance for population, it is given by formula

When we computed the mean we saw it was same but when we compute the variance we observed that both the variance are different. The variance of distribution A is 4 and that of distribution B is 1.

The reason behind the large and small value in variance is because of the distance between the data points.

When the distance between the data points is more which means dispersion or spread is more hence we get higher variance. When the distance between the data points is less which means dispersion or spread is less hence we get lower variance.

For sample variance, there is little change in the formula.

Why n-1 ?

As we now we take sample from population data. So sample data should surely make some inference about the population data. There are different inferences using sample data for population data.

Now let us consider that we have a population data of ages and we are plotting it on the graph and it increasing across the x-axis. Also we have the mean at the middle.

So if we randomly select sample in the population data, the sample mean and population mean is almost equal.

If we take a random sample then the distance between the mean of random sample and actual sample is huge. So sample mean <<<<< population mean and sample variance <<<< population variance. Here we are underestimating the true population variance.

Hence we take the n-1 during the calculation of variance using sample data. n-1 makes the distance shorter then that of using n. Therefore to reduce the distance we use ‘n – 1’ instead of ‘n’ while computing sample variance. This ‘n-1’ is called as Bessel’s correction.

Also while discussing further topics we will come across a term Degree of freedom = n – 1.

Importance of Variance

  1. Variance can determine what a typical member of a data set looks like and how similar the points are.
  2. If the variance is high it implies that there are very large dissimilarities among data points in data set.
  3. If the variance is zero it implies that every member of data set is the same.

Standard deviation

As variance is measure of dispersion but sometime the figure obtained while computing variance is pretty large and hard to compare as unit of measurement is square.

Standard deviation (SD) is a very common measure of dispersion. SD also measures how spread out the values in data ste are around the mean.

More accurately it is a measure of average distance between the values of data and mean.

  1. If data values are similar, then the SD will be low (close to zero).
  2. If data values are of high variable, then the SD will be high (far from zero).

  • If SD is small, data has little spread (i.e. majority of points fall near the mean).
  • If SD = 0, there is no spread. This only happens when all data items are of same value.
  • The SD is significantly affected by outliers and skewed distributions.

Coefficient of variation

Standard deviation is the most common measure of variablity for a single data set Whereas the coefficient of variation is used to compare the SD of two or more dataset.

Example

     

  • If we observe, variance gives answer in square units and so in original units and hence SD is preferred and interpretable.
  • Correlation coefficient does not have unit of measurement. It is universal across data sets and perfect for comparisons.
  • If Correlation coefficient is same we can say that two data sets has same variability.

Python Implementation 

Python code for finding range

import numpy as np
import statistics as st

data = np.array([4,6,9,3,7])
print(f"The range of the dataset is {max(data)-min(data)}")

The Output will give us the value of range i.e. 6

Python code for finding variance

import numpy as np
import statistics as st

data = np.array([3,8,6,10,12,9,11,10,12,7])
var = st.variance(data)

print(f"The variance of the data is {var}")

The Output will give us the value of variance i.e. 8.

Python code for finding Standard deviation

import numpy as np
import statistics as st

data = np.array([3,8,6,10,12,9,11,10,12,7])
sd= st.stdev(data)

print(f"The standard deviation of data points is {sd}")

The Output will give us the value of SD i.e. 2.8284271247461903

Conclusion

So here we have understood about Measures of variability. Measures of Central Tendency and Measures of Variability together are called Univariate Measures of analysis.

Measures which deals with only one variable is called as univariate measures.

In the next section, we are going to discuss about more interesting topic such as 5 number summary statistics and skewness.

Happy Learning !! 

 

 

Is Statistics important for Data Science?

Introduction

Statistics is the science of conducting studies to collect, organize, summarize, analyze and draw a conclusion out of the data. It is nothing but learning from data.

The field of math Statistics mainly deals with collective information, interpreting those information from data set and drawing conclusion from the it. It can be used in various fields.

For example, when we observe any cricket matches there are various terms used like batting average, bowling economy, strike rate, etc. Also we can observe many graphs and data visualizations. This things are the part of statistics. Here information is analyzed and various results are shown accordingly.

We can talk about statistics all the time but do we know the science behind it?

Here by using various methods various large cricket organizations compare players, teams and rank them accordingly. So if we learn the science behind it we can create our ranking, compare different thing and debate with hard facts.

Stats is very important in the field of analytics, Data Science, artificial intelligence ai, machine learning models, deep neural networks (deep learning). It is a used to process complex problems in the real world so that data professionals like data analyst and data scientist can analyze data and retrieve meaningful insights from data.

In simple words, stats can be used to derive meaningful insights from data by performing mathematical computations on it.

The field of statistics is divided into two parts Descriptive statistics and Inferential statistics. And data has two types quantitative data and qualitative data and it can be either labelled data or unlabeled data.

Some important terms used

Population: In statistics, a population is the entire pool from which statistical sample is drawn.  For example: Consider all students in a college. All students in the college are considered as population. Population can be contrasted with samples.

Samples: Sample is subset of the population. Sample is derived from population. It is representative of population. It refers to set of observation drawn from population.

It is necessary to use samples for research because it is impractical to study the whole population. For example, we want to know the average heights of boys in college.

So we can’t consider population as there can lots of boys and measuring height and calculating height is not reliable. So for such cases samples are taken. As sample is representative of population. Certain amount of boys are selected as a sample and average is computed.

Variable: A characteristic of each element of population or a sample is called as variable.

Also read: Essential Mathematics to master Data Science

Some of the important topics which we will be discussing in further articles are:

Basics statistics:

  • Terms related to statistics.
  • Random variables
  • Population and sample concept.
  • Measures of central tendency
  • Measures of variability
  • Sampling Techniques
  • Measures of Dispersion
  • Gaussian / Normal Distribution

Intermediate Statistics

  • Standard Normal Distribution
  • z-score
  • Probability Density function (pdf)
  • Cumulative distribution function (cdf)
  • Hypothesis testing
  • Plotting graphs
  • Kernel Density Estimation
  • Central limit theorem
  • Skewness of data
  • Covariance
  • Pearson correlation coefficient
  • Spearman Rank Correlation

Advanced Statistics

  • Q-Q Plot
  • Chebyshev’s inequality
  • Discrete and continuous distribution
  • Bernoulli and Binomial distribution
  • Log Normal Distribution
  • Power Law distribution
  • Box – cox transform
  • Poisson Distribution
  • z-stats
  • t-stats
  • Type 1 and Type 2 error
  • chi-square test
  • Annova testing
  • F-stats
  • A/B testing

Looking at the topics we can interpret that topics are tough but it depends on level of understanding and determination to learn. It’s not any rocket science and can be easily done.

It’s pretty much important that you know statistics because it’s going to be the pre-requisite for you further Data Science journey. So let’s kickstart our journey of statistics here.

The best way to learn anything is to understand it properly and interpret it by implementing it. As we learn from our mistakes so it’s better to keep learning unless you don’t understand it properly.

Before jumping into deep data science I will like to repeat that learning “Statistics” is must.

Let’s go 🚀🚀