Descriptive statistics is a branch of statistics that deals with collecting, analyzing, interpreting, and presenting data in an organized and effective manner. Its main objective is to provide simple and understandable summaries about the main characteristics of a dataset, without making inferences or predictions about a broader population.
Measures of central tendency are numerical values that describe how data in a set are centralized or clustered. They are essential in statistics and data analysis because they provide a summary of information, allowing us to quickly understand the general characteristics of a data distribution.
To illustrate these concepts, we will use a Pandas DataFrame.
1import pandas as pd 2import numpy as np 3 4# We create an example DataFrame 5data = {'values': [10, 20, -15, 0, 50, 10, 5, 100]} 6df = pd.DataFrame(data) 7 8print(df)
Mean
It is the average of a set of numerical data.
Loading...
Median
It is the middle value when the data are ordered.
Loading...
Mode
Value that occurs most frequently.
Loading...
These measures are fundamental for describing and analyzing data distributions.
Measures of dispersion are numerical values that describe how varied the data are in a set. While measures of central tendency tell us where the data are "centered", measures of dispersion show us how much those data "spread out" or "vary" around that center.
Range
The difference between the maximum value and the minimum value of a data set.
Loading...
Variance and standard deviation
Both metrics measure how far, on average, the values are from the mean. Standard deviation is more interpretable because it is in the same units as the original data. Pandas calculates both easily.
Loading...
The shape measures describe how the values in a data set are distributed in relation to the measures of central tendency. Specifically, they tell us the nature of the distribution, whether it is symmetric, skewed, or has heavy tails, among others.
Skewness
Measures the lack of symmetry in the data distribution. A positive skewness indicates that most of the data are on the left and there are a few very high values on the right. A negative skewness indicates that there are more unusually low values. If it is close to zero, it suggests that the data are quite symmetrical.
1skewness = df['values'].skew() 2print(f"Skewness: {skewness}")
Kurtosis
Kurtosis measures the "heaviness of the tails" and the "peakedness" of a distribution. In practical terms, it tells us the probability of finding atypical values (outliers). Its usefulness is key, for example, in financial risk modeling, where a high kurtosis means a higher risk of extreme events. A positive kurtosis indicates a sharper peak compared to the normal distribution. A negative kurtosis indicates a flatter peak and lighter tails. A kurtosis close to zero is ideal, as it suggests a shape similar to that of the normal distribution.
The df.kurt()
method in Pandas calculates the excess kurtosis, which facilitates comparison with the normal distribution.
Here we show you the three main types of kurtosis:
Loading...
Visualizing data is fundamental. Histograms, bar charts, and scatter plots are often used, depending on the data type.
Loading...