Whenever we are exploring a new dataset, the very first thing to do is calculate some basic statistics: number of observations, mean or average, minimum, maximum, median and standard deviation.
This helps us get an overview of our data quickly.
We will illustrate this in a dataset consisting of the height of a sample of 18-year-old males (in cm). In this case, the measured height of each student is our value or observation.
Our first statistic, the number of observations or sample size is easy to get but important: we usually require a minimum of 30 before deciding the statistical test that should be applied later.
The mean or average is the sum of all observations divided by the number of observations. In several contexts this represents a good estimate of how our data looks like.
The minimum and maximum are useful for determining the range of the data, that is, the set of possible values that we will find in our dataset. They can be calculated with the min() and max() functions in a similar way.
The median is the value that is right in the middle. That is, if we order from smallest to largest, the median is the value such that 50% of the values are above it and 50% of the values are below.
Sometimes, we will find that the median and the mean are very similar. This can be an indication that there is some symmetry in our data: for every large observation there is also a small observation, in similar proportions. Whenever the median and the mean are different, this means that there is a certain skew in our data, suggesting perhaps the presence of outliers or unusual observations.
The standard deviation measures the spread of the data. That is, on average, how far are the values in our dataset from the mean. The larger the standard deviation, the bigger the spread. A small value of the standard deviation suggests that all the observations are similar to each other and to the average.