L10 – Descriptive Analytics

A basic step for all Data Science projects is to analyse the data to understand it and to gain some early insights into the data. This is called Descriptive Analytics. You can built upon these early insights to gain deeper and deeper insights before you move onto some of the typical machine learning algorithms.

There are some useful packages available to help you with this Descriptive Analytics task. Typically the first step is to profile the data. This is where you gather some basic statistical information about the data. The first package to help with this is ‘numpy’ and the second is ‘pandas’.

A slightly different set of statistics is used for Numeric and for Alpha-numeric variables/features.

These statistics can be gathered for each attribute/feature and then reported in a Data Description report.

The following example illustrates using the ‘numpy’ package to gather basic statistical information for a numeric input.

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats

def descriptive_stats(distribution):

   Compute and present simple descriptive stats for a distribution

   distribution: list
   Distribution as a Python list 

   # Convert distribution as numpy ndarray
   dist = np.array(distribution)

   print 'Descriptive statistics for distribution:\n', dist
   print 'Number of scores:', len(dist)
   print 'Number of unique scores:', len(np.unique(dist))
   print 'Sum:', sum(dist)
   print 'Min:', min(dist)
   print 'Max:', max(dist)
   print 'Range:', max(dist)-min(dist)
   print 'Mean:', np.mean(dist, axis=0)
   print 'Median:', np.median(dist, axis=0)
   print 'Mode:', scipy.stats.mode(dist)[0][0]
   print 'Variance:', np.var(dist, axis=0)
   print 'Standard deviation:', np.std(dist, axis=0)
   print '1st quartile:', np.percentile(dist, 25)
   print '3rd quartile:', np.percentile(dist, 75)
   print 'Distribution skew:', scipy.stats.skew(dist)

   plt.hist(dist, bins=len(dist))
   plt.yticks(np.arange(0, 6, 1.0))
   plt.title('Histogram of distribution scores')

descriptive_stats([ 1, 4, 5, 6, 8, 8, 9, 10, 10, 11, 11, 13, 13, 13, 14, 14, 15, 15, 15, 15 ])


Descriptive statistics for distribution:

[ 1  4  5  6  8  8  9 10 10 11 11 13 13 13 14 14 15 15 15 15]

Number of scores: 20
Number of unique scores: 11
Sum: 210
Min: 1
Max: 15
Range: 14
Mean: 10.5
Median: 11.0
Mode: 15
Variance: 16.15
Standard deviation: 4.01870625948
1st quartile: 8.0
3rd quartile: 14.0
Distribution skew: -0.714152479663

A similar approach can be followed using Pandas.