Exploratory Data Analysis

January 13, 2018

Exploratory Data Analysis

EDA is an attitude to analysing data sets to summarise their main characteristics, often with visual methods.

Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments.

Which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed.

The purpose of exploratory data analysis is to:

Check for missing data and other mistakes.
Gain maximum insight into the data set and its underlying structure.
Uncover a parsimonious model, one which explains the data with a minimum number of predictor variables.
Check assumptions associated with any model fitting or hypothesis test.
Create a list of outliers or other anomalies.
Find parameter estimates and their associated confidence intervals or margins of error.
Identify the most influential variables.

Types of Exploratory Data Analysis

EDA falls into four main areas:

Univariate non-graphical — looking at one variable of interest, like age, height, income level etc.
Univariate graphical.
Multivariate non-graphical — analysis of multiple variables at the same time.
Multivariate graphical.

Steps involved

let's do with an example, default dataset called "mtcars"

Step1
summary of data check whether it has missing variables or not & basic stats like mean,median & mode.

code : str(mtcars)

output:

'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...

$ carb: num 4 4 1 1 2 1 4 2 2 4 ...

See the variables involved

code: head(mtcars)

output:

Step2

Basic Stats

Code: fivenum(mtcars$mpg)

Output

10.40 15.35 19.20 22.80 33.90

Step3

Interquartile Range

Code: IQR(mtcars$mpg)

Output

7.375

Step4

Boxplot

#boxplot(mtcars) -- no proper insights from the output.

Code: boxplot(mtcars$mpg)

Outlier will be 3/2 times of quartile.

Step5

Summary

Code: summary(mtcars)

* For more insight than str

use "Hmisc" library

Code: describe(mtcars)

Code: describe(mtcars$mpg)

Output:

Search This Blog

R Code