Exploratory Data Analysis

Exploratory Data Analysis

EDA is an attitude to analysing data sets to summarise their main characteristics, often with visual methods.
Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments.


Which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed.

The purpose of exploratory data analysis is to:

Types of Exploratory Data Analysis

EDA falls into four main areas:
  • Univariate non-graphical — looking at one variable of interest, like age, height, income level etc.
  • Univariate graphical.
  • Multivariate non-graphical — analysis of multiple variables at the same time.
  • Multivariate graphical.
Steps involved

let's do with an example, default dataset called "mtcars"

Step1
summary of data check whether it has missing variables or not & basic stats like mean,median & mode.

code :    str(mtcars)

output:

'data.frame': 32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...

 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

See the variables involved

code:   head(mtcars)

output:



Step2

Basic Stats

Code: fivenum(mtcars$mpg)

Output

10.40 15.35 19.20 22.80 33.90

Step3

Interquartile Range

Code:  IQR(mtcars$mpg)

Output

7.375

Step4

Boxplot

#boxplot(mtcars) -- no proper insights from the output.

Code:  boxplot(mtcars$mpg)


Outlier will be 3/2 times of quartile.

Step5

Summary 

Code: summary(mtcars)




* For more insight than str

use "Hmisc" library

Code: describe(mtcars)

Code: describe(mtcars$mpg)

Output: 



Comments

Popular posts from this blog

Apriori

Decision Tree Classification