Exploratory Data Analysis
Exploratory Data Analysis
EDA is an attitude to analysing data sets to summarise their main characteristics, often with visual methods.
Exploratory data analysis
was promoted by John Tukey to encourage statisticians to explore the data, and
possibly formulate hypotheses that could lead to new data collection and
experiments.
Which focuses more
narrowly on checking assumptions required for model fitting and hypothesis
testing, and handling missing values and making transformations of variables as
needed.
The purpose of exploratory data analysis is to:
- Check for missing data and other mistakes.
- Gain maximum insight into the data set and its underlying structure.
- Uncover a parsimonious model, one which explains the data with a minimum number of predictor variables.
- Check assumptions associated with any model fitting or hypothesis test.
- Create a list of outliers or other anomalies.
- Find parameter estimates and their associated confidence intervals or margins of error.
- Identify the most influential variables.
Types of Exploratory Data Analysis
EDA falls into four main areas:
- Univariate non-graphical — looking at one variable of interest, like age, height, income level etc.
- Univariate graphical.
- Multivariate non-graphical — analysis of multiple variables at the same time.
- Multivariate graphical.
Steps involved
let's do with an example, default dataset called "mtcars"
Step1
summary of data check whether it has missing variables or not & basic stats like mean,median & mode.
code : str(mtcars)
output:
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
See the variables involved
code: head(mtcars)
output:
Step2
Basic Stats
Code: fivenum(mtcars$mpg)
Output
10.40 15.35 19.20 22.80 33.90
Step3
Interquartile Range
Code: IQR(mtcars$mpg)
Output
7.375
Step4
Boxplot
#boxplot(mtcars) -- no proper insights from the output.
Code: boxplot(mtcars$mpg)
Outlier will be 3/2 times of quartile.
Step5
Summary
Code: summary(mtcars)
* For more insight than str
use "Hmisc" library
Code: describe(mtcars)
Code: describe(mtcars$mpg)
Output:
let's do with an example, default dataset called "mtcars"
Step1
summary of data check whether it has missing variables or not & basic stats like mean,median & mode.
code : str(mtcars)
output:
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
See the variables involved
code: head(mtcars)
output:
Step2
Basic Stats
Code: fivenum(mtcars$mpg)
Output
10.40 15.35 19.20 22.80 33.90
Step3
Interquartile Range
Code: IQR(mtcars$mpg)
Output
7.375
Step4
Boxplot
#boxplot(mtcars) -- no proper insights from the output.
Code: boxplot(mtcars$mpg)
Outlier will be 3/2 times of quartile.
Step5
Summary
Code: summary(mtcars)
* For more insight than str
use "Hmisc" library
Code: describe(mtcars)
Code: describe(mtcars$mpg)
Output:
Comments
Post a Comment