Data Preprocessing

February 04, 2018

The example, that we practice now, it has 4 variables with missing values in it.

Country	Age	Salary	Purchased
France	44	72000	No
Spain	27	48000	Yes
Germany	30	54000	No
Spain	38	61000	No
Germany	40		Yes
France	35	58000	Yes
Spain		52000	No
France	48	79000	Yes
Germany	50	83000	No
France	37	67000	Yes

#data preprocessing

#working directory

Graphical

#importing data

Step 1

datasets = read.csv('data.csv')

#Missing Values

Age & Salary has missing values.

Replacing missing values in Age column

datasets$Age = ifelse(is.na(datasets$Age),

ave(datasets$Age, FUN = function(x) mean(x,na.rm=TRUE)),

datasets$Age)

Replacing missing values in Salary column

datasets$Salary = ifelse(is.na(datasets$Salary),

ave(datasets$Salary, FUN = function(x) mean(x,na.rm=TRUE)),

datasets$Salary)

#Encoding Categorical Data

Country & Purchased columns have Categorical data.

Decoding Country

*c is a vector which is for holding 3 variables

datasets$Country = factor(datasets$Country,

levels =c('France', 'Spain', 'Germany'),

labels = c(1,2,3))

Decoding Purchased

datasets$Purchased = factor(datasets$Purchased,

levels =c('No', 'Yes'),

labels = c(0,1))

#Splitting Data into Training Set & Test Set

Step 1

set.seed(123)

Step 2

*why purchased – should consider Dependent Variable

Split = sample.split(datasets$Purchased, splitRatio = 0.8)

Taining_set = subset(dataset, split == TRUE)

Test_set = subset(dataset, split == FALSE)

#scaling the data

1.Standardisation method

2.Normalisation method

Training_set[,2:3] = scale(training_set[,2:3])

Test_set[,2:3] = scale(test_set[,2:3])

Why only 2 &3 Columns

Because 1 & 4 columns are numeric but still they are in encoded format of categorical values

Search This Blog

R Code

Data Preprocessing

Comments

Post a Comment

Popular posts from this blog

Decision Tree Classification

The Multi-Armed Bandit Problem -Upper Confidence Bound (Ad Campaign)

View Data from frames