Data Preprocessing

The example, that we practice now, it has 4 variables with missing values in it.


Country
Age
Salary
Purchased
France
44
72000
No
Spain
27
48000
Yes
Germany
30
54000
No
Spain
38
61000
No
Germany
40
Yes
France
35
58000
Yes
Spain
52000
No
France
48
79000
Yes
Germany
50
83000
No
France
37
67000
Yes


#data preprocessing

#working directory

Graphical

#importing data

Step 1
datasets = read.csv('data.csv')


#Missing Values

Age & Salary has missing values.

Replacing missing values in Age column


datasets$Age = ifelse(is.na(datasets$Age),
                      ave(datasets$Age, FUN = function(x) mean(x,na.rm=TRUE)),
                      datasets$Age)

Replacing missing values in Salary column


datasets$Salary = ifelse(is.na(datasets$Salary),
                      ave(datasets$Salary, FUN = function(x) mean(x,na.rm=TRUE)),
                      datasets$Salary)


#Encoding Categorical Data

Country & Purchased columns have Categorical data.

Decoding Country

*c is a vector which is for holding 3 variables

datasets$Country = factor(datasets$Country,
                          levels =c('France', 'Spain', 'Germany'),
                          labels = c(1,2,3))

Decoding Purchased

datasets$Purchased = factor(datasets$Purchased,
                          levels =c('No', 'Yes'),
                          labels = c(0,1))


#Splitting Data into Training Set & Test Set

Step 1

set.seed(123)

Step 2

*why purchased – should consider Dependent Variable

Split = sample.split(datasets$Purchased, splitRatio = 0.8)

Taining_set = subset(dataset, split == TRUE)
Test_set = subset(dataset, split == FALSE)




#scaling the data

1.Standardisation method
2.Normalisation method






Training_set[,2:3] = scale(training_set[,2:3])

Test_set[,2:3] = scale(test_set[,2:3])

Why only 2 &3 Columns

Because 1 & 4 columns are numeric but still they are in encoded format of categorical values

Comments

Popular posts from this blog

Decision Tree Classification

Random Forest Classification