Data Preprocessing
The example, that we practice now, it has 4 variables with missing values in it.
Country
|
Age
|
Salary
|
Purchased
|
France
|
44
|
72000
|
No
|
Spain
|
27
|
48000
|
Yes
|
Germany
|
30
|
54000
|
No
|
Spain
|
38
|
61000
|
No
|
Germany
|
40
|
Yes
|
|
France
|
35
|
58000
|
Yes
|
Spain
|
52000
|
No
|
|
France
|
48
|
79000
|
Yes
|
Germany
|
50
|
83000
|
No
|
France
|
37
|
67000
|
Yes
|
#data preprocessing
#working directory
Graphical
#importing data
Step 1
datasets = read.csv('data.csv')
#Missing Values
Age & Salary has missing values.
Replacing missing values in Age column
datasets$Age = ifelse(is.na(datasets$Age),
ave(datasets$Age, FUN = function(x) mean(x,na.rm=TRUE)),
datasets$Age)
Replacing missing values in Salary column
datasets$Salary = ifelse(is.na(datasets$Salary),
ave(datasets$Salary, FUN = function(x) mean(x,na.rm=TRUE)),
datasets$Salary)
#Encoding Categorical Data
Country & Purchased columns have Categorical data.
Decoding Country
*c is a vector which is for holding 3 variables
datasets$Country = factor(datasets$Country,
levels =c('France', 'Spain', 'Germany'),
labels = c(1,2,3))
Decoding Purchased
datasets$Purchased = factor(datasets$Purchased,
levels =c('No', 'Yes'),
labels = c(0,1))
#Splitting Data into Training Set & Test Set
Step 1
set.seed(123)
Step 2
*why purchased – should consider Dependent Variable
Split = sample.split(datasets$Purchased, splitRatio = 0.8)
Taining_set = subset(dataset, split == TRUE)
Test_set = subset(dataset, split == FALSE)
#scaling the data
1.Standardisation method
2.Normalisation method
Training_set[,2:3] = scale(training_set[,2:3])
Test_set[,2:3] = scale(test_set[,2:3])
Why only 2 &3 Columns
Because 1 & 4 columns are numeric but still they are in encoded format of categorical values
Comments
Post a Comment