Simple Linear Regression
Name of data is data2
YearsExperience
|
Salary
|
1.1
|
39343
|
1.3
|
46205
|
1.5
|
37731
|
2
|
43525
|
2.2
|
39891
|
2.9
|
56642
|
3
|
60150
|
3.2
|
54445
|
3.2
|
64445
|
3.7
|
57189
|
3.9
|
63218
|
4
|
55794
|
4
|
56957
|
4.1
|
57081
|
4.5
|
61111
|
4.9
|
67938
|
5.1
|
66029
|
5.3
|
83088
|
5.9
|
81363
|
6
|
93940
|
6.8
|
91738
|
7.1
|
98273
|
7.9
|
101302
|
8.2
|
113812
|
8.7
|
109431
|
9
|
105582
|
9.5
|
116969
|
9.6
|
112635
|
10.3
|
122391
|
10.5
|
121872
|
#Import data2
#Splitting Data into Training Set & Test Set
Step 1
set.seed(123)
Step 2
split = sample.split(data2$Salary, SplitRatio = 2/3)
training_set = subset(data2, split == TRUE)
test_set = subset(data2, split == FALSE)
#scalling will be taken care internally
#fitting simple linear regression to the training set
# regressor is a vector created to store values
Step 3
regressor = lm(formula = Salary ~ YearsExperience,
data = training_set)
#to check the summary of regressor for more info how the variable are bheaving
summary(regressor)
OUTPUT:
summary(regressor)
Call:
lm(formula = Salary ~ YearsExperience, data = training_set)
Residuals:
Min 1Q Median 3Q Max
-7325.1 -3814.4 427.7 3559.7 8884.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25592 2646 9.672 1.49e-08 ***
YearsExperience 9365 421 22.245 1.52e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5391 on 18 degrees of freedom
Multiple R-squared: 0.9649, Adjusted R-squared: 0.963
F-statistic: 494.8 on 1 and 18 DF, p-value: 1.524e-14
#How to understand
*** means highly more impact (independent variable on dependent )
#statically significant
#predecting test set results
Step 4
#y_pred is a vector created to store values
y_pred = predict(regressor ,newdata =test_set)
#to see predicted values
y_pred
Step5
#visualising the training set results
#load ggplot2 library
ggplot()+
geom_point(aes(x=training_set$YearsExperience, y=training_set$Salary),
color = 'red')+
geom_line(aes(x=training_set$YearsExperience, y=predict(regressor ,newdata = training_set)),
color = 'blue')+
ggtitle('salary vs experience(training_set)')+
xlab("years of experience")+
ylab("salary")
Plot interpretation
red points are real salaries of employees
blue line your linear regression model
#visualising the test set results
#load ggplot2 library
ggplot()+
geom_point(aes(x=test_set$YearsExperience, y=test_set$Salary),
color = 'red')+
geom_line(aes(x=training_set$YearsExperience, y=predict(regressor ,newdata = training_set)),
color = 'blue')+
ggtitle('salary vs experience(test_set)')+
xlab("years of experience")+
ylab("salary")
Plot interpretation
Few predictions are very close to real values,
there is no 100% linearity dependence between variables, only certain dependence there.
Comments
Post a Comment