Simple Linear Regression

Name of data is data2

YearsExperience
Salary
1.1
39343
1.3
46205
1.5
37731
2
43525
2.2
39891
2.9
56642
3
60150
3.2
54445
3.2
64445
3.7
57189
3.9
63218
4
55794
4
56957
4.1
57081
4.5
61111
4.9
67938
5.1
66029
5.3
83088
5.9
81363
6
93940
6.8
91738
7.1
98273
7.9
101302
8.2
113812
8.7
109431
9
105582
9.5
116969
9.6
112635
10.3
122391
10.5
121872

#Import data2

#Splitting Data into Training Set & Test Set

Step 1

set.seed(123)

Step 2

split = sample.split(data2$Salary, SplitRatio = 2/3)
training_set = subset(data2, split == TRUE)
test_set = subset(data2, split == FALSE)


#scalling will be taken care internally

#fitting simple linear regression to the training set

# regressor  is a vector created to store values

Step 3

regressor = lm(formula = Salary ~ YearsExperience,
               data = training_set)

#to check the summary of regressor for more info how the variable are bheaving

summary(regressor)

OUTPUT:

summary(regressor)

Call:
lm(formula = Salary ~ YearsExperience, data = training_set)

Residuals:
    Min      1Q  Median      3Q     Max
-7325.1 -3814.4   427.7  3559.7  8884.6

Coefficients:
                Estimate Std. Error t value Pr(>|t|)   
(Intercept)        25592       2646   9.672 1.49e-08 ***
YearsExperience     9365        421  22.245 1.52e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5391 on 18 degrees of freedom
Multiple R-squared:  0.9649,   Adjusted R-squared:  0.963
F-statistic: 494.8 on 1 and 18 DF,  p-value: 1.524e-14


#How to understand

*** means highly more impact (independent variable on dependent )

#statically significant

#predecting test set results

Step 4

#y_pred is a vector created to store values

y_pred = predict(regressor ,newdata =test_set)

#to see predicted values

y_pred

Step5

#visualising the training set results

#load ggplot2 library

ggplot()+
  geom_point(aes(x=training_set$YearsExperience, y=training_set$Salary),
             color = 'red')+
  geom_line(aes(x=training_set$YearsExperience, y=predict(regressor ,newdata = training_set)),
            color = 'blue')+
ggtitle('salary vs experience(training_set)')+
xlab("years of experience")+
  ylab("salary")


Plot interpretation
red points are real salaries of employees
blue line your linear regression model





#visualising the test set results

#load ggplot2 library

ggplot()+
  geom_point(aes(x=test_set$YearsExperience, y=test_set$Salary),
             color = 'red')+
  geom_line(aes(x=training_set$YearsExperience, y=predict(regressor ,newdata = training_set)),
            color = 'blue')+
  ggtitle('salary vs experience(test_set)')+
  xlab("years of experience")+

  ylab("salary")



Plot interpretation

Few predictions are very close to real values,
there is no 100% linearity dependence between variables, only certain dependence there.





Comments

Popular posts from this blog

Decision Tree Classification

Random Forest Classification