Simple Linear Regression

February 06, 2018

Name of data is data2

YearsExperience	Salary
1.1	39343
1.3	46205
1.5	37731
2	43525
2.2	39891
2.9	56642
3	60150
3.2	54445
3.2	64445
3.7	57189
3.9	63218
4	55794
4	56957
4.1	57081
4.5	61111
4.9	67938
5.1	66029
5.3	83088
5.9	81363
6	93940
6.8	91738
7.1	98273
7.9	101302
8.2	113812
8.7	109431
9	105582
9.5	116969
9.6	112635
10.3	122391
10.5	121872

#Import data2

#Splitting Data into Training Set & Test Set

Step 1

set.seed(123)

Step 2

split = sample.split(data2$Salary, SplitRatio = 2/3)

training_set = subset(data2, split == TRUE)

test_set = subset(data2, split == FALSE)

#scalling will be taken care internally

#fitting simple linear regression to the training set

# regressor is a vector created to store values

Step 3

regressor = lm(formula = Salary ~ YearsExperience,

data = training_set)

#to check the summary of regressor for more info how the variable are bheaving

summary(regressor)

OUTPUT:

summary(regressor)

Call:

lm(formula = Salary ~ YearsExperience, data = training_set)

Residuals:

Min 1Q Median 3Q Max

-7325.1 -3814.4 427.7 3559.7 8884.6

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 25592 2646 9.672 1.49e-08 ***

YearsExperience 9365 421 22.245 1.52e-14 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5391 on 18 degrees of freedom

Multiple R-squared: 0.9649, Adjusted R-squared: 0.963

F-statistic: 494.8 on 1 and 18 DF, p-value: 1.524e-14

#How to understand

*** means highly more impact (independent variable on dependent )

#statically significant

#predecting test set results

Step 4

#y_pred is a vector created to store values

y_pred = predict(regressor ,newdata =test_set)

#to see predicted values

y_pred

Step5

#visualising the training set results

#load ggplot2 library

ggplot()+

geom_point(aes(x=training_set$YearsExperience, y=training_set$Salary),

color = 'red')+

geom_line(aes(x=training_set$YearsExperience, y=predict(regressor ,newdata = training_set)),

color = 'blue')+

ggtitle('salary vs experience(training_set)')+

xlab("years of experience")+

ylab("salary")

Plot interpretation

red points are real salaries of employees

blue line your linear regression model

#visualising the test set results

#load ggplot2 library

ggplot()+

geom_point(aes(x=test_set$YearsExperience, y=test_set$Salary),

color = 'red')+

geom_line(aes(x=training_set$YearsExperience, y=predict(regressor ,newdata = training_set)),

color = 'blue')+

ggtitle('salary vs experience(test_set)')+

xlab("years of experience")+

ylab("salary")

Plot interpretation

Few predictions are very close to real values,

there is no 100% linearity dependence between variables, only certain dependence there.

Search This Blog

R Code

Simple Linear Regression

Comments

Post a Comment

Popular posts from this blog

Decision Tree Classification

The Multi-Armed Bandit Problem -Upper Confidence Bound (Ad Campaign)

View Data from frames