Multiple Linear Regression in R

Data called 50_Startups.csv


R&D Spend
Administration
Marketing Spend
State
Profit
165349.2
136897.8
471784.1
New York
192261.8
162597.7
151377.6
443898.5
California
191792.1
153441.5
101145.6
407934.5
Florida
191050.4
144372.4
118671.9
383199.6
New York
182902
142107.3
91391.77
366168.4
Florida
166187.9
131876.9
99814.71
362861.4
New York
156991.1
134615.5
147198.9
127716.8
California
156122.5
130298.1
145530.1
323876.7
Florida
155752.6
120542.5
148719
311613.3
New York
152211.8
123334.9
108679.2
304981.6
California
149760
101913.1
110594.1
229161
Florida
146122
100672
91790.61
249744.6
California
144259.4
93863.75
127320.4
249839.4
Florida
141585.5
91992.39
135495.1
252664.9
California
134307.4
119943.2
156547.4
256512.9
Florida
132602.7
114523.6
122616.8
261776.2
New York
129917
78013.11
121597.6
264346.1
California
126992.9
94657.16
145077.6
282574.3
New York
125370.4
91749.16
114175.8
294919.6
Florida
124266.9
86419.7
153514.1
0
New York
122776.9
76253.86
113867.3
298664.5
California
118474
78389.47
153773.4
299737.3
New York
111313
73994.56
122782.8
303319.3
Florida
110352.3
67532.53
105751
304768.7
Florida
108734
77044.01
99281.34
140574.8
New York
108552
64664.71
139553.2
137962.6
California
107404.3
75328.87
144136
134050.1
Florida
105733.5
72107.6
127864.6
353183.8
New York
105008.3
66051.52
182645.6
118148.2
Florida
103282.4
65605.48
153032.1
107138.4
New York
101004.6
61994.48
115641.3
91131.24
Florida
99937.59
61136.38
152701.9
88218.23
New York
97483.56
63408.86
129219.6
46085.25
California
97427.84
55493.95
103057.5
214634.8
Florida
96778.92
46426.07
157693.9
210797.7
California
96712.8
46014.02
85047.44
205517.6
New York
96479.51
28663.76
127056.2
201126.8
Florida
90708.19
44069.95
51283.14
197029.4
California
89949.14
20229.59
65947.93
185265.1
New York
81229.06
38558.51
82982.09
174999.3
California
81005.76
28754.33
118546.1
172795.7
California
78239.91
27892.92
84710.77
164470.7
Florida
77798.83
23640.93
96189.63
148001.1
California
71498.49
15505.73
127382.3
35534.17
New York
69758.98
22177.74
154806.1
28334.72
California
65200.33
1000.23
124153
1903.93
New York
64926.08
1315.46
115816.2
297114.5
Florida
49490.75
0
135426.9
0
California
42559.73
542.05
51743.15
0
New York
35673.41
0
116983.8
45173.06
California
14681.4


Step 1

#Import

dataset1 = read.csv('50_Startups.csv')

Step 2

# Encoding categorical data

Dataset1$State = factor(dataset1$State,
                       levels = c('New York', 'California', 'Florida'),
                       labels = c(1, 2, 3))
Step 3

# Fitting Multiple Linear Regression to the Training set

Regresor = lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend + State,
Data = training_set)
OR
regressor = lm(formula = Profit ~ .,
               data = training_set)

summary(refressor)

Call:
lm(formula = Profit ~ ., data = training_set)

Residuals:
   Min     1Q Median     3Q    Max
-33128  -4865      5   6098  18065

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)   
(Intercept)      4.965e+04  7.637e+03   6.501 1.94e-07 ***
R.D.Spend        7.986e-01  5.604e-02  14.251 6.70e-16 ***
Administration  -2.942e-02  5.828e-02  -0.505    0.617   
Marketing.Spend  3.268e-02  2.127e-02   1.537    0.134   
State2           1.213e+02  3.751e+03   0.032    0.974   
State3           2.376e+02  4.127e+03   0.058    0.954   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9908 on 34 degrees of freedom
Multiple R-squared:  0.9499,   Adjusted R-squared:  0.9425
F-statistic:   129 on 5 and 34 DF,  p-value: < 2.2e-16

# Predicting the Test set results
y_pred = predict(regressor, newdata = test_set)


Step 4

# Building the optimal model using Backward Elimination

regressor = lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend + State,
               data = dataset1)

summary(regressor)

Call:
lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend +
    State, data = dataset1)

Residuals:
   Min     1Q Median     3Q    Max
-33504  -4736     90   6672  17338

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)   
(Intercept)      5.008e+04  6.953e+03   7.204 5.76e-09 ***
R.D.Spend        8.060e-01  4.641e-02  17.369  < 2e-16 ***
Administration  -2.700e-02  5.223e-02  -0.517    0.608   
Marketing.Spend  2.698e-02  1.714e-02   1.574    0.123   
State2           4.189e+01  3.256e+03   0.013    0.990   
State3           2.407e+02  3.339e+03   0.072    0.943   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9439 on 44 degrees of freedom
Multiple R-squared:  0.9508,   Adjusted R-squared:  0.9452
F-statistic: 169.9 on 5 and 44 DF,  p-value: < 2.2e-16

State is not significant so removing
Step 5

regressor = lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend,
               data = dataset1)

summary(regressor)

Call:
lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend,
    data = dataset1)

Residuals:
   Min     1Q Median     3Q    Max
-33534  -4795     63   6606  17275

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)   
(Intercept)      5.012e+04  6.572e+03   7.626 1.06e-09 ***
R.D.Spend        8.057e-01  4.515e-02  17.846  < 2e-16 ***
Administration  -2.682e-02  5.103e-02  -0.526    0.602   
Marketing.Spend  2.723e-02  1.645e-02   1.655    0.105   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9232 on 46 degrees of freedom
Multiple R-squared:  0.9507,   Adjusted R-squared:  0.9475
F-statistic:   296 on 3 and 46 DF,  p-value: < 2.2e-16


Administration is not significant so removing

Step 6

regressor = lm(formula = Profit ~ R.D.Spend + Marketing.Spend,
               data = dataset1)

summary(regressor)

Call:
lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = dataset1)

Residuals:
   Min     1Q Median     3Q    Max
-33645  -4632   -414   6484  17097

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)   
(Intercept)     4.698e+04  2.690e+03  17.464   <2e-16 ***
R.D.Spend       7.966e-01  4.135e-02  19.266   <2e-16 ***
Marketing.Spend 2.991e-02  1.552e-02   1.927     0.06 . 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9161 on 47 degrees of freedom
Multiple R-squared:  0.9505,   Adjusted R-squared:  0.9483
F-statistic: 450.8 on 2 and 47 DF,  p-value: < 2.2e-16

Marketing.Spend is not significant so removing.

Step 7

regressor = lm(formula = Profit ~ R.D.Spend,
               data = dataset1)

summary(regressor)

Call:
lm(formula = Profit ~ R.D.Spend, data = dataset1)

Residuals:
   Min     1Q Median     3Q    Max
-34351  -4626   -375   6249  17188

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept) 4.903e+04  2.538e+03   19.32   <2e-16 ***
R.D.Spend   8.543e-01  2.931e-02   29.15   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9416 on 48 degrees of freedom
Multiple R-squared:  0.9465,   Adjusted R-squared:  0.9454
F-statistic: 849.8 on 1 and 48 DF,  p-value: < 2.2e-16

Step 8

y_pred = predict(regressor, newdata = test_set)
y_pred

step 9
#visualising the training set results
#load ggplot2 library

ggplot()+
  geom_point(aes(x=training_set$R.D.Spend, y=training_set$Profit),
             color = 'red')+
  geom_line(aes(x=training_set$R.D.Spend, y=predict(regressor ,newdata = training_set)),
            color = 'blue')+
  ggtitle('Profit vs R.D.Spend(training_set)')+
  xlab("R.D.Spend")+
  ylab("Profit")



#plot interpretation
red points are real Profit
the blue line is our linear regression model


#visualising the test set results
#load ggplot2 library


ggplot()+
  geom_point(aes(x=test_set$YearsExperience, y=test_set$Salary),
             color = 'red')+
  geom_line(aes(x=training_set$YearsExperience, y=predict(regressor ,newdata = training_set)),
            color = 'blue')+
  ggtitle('salary vs experience(test_set)')+
  xlab("years of experience")+
  ylab("salary")


#plot interpretation

red points are real Profit,
the blue line is our linear regression model, these are very much near to predicted values.


Comments

Popular posts from this blog

Decision Tree Classification

Random Forest Classification