Regression R/Rstudio Example

Data: WOW_data

1.研究者想了解二年級時學生自評師生衝突(CO_S_2)是否能預測三年級時的數學成績(math_3)。

用R讀取剛剛的CSV檔，並將此資料命名為 reg

reg <- read.csv("D:/104/ML_R/WOW_data.csv",header=TRUE,sep=",")

使用lm() 進行迴歸分析，並將結果存成M_reg

M_reg <-lm( math_3 ~ CO_S_2  ,data = reg)
summary(M_reg)

## 
## Call:
## lm(formula = math_3 ~ CO_S_2, data = reg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -30.1894  -5.6611   0.4521   6.9238  21.0371 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  509.793      2.135 238.801   <2e-16 ***
## CO_S_2        -2.830      1.163  -2.433   0.0159 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.13 on 191 degrees of freedom
## Multiple R-squared:  0.03006,    Adjusted R-squared:  0.02498 
## F-statistic: 5.919 on 1 and 191 DF,  p-value: 0.0159

載入套件進行標準化迴歸分析

library(lm.beta)
library(ggplot2)

lm.beta(M_reg)

## 
## Call:
## lm(formula = math_3 ~ CO_S_2, data = reg)
## 
## Standardized Coefficients::
## (Intercept)      CO_S_2 
##   0.0000000  -0.1733755

載入套件進行繪圖

plot(reg$CO_S_2,reg$math_3)
abline(lm(reg$math_3 ~ reg$CO_S_2))

ggplot(reg, aes(x = CO_S_2, y = math_3)) + geom_point(size=3)  + stat_smooth(method="lm")

2.延續上面的分析，研究者想再將二年級時學生自評師生溫暖(WA_S_2)加到自變項中，形成多元迴歸分析。

一樣使用lm() 進行迴歸分析，並將結果存成Model_2

Model_2<-lm(math_3 ~ CO_S_2 + WA_S_2, data=reg)
summary(Model_2)

## 
## Call:
## lm(formula = math_3 ~ CO_S_2 + WA_S_2, data = reg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -30.2074  -5.4557   0.6687   6.9193  22.2977 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 514.4646     3.7506 137.168   <2e-16 ***
## CO_S_2       -2.9986     1.1647  -2.575   0.0108 *  
## WA_S_2       -1.2530     0.8285  -1.512   0.1321    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.1 on 190 degrees of freedom
## Multiple R-squared:  0.0416, Adjusted R-squared:  0.03151 
## F-statistic: 4.123 on 2 and 190 DF,  p-value: 0.01766

標準化迴歸分析

lm.beta(Model_2)

## 
## Call:
## lm(formula = math_3 ~ CO_S_2 + WA_S_2, data = reg)
## 
## Standardized Coefficients::
## (Intercept)      CO_S_2      WA_S_2 
##   0.0000000  -0.1836963  -0.1079128

由於自變數大於1個，需要進行共線性診斷(VIF<10即可)

library(car)

vif(Model_2)

##   CO_S_2   WA_S_2 
## 1.009231 1.009231

3.利用三個種族(ethnic)及三年級時的數學成績(math_3)進行虛擬變項分析

選取所需要的變項

dum<-reg[c(3,13)]

將種族換為數字

dum$ethnic_N<-as.numeric(dum$ethnic)

載入套件，利用mutate建立dummy code

library(dplyr)

dum<-mutate(dum, E1 =ifelse(ethnic_N == 1, "1", "0"))
dum<-mutate(dum, E2 =ifelse(ethnic_N == 2, "1", "0"))
dum<-mutate(dum, E3 =ifelse(ethnic_N == 3, "1", "0"))

利用contrast( )看目前的coding發現group目前的基準組為Human

contrasts(dum$ethnic)

##         Orc  Undead 
## Human      0       0
## Orc        1       0
## Undead     0       1

事實上，對於「類別」變數來說，R會自動給予一組dummy coding，至於基準組則是依照字母先後順序去選擇。
在本例中，ethnic為類別變數，三個水準為”Human”、”Orc”、”Undead”，故預設基準組為” Human”

以Human為基準組

dum$ethnic<-as.factor(dum$ethnic)
dum_lm <- lm(math_3 ~ ethnic , data=dum)
summary(dum_lm)

## 
## Call:
## lm(formula = math_3 ~ ethnic, data = dum)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.1905  -6.9062   0.8506   6.8095  21.0938 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    509.149      1.015 501.605  < 2e-16 ***
## ethnicOrc       -9.959      1.779  -5.598 7.49e-08 ***
## ethnicUndead    -6.243      1.559  -4.004 8.92e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.468 on 190 degrees of freedom
## Multiple R-squared:  0.1579, Adjusted R-squared:  0.1491 
## F-statistic: 17.82 on 2 and 190 DF,  p-value: 8.078e-08

更換基準組為Orc

dum$ethnic <- relevel(dum$ethnic,'Orc ')
contrasts(dum$ethnic)

##         Human  Undead 
## Orc          0       0
## Human        1       0
## Undead       0       1

dum_lm_2 <- lm(math_3 ~ ethnic , data=dum)
summary(dum_lm_2)

## 
## Call:
## lm(formula = math_3 ~ ethnic, data = dum)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.1905  -6.9062   0.8506   6.8095  21.0938 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    499.190      1.461 341.702  < 2e-16 ***
## ethnicHuman      9.959      1.779   5.598 7.49e-08 ***
## ethnicUndead     3.716      1.880   1.976   0.0496 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.468 on 190 degrees of freedom
## Multiple R-squared:  0.1579, Adjusted R-squared:  0.1491 
## F-statistic: 17.82 on 2 and 190 DF,  p-value: 8.078e-08

date: “2016年1月23日,第一版”

author: “邱浩恩”