R语言之广义加性模型（GAM）

背景

当我们想了解输入变量和输出之间的关系时，线性回归和Logistic回归都是好的方法，但是这两种模型有缺点：他们会假定输入和输出之间的关系是单调的。但是实际关系如果是非单调的呢？这个时候就是GAM的用武之地了。

名词解释

广义加性模型（GAM）是一种在线性或Logistic回归模型（或任何其他广义线性模型）的框架内，构造非单调的响应模型的方法。

实现方法就是用一组函数来构成平滑的拟合曲线。

人工示例

这里用人工的方法先生成数据来做一个实验。

构建一个函数：y是输入变量x的一个带噪声的非线性函数。

set.seed(602957)

x <- rnorm(1000)
noise <- rnorm(1000, sd = 1.5)
y <- 3*sin(2*x) + cos(0.75*x) - 1.5*(x^2) + noise

构建数据框，划分训练集和测试集：

select <- runif(1000)
frame <- data.frame(y = y, x = x)

train <- frame[select > 0.1,]
test <- frame[select <= 0.1,]

先用线性回归来拟合：

> lin.model <- lm(y ~ x, data = train)
> summary(lin.model)

Call:
lm(formula = y ~ x, data = train)

Residuals:
    Min       1Q   Median       3Q      Max 
-19.2750  -1.7457   0.2953   2.3922   7.1232 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.7241     0.1153  -6.281 5.25e-10 ***
x             0.8079     0.1197   6.751 2.63e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.455 on 896 degrees of freedom
Multiple R-squared:  0.04841,    Adjusted R-squared:  0.04735 
F-statistic: 45.58 on 1 and 896 DF,  p-value: 2.631e-11

> resid.lin <- train$y - predict(lin.model)
> sqrt(mean(resid.lin^2))
[1] 3.450926

上述结果看出来了，非常差，R平方约为0.04，也应该是这样，因为本来我们的函数y就是非线性的，下面看看GAM的效果：

library(mgcv)
glin.model <- gam(y~s(x), data = train)

> glin.model$converged  # 判断是否收敛
[1] TRUE
> summary(glin.model)

Family: gaussian 
Link function: identity 

Formula:
y ~ s(x)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.73023    0.04896  -14.91   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
    edf Ref.df     F p-value    
s(x) 8.404  8.906 484.6  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.828   Deviance explained =   83%
GCV = 2.1756  Scale est. = 2.1529    n = 898
> resid.glin <- train$y - predict(glin.model)
> sqrt(mean(resid.glin^2))
[1] 1.45956

可以看到模型解释了超过80%的方差。接下来就可以看看在测试集上的效果了：

> actual <- test$y
> pred.lin <- predict(lin.model, newdata = test)
> pred.glin <- predict(glin.model, newdata = test)
> resid.lin <- actual - pred.lin
> resid.glin <- actual - pred.glin
> sqrt(mean(resid.lin^2))
[1] 3.832652
> sqrt(mean(resid.glin^2))
[1] 1.43423
> cor(actual, pred.lin)^2
[1] 0.005517822
> cor(actual, pred.glin)^2
[1] 0.8565234

比较训练集和测试集的均方根误差和R平方，发现差距不大，应该是没有过拟合。

下面我们可以把GAM中的函数画出来，看看长什么样子：

# 从GAM中提取学过的样条
> sx <- predict(glin.model, type = "terms")
> summary(sx)
    s(x)         
Min.   :-16.4890  
1st Qu.: -2.5411  
Median :  0.1049  
Mean   :  0.0000  
3rd Qu.:  2.8355  
Max.   :  3.9613  
> xframe <- cbind(train, sx = sx[,1])

library(ggplot2)
ggplot(xframe, aes(x = x)) + 
    geom_point(aes(y = y), alpha = 0.4) +
    geom_line(aes(y = sx))

实际示例

下面就看看GAM在真实数据集中的表现，这里采用的是2010年CDC出生率数据集中的额数据，来预测新生儿的体重（DBWT）。这里考虑的变量包括母亲的体重（PWGT）、母亲怀孕期间体重的增加量（WTGAIN）、母亲的年龄（MAGER）和产前检查的次数（UPREVIS）。数据在这里

library(mgcv)
load("D:/zmPDSwR-master/CDC/NatalBirthData.rData")
train <- sdata[sdata$ORIGRANDGROUP <= 5,]
test <- sdata[sdata$ORIGRANDGROUP > 5,]
form.lin <- as.formula("DBWT ~ PWGT + WTGAIN + MAGER + UPREVIS")
linmodel <- lm(form.lin, data = train)
> summary(linmodel)

Call:
lm(formula = form.lin, data = train)

Residuals:
    Min       1Q   Median       3Q      Max 
-3155.43  -272.09    45.04   349.81  2870.55 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2419.7090    31.9291  75.784  < 2e-16 ***
PWGT           2.1713     0.1241  17.494  < 2e-16 ***
WTGAIN         7.5773     0.3178  23.840  < 2e-16 ***
MAGER          5.3213     0.7787   6.834  8.6e-12 ***
UPREVIS       12.8753     1.1786  10.924  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 562.7 on 14381 degrees of freedom
Multiple R-squared:  0.06596,    Adjusted R-squared:  0.0657 
F-statistic: 253.9 on 4 and 14381 DF,  p-value: < 2.2e-16

同样的数据，用在GAM上：

> form.glin <- as.formula("DBWT ~ s(PWGT) + s(WTGAIN) + s(MAGER) + s(UPREVIS)")
> glinmodel <- gam(form.glin, data = train)
> glinmodel$converged
[1] TRUE
> summary(glinmodel)

Family: gaussian 
Link function: identity 

Formula:
DBWT ~ s(PWGT) + s(WTGAIN) + s(MAGER) + s(UPREVIS)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 3276.948      4.623   708.8   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
            edf Ref.df       F  p-value    
s(PWGT)    5.374  6.443  69.010  < 2e-16 ***
s(WTGAIN)  4.719  5.743 102.313  < 2e-16 ***
s(MAGER)   7.742  8.428   7.145 1.37e-09 ***
s(UPREVIS) 5.491  6.425  48.423  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.0927   Deviance explained = 9.42%
GCV = 3.0804e+05  Scale est. = 3.0752e+05  n = 14386

绘图来看：

terms <- predict(glinmodel, type = "terms")
tframe <- cbind(DBWT = train$DBWT, as.data.frame(terms))
colnames(tframe) <- gsub('[()]', '', colnames(tframe))
pframe <- cbind(tframe, train[,c("PWGT", "WTGAIN", "MAGER", "UPREVIS")])
p1 <- ggplot(pframe, aes(x = PWGT)) +
    geom_point(aes(y = scale(sPWGT, scale = F))) +
    geom_smooth(aes(y = scale(DBWT, scale = F))) 
p1

p2 <- ggplot(pframe, aes(x=WTGAIN)) +
    geom_point(aes(y=scale(sWTGAIN, scale=F))) +
    geom_smooth(aes(y=scale(DBWT, scale=F)))
p2

p3 <- ggplot(pframe, aes(x=MAGER)) +
    geom_point(aes(y=scale(sMAGER, scale=F))) +
    geom_smooth(aes(y=scale(DBWT, scale=F)))
p3

p4 <- ggplot(pframe, aes(x=UPREVIS)) +
    geom_point(aes(y=scale(sUPREVIS, scale=F))) +
    geom_smooth(aes(y=scale(DBWT, scale=F)))
p4

效果不错，但是还是要做交叉验证：

> pred.lin <- predict(linmodel, newdata = test)
> pred.glin <- predict(glinmodel, newdata = test)
> cor(pred.lin, test$DBWT)^2
[1] 0.0616812
> cor(pred.glin, test$DBWT)^2
[1] 0.08857426

与训练集相似，没有过拟合。

使用GAM实现Logistic回归

这里也检查的在Logistic回归上看看效果：

form <- as.formula("DBWT < 2000 ~ PWGT + WTGAIN + MAGER +UPREVIS")
logmod <- glm(form, data = train, family = binomial(link = "logit"))

> form2 <- as.formula("DBWT < 2000 ~ s(PWGT) + s(WTGAIN) + s(MAGER) + s(UPREVIS)")
> glogmod <- gam(form2, data = train, family = binomial(link = "logit"))
> glogmod$converged
[1] TRUE
> summary(glogmod)

Family: binomial 
Link function: logit 

Formula:
DBWT < 2000 ~ s(PWGT) + s(WTGAIN) + s(MAGER) + s(UPREVIS)

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.94085    0.06794     -58   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
            edf Ref.df  Chi.sq  p-value    
s(PWGT)    1.905  2.420   2.463  0.39023    
s(WTGAIN)  3.674  4.543  64.211 1.81e-12 ***
s(MAGER)   1.003  1.005   8.347  0.00393 ** 
s(UPREVIS) 6.802  7.216 217.631  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.0331   Deviance explained = 9.14%
UBRE = -0.76987  Scale est. = 1         n = 14386

效果不是很好，这里只是简单的试验下，不过还是发现，母亲的体重（PWGT）对结果没有显著影响。

总结

GAM将变量和结果之间的非线性、非单调性关系在一个线性或Logistic回归框架中表现出来。
在mgcv包中，可以把在GAM模型中使用type="terms"参数调用predict()时所发现的关系提取出来。
可以使用评估标准线性或Logistic回归时所使用的度量准则来评价GAM，如：残差、偏差、R-平方和伪R-平方。gam()概要还能给出指示，表明哪些变量会对模型产生显著影响。
因为相对于标准线性或Logistic回归模型而言，GAM的复杂性增加了，所以GAM过拟合的风险更高。

参考文献

[1][数据科学理论、方法与R语言实践]