R语言之bagging和随机森林

背景

之前我们用过决策树算法,决策树有一些缺点,其中有一个就是训练方法很高,在这里,我们使用bagging技术来改进决策树模型,并且还会用到随机森林。

bagging

bagging简介

在bagging方法里,从数据里抽取出自举样本(带放回的随机样本),根据每一个样本,可以建立一个决策树模型,最终的模型是所有单个决策树结果的平均。bagging决策树算法通过降低方差得到稳定的最终模型,这种方法提高了精度,也不太可能过拟合。

处理过程

数据在这里

首先加载数据,并划分训练集和测试集:

spamD <- read.table('D:/zmPDSwR-master/Spambase/spamD.tsv', header = T, sep = '\t')
spamTrain <- subset(spamD, spamD$rgroup >= 10)
spamTest <- subset(spamD, spamD$rgroup < 10)

然后构建决策树,查看分类评分:

spamVars <- setdiff(colnames(spamD), list('rgroup', 'spam'))
spamFormula <- as.formula(paste('spam == "spam"',
                                paste(spamVars, collapse = ' + '), sep = ' ~ '))

loglikelihood <- function(y, py) {
    pysmooth <- ifelse(py == 0, 1e-12, ifelse(py == 1, 1-1e-12, py))
    sum(y * log(pysmooth) + (1-y)*log(1 - pysmooth))
}

accuracyMeasures <- function(pred, truth, name = "model") {
    dev.norm <- -2*loglikelihood(as.numeric(truth), pred) / length(pred)
    ctable <- table(truth = truth, pred = (pred > 0.5))
    accuracy <- sum(diag(ctable)) / sum(ctable)
    precision <- ctable[2, 2] / sum(ctable[, 2])
    recall <- ctable[2, 2] / sum(ctable[2, ])
    f1 <- precision * recall
    data.frame(model = name, accuracy = accuracy, f1 = f1, dev.norm)
}

library(rpart)
treemodel <- rpart(spamFormula, spamTrain)
accuracyMeasures(predict(treemodel, newdata = spamTrain), spamTrain$spam == "spam", name = "tree, training")

        model  accuracy        f1  dev.norm
1 tree, training 0.9104514 0.7809002 0.5618654

accuracyMeasures(predict(treemodel, newdata = spamTest), spamTest$spam == "spam", name = "tree, test")

    model  accuracy        f1  dev.norm
1 tree, test 0.8799127 0.7091151 0.6702857

下面就开始尝试bagging决策树,看看改善效果:

# bagging决策树
ntrain <- dim(spamTrain)[1]
n <- ntrain
ntree <- 100

samples <- sapply(1:ntree, FUN = function(iter){sample(1:ntrain, size = n, replace = T)})

treelist <- lapply(1:ntree, FUN = function(iter){
    samp <- samples[, iter];
    rpart(spamFormula, spamTrain[samp, ])
})

predict.bag <- function(treelist, newdata) {
    preds <- sapply(1:length(treelist), FUN = function(iter) {
        predict(treelist[[iter]], newdata = newdata)
    })
    predsums <- rowSums(preds)
    predsums / length(treelist)
}

accuracyMeasures(predict.bag(treelist, newdata = spamTrain), spamTrain$spam == "spam", name = "bagging, training")

            model  accuracy        f1  dev.norm
1 bagging, training 0.9201062 0.8025071 0.4672325

accuracyMeasures(predict.bag(treelist, newdata = spamTest), spamTest$spam == "spam", name = "bagging, test")

        model  accuracy        f1  dev.norm
1 bagging, test 0.9061135 0.7646497 0.5280876

结果蛮好的,提高了模型的精度和F1值。

使用随机森林

随机森林简介

简要介绍,随机森林方法尝试通过随机化每棵树允许使用的变量集来使得这些树不相关。

随机森林实现

library(randomForest)
set.seed(5123512)
fmodel <- randomForest(x = spamTrain[, spamVars],
                    y = spamTrain$spam,
                    ntree = 100,
                    nodesize = 7,
                    importance = T)

accuracyMeasures(predict(fmodel,
                        newdata = spamTrain[, spamVars],
                        type = 'prob')[, 'spam'],
                spamTrain$spam == "spam",
                name = "random forest, train")

                model  accuracy        f1  dev.norm
1 random forest, train 0.9884142 0.9706611 0.1428786

accuracyMeasures(predict(fmodel,
                        newdata = spamTest[, spamVars],
                        type = 'prob')[, 'spam'],
                spamTest$spam == "spam",
                name = "random forest, test")

                model  accuracy        f1  dev.norm
1 random forest, test 0.9541485 0.8845029 0.3972416

随机森林需要检查变量的重要性,从而来了解哪些变量是重要的。

varImp <- importance(fmodel)
varImp[1:10, ]

non-spam       spam MeanDecreaseAccuracy MeanDecreaseGini

word.freq.make 2.096811 3.7304353 4.334207 5.877954
word.freq.address 3.603167 3.9967031 4.977452 10.081640
word.freq.all 2.799456 4.9527834 4.924958 23.524720
word.freq.3d 3.000273 0.4125932 2.917972 1.550635
word.freq.our 9.037946 7.9421391 10.731509 52.569163
word.freq.over 5.879377 4.2402613 5.751371 11.820391
word.freq.remove 16.637390 13.9331691 17.753122 174.126926
word.freq.internet 7.301055 4.4458342 7.947515 22.578106
word.freq.order 3.937897 4.3587883 4.866540 11.809265
word.freq.mail 5.022432 3.4701224 6.103929 11.127200

varImpPlot(fmodel, type = 1)

id

用更少的变量来拟合:

selVars <- names(sort(varImp[, 1], decreasing = T))[1:25]
fsel <- randomForest(x = spamTrain[, selVars],
                    y = spamTrain$spam,
                    ntree = 100,
                    nodesize = 7, 
                    importance = T)
accuracyMeasures(predict(fsel,
                        newdata = spamTrain[, selVars],
                        type = 'prob')[, 'spam'],
                spamTrain$spam == "spam",
                name = "RF small, train")

            model  accuracy        f1  dev.norm
1 RF small, train 0.9864832 0.9658047 0.1379438

accuracyMeasures(predict(fsel,
                        newdata = spamTest[, selVars],
                        type = 'prob')[, 'spam'],
                spamTest$spam == "spam",
                name = "RF small, test")

        model  accuracy        f1  dev.norm
1 RF small, test 0.9497817 0.8742775 0.3985712

总结

  • bagging通过减少方差来稳定决策树并提高精度。
  • bagging降低泛化误差。
  • 随机森林通过去除bagging集合中的个体树之间的相关性来进一步提高决策树的性能。
  • 随机森林的变量重要性度量可以帮助确定哪些变量对模型的贡献度最大。
  • 由于随机森林的集合中的树是未剪枝的并且往往非常深,还有过拟合的危险,因此,一定要确保在保留数据上使用简单交叉验证来评估模型,从而更好地评价模型的性能。

参考文献

[1][数据科学 理论、方法与R语言实践]