随机森林模型在分类与回归分析中的应用
Using “random forest” for classification and regression
李欣海
点击:27381次 下载:2293次
DOI:
作者单位:
中文关键词:随机森林, 分类树, 判别分析, 回归, 机器学习
英文关键词:random forest, classification tree, discriminant analysis, regression, machine learning
中文摘要:随机森林(random forest)模型是由Breiman和Cutler在2001年提出的一种基于分类树的算法。它通过对大量分类树的汇总提高了模型的预测精度,是取代神经网络等传统机器学习方法的新的模型。随机森林的运算速度很快,在处理大数据时表现优异。随机森林不需要顾虑一般回归分析面临的多元共线性的问题,不用做变量选择。现有的随机森林软件包给出了所有变量的重要性。另外,随机森林便于计算变量的非线性作用,而且可以体现变量间的交互作用(interaction)。它对离群值也不敏感。本文通过3个案例,分别介绍了随机森林在昆虫种类的判别分析、有无数据的分析(取代逻辑斯蒂回归)和回归分析上的应用。案例的数据格式和R语言代码可为研究随机森林在分类与回归分析中的应用提供参考。
英文摘要:“Random forest” is an algorithm developed by Breiman and Cutler in 2001. It runs by constructing multiple decision trees while training and outputing the class that is the mode of the classes output by individual trees. It has improved performance over single decision trees, and it is much more efficient than traditional machine learning techniques, e.g. artificial neural networks, especially when the dataset is large. Random forest can handle up to thousands of explanatory variables. It can be used to rank the importance of variables when the R package “random.forest” is implemented. It is suitable for demonstrating the nonlinear effect of variables, and it can model complex interactions among variables. Random forest is robust for outliers. In this paper, three examples are used to introduce how to use random forest for a discrimination problem (the dependent variable has multiple categories) for presenceabsence data (the deperdent variable has two categories), and for regression(the dependent variable is a continuous variable).