Using “random forest” for classification and regression
Author of the article:
Author's Workplace:中国科学院动物研究所北京100101
Key Words:random forest, classification tree, discriminant analysis, regression, machine learning
Abstract:“Random forest” is an algorithm developed by Breiman and Cutler in 2001. It runs by constructing multiple decision trees while training and outputing the class that is the mode of the classes output by individual trees. It has improved performance over single decision trees, and it is much more efficient than traditional machine learning techniques, e.g. artificial neural networks, especially when the dataset is large. Random forest can handle up to thousands of explanatory variables. It can be used to rank the importance of variables when the R package “random.forest” is implemented. It is suitable for demonstrating the nonlinear effect of variables, and it can model complex interactions among variables. Random forest is robust for outliers. In this paper, three examples are used to introduce how to use random forest for a discrimination problem (the dependent variable has multiple categories) for presenceabsence data (the deperdent variable has two categories), and for regression(the dependent variable is a continuous variable).