Random forest is a specific algorithm, not omnipotent for all datasets
Author of the article:LI Xin-Hai1, 2**
Author's Workplace:1. Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China; 2. University of Chinese Academy of Sciences, Beijing 100049, China
Key Words:random forest; partial effect; interaction; multicollinearity; R
Abstract:
Random forest has gained extensive
attention since its publication in 2001. Random forest can handle both
regression and classification with minimum assumptions (no need for normality,
homogeneity of variance, and independence between explanatory variables), so
that its applications has dramatically increased. Someone even use it as an
omnipotent tool for all analysis. In fact, random forest is a specific
algorithm with clear characteristics. It is an ensemble method by constructing
a number of decision trees, which intends to use local optimization to fit
data. When the data have strong partial effect, random forest usually does not
fit well. I compared the performance of random forest with multiple regression
models, generalized additive models, and artificial neural network using the
occurrence data of Cicadidea species. The results showed, although the
prediction of random forest looked fragmented, it outperformed the other three
models. Random forest also performed better than linear discriminant analysis
for classifications. Random forest has its strength and weakness. I suggestion
to use multiple models for data analysis rather than one “powerful” model.