XGBoost and random forest machine learning models have a dizzying array of parameters for data science practitioners to tune to produce the best possible model. Join Rory Mitchell, NVIDIA engineer and primary author of XGBoost’s GPU gradient boosting algorithms, for a clear discussion about how these parameters impact model performance.
This developer blog serves as a brief refresher on bias and variance to help explain how various hyperparameters impact ensemble tree methods like gradient boosting and random forest. Gradient boosting models like XGBoost combat both bias and variance by boosting for many rounds at a low learning rate. Alternatively, random forest models combat both bias and variance via tree depth and number of trees. However, random forest trees may need to be much deeper than their gradient boosting counterpart. Ultimately, more data reduces both bias and variance.
The experiments in the blog use the two-dimensional Rosenbrock function as the primary data source, visualized in the figure below. The team decided to use this function because it is easy to visualize and provides some challenge to the learning algorithm by variation in its surface.
The experiments described in the post all use the XGBoost library as a back-end for building both gradient boosting and random forest models. Code for all experiments can be found in this Github repo.
For more information on the work NVIDIA is doing to accelerated XGBoost on GPUs, visit the new RAPIDS.ai XGBoost project webpage and get started.