Ensemble Learning: Bagging and Boosting

Soya Kim
3 min readMay 2, 2024

--

In the ever-evolving landscape of machine learning, ensemble methods stand out as powerful tools for improving model performance and robustness. Among these, Bagging and Boosting shine as exemplary techniques reshaping the way we approach predictive modeling.

Bagging

Bagging (Bootstrap Aggregating) operates on the premise of harnessing the collective wisdom of multiple models to enhance predictive accuracy. It begins by repeatedly sampling subsets from the input dataset, creating diverse datasets D1 to DT. Models M1 to MT are then trained on these subsets, and their predictions are aggregated, often through majority voting or averaging, to classify new data points.

A Closer Look at Random Forest:

Random Forest, a classical bagging algorithm, epitomizes the essence of ensemble learning. Here, decision trees serve as the underlying models. Subsets are randomly sampled to construct sub-matrices D1 or DT, upon which decision trees are built recursively, selecting features for splitting until all features are exhausted. This streamlined algorithm intentionally fosters model diversity, reducing computational cost while generating a robust set of models.

Why Bagging Works:

Bagging’s efficacy stems from its ability to reduce variance without inflating bias. By aggregating predictions from multiple models, variance is mitigated while bias remains stable. This reduction in variance mirrors the statistical concept of averaging independent identically distributed samples, a phenomenon accentuated by the use of bootstrap samples, which closely resemble IID (independent and identically distributed) data.

Boosting

In contrast to Bagging’s parallel model construction, Boosting adopts a sequential approach, incrementally refining models based on previous iterations’ misclassifications. This iterative process begins with the construction of a base model M1 on the initial dataset D1. Subsequent models, such as M2, are then trained to correct misclassifications made by the combined models before them. The final model, MT, emerges as a weighted average of its predecessors, embodying a refined understanding of the data.

Boosting’s propensity for enhanced accuracy comes at a cost — increased susceptibility to overfitting. AdaBoost (Adaptive Boosting), a prominent Boosting algorithm, exemplifies this trade-off, emphasizing model refinement at the risk of overfitting. Despite this caveat, Boosting remains a potent tool in the data scientist’s arsenal, offering unparalleled predictive performance.

In conclusion, while Bagging and Boosting each offer distinct approaches to ensemble learning, both hold immense promise in elevating model performance and resilience.

(Resource: CSE 6250 BigData for Healthcare Class material)
#GT #CSE6250 #BigDataforHealthcare #LectureSummary #DataScience #MachineLearning #EnsembleLearning #Bagging #Boosting #RandomForest #AdaBoost

--

--

Soya Kim
Soya Kim

Written by Soya Kim

Data Scientist | Data Analyst | MS in Data Science @ University of Michigan | MS in Computational Data Analytics @ Georgia Tech

No responses yet