When should I use each algorithm available in sparkflows, how do they work, and what are the most important settings to tune?

shreyash_prashu · December 20, 2025, 7:26pm

The platform offers a range of supervised and unsupervised machine learning algorithms, each optimized for different data types, business goals, and modeling constraints. The comparison below helps you understand when to use each algorithm, how it works at a high level, and which parameters matter most when tuning models in practice.

→ Supervised Learning: Boosting & Forests

Algorithm	When to use it	Internal Logic (The “Why”)	Top Tuning Parameters
XGBoost	Tabular data where accuracy is the #1 priority.	Builds trees sequentially using Gradient Descent. Corrects residuals of previous trees.	`learnRate` (eta): Lower = better convergence. `maxDepth`: Keep low (3-6) to avoid overfitting.
GBM	When you need specific distributions (e.g., Tweedie for insurance).	Sequential building but uses Leaf-Wise growth (XGBoost is Level-Wise).	`learnRateAnnealing`: Automatically reduces step size as you approach the optimum.
DRF	A “robust” baseline or when interpretability via simplicity is needed.	Bagging (Parallel trees). Reduces variance by averaging independent deep trees.	`mtries`: Crucial knob. Higher = stronger trees; Lower = more diverse forest.
GLM	Regulated industries or when you need a fast, linear baseline.	Fits a hyperplane using a Link Function. Uses Elastic Net (L1/L2) regularization.	`lambdaSearch`: Automates the search for the best penalty strength. `standardize`: Essential for math.
Deep Learning	High-dimensional data with complex, non-linear interactions.	Multi-layer feed-forward network using Stochastic Gradient Descent & backprop.	`hidden`: Layers/neurons. `activation`: Use `RectifierWithDropout` for built-in regularization.

→ Unsupervised Learning: Anomalies & Structure

Algorithm	When to use it	Internal Logic (The “Why”)	Top Tuning Parameters
Isolation Forest	Rare event detection (Fraud, Network Intrusions).	Isolates anomalies by random partitioning. Anomalies isolate with fewer splits.	`sampleSize`: Keep at 256. Large samples mask anomalies (swamping/masking).
K-Means	Customer segmentation or data grouping.	Iteratively minimizes the Within-Cluster Sum of Squares (WCSS) around centroids.	`standardize`: Must be ON. Distance math fails if features are on different scales.
GLRM	Missing value imputation or mixed-type dimensionality reduction.	Matrix factorization. PCA for heterogeneous data.	`imputeOriginal`: Use this to fill gaps in raw data with high-quality estimates.
PCA	Compressing numeric features for faster downstream modeling.	Projects data onto orthogonal axes that maximize Variance Explained.	`k`: Number of components. Use enough to capture 90%+ of total variance.

There is no single “best” algorithm—the right choice depends on your data, accuracy requirements, interpretability needs, and operational constraints. Start with simpler or more robust models as baselines, then move to more complex algorithms as needed, focusing first on the key tuning parameters highlighted above for the biggest performance gains.

Topic	Replies	Views
What are the most critical parameters to tweak for each specific H2O model type? Machine Learning user-guide , faq	4	December 19, 2025
How to Train ML Model in Sparkflows? Machine Learning	2	December 30, 2025
Does Sparkflows supports AutoML. If so, how? Machine Learning	2	December 30, 2025
Machine Learning Using H20 Machine Learning tutorials	7	December 9, 2025
What are the benefits of H2O AutoML node in Sparkflows? Machine Learning	4	December 18, 2025

When should I use each algorithm available in sparkflows, how do they work, and what are the most important settings to tune?

→ Supervised Learning: Boosting & Forests

→ Unsupervised Learning: Anomalies & Structure

Related topics