When should I use each algorithm available in sparkflows, how do they work, and what are the most important settings to tune?

The platform offers a range of supervised and unsupervised machine learning algorithms, each optimized for different data types, business goals, and modeling constraints. The comparison below helps you understand when to use each algorithm, how it works at a high level, and which parameters matter most when tuning models in practice.

→ Supervised Learning: Boosting & Forests

Algorithm When to use it Internal Logic (The “Why”) Top Tuning Parameters
XGBoost Tabular data where accuracy is the #1 priority. Builds trees sequentially using Gradient Descent. Corrects residuals of previous trees. learnRate (eta): Lower = better convergence. maxDepth: Keep low (3-6) to avoid overfitting.
GBM When you need specific distributions (e.g., Tweedie for insurance). Sequential building but uses Leaf-Wise growth (XGBoost is Level-Wise). learnRateAnnealing: Automatically reduces step size as you approach the optimum.
DRF A “robust” baseline or when interpretability via simplicity is needed. Bagging (Parallel trees). Reduces variance by averaging independent deep trees. mtries: Crucial knob. Higher = stronger trees; Lower = more diverse forest.
GLM Regulated industries or when you need a fast, linear baseline. Fits a hyperplane using a Link Function. Uses Elastic Net (L1/L2) regularization. lambdaSearch: Automates the search for the best penalty strength. standardize: Essential for math.
Deep Learning High-dimensional data with complex, non-linear interactions. Multi-layer feed-forward network using Stochastic Gradient Descent & backprop. hidden: Layers/neurons. activation: Use RectifierWithDropout for built-in regularization.

→ Unsupervised Learning: Anomalies & Structure

Algorithm When to use it Internal Logic (The “Why”) Top Tuning Parameters
Isolation Forest Rare event detection (Fraud, Network Intrusions). Isolates anomalies by random partitioning. Anomalies isolate with fewer splits. sampleSize: Keep at 256. Large samples mask anomalies (swamping/masking).
K-Means Customer segmentation or data grouping. Iteratively minimizes the Within-Cluster Sum of Squares (WCSS) around centroids. standardize: Must be ON. Distance math fails if features are on different scales.
GLRM Missing value imputation or mixed-type dimensionality reduction. Matrix factorization. PCA for heterogeneous data. imputeOriginal: Use this to fill gaps in raw data with high-quality estimates.
PCA Compressing numeric features for faster downstream modeling. Projects data onto orthogonal axes that maximize Variance Explained. k: Number of components. Use enough to capture 90%+ of total variance.

There is no single “best” algorithm—the right choice depends on your data, accuracy requirements, interpretability needs, and operational constraints. Start with simpler or more robust models as baselines, then move to more complex algorithms as needed, focusing first on the key tuning parameters highlighted above for the biggest performance gains.