Sometimes it isn’t enough to know which features are important overall; you must be able to justify individual predictions. The H2O Scoring Node in Sparkflows enables this by calculating Shapley (SHAP) values at the row level.
1. Visualizing Global impact (The Graphs)
When you enable withContribution, the node automatically processes the model’s output to create two high-level visual guides:
-
SHAP Feature Importance: This chart ranks features by their mean absolute impact. It tells you, on average, which variables (e.g., Credit Score or Annual Income) are the most powerful drivers of the model’s decisions.
-
SHAP Summary Plot: This visualization focuses on the direction of the influence. It shows whether a feature generally pushes a prediction “Up” (positive influence) or “Down” (negative influence).
2. Saving Row-Level Data (The Table)
To allow end-users to see the exact contribution for every single row, the node uses the Save Shapley values property.
-
How it works: If you provide a storage path (e.g., an S3 bucket or HDFS directory), the node saves the entire processed DataFrame, including the raw
contributionsstruct, to that location. -
Data Structure: Inside the saved table (often in Parquet format), each feature will have its own numeric SHAP value for every row, alongside a Bias term.
-
Auditability: This “per-row” table is what allows an analyst to say: “For Customer X, their ‘High Debt’ increased the probability of default by 0.15, while their ‘Stable Employment’ decreased it by 0.05.”.
The goal of machine learning in Sparkflows isn’t just to generate a prediction, but to provide actionable insights. Using features like Shapley values through the withContribution setting allows you to move beyond “black-box” AI, offering your stakeholders clear, row-level explanations for every decision the model makes.