The core challenge
Excel files often drift:
-
Extra columns
-
Missing columns
-
Slight type differences
Sparkflows’ strategy
Sparkflows establishes a reference schema from the first successfully read dataset.
Then:
-
Subsequent datasets are aligned against it
-
Union strategy depends on schema mode
Behavior by mode
Without Enforce Schema
-
Uses
unionByName(allowMissingColumns = true) -
Missing columns are filled with nulls
-
Extra columns are tolerated
With Enforce Schema
-
Uses strict
union -
Schema mismatches are rejected
-
Guarantees absolute consistency
Result
Predictable schemas, even with imperfect Excel inputs.