How does Sparkflows guarantee deterministic schema across many Excel files?

The core challenge

Excel files often drift:

  • Extra columns

  • Missing columns

  • Slight type differences


Sparkflows’ strategy

Sparkflows establishes a reference schema from the first successfully read dataset.

Then:

  • Subsequent datasets are aligned against it

  • Union strategy depends on schema mode


Behavior by mode

Without Enforce Schema

  • Uses unionByName(allowMissingColumns = true)

  • Missing columns are filled with nulls

  • Extra columns are tolerated

With Enforce Schema

  • Uses strict union

  • Schema mismatches are rejected

  • Guarantees absolute consistency


Result

Predictable schemas, even with imperfect Excel inputs.