Why does recursive directory search slow down large pipelines?

What users observe

Pipelines slow down when recursive directory search is enabled.


Why this happens

Recursive search:

  • Traverses every subdirectory

  • Evaluates every file

  • Performs filesystem metadata calls

On distributed storage (S3, HDFS, ADLS):

  • These operations are expensive

  • Latency accumulates quickly


Best practices

  • Avoid recursion unless required

  • Use controlled folder structures

  • Prefer explicit paths or glob patterns


Design principle

Filesystem traversal cost scales with number of files, not file size.