To use a spacy package in a PySpark node, follow the steps below:
- You must load the model object (e.g., nlp = spacy.load(…)) before you start iterating over your rows.
- Standard Method: Use the model name: spacy.load(“en_core_web_sm”).
- Custom Path: If using a local directory(for restricted environments), point directly to the folder: spacy.load(“/path/to/model/directory”).
- Best Practice: Initialize nlp inside your preprocessing function or immediately before the transformation loop within myFn. This ensures the model is loaded in the active memory context where the data processing actually happens.
- You can enable/disable specific spaCy components (NER, Parser). If you only need lemmatization and tokenization, disable the parser and named entity recognizer (NER). This significantly speeds up the nlp() call.
Example: nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])