Following nodes in Sparkflows can help to perform Data Profiling:
- Correlation - It displays relation between dependent and independent features. Relation between features is plotted in Heatmap Graph.
- Summary - It calculates and prints spreads of feature such as Count, Mean, Min, Max and so on.
- Using various ML Model nodes we can also get an insight into the importance of each feature.
- Flag Outlier - It flags outliers in the dataset.
Following nodes in Sparkflows can help to perform Data Cleansing:
- Imputing - There are various imputing nodes available to handle missing values. Using these nodes missing values can be replaced with either a Constant or Mean/Median/Mode value.
- Dedup - To resolve duplicate entity data.
- Drop Duplicate Rows - Handles duplicate rows.
- Null Value handling - There are various nodes to handles null values in the dataset.
- Find And Replace - There are various nodes to handle unwanted characters, replacing a string pattern with others and so on.
Following nodes in Sparkflows can help to perform Feature Engineering:
- String Indexer - It encodes String categorical data to numeric values.
- Min Max Scaler And Standard Scaler - They scale incoming data by reducing variance.
- Feature Extraction nodes
- Feature Transformation nodes
- Feature Selection nodes
- Splitting Dataset nodes