Feature Engineering
Common feature engineering operations
- Handling Missing Values:
- (MNAR) Missing Not At Random : missing because of the value itself
- (MAR) Missing At Random : missing not due to the value itself but because of an other observable one
- (MCAR) Missing Completely At Random : no pattern in when the value is missing
- Feature Scaling
- Discretization
- __ Encoding Categorical Features__
- Hashing Trick
- Feature Crossing
- Discrete and continuous positional embeddings
Data Leakage
reasons :
- splitting time-correlated data randomly and not by time
- scalling before splitting
- filling in missing data with stats from the test split
- poor handling of data duplication before splitting
- group leakage
- leakage from data generation process
detecting data leakage
- correlation with label (alone or grouped)
- ablation studies
Feature Importance
example for traditional ML : XGBoost for model-agnostic methods, look into SHAP (InterpretML)