Training Data
Sampling
Non Probability Sampling
- convenience sampling : based on availability of the data
- snowball sampling : take one sample, then take the ones linked to it and so on
- judgment sampling : expert decision
- quota sampling: based on quota
[!caution] riddled with biaises
Simple Random Sampling
can causes issues with some extreme occurences
Stratified Sampling
divide population into subgroups (stratum) that are importants and then randomly pick from each subgroups.
Weighted Sampling
Reservoir Sampling
usefull to deal with streaming data : allows to keep k elements
- every element has an equal probability of being selected
- can stop the algorithm at any time and elements will be sampled with the correct probability
[!example] 1 - put the firsts $k$ elements in the reservoir 2 - each incoming $n^{th}$ element generate $i$ so that $1 \leq i \leq n$ 3 - if $1\leq i\leq k$ then replace the $k^{th}$ element with the new one
every element has $\frac{k}{n}$ chances to be selected.
[!recursive proof] init
Importance Sampling
Sample from one distribution when we only have access to another distribution \(E_P[x] = \sum xP(x) = \sum Q(x)\frac{P(x)}{Q(x)}x = E_Q [\frac{P(x)}{Q(x)}x]\)
Labeling
Hands Labels
issues with label multiplicity ==data lineage== : keep track of the origin of the data & labels
Natural Labels
feed back loops length is an important criteria
Handling lack of labels
Weak supervision
Snorkel based on heuristics :
- keyword heuristics
- regular expression
- database lookup
- output of other models
Semi Supervision
- self training
- similarity clustering
- perturbation based most useful when the number of training labels is limited
Transfer Learning
- feature extraction
- fine tuning
Active Learning
label the samples that are the most helpful to your model according to some metrics or heuristics
Class Imbalance
Accuracy is not the “holy” metric, it can be misleading in a context of data imbalance.

ROC Curve

Resampling
good for low dim data :
- undersampling (e.g : Tomek links)
- upsampling (e.g: SMOTE)
Many techniques are to computationally expensives for high-dimension data of high-dimensional feature space (Near&Miss, one-sided selection)
[!caution] Never evaluate model on ressampled data, only train on it
Sophisticated Sampling Techniques
- Two phases sampling: 1 - train on ressampled data 2 - fine tune on original data
- Dynamic Sampling: oversample low-performing classes and oversample high-performing ones
Algorithm level methods
- cost sensitive methods
- class balanced loss
- focal loss
Data Augmentation
- Simple label-preserving transformation
- Perturbation (also simple label-preserving transformation) : used to trick model into making wrong prediction, also named “one pixel attack”
- Data synthesis