Training Data


Sampling

Non Probability Sampling

Simple Random Sampling

can causes issues with some extreme occurences

Stratified Sampling

divide population into subgroups (stratum) that are importants and then randomly pick from each subgroups.

Weighted Sampling

Reservoir Sampling

usefull to deal with streaming data : allows to keep k elements

every element has $\frac{k}{n}$ chances to be selected.

[!recursive proof] init

Importance Sampling

Sample from one distribution when we only have access to another distribution \(E_P[x] = \sum xP(x) = \sum Q(x)\frac{P(x)}{Q(x)}x = E_Q [\frac{P(x)}{Q(x)}x]\)

Labeling

Hands Labels

issues with label multiplicity ==data lineage== : keep track of the origin of the data & labels

Natural Labels

feed back loops length is an important criteria

Handling lack of labels

Weak supervision

Snorkel based on heuristics :

Semi Supervision

Transfer Learning

Active Learning

label the samples that are the most helpful to your model according to some metrics or heuristics

Class Imbalance

Accuracy is not the “holy” metric, it can be misleading in a context of data imbalance.

ROC Curve

Resampling

good for low dim data :

Many techniques are to computationally expensives for high-dimension data of high-dimensional feature space (Near&Miss, one-sided selection)

[!caution] Never evaluate model on ressampled data, only train on it

Sophisticated Sampling Techniques

  • Two phases sampling: 1 - train on ressampled data 2 - fine tune on original data
  • Dynamic Sampling: oversample low-performing classes and oversample high-performing ones

Algorithm level methods

Data Augmentation