Continual Learning & Test in Production

Continual Learning

champion VS challenger model

stateless retraining: retraining the model from scratch each time
stateful retraining: conitnue retraining on new data

continual learning is about setting up an infrastructure in a way that allow the update of models whenever needed and deploy it quickly.

model iteration: new features or different model architecture
data iteration : new data, same model

Continual Learning is good against data shift. It also help overcome the ==continuous cold start problem==

Continual Learning challenges

fresh data access challenge (but with streaming natural labelled data, it’s label computation)
evaluation challenge
- biggest challenge. The more updates, the more likely it will faill
- more susceptible to coordinated manipulation & adversarial attack eg [[Tay - le Chatbot Raciste]]
- evaluation takes time
algorithm challenge
- only affects matrix-based & tree-based models that are updated very fasts
- much easier to adapt NN to continual learning paradigm than matrix or tree based algorithms.
  How often to update your models
value of data freshness : test on different slides of window date
models iterations vs data iteration case
Test in Production
need a mixture of online & offline predictions
offline not enough ?
- 2 majors test types for offline eval : test splits & backtest
  - static test split to benchmark and compare different model performances
  - backtest : testing model on data from specific period of time in the past. Not suffisent, can be an issue with the new data. Always use a static test set in addition as a sanity check.
Data shitfs
- models does well on data from 1h but what about later ? You have to deploy to check
  Shadow Deployment
  1. deploy candidate model in parallel with the existing one
  2. for each incoming requests, route it to both models to make predictions, but only serve the existing model’s predicitons
  3. log the candidate predictions for analysis purpose
    A/B Testing
  4. deploy candidate model in parallel with the existing one
  5. alternatively route requests to one or another (has to be random so there is no selection biais, should run enough samples to be confident about the outcome)
  6. analyse the logs (can do two-sample test to determine which model is better)
    Canary Release
    
    candidate model (canary) is deployed to a subgroup of users
    
    Interleaving Experiments
    
    expose users to both models. Needs less population than A/B testing
- the two models recommandations should be as likely so there is no biais in the exposition -> that’s why we use ==team-draft interleaving==
  Bandits
  
  comes from gambling, more data efficient than A/B testing eploration & exploitation
  
  Contextual Bandits
  
  also called “one shot reinforcment” bandits -> determine the payout (prediction accuracy) of each models contextural bandits -> determine the payout of each actions

All the process of model evaluation should be clearly determined by the team : which test, in which order on which data. Better, it should be automatized and kicked off when there is a new update. Should be similar to continuous integration developpment (CI/CD)

Continual Learning & Test in Production

Continual Learning

Continual Learning challenges

How often to update your models

Test in Production

Shadow Deployment

A/B Testing

Canary Release

Interleaving Experiments

Bandits

Contextual Bandits