Continual Learning & Test in Production
Continual Learning
champion VS challenger model
- stateless retraining: retraining the model from scratch each time
- stateful retraining: conitnue retraining on new data
continual learning is about setting up an infrastructure in a way that allow the update of models whenever needed and deploy it quickly.
- model iteration: new features or different model architecture
- data iteration : new data, same model
Continual Learning is good against data shift. It also help overcome the ==continuous cold start problem==
Continual Learning challenges
- fresh data access challenge (but with streaming natural labelled data, it’s label computation)
- evaluation challenge
- biggest challenge. The more updates, the more likely it will faill
- more susceptible to coordinated manipulation & adversarial attack eg [[Tay - le Chatbot Raciste]]
- evaluation takes time
- algorithm challenge
- only affects matrix-based & tree-based models that are updated very fasts
- much easier to adapt NN to continual learning paradigm than matrix or tree based algorithms.
How often to update your models
- value of data freshness : test on different slides of window date
- models iterations vs data iteration case
Test in Production
- need a mixture of online & offline predictions
- offline not enough ?
- 2 majors test types for offline eval : test splits & backtest
- static test split to benchmark and compare different model performances
- backtest : testing model on data from specific period of time in the past. Not suffisent, can be an issue with the new data. Always use a static test set in addition as a sanity check.
- 2 majors test types for offline eval : test splits & backtest
- Data shitfs
- models does well on data from 1h but what about later ? You have to deploy to check
Shadow Deployment
- deploy candidate model in parallel with the existing one
- for each incoming requests, route it to both models to make predictions, but only serve the existing model’s predicitons
- log the candidate predictions for analysis purpose
A/B Testing
- deploy candidate model in parallel with the existing one
- alternatively route requests to one or another (has to be random so there is no selection biais, should run enough samples to be confident about the outcome)
- analyse the logs (can do two-sample test to determine which model is better)
Canary Release
candidate model (canary) is deployed to a subgroup of users
Interleaving Experiments
expose users to both models. Needs less population than A/B testing
- the two models recommandations should be as likely so there is no biais in the exposition -> that’s why we use ==team-draft interleaving==
Bandits
comes from gambling, more data efficient than A/B testing eploration & exploitation
Contextual Bandits
also called “one shot reinforcment” bandits -> determine the payout (prediction accuracy) of each models contextural bandits -> determine the payout of each actions
- models does well on data from 1h but what about later ? You have to deploy to check
All the process of model evaluation should be clearly determined by the team : which test, in which order on which data. Better, it should be automatized and kicked off when there is a new update. Should be similar to continuous integration developpment (CI/CD)