Model Deployment
Batch prediction VS Online prediction
ML Deployment Myths
- you only deploy one or two models at a time
- if we don’t do anything, model performance remain the same
- you won’t need to update your model as much
- most ML Engineers don’t need to worry about scale
Batch prediction

Online prediction using batch features

Online Prediction (Streaming)

Model Compression
- low rank factorization, e.g : compact conv filters
- knowledge distillation : student & teacher model
- pruning : by zeroing params, but can introduce biais
- quantization : using fewer bits to represent parameters
- Roblox Bert Case Study
ML on the Cloud VS On Device
ML Optimization on Edge
- vectorization
- parallelization
- loop tiling
- operator fusion
- graph optimization