I spent a week setting up my ML pipeline on a new server.
Install Python. Install packages. Fix version conflicts. Install system dependencies. Configure environment variables. Debug path issues. Rinse and repeat.
Then I learned Docker.
Now? I ship my entire ML workflow in a container. One command, works anywhere. No more "it works on my machine." No more dependency hell. No more environment setup nightmares.
Docker isn't just for DevOps. It's the difference between "my ML pipeline only runs on my laptop" and "my ML pipeline runs everywhere."
Let me show you how to containerize your ML workflows properly, avoid the mistakes I made, and never worry about environment issues again.
Why Data Engineers Need Docker
"Can't I just use virtual environments?"
You could. But Docker solves problems virtualenv can't.
The Problems Docker Solves
Problem #1: System Dependencies
- Python 3.10
- PostgreSQL client libraries
- GDAL (for geospatial data)
- Java (for PySpark)
- System libraries for scikit-learn
With Docker:
FROM python:3.10-slim
RUN apt-get update && apt-get install -y \
libpq-dev gdal-bin openjdk-11-jre
# Done. Works everywhere.
```
**Problem #2: "Works on My Machine"**
```
You: "The pipeline works perfectly!"
Coworker: "It crashes on my machine"
You: "What Python version?"
Coworker: "3.9"
You: "I'm on 3.10"
Coworker: "What about package X?"
You: "What version do you have?"
Coworker: *frustration intensifies*
```
**With Docker:**
```
You: "docker run myimage"
Coworker: "docker run myimage"
Both: *it works* ✅Problem #3: Environment Conflicts
docker run project-a
docker run project-b
Problem #4: Deployment Complexity
1. SSH into server
2. Install dependencies
3. Configure environment
4. Copy code
5. Restart service
6. Debug why it doesn't work
7. Repeat steps 2-6 multiple times
# With Docker
1. docker push image
2. docker pull image
3. docker run image
# Done.
Docker Basics for Data Engineers
Before we containerize ML workflows, let's cover the essentials.
Core Concepts
Image: Blueprint for a container (like a class) Container: Running instance of an image (like an object) Dockerfile: Recipe for building an image Volume: Persistent storage (survives container restarts) Network: How containers talk to each other
Your First Dockerfile
# Start from a base image
FROM python:3.10-slim
# Set working directory
WORKDIR /app
# Copy requirements
COPY requirements.txt .
# Install dependencies
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY . .
# Run the application
CMD ["python", "train.py"]
Build and run:
docker build -t ml-training:v1 .
docker run ml-training:v1
's it. Basic Docker in 5 minutes.**
---
## **Containerizing a Training Pipeline**
Let'
ml-pipeline/
├── Dockerfile
├── docker-compose.yml
├── requirements.txt
├── data/
│ └── train.csv
├── models/
│ └── (empty - models saved here)
├── src/
│ ├── train.py
│ ├── preprocess.py
│ └── evaluate.py
└── config/
└── config.yaml
pandas==2.1.3
scikit-learn==1.3.2
joblib==1.3.2
pyyaml==6.0.1
mlflow==2.8.1
numpy==1
Dockerfile for Training
# Use Python slim image (smaller)
FROM python:3.10-slim as builder
# Install system dependencies for ML libraries
RUN apt-get update && apt-get install -y \
gcc \
g++ \
&& rm -rf /var/lib/apt/listsKey features:
✅ Multi-stage build (smaller final image)
✅ Virtual environment
✅ Non-root user (security)
✅ Clean layer structure
src/train.py
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import joblib
import yaml
import mlflow
from pathlib import Path
import logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
def load_config():
"""Load configuration"""
with open('config/config.yaml', 'r') as f:
return yaml.safe_load(f)
def train_model():
"""Train ML model"""
config = load_config()
logger.info("Configuration loaded")
data_path = Path(config['data']['path'])
df = pd.read_csv(data_path)
logger.info(f"Loaded {len(df)} samples from {data_path}")
X = df.drop(config['data']['target_column'], axis=1)
y = df[config['data']['target_column']]
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=config['training']['test_size'],
random_state=config['training']['random_state']
)
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(
n_estimators=config['model']['n_estimators'],
max_depth=config['model']['max_depth'],
random_state=config['training']['random_state']
))
])
logger.info("Training model...")
with mlflow.start_run():
pipeline.fit(X_train, y_train)
train_score = pipeline.score(X_train, y_train)
test_score = pipeline.score(X_test, y_test)
logger.info(f"Train accuracy: {train_score:.4f}")
logger.info(f"Test accuracy: {test_score:.4f}")
mlflow.log_params(config['model'])
mlflow.log_metric("train_accuracy", train_score)
mlflow.log_metric("test_accuracy", test_score)
model_path = Path('models/model.pkl')
joblib.dump(pipeline, model_path)
logger.info(f"Model saved to {model_path}")
mlflow.sklearn.log_model(pipeline, "model")
logger.info("Training complete!")
if __name__ == "__main__":
train_model()config/config.yaml
data:
path: "data/train.csv"
target_column: "target"
training:
test_size: 0.2
random_state: 42
model:
n_estimators: 100
max_depth: 10
Build and Run
docker build -t ml-training:v1 .
docker run \
-v /data:/app/data \
-v /models:/app/models \
-v /logs:/app/logs \
ml-training:v1
The magic: Model is saved to your local models/ directory even though training happened in a container!
Docker Compose for Complex Workflows
Real ML workflows have multiple services. Docker Compose orchestrates them.
The Problem
Your ML pipeline needs:
Training service
API service
PostgreSQL database
MLflow tracking server
Monitoring (Prometheus/Grafana)
Managing 5 containers manually? Nightmare.
The Solution: Docker Compose
docker-compose.yml
version: '3.8'
services:
postgres:
image: postgres:15
container_name: ml-postgres
environment:
POSTGRES_USER: mluser
POSTGRES_PASSWORD: mlpassword
POSTGRES_DB: mldb
volumes:
- postgres_data:/var/lib/postgresql/data
ports:
- "5432:5432"
healthcheck:
test: ["CMD-SHELL", "pg_isready -U mluser"]
interval: 10s
timeout: 5s
retries: 5
mlflow:
image: ghcr.io/mlflow/mlflow:v2.8.1
container_name: ml-mlflow
command: >
mlflow server
--backend-store-uri postgresql://mluser:mlpassword@postgres/mldb
--default-artifact-root /mlflow/artifacts
--host 0.0.0.0
--port 5000
ports:
- "5000:5000"
volumes:
- mlflow_artifacts:/mlflow/artifacts
depends_on:
postgres:
condition: service_healthy
environment:
- MLFLOW_TRACKING_URI=http://mlflow:5000
training:
build:
context: .
dockerfile: Dockerfile.training
container_name: ml-training
volumes:
- ./data:/app/data
- ./models:/app/models
- ./logs:/app/logs
environment:
- MLFLOW_TRACKING_URI=http://mlflow:5000
- DATABASE_URL=postgresql://mluser:mlpassword@postgres/mldb
depends_on:
postgres:
condition: service_healthy
mlflow:
condition: service_started
command: python src/train.py
api:
build:
context: .
dockerfile: Dockerfile.api
container_name: ml-api
ports:
- "8000:8000"
volumes:
- ./models:/app/models
environment:
- MODEL_PATH=/app/models/model.pkl
- DATABASE_URL=postgresql://mluser:mlpassword@postgres/mldb
depends_on:
- postgres
- training
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
restart: unless-stopped
prometheus:
image: prom/prometheus:latest
container_name: ml-prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
depends_on:
- api
grafana:
image: grafana/grafana:latest
container_name: ml-grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana_data:/var/lib/grafana
- ./grafana:/etc/grafana/provisioning
depends_on:
- prometheus
volumes:
postgres_data:
mlflow_artifacts:
prometheus_data:
grafana_data
Start Everything
docker-compose up -d
docker-compose ps
docker-compose logs -f training
docker-compose down
docker-compose down -v
What you get:
One command. Everything runs.
Optimizing Docker Images for ML
ML images can get huge. Let's optimize.
Problem: Bloated Images
# Bad: 3.2 GB image!
FROM python:3.10
RUN pip install \
pandas numpy scikit-learn \
tensorflow torch torchvision \
jupyter notebook matplotlib seaborn
COPY
Why it's huge:
Full Python image (not slim)
Unnecessary packages
Poor layer caching
Build artifacts included
Solution: Multi-Stage Builds
# Stage 1: Builder (includes build tools)
FROM python:3.10-slim as builder
# Install build dependencies
RUN apt-get update && apt-get install -y \
gcc \
g++ \
make \
&& rm -rf /var/lib/apt/lists2. Layer Caching
# Bad: Changes to code rebuild everything
COPY . .
RUN pip install -r requirements.txt
# Good: Requirements cached separately
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . . # Code changes don't rebuild dependencies
3. Minimize Layers
# Bad: 3 layers
RUN apt-get update
RUN apt-get install -y gcc
RUN apt-get clean
# Good: 1 layer
RUN apt-get update && \
apt-get install -y gcc && \
apt-get clean && \
rm -rf /var/lib/apt/lists4. Use Specific Versions
# Bad: Version can change
FROM python:3.10
# Good: Exact version
FROM python:3.10.13-slim
Volumes: Persisting Data
Containers are ephemeral. Volumes persist data.
The Problem
Solution: Volumes
Named Volumes (Docker-managed)
bash
docker volume create ml-models
docker run \
-v ml-models:/app/models \
ml-training
docker volume ls
Bind Mounts (Host directory)
bash
docker run \
-v /models:/app/models \
-v /data:/app/data \
ml-training
When to Use What
Named Volumes:
✅ Production databases
✅ Shared data between containers
✅ Don't need direct access from host
Bind Mounts:
✅ Development (edit code, see changes)
✅ Configuration files
✅ Need direct access from host
Docker Compose Volumes
yaml
services:
training:
image: ml-training
volumes:
- ./src:/app/src
- model_data:/app/models
- ./data:/app/data:ro
volumes:
model_data: Networking: Connecting Services
Your ML pipeline has multiple services that need to talk.
Default: Bridge Network
bash
Start database
docker run -d --name postgres postgres:15
docker run --name api ml-api
Problem: Containers are isolated by default.
Solution: Custom Network
bash
docker network create ml-network
docker run -d \
--name postgres \
--network ml-network \
postgres:15
docker run -d \
--name api \
--network ml-network \
-e DATABASE_HOST=postgres \
ml-api
Docker Compose (Automatic Network)
yaml
services:
postgres:
image: postgres:15
api:
image: ml-api
environment:
- DATABASE_URL=postgresql://user:pass@postgres/db
depends_on
Docker Compose creates a network automatically. Services can reach each other by name.
Real-World Example: Complete ML Pipeline
Let's put it all together.
Dockerfile.training
dockerfile
FROM python:3.10-slim as builder
RUN apt-get update && apt-get install -y gcc g++ && \
rm -rf /var/lib/apt/listsDockerfile.api
docekrfile
FROM python:3.10-slim as builder
RUN apt-get update && apt-get install -y gcc && \
rm -rf /var/lib/apt/listsComplete docker-compose.yml
yaml
version: '3.8'
services:
postgres:
image: postgres:15
environment:
POSTGRES_USER: mluser
POSTGRES_PASSWORD: mlpass
POSTGRES_DB: mldb
volumes:
- postgres_data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U mluser"]
interval: 10s
timeout: 5s
retries: 5
training:
build:
context: .
dockerfile: Dockerfile.training
volumes:
- ./data:/app/data:ro
- ./models:/app/models
- ./logs:/app/logs
environment:
- DATABASE_URL=postgresql://mluser:mlpass@postgres/mldb
- PYTHONUNBUFFERED=1
depends_on:
postgres:
condition: service_healthy
api:
build:
context: .
dockerfile: Dockerfile.api
ports:
- "8000:8000"
volumes:
- ./models:/app/models:ro
environment:
- MODEL_PATH=/app/models/model.pkl
- DATABASE_URL=postgresql://mluser:mlpass@postgres/mldb
depends_on:
training:
condition: service_completed_successfully
restart: unless-stopped
prometheus:
image: prom/prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
depends_on:
- api
grafana:
image: grafana/grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana_data:/var/lib/grafana
depends_on:
- prometheus
volumes:
postgres_data:
prometheus_data:
grafana_data
Running the Complete Pipeline
bash
docker-compose build
docker-compose up -d
docker-compose logs -f training
curl http://localhost:8000/health
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"features": [1, 2, 3, 4, 5]}'
open http://localhost:9090
open http://localhost:3000
Common Docker Mistakes
Mistake #1: Running as Root
❌ Don't:
dockerfile
FROM python:3.10
# Runs as root (security risk!)
CMD ["python", "app.py"]
✅ Do:
dockerfile
FROM python:3.10
RUN useradd -m -u 1000 appuser
USER appuser
CMD ["python", "app.py"]
Mistake #2: Hardcoded Secrets
❌ Don't:
dockerfile
ENV DATABASE_PASSWORD=secretpassword123
# Password in image! Anyone can see it
✅ Do:
Mistake #3: Not Using .dockerignore
❌ Don't:
Mistake #4: Copying Data into Image
❌ Don't:
dockerfile
COPY data/large_dataset.csv /app/data/
# Image is now 5GB
✅ Do:
Mistake #5: Not Cleaning Up
❌ Don't:
✅ Do:
bash
docker system prune -a
docker volume prune
Production Deployment with Docker
CI/CD Pipeline
yaml
name: Build and Push Docker Image
on:
push:
branches: [main]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Login to Docker Hub
uses: docker/login-action@v2
with:
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_PASSWORD }}
- name: Build and push
uses: docker/build-push-action@v4
with:
context: .
push: true
tags: |
myorg/ml-api:latest
myorg/ml-api:${{ github.sha }}Kubernetes Deployment
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-api
spec:
replicas: 3
selector:
matchLabels:
app: ml-api
template:
metadata:
labels:
app: ml-api
spec:
containers:
- name: api
image: myorg/ml-api:latest
ports:
- containerPort: 8000
env:
- name: MODEL_PATH
value: /models/model.pkl
volumeMounts:
- name: models
mountPath: /models
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "2000m"
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
volumes:
- name: models
persistentVolumeClaim:
claimName
Debugging Docker Containers
Common Commands
bash
docker ps
docker logs <container_id>
docker logs -f <container_id>
docker exec -it <container_id> bash
docker inspect <container_id>
docker stats
docker top <container_id>
docker cp <container_id>:/app/model.pkl ./
docker stop <container_id>
docker rm
Debugging Tips
bash
docker run -it --entrypoint bash myimage
docker exec <container> env
docker exec <container> ping postgres
docker exec <container> ls -la /app
'T:**
1. **Don'
't hardcode secrets**
3. **Don'
't use `latest` tag in production**
5. **Don'
't skip .dockerignore**
7. **Don'
't forget to clean up**
---
Conclusion
Docker isn't optional for modern ML workflows. It's essential.
What Docker gives you:
✅ Reproducibility - Works everywhere
✅ Isolation - No conflicts
✅ Portability - Easy deployment
✅ Consistency - Same environment everywhere
✅ Scalability - Easy to scale
The complete workflow:
Containerize training - Reproducible experiments
Containerize API - Reliable serving
Use Docker Compose - Orchestrate services
Optimize images - Fast builds, small sizes
Use volumes - Persist data
Add monitoring - Prometheus + Grafana
Start simple. Build complexity as needed.
Want to see complete Dockerized ML systems?
Check out my projects with full Docker setups:
GitHub: github.com/Shodexco
Questions? Let's connect:
Portfolio: jonathansodeke.framer.website
GitHub: github.com/Shodexco
LinkedIn: [Your LinkedIn]
Now go containerize your ML workflows. Your future self will thank you.
About the Author
Jonathan Sodeke is a Data Engineer and ML Engineer who learned Docker the hard way (by debugging containers at 3am). He now containerizes everything and deploys ML systems that work reliably across environments.
When he's not optimizing Dockerfiles at 2am, he's building production ML systems and teaching others to escape dependency hell.
Portfolio: jonathansodeke.framer.website
GitHub: github.com/Shodexco
LinkedIn: [Your LinkedIn URL]