Back to Blog

Docker for Data Engineers: Containerizing ML Workflows

Dec 14, 2024

I spent a week setting up my ML pipeline on a new server.

Install Python. Install packages. Fix version conflicts. Install system dependencies. Configure environment variables. Debug path issues. Rinse and repeat.

Then I learned Docker.

Now? I ship my entire ML workflow in a container. One command, works anywhere. No more "it works on my machine." No more dependency hell. No more environment setup nightmares.

Docker isn't just for DevOps. It's the difference between "my ML pipeline only runs on my laptop" and "my ML pipeline runs everywhere."

Let me show you how to containerize your ML workflows properly, avoid the mistakes I made, and never worry about environment issues again.

Why Data Engineers Need Docker

"Can't I just use virtual environments?"

You could. But Docker solves problems virtualenv can't.

The Problems Docker Solves

Problem #1: System Dependencies

# Your ML pipeline needs
- Python 3.10
- PostgreSQL client libraries
- GDAL (for geospatial data)
- Java (for PySpark)
- System libraries for scikit-learn

# Good luck installing all this consistently!

With Docker:

FROM python:3.10-slim
RUN apt-get update && apt-get install -y \
    libpq-dev gdal-bin openjdk-11-jre
# Done. Works everywhere.
```

**Problem #2: "Works on My Machine"**
```
You: "The pipeline works perfectly!"
Coworker: "It crashes on my machine"
You: "What Python version?"
Coworker: "3.9"
You: "I'm on 3.10"
Coworker: "What about package X?"
You: "What version do you have?"
Coworker: *frustration intensifies*
```

**With Docker:**
```
You: "docker run myimage"
Coworker: "docker run myimage"
Both: *it works* ✅

Problem #3: Environment Conflicts

# Project A needs scikit-learn 0.24
# Project B needs scikit-learn 1.3
# They conflict. Pick one.

# Or use Docker:
docker run project-a  # scikit-learn 0.24
docker run project-b  # scikit-learn 1.3
# No conflicts!

Problem #4: Deployment Complexity

# Without Docker
1. SSH into server
2. Install dependencies
3. Configure environment
4. Copy code
5. Restart service
6. Debug why it doesn't work
7. Repeat steps 2-6 multiple times

# With Docker
1. docker push image
2. docker pull image
3. docker run image
# Done.

Docker Basics for Data Engineers

Before we containerize ML workflows, let's cover the essentials.

Core Concepts

Image: Blueprint for a container (like a class) Container: Running instance of an image (like an object) Dockerfile: Recipe for building an image Volume: Persistent storage (survives container restarts) Network: How containers talk to each other

Your First Dockerfile

# Start from a base image
FROM python:3.10-slim

# Set working directory
WORKDIR /app

# Copy requirements
COPY requirements.txt .

# Install dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY . .

# Run the application
CMD ["python", "train.py"]

Build and run:

# Build image
docker build -t ml-training:v1 .

# Run container
docker run ml-training:v1
```

**That's it. Basic Docker in 5 minutes.**

---

## **Containerizing a Training Pipeline**

Let's containerize a real ML training workflow.

### **Project Structure**
```
ml-pipeline/
├── Dockerfile
├── docker-compose.yml
├── requirements.txt
├── data/
│   └── train.csv
├── models/
│   └── (empty - models saved here)
├── src/
│   ├── train.py
│   ├── preprocess.py
│   └── evaluate.py
└── config/
    └── config.yaml
```

### **requirements.txt**
```
pandas==2.1.3
scikit-learn==1.3.2
joblib==1.3.2
pyyaml==6.0.1
mlflow==2.8.1
numpy==1

Dockerfile for Training

# Use Python slim image (smaller)
FROM python:3.10-slim as builder

# Install system dependencies for ML libraries
RUN apt-get update && apt-get install -y \
    gcc \
    g++ \
    && rm -rf /var/lib/apt/lists/*

# Create virtual environment
RUN python -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Install Python packages
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Final stage - smaller image
FROM python:3.10-slim

# Copy virtual environment from builder
COPY --from=builder /opt/venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Set working directory
WORKDIR /app

# Copy source code
COPY src/ src/
COPY config/ config/

# Create directories for data and models
RUN mkdir -p data models logs

# Set Python path
ENV PYTHONPATH=/app

# Non-root user for security
RUN useradd -m -u 1000 mluser && \
    chown -R mluser:mluser /app
USER mluser

# Default command
CMD ["python", "src/train.py"]

Key features:

✅ Multi-stage build (smaller final image)
✅ Virtual environment
✅ Non-root user (security)
✅ Clean layer structure

src/train.py

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import joblib
import yaml
import mlflow
from pathlib import Path
import logging

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

def load_config():
    """Load configuration"""
    with open('config/config.yaml', 'r') as f:
        return yaml.safe_load(f)

def train_model():
    """Train ML model"""
    
    # Load config
    config = load_config()
    logger.info("Configuration loaded")
    
    # Load data
    data_path = Path(config['data']['path'])
    df = pd.read_csv(data_path)
    logger.info(f"Loaded {len(df)} samples from {data_path}")
    
    # Prepare data
    X = df.drop(config['data']['target_column'], axis=1)
    y = df[config['data']['target_column']]
    
    # Split
    X_train, X_test, y_train, y_test = train_test_split(
        X, y,
        test_size=config['training']['test_size'],
        random_state=config['training']['random_state']
    )
    
    # Create pipeline
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', RandomForestClassifier(
            n_estimators=config['model']['n_estimators'],
            max_depth=config['model']['max_depth'],
            random_state=config['training']['random_state']
        ))
    ])
    
    # Train
    logger.info("Training model...")
    with mlflow.start_run():
        pipeline.fit(X_train, y_train)
        
        # Evaluate
        train_score = pipeline.score(X_train, y_train)
        test_score = pipeline.score(X_test, y_test)
        
        logger.info(f"Train accuracy: {train_score:.4f}")
        logger.info(f"Test accuracy: {test_score:.4f}")
        
        # Log to MLflow
        mlflow.log_params(config['model'])
        mlflow.log_metric("train_accuracy", train_score)
        mlflow.log_metric("test_accuracy", test_score)
        
        # Save model
        model_path = Path('models/model.pkl')
        joblib.dump(pipeline, model_path)
        logger.info(f"Model saved to {model_path}")
        
        mlflow.sklearn.log_model(pipeline, "model")
    
    logger.info("Training complete!")

if __name__ == "__main__":
    train_model()

config/config.yaml

data:
  path: "data/train.csv"
  target_column: "target"

training:
  test_size: 0.2
  random_state: 42

model:
  n_estimators: 100
  max_depth: 10

Build and Run

# Build image
docker build -t ml-training:v1 .

# Run training
docker run \
  -v $(pwd)/data:/app/data \
  -v $(pwd)/models:/app/models \
  -v $(pwd)/logs:/app/logs \
  ml-training:v1

# Output:
# 2025-01-15 10:23:45 - INFO - Configuration loaded
# 2025-01-15 10:23:46 - INFO - Loaded 10000 samples from data/train.csv
# 2025-01-15 10:23:50 - INFO - Training model...
# 2025-01-15 10:24:15 - INFO - Train accuracy: 0.9876
# 2025-01-15 10:24:15 - INFO - Test accuracy: 0.9234
# 2025-01-15 10:24:16 - INFO - Model saved to models/model.pkl
# 2025-01-15 10:24:16 - INFO - Training complete!

The magic: Model is saved to your local models/ directory even though training happened in a container!

Docker Compose for Complex Workflows

Real ML workflows have multiple services. Docker Compose orchestrates them.

The Problem

Your ML pipeline needs:

Training service
API service
PostgreSQL database
MLflow tracking server
Monitoring (Prometheus/Grafana)

Managing 5 containers manually? Nightmare.

The Solution: Docker Compose

docker-compose.yml

version: '3.8'

services:
  # PostgreSQL database
  postgres:
    image: postgres:15
    container_name: ml-postgres
    environment:
      POSTGRES_USER: mluser
      POSTGRES_PASSWORD: mlpassword
      POSTGRES_DB: mldb
    volumes:
      - postgres_data:/var/lib/postgresql/data
    ports:
      - "5432:5432"
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U mluser"]
      interval: 10s
      timeout: 5s
      retries: 5

  # MLflow tracking server
  mlflow:
    image: ghcr.io/mlflow/mlflow:v2.8.1
    container_name: ml-mlflow
    command: >
      mlflow server
      --backend-store-uri postgresql://mluser:mlpassword@postgres/mldb
      --default-artifact-root /mlflow/artifacts
      --host 0.0.0.0
      --port 5000
    ports:
      - "5000:5000"
    volumes:
      - mlflow_artifacts:/mlflow/artifacts
    depends_on:
      postgres:
        condition: service_healthy
    environment:
      - MLFLOW_TRACKING_URI=http://mlflow:5000

  # Training service
  training:
    build:
      context: .
      dockerfile: Dockerfile.training
    container_name: ml-training
    volumes:
      - ./data:/app/data
      - ./models:/app/models
      - ./logs:/app/logs
    environment:
      - MLFLOW_TRACKING_URI=http://mlflow:5000
      - DATABASE_URL=postgresql://mluser:mlpassword@postgres/mldb
    depends_on:
      postgres:
        condition: service_healthy
      mlflow:
        condition: service_started
    command: python src/train.py

  # Prediction API
  api:
    build:
      context: .
      dockerfile: Dockerfile.api
    container_name: ml-api
    ports:
      - "8000:8000"
    volumes:
      - ./models:/app/models
    environment:
      - MODEL_PATH=/app/models/model.pkl
      - DATABASE_URL=postgresql://mluser:mlpassword@postgres/mldb
    depends_on:
      - postgres
      - training
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
    restart: unless-stopped

  # Prometheus monitoring
  prometheus:
    image: prom/prometheus:latest
    container_name: ml-prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
    depends_on:
      - api

  # Grafana dashboards
  grafana:
    image: grafana/grafana:latest
    container_name: ml-grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana:/etc/grafana/provisioning
    depends_on:
      - prometheus

volumes:
  postgres_data:
  mlflow_artifacts:
  prometheus_data:
  grafana_data

Start Everything

# Start all services
docker-compose up -d

# Check status
docker-compose ps

# View logs
docker-compose logs -f training

# Stop everything
docker-compose down

# Stop and remove volumes (clean slate)
docker-compose down -v

What you get:

One command. Everything runs.

Optimizing Docker Images for ML

ML images can get huge. Let's optimize.

Problem: Bloated Images

# Bad: 3.2 GB image!
FROM python:3.10

RUN pip install \
    pandas numpy scikit-learn \
    tensorflow torch torchvision \
    jupyter notebook matplotlib seaborn

COPY

Why it's huge:

Full Python image (not slim)
Unnecessary packages
Poor layer caching
Build artifacts included

Solution: Multi-Stage Builds

# Stage 1: Builder (includes build tools)
FROM python:3.10-slim as builder

# Install build dependencies
RUN apt-get update && apt-get install -y \
    gcc \
    g++ \
    make \
    && rm -rf /var/lib/apt/lists/*

# Create virtual environment
RUN python -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Install only what you need
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Stage 2: Runtime (minimal)
FROM python:3.10-slim

# Copy only the virtual environment
COPY --from=builder /opt/venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Copy application
WORKDIR /app
COPY src/ src/
COPY models/ models/

# Run
CMD ["python", "src/train.py"]
```

**Result:**
- Before: 3.2 GB
- After: 450 MB
- **Reduction: 86%**

### **Additional Optimization Tips**

#### **1. Use .dockerignore**
```
# .dockerignore
__pycache__/
*.pyc
*.pyo
*.pyd
.git/
.gitignore
.pytest_cache/
.venv/
*.ipynb
*.md
tests/
docs/
.DS_Store
data/*.csv  # Don't copy large data files
models/*.pkl  # Don't copy models (use volumes)

2. Layer Caching

# Bad: Changes to code rebuild everything
COPY . .
RUN pip install -r requirements.txt

# Good: Requirements cached separately
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .  # Code changes don't rebuild dependencies

3. Minimize Layers

# Bad: 3 layers
RUN apt-get update
RUN apt-get install -y gcc
RUN apt-get clean

# Good: 1 layer
RUN apt-get update && \
    apt-get install -y gcc && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

4. Use Specific Versions

# Bad: Version can change
FROM python:3.10

# Good: Exact version
FROM python:3.10.13-slim

Volumes: Persisting Data

Containers are ephemeral. Volumes persist data.

The Problem

# Train model in container
docker run ml-training

# Model is saved... inside the container
# Container stops
# Model is GONE! 🔥

Solution: Volumes

Named Volumes (Docker-managed)

bash

# Create named volume
docker volume create ml-models

# Use volume
docker run \
  -v ml-models:/app/models \
  ml-training

# Model persists after container stops!
# List volumes
docker volume ls

# Inspect volume

Bind Mounts (Host directory)

bash

# Mount local directory
docker run \
  -v $(pwd)/models:/app/models \
  -v $(pwd)/data:/app/data \
  ml-training

# Changes appear in your local filesystem immediately!

When to Use What

Named Volumes:

✅ Production databases
✅ Shared data between containers
✅ Don't need direct access from host

Bind Mounts:

✅ Development (edit code, see changes)
✅ Configuration files
✅ Need direct access from host

Docker Compose Volumes

yaml

services:
  training:
    image: ml-training
    volumes:
      # Bind mount for code (development)
      - ./src:/app/src
      # Named volume for models (production)
      - model_data:/app/models
      # Bind mount for data
      - ./data:/app/data:ro  # :ro = read-only

volumes:
  model_data:  # Docker-managed volume

Networking: Connecting Services

Your ML pipeline has multiple services that need to talk.

Default: Bridge Network

bash

Start database
docker run -d --name postgres postgres:15

# Start API (can't reach database!)
docker run --name api ml-api

# API tries to connect to postgres
# ERROR: Connection refused

Problem: Containers are isolated by default.

Solution: Custom Network

bash

# Create network
docker network create ml-network

# Start database on network
docker run -d \
  --name postgres \
  --network ml-network \
  postgres:15

# Start API on same network
docker run -d \
  --name api \
  --network ml-network \
  -e DATABASE_HOST=postgres \
  ml-api

# API can now reach postgres by name!

Docker Compose (Automatic Network)

yaml

services:
  postgres:
    image: postgres:15
    # Automatically on default network
    # Reachable at hostname: postgres

  api:
    image: ml-api
    environment:
      - DATABASE_URL=postgresql://user:pass@postgres/db
    # Can reach postgres by service name!
    depends_on

Docker Compose creates a network automatically. Services can reach each other by name.

Real-World Example: Complete ML Pipeline

Let's put it all together.

Dockerfile.training

dockerfile
  
FROM python:3.10-slim as builder

RUN apt-get update && apt-get install -y gcc g++ && \
    rm -rf /var/lib/apt/lists/*

RUN python -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

FROM python:3.10-slim

COPY --from=builder /opt/venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

WORKDIR /app
COPY src/ src/
COPY config/ config/

RUN useradd -m -u 1000 mluser && chown -R mluser:mluser /app
USER mluser

CMD ["python", "src/train.py"]

Dockerfile.api

docekrfile
  
FROM python:3.10-slim as builder

RUN apt-get update && apt-get install -y gcc && \
    rm -rf /var/lib/apt/lists/*

RUN python -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

COPY requirements-api.txt .
RUN pip install --no-cache-dir -r requirements-api.txt

FROM python:3.10-slim

COPY --from=builder /opt/venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

WORKDIR /app
COPY src/api.py .

RUN useradd -m -u 1000 apiuser && chown -R apiuser:apiuser /app
USER apiuser

EXPOSE 8000

HEALTHCHECK --interval=30s --timeout=3s \
    CMD python -c "import requests; requests.get('http://localhost:8000/health')"

CMD ["uvicorn", "api:app", "--host", "0.0.0.0", "--port", "8000"]

Complete docker-compose.yml

yaml

version: '3.8'

services:
  postgres:
    image: postgres:15
    environment:
      POSTGRES_USER: mluser
      POSTGRES_PASSWORD: mlpass
      POSTGRES_DB: mldb
    volumes:
      - postgres_data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U mluser"]
      interval: 10s
      timeout: 5s
      retries: 5

  training:
    build:
      context: .
      dockerfile: Dockerfile.training
    volumes:
      - ./data:/app/data:ro
      - ./models:/app/models
      - ./logs:/app/logs
    environment:
      - DATABASE_URL=postgresql://mluser:mlpass@postgres/mldb
      - PYTHONUNBUFFERED=1
    depends_on:
      postgres:
        condition: service_healthy

  api:
    build:
      context: .
      dockerfile: Dockerfile.api
    ports:
      - "8000:8000"
    volumes:
      - ./models:/app/models:ro
    environment:
      - MODEL_PATH=/app/models/model.pkl
      - DATABASE_URL=postgresql://mluser:mlpass@postgres/mldb
    depends_on:
      training:
        condition: service_completed_successfully
    restart: unless-stopped

  prometheus:
    image: prom/prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    depends_on:
      - api

  grafana:
    image: grafana/grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana_data:/var/lib/grafana
    depends_on:
      - prometheus

volumes:
  postgres_data:
  prometheus_data:
  grafana_data

Running the Complete Pipeline

bash

# Build all images
docker-compose build

# Start everything
docker-compose up -d

# Watch training logs
docker-compose logs -f training

# Training completes, API starts automatically
# Check API health
curl http://localhost:8000/health

# Make prediction
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"features": [1, 2, 3, 4, 5]}'

# View metrics
open http://localhost:9090  # Prometheus
open http://localhost:3000  # Grafana

# Stop everything

Common Docker Mistakes

Mistake #1: Running as Root

❌ Don't:

dockerfile
  
FROM python:3.10
# Runs as root (security risk!)
CMD ["python", "app.py"]

✅ Do:

dockerfile
  
FROM python:3.10
RUN useradd -m -u 1000 appuser
USER appuser
CMD ["python", "app.py"]

Mistake #2: Hardcoded Secrets

❌ Don't:

dockerfile
  
ENV DATABASE_PASSWORD=secretpassword123
# Password in image! Anyone can see it

✅ Do:

bash

# Pass secrets at runtime
docker run -e DATABASE_PASSWORD=secret myimage

# Or use Docker secrets (production)

Mistake #3: Not Using .dockerignore

❌ Don't:

bash

# Copies everything including .git, __pycache__, etc.
COPY . .
```

✅ **Do:**
```
# .dockerignore
.git/
__pycache__/
*.pyc
.venv/
data/  # Don't copy large datasets

Mistake #4: Copying Data into Image

❌ Don't:

dockerfile
  
COPY data/large_dataset.csv /app/data/
# Image is now 5GB

✅ Do:

bash

# Use volumes for data
docker run -v $(pwd)

Mistake #5: Not Cleaning Up

❌ Don't:

bash

# Docker images accumulate
docker images  # 50 images, 100GB!

✅ Do:

bash

# Regular cleanup
docker system prune -a  # Remove unused images
docker volume prune     # Remove unused volumes

Production Deployment with Docker

CI/CD Pipeline

yaml

# .github/workflows/docker.yml

name: Build and Push Docker Image

on:
  push:
    branches: [main]

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Login to Docker Hub
        uses: docker/login-action@v2
        with:
          username: ${{ secrets.DOCKER_USERNAME }}
          password: ${{ secrets.DOCKER_PASSWORD }}
      
      - name: Build and push
        uses: docker/build-push-action@v4
        with:
          context: .
          push: true
          tags: |
            myorg/ml-api:latest
            myorg/ml-api:${{ github.sha }}

Kubernetes Deployment

yaml

# k8s/deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ml-api
  template:
    metadata:
      labels:
        app: ml-api
    spec:
      containers:
      - name: api
        image: myorg/ml-api:latest
        ports:
        - containerPort: 8000
        env:
        - name: MODEL_PATH
          value: /models/model.pkl
        volumeMounts:
        - name: models
          mountPath: /models
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "2000m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
      volumes:
      - name: models
        persistentVolumeClaim:
          claimName

Debugging Docker Containers

Common Commands

bash

# Check running containers
docker ps

# View logs
docker logs <container_id>
docker logs -f <container_id>  # Follow logs

# Execute command in running container
docker exec -it <container_id> bash

# Inspect container
docker inspect <container_id>

# Check resource usage
docker stats

# View container processes
docker top <container_id>

# Copy files from container
docker cp <container_id>:/app/model.pkl ./

# Stop container
docker stop <container_id>

# Remove container
docker rm

Debugging Tips

bash

# Override CMD to get shell
docker run -it --entrypoint bash myimage

# Check environment variables
docker exec <container> env

# Test network connectivity
docker exec <container> ping postgres

# View filesystem
docker exec <container> ls -la /app
```

---

## **Docker Best Practices for ML**

### **✅ DO:**

1. **Use multi-stage builds**
2. **Create .dockerignore**
3. **Run as non-root user**
4. **Use specific version tags**
5. **Leverage layer caching**
6. **Keep images small**
7. **Use volumes for data**
8. **Add health checks**
9. **Set resource limits**
10. **Document your images**

### **❌ DON'T:**

1. **Don't run as root**
2. **Don't hardcode secrets**
3. **Don't copy large files**
4. **Don't use `latest` tag in production**
5. **Don't install unnecessary packages**
6. **Don't skip .dockerignore**
7. **Don't combine unrelated services**
8. **Don't forget to clean up**

---

Conclusion

Docker isn't optional for modern ML workflows. It's essential.

What Docker gives you:

✅ Reproducibility - Works everywhere
✅ Isolation - No conflicts
✅ Portability - Easy deployment
✅ Consistency - Same environment everywhere
✅ Scalability - Easy to scale

The complete workflow:

Containerize training - Reproducible experiments
Containerize API - Reliable serving
Use Docker Compose - Orchestrate services
Optimize images - Fast builds, small sizes
Use volumes - Persist data
Add monitoring - Prometheus + Grafana

Start simple. Build complexity as needed.

Want to see complete Dockerized ML systems?

Check out my projects with full Docker setups:

GitHub: github.com/Shodexco

Questions? Let's connect:

Portfolio: jonathansodeke.framer.website
GitHub: github.com/Shodexco
LinkedIn: [Your LinkedIn]

Now go containerize your ML workflows. Your future self will thank you.

About the Author

Jonathan Sodeke is a Data Engineer and ML Engineer who learned Docker the hard way (by debugging containers at 3am). He now containerizes everything and deploys ML systems that work reliably across environments.

When he's not optimizing Dockerfiles at 2am, he's building production ML systems and teaching others to escape dependency hell.

Portfolio: jonathansodeke.framer.website
GitHub: github.com/Shodexco
LinkedIn: [Your LinkedIn URL]

Docker for Data Engineers: Containerizing ML Workflows

Why Data Engineers Need Docker

The Problems Docker Solves

Docker Basics for Data Engineers

Core Concepts

Your First Dockerfile

Dockerfile for Training

src/train.py

config/config.yaml

Build and Run

Docker Compose for Complex Workflows

The Problem

The Solution: Docker Compose

Start Everything

Optimizing Docker Images for ML

Problem: Bloated Images

Solution: Multi-Stage Builds

2. Layer Caching

3. Minimize Layers

4. Use Specific Versions

Volumes: Persisting Data

The Problem

Solution: Volumes

Named Volumes (Docker-managed)

Bind Mounts (Host directory)

When to Use What

Docker Compose Volumes

Networking: Connecting Services

Default: Bridge Network

Solution: Custom Network

Docker Compose (Automatic Network)

Real-World Example: Complete ML Pipeline

Dockerfile.training

Dockerfile.api

Complete docker-compose.yml

Running the Complete Pipeline

Common Docker Mistakes

Mistake #1: Running as Root

Mistake #2: Hardcoded Secrets

Mistake #3: Not Using .dockerignore

Mistake #4: Copying Data into Image

Mistake #5: Not Cleaning Up

Production Deployment with Docker

CI/CD Pipeline

Kubernetes Deployment

Debugging Docker Containers

Common Commands

Debugging Tips

Conclusion

About the Author

Recent Posts

Sign Up To My Newsletter

Sign Up To My Newsletter

Sign Up To My Newsletter