5 Things I Wish I Knew Before Building My First MLOps Pipeline

May 5, 2024

5 Things I Wish I Knew Before Building My First MLOps Pipeline

My fraud detection model worked flawlessly in my Jupyter notebook. 95% accuracy. Beautiful confusion matrix. The kind of results that make you think, "I'm basically a machine learning genius."

Then I tried to Dockerize it.

Everything went to hell.

If you've ever built a machine learning model, you know that getting it to work in a notebook is the easy part. The hard part? Getting it to work everywhere else. In production. On someone else's machine. At 3am when you're not there to babysit it.

This is the story of building FraudShield AI, my production fraud detection system, and the five brutal lessons I learned along the way. These aren't the polished takeaways you see in Medium posts written by people who definitely didn't build the thing themselves. These are the messy, unglamorous truths about MLOps that I wish someone had told me before I started.

If you're about to build your first MLOps pipeline, buckle up. Here's what actually matters.

#1: Training Accuracy is a Beautiful Lie

Let me be blunt: your 95% accuracy is basically astrology.

I know, I know. You spent weeks tuning hyperparameters. You did cross-validation. You have a confusion matrix that looks like art. But here's the thing—your training accuracy is measured against the most optimistic version of reality possible.

Your test set is:

  • Perfectly clean (no typos, no weird edge cases)

  • From the same distribution as your training data (obviously)

  • Small enough that you never see the truly bizarre inputs

  • Static (it never changes)

Real production data is none of these things.

What actually happens in production:

Your model meets data it's never seen before. Users fat-finger inputs. Systems upstream change their output format without telling you. Three months pass and user behavior shifts. Suddenly your beautiful 95% accuracy becomes 87%, then 82%, then you're getting angry Slack messages.

Here's what I learned: the metrics that matter in production aren't the ones you see in your notebook.

What to track instead:

  • Latency: Is your model actually fast enough for users to tolerate?

  • Error rate: How often does it completely fail?

  • Prediction distribution: Are you suddenly predicting "fraud" 10x more than usual? (Congrats, something broke)

  • Real-world accuracy: If you can measure it, measure it. User feedback, manual reviews, whatever you've got.

The brutal truth is that 95% notebook accuracy → 87% production accuracy is completely normal. Your model is meeting reality for the first time, and reality is messier than your test set.

Here's what basic production monitoring looks like:

python


from prometheus_client import Counter, Histogram
import time

# Define metrics
prediction_counter = Counter('ml_predictions_total', 'Total predictions made')
prediction_latency = Histogram('ml_prediction_latency_seconds', 'Prediction latency')
prediction_errors = Counter('ml_prediction_errors_total', 'Total prediction errors')

def predict_with_monitoring(model, input_data):
    """Make a prediction with production monitoring"""
    start_time = time.time()
    
    try:
        # Make prediction
        prediction = model.predict(input_data)
        
        # Record successful prediction
        prediction_counter.inc()
        latency = time.time() - start_time
        prediction_latency.observe(latency)
        
        return prediction
        
    except Exception as e:
        # Record error
        prediction_errors.inc()
        raise e

This isn't sexy code. But it's the code that tells you when your model is dying a slow death in production.

The lesson: Build for reality, not for Kaggle. Your notebook accuracy is a starting point, not a promise. Monitor everything, assume nothing, and accept that production will humble you.

#2: You Will Forget How Your Model Works

Three months after deploying your model, someone will ask you: "Hey, which version is running in production right now?"

And you'll have no idea.

I promise you this will happen. You'll be confident at first. "Oh, I know exactly what's deployed." Then you'll check, and you'll realize you have three different model files, five training scripts, and absolutely no idea which combination is currently serving traffic.

This is the "oh shit" moment that makes you understand why version control isn't just for code.

What you need to version:

  1. Model weights - The actual .pkl or .h5 files. Not just the latest one. All of them.

  2. Training data - Which dataset did you use? What dates? What filters?

  3. Hyperparameters - Learning rate, number of epochs, batch size, regularization...

  4. Dependencies - scikit-learn 1.0 and scikit-learn 1.3 are not the same

  5. Training code - The script that generated the model

  6. Evaluation metrics - So you can compare apples to apples

If you're not versioning all of this, you're going to have a bad time.

Enter MLflow, the thing that saves your sanity:

MLflow is an open-source platform that tracks every experiment you run. Think of it as Git, but for machine learning experiments.

python


import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score

# Start MLflow run
with mlflow.start_run(run_name="fraud_detection_v1"):
    
    # Log parameters
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("max_depth", 10)
    mlflow.log_param("min_samples_split", 5)
    
    # Train model
    model = RandomForestClassifier(
        n_estimators=100,
        max_depth=10,
        min_samples_split=5,
        random_state=42
    )
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Log metrics
    mlflow.log_metric("accuracy", accuracy_score(y_test, y_pred))
    mlflow.log_metric("precision", precision_score(y_test, y_pred))
    mlflow.log_metric("recall", recall_score(y_test, y_pred))
    
    # Log the model itself
    mlflow.sklearn.log_model(model, "model")
    
    # Log training data info
    mlflow.log_param("training_data_size", len(X_train))
    mlflow.log_param("training_data_date", "2025-01-15")

Now when your boss asks "which model is in production," you can pull up the MLflow UI and actually show them instead of nervously sweating.

Why this matters:

Without versioning, you can't:

  • Reproduce your results

  • Debug production issues

  • Compare model versions objectively

  • Roll back when the new model sucks

  • Sleep peacefully at night

The lesson: Your future self will thank you for logging everything. Document obsessively or suffer later. MLflow is free, setup takes 30 minutes, and it will save you hours of "wait, what did I do?" panic.

#3: Manual Retraining is Emotional Damage

Here's how retraining your model manually goes:

Week 1: "This is fine. I'll just retrain every Monday morning."

Week 3: "Okay I forgot last Monday but I'll definitely do it this week."

Week 6: "Why is the model performing terribly? Oh right, I haven't retrained in a month."

Week 8: Existential dread

Manual retraining doesn't scale. It doesn't work when you're on vacation. It doesn't work when you're busy. It doesn't work when you forget. And you will forget.

This is why orchestration tools exist. They're your robot assistant that does boring tasks while you do literally anything else.

Enter Airflow: Your New Best Friend

Apache Airflow is a workflow orchestration tool. Think of it as a fancy scheduler that can handle dependencies, retry failures, and send you passive-aggressive emails when things break.

Here's what my Airflow DAG does while I sleep:

  1. Check Performance - Is the model getting worse? (Compare recent accuracy to baseline)

  2. Prepare Data - Pull fresh training data from the database

  3. Train Model - Train a new model with updated data

  4. Validate - Test the new model. Is it better than what's in production?

  5. Deploy & Notify - If yes, swap models. Either way, email me the results.

Here's what a simple Airflow DAG looks like:

python

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
import mlflow

# Define default args
default_args = {
    'owner': 'jonathan',
    'retries': 2,
    'retry_delay': timedelta(minutes=5),
    'email_on_failure': True,
    'email': ['your-email@example.com']
}

# Create DAG
dag = DAG(
    'fraud_model_retraining',
    default_args=default_args,
    description='Automated ML model retraining pipeline',
    schedule_interval='@weekly',  # Run every week
    start_date=datetime(2025, 1, 1),
    catchup=False
)

def check_model_performance():
    """Check if current model performance is degrading"""
    current_accuracy = get_production_accuracy()
    baseline_accuracy = 0.92
    
    if current_accuracy < baseline_accuracy:
        print(f"Model performance degraded: {current_accuracy} < {baseline_accuracy}")
        return "retrain_needed"
    else:
        print(f"Model performance okay: {current_accuracy}")
        return "no_retrain"

def prepare_training_data():
    """Pull and prepare fresh training data"""
    query = """
        SELECT * FROM transactions 
        WHERE date > NOW() - INTERVAL '90 days'
    """
    df = pd.read_sql(query, database_connection)
    df_clean = preprocess_data(df)
    df_clean.to_parquet('/tmp/training_data.parquet')
    print(f"Prepared {len(df_clean)} training samples")

def train_new_model():
    """Train a new model with fresh data"""
    df = pd.read_parquet('/tmp/training_data.parquet')
    X = df.drop('is_fraud', axis=1)
    y = df['is_fraud']
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    
    with mlflow.start_run():
        model = RandomForestClassifier(n_estimators=100, max_depth=10)
        model.fit(X_train, y_train)
        
        accuracy = model.score(X_test, y_test)
        mlflow.log_metric("accuracy", accuracy)
        mlflow.sklearn.log_model(model, "model")
        
        print(f"New model accuracy: {accuracy}")
        return accuracy

def deploy_if_better():
    """Deploy new model if it's better than current production model"""
    new_accuracy = get_latest_model_accuracy()
    prod_accuracy = get_production_accuracy()
    
    if new_accuracy > prod_accuracy:
        print(f"Deploying new model: {new_accuracy} > {prod_accuracy}")
        deploy_model()
        return "deployed"
    else:
        print(f"Keeping old model: {new_accuracy} <= {prod_accuracy}")
        return "not_deployed"

# Define tasks
check_task = PythonOperator(
    task_id='check_performance',
    python_callable=check_model_performance,
    dag=dag
)

prepare_task = PythonOperator(
    task_id='prepare_data',
    python_callable=prepare_training_data,
    dag=dag
)

train_task = PythonOperator(
    task_id='train_model',
    python_callable=train_new_model,
    dag=dag
)

deploy_task = PythonOperator(
    task_id='deploy_model',
    python_callable=deploy_if_better,
    dag=dag
)

# Set dependencies
check_task >> prepare_task >> train_task >> deploy_task

What this gives you:

  • Automated retraining - Runs every week without you lifting a finger

  • Dependency management - Tasks run in order, data flows between them

  • Failure recovery - If something breaks, it retries automatically

  • Monitoring - Beautiful UI shows you what's running, what failed, what succeeded

  • Alerting - Emails you when things break (or succeed)

The honest take: Setting up Airflow took me about 2 days. Debugging the DAG took another day. But now it saves me 2 hours every single week, and I never have to remember to retrain manually.

The lesson: Automate the boring stuff. Your brain is for solving problems, not for remembering to run scripts every Monday. Let Airflow be your robot assistant.

#4: Docker Will Ruin Your Week (Then Save Your Career)

Let me tell you about my relationship with Docker.

Day 1: "This is amazing! Everything works!"

Day 2: "Why doesn't it work anymore?"

Day 3: "I hate Docker. I hate computers. I hate everything."

Day 7: "Okay it works now. Docker is actually amazing."

Docker is one of those technologies that makes you want to throw your laptop out the window... until suddenly everything clicks and you realize it's actually magical.

My Docker nightmare:

Everything worked fine locally. My fraud detection API ran perfectly. Then I tried to Dockerize it, and:

  • The API couldn't connect to the database (network issues)

  • Environment variables weren't loading (they were, I was just checking wrong)

  • The model file path was broken (volume mounts are confusing)

  • The container was 4GB for some reason (inefficient Dockerfile)

  • Port 8000 was "already in use" (phantom container haunting me)

I spent 5+ hours debugging. Here's what I learned the hard way:

Docker Debugging Survival Guide:

bash

# 1. CHECK LOGS FIRST (always)
docker logs container_name

# 2. Get inside the container when confused
docker exec -it container_name bash
# Then poke around, check if files exist, test connections

# 3. Check what's actually running
docker ps -a  # Shows all containers, even stopped ones

# 4. Nuclear option: kill everything and start fresh
docker-compose down -v  # Removes containers AND volumes
docker system prune -a  # Cleans up everything

# 5. Check port conflicts
lsof -i :8000  # See what's using port 8000

Here's a production-ready Dockerfile for an ML API:

dockerfile
  
# Multi-stage build for smaller image size
FROM python:3.10-slim as builder

# Install build dependencies
RUN apt-get update && apt-get install -y \
    gcc \
    g++ \
    && rm -rf /var/lib/apt/lists/*

# Create and activate virtual environment
RUN python -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Copy and install requirements
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Final stage - much smaller
FROM python:3.10-slim

# Copy virtual environment from builder
COPY --from=builder /opt/venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Set working directory
WORKDIR /app

# Copy application code
COPY . .

# Create non-root user for security
RUN useradd -m -u 1000 appuser && chown -R appuser:appuser /app
USER appuser

# Expose port
EXPOSE 8000

# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
    CMD curl -f http://localhost:8000/health || exit 1

# Run application
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Key things this Dockerfile does right:

  • Multi-stage build - Smaller final image (200MB vs 1GB+)

  • Virtual environment - Isolated dependencies

  • Non-root user - Security best practice

  • Health check - Docker can tell if app is actually working

  • No cache - Faster builds, smaller images

Docker Compose for the full stack:

yaml

version: '3.8'

services:
  # ML API
  api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - DATABASE_URL=postgresql://user:pass@db:5432/frauddb
      - MODEL_PATH=/models/fraud_model.pkl
    volumes:
      - ./models:/models
    depends_on:
      - db
    restart: unless-stopped
  
  # PostgreSQL Database
  db:
    image: postgres:15
    environment:
      - POSTGRES_USER=user
      - POSTGRES_PASSWORD=pass
      - POSTGRES_DB=frauddb
    volumes:
      - postgres_data:/var/lib/postgresql/data
    ports:
      - "5432:5432"
  
  # Prometheus (monitoring)
  prometheus:
    image: prom/prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    ports:
      - "9090:9090"
  
  # Grafana (dashboards)
  grafana:
    image: grafana/grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana_data:/var/lib/grafana
    depends_on:
      - prometheus

volumes:
  postgres_data:
  prometheus_data:
  grafana_data

Why Docker is worth the pain:

Once you get it working:

  • "Works on my machine" is solved - If it works in Docker, it works everywhere

  • No more dependency hell - Everything is isolated

  • Easy deployment - Push image, run container, done

  • Scalability - Need 5 API instances? Easy.

  • Reproducibility - Same environment every time

The lesson: Docker will hurt you before it helps you. That's normal. Stick with it. Debug methodically. Once it clicks, you'll never want to deploy without it again.

Thing #5: Monitoring or You're Flying Blind

Here's a fun scenario: Your model starts failing. Slowly. Silently.

Users notice first. They complain. By the time you investigate, it's been broken for days.

This is what happens when you don't monitor.

What you need to track in production:

  1. Prediction latency - How long does each prediction take?

  2. Error rate - How often does the API completely fail?

  3. Prediction distribution - Are predictions suddenly very different than usual?

  4. Model accuracy - Is the model getting dumber over time? (if you can measure this)

  5. System health - CPU, memory, disk space

My monitoring stack: Prometheus + Grafana

Prometheus collects metrics from your application.
Grafana turns those metrics into beautiful dashboards.

Both are free, open-source, and battle-tested by companies much bigger than yours.

Adding Prometheus metrics to a FastAPI app:

python

from fastapi import FastAPI, Response
from prometheus_client import Counter, Histogram, Gauge, generate_latest
from prometheus_fastapi_instrumentator import Instrumentator
import time

app = FastAPI()

# Define custom metrics
predictions_total = Counter(
    'ml_predictions_total',
    'Total number of predictions made'
)

prediction_latency = Histogram(
    'ml_prediction_latency_seconds',
    'Prediction latency in seconds',
    buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 2.0, 5.0]
)

prediction_errors = Counter(
    'ml_prediction_errors_total',
    'Total prediction errors'
)

fraud_predictions = Gauge(
    'ml_fraud_predictions_current',
    'Current number of fraud predictions in last hour'
)

# Add default FastAPI metrics
Instrumentator().instrument(app).expose(app)

@app.post("/predict")
async def predict(transaction: dict):
    """Make fraud prediction with monitoring"""
    start_time = time.time()
    
    try:
        # Load model and make prediction
        prediction = model.predict([transaction])
        
        # Record metrics
        predictions_total.inc()
        latency = time.time() - start_time
        prediction_latency.observe(latency)
        
        if prediction == 1:  # Fraud detected
            fraud_predictions.inc()
        
        return {"prediction": int(prediction), "latency": latency}
        
    except Exception as e:
        prediction_errors.inc()
        raise e

@app.get("/metrics")
async def metrics():
    """Prometheus metrics endpoint"""
    return Response(content=generate_latest(), media_type="text/plain")

What your Grafana dashboard should show:

Panel 1: Predictions Per Minute

  • Line graph showing request volume

  • Helps you see traffic patterns

  • Spikes might indicate issues or attacks

Panel 2: P95 Latency

  • 95th percentile response time

  • If this spikes, users are having a bad time

Panel 3: Error Rate

  • Percentage of failed predictions

  • Should be near 0%, always

Panel 4: Prediction Distribution

  • Fraud vs non-fraud ratio over time

  • Sudden changes = something is wrong

Panel 5: Model Accuracy (if measurable)

  • Track accuracy over time

  • Catch degradation early

Panel 6: System Resources

  • CPU, memory, disk usage

  • Prevents surprise outages

Prometheus configuration (prometheus.yml):

yaml

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'fraud-detection-api'
    static_configs:
      - targets: ['api:8000']
    metrics_path: '/metrics'

The brutal truth: You can't fix what you can't see.

Without monitoring:

  • You find out about issues from angry users

  • You have no idea when things started breaking

  • You can't prove your model works

  • You can't debug production issues

  • You look unprofessional

With monitoring:

  • You see issues before users do

  • You have data to debug with

  • You can confidently say "yes, the model works"

  • You catch degradation early

  • You sleep better at night

The lesson: Set up monitoring on day one, not after things break. Prometheus and Grafana take maybe 2 hours to set up. That investment will save you countless hours of "what the hell is happening?" panic.

Bonus: Nobody's First MLOps Pipeline is Good

Let me tell you a secret: your first MLOps pipeline will suck.

Mine did. Everyone's does. That's completely fine.

The evolution of my fraud detection system:

V1: The "It Works!" Stage

  • One Python script

  • Trained model manually

  • Saved as .pkl file

  • Ran locally

  • No monitoring, no versioning, no orchestration

  • But it worked!

V2: The "Let's Add Some Logging" Stage

  • Added print statements everywhere

  • Could finally debug issues

  • Still running locally

  • Still manual retraining

V3: The "Docker Nightmare" Stage

  • Dockerized everything

  • Spent a week debugging

  • Finally got it working

  • Felt like a genius

V4: The "Automation Era" Stage

  • Added Airflow

  • Automated retraining

  • Never touched it manually again

  • Started feeling professional

V5: The "Full MLOps" Stage

  • Added Prometheus monitoring

  • Added Grafana dashboards

  • Added MLflow tracking

  • 7 microservices in Docker Compose

  • Fully automated, fully monitored

  • Actually production-ready

Total time from V1 to V5: About 3 months

The point: You don't build V5 on day one. You start with V1 (a script that works) and add complexity as you feel the pain.

My advice for beginners:

  1. Start with working, not perfect - Get something deployed first

  2. Add monitoring next - You'll want this immediately

  3. Automate when manual becomes painful - Usually week 2-3

  4. Containerize when ready - Don't fight Docker until you have to

  5. Document everything - Future you will be grateful

Don't over-engineer V1. You don't need Kubernetes. You don't need distributed training. You don't need 47 microservices. You need a model that works in production.

Add complexity when the pain of not having it becomes unbearable.

The lesson: MLOps is iterative, just like everything in software. Ship fast, learn fast, iterate fast. Your V1 will be embarrassing. That's how you know you're learning.

Conclusion

So there you have it. Five things I wish I knew before building my first MLOps pipeline:

  1. Training accuracy is a beautiful lie - Your 95% will become 87% in production. Plan for it.

  2. You will forget how your model works - Version everything with MLflow or suffer amnesia.

  3. Manual retraining is emotional damage - Automate with Airflow or hate your life.

  4. Docker will hurt you (then save you) - It's painful until it's magical. Stick with it.

  5. Monitor or be embarrassed - Prometheus + Grafana. Set it up day one.

My FraudShield AI fraud detection system now runs at 95%+ accuracy with <120ms latency, fully automated retraining, comprehensive monitoring, and zero manual intervention. But it took 3 months of iteration, countless bugs, and one very frustrating week with Docker to get there.

The unglamorous truth about MLOps: it's messy, it's iterative, and your first version will be embarrassing. But that's fine. Ship it anyway. Learn by breaking things. Add complexity as pain demands it.

MLOps isn't about perfection. It's about building systems that work in the real world, survive contact with actual users, and don't require you to babysit them at 3am.

Want to see the full system in action?

Check out my deep-dive article on building the complete fraud detection platform: [Coming soon - link to your project article]

Ready to build your own MLOps pipeline?

I'm still learning, still breaking things, and still improving the system. That's the fun part.

About the Author

Jonathan Sodeke is a Data Engineer and ML Engineer who has broken enough ML pipelines to write about it with authority. He specializes in MLOps, geospatial data processing, and containerized deployments. Currently building production AI systems and occasionally fighting with Docker at 2am.

When he's not debugging containers, he's building multi-agent AI systems, teaching prompt engineering, and helping others navigate the messy reality of production machine learning.

Portfolio: jonathansodeke.framer.website
GitHub: github.com/Shodexco
LinkedIn: www.linkedin.com/in/jonathan-sodeke


Sign Up To My Newsletter

Get notified when a new article is posted.

Sign Up To My Newsletter

Get notified when a new article is posted.

Sign Up To My Newsletter

Get notified when a new article is posted.

© Jonathan Sodeke 2025

© Jonathan Sodeke 2025

© Jonathan Sodeke 2025

Create a free website with Framer, the website builder loved by startups, designers and agencies.