to navigate

to select

to close

On this page

Machine Learning with Python

Python dominates machine learning thanks to libraries like NumPy, pandas, scikit-learn, PyTorch, and TensorFlow. Backend developers often integrate ML models into APIs, batch jobs, and data pipelines — this page covers practical ML workflows without requiring a PhD.

ML Workflow Overview

  Data → Explore → Feature Engineering → Train → Evaluate → Deploy → Monitor

Most production effort goes into data quality and feature engineering, not model selection.

Environment Setup

  python -m venv .venv
source .venv/bin/activate
pip install numpy pandas scikit-learn matplotlib jupyter

For deep learning:

  pip install torch torchvision  # or tensorflow

Pin versions in requirements.txt — ML stacks are sensitive to library versions.

Loading and Exploring Data

  import pandas as pd

df = pd.read_csv("customers.csv")
print(df.shape)
print(df.dtypes)
print(df.describe())
print(df.isnull().sum())

# Visualize distributions
import matplotlib.pyplot as plt
df["age"].hist(bins=30)
plt.savefig("age_distribution.png")

Key questions before modeling:

Missing values? Outliers? Class imbalance?
Leakage — does any feature contain future information?
Target variable distribution?

Feature Engineering

  from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

numeric_features = ["age", "income", "tenure_months"]
categorical_features = ["plan_type", "region"]

preprocessor = ColumnTransformer([
    ("num", StandardScaler(), numeric_features),
    ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features),
])

Good features beat complex models. Domain knowledge drives feature design:

Ratios (spend_per_month = total_spend / months_active)
Time windows (purchases_last_30_days)
Aggregations (avg_order_value)

Training with scikit-learn

  from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score

X = df.drop("churned", axis=1)
y = df["churned"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

pipeline = Pipeline([
    ("prep", preprocessor),
    ("model", RandomForestClassifier(n_estimators=100, random_state=42)),
])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
y_prob = pipeline.predict_proba(X_test)[:, 1]

print(classification_report(y_test, y_pred))
print(f"AUC: {roc_auc_score(y_test, y_prob):.3f}")

Always split before preprocessing to avoid data leakage. Use Pipeline to apply transforms consistently.

Cross-Validation and Hyperparameter Tuning

  from sklearn.model_selection import GridSearchCV

param_grid = {
    "model__n_estimators": [50, 100, 200],
    "model__max_depth": [5, 10, None],
}

search = GridSearchCV(pipeline, param_grid, cv=5, scoring="roc_auc", n_jobs=-1)
search.fit(X_train, y_train)
print(search.best_params_)
print(f"Best CV AUC: {search.best_score_:.3f}")

Cross-validation gives more reliable estimates than a single train/test split.

Model Evaluation Metrics

Task	Metrics
Binary classification	Precision, recall, F1, AUC-ROC
Multi-class	Macro/micro F1, confusion matrix
Regression	MAE, RMSE, R²
Ranking	NDCG, MAP

For imbalanced classes (fraud, churn), accuracy is misleading — optimize precision-recall or use class_weight='balanced'.

Feature Importance

  import numpy as np

model = pipeline.named_steps["model"]
importances = model.feature_importances_
feature_names = pipeline.named_steps["prep"].get_feature_names_out()

for name, imp in sorted(zip(feature_names, importances), key=lambda x: -x[1])[:10]:
    print(f"{name}: {imp:.4f}")

Use SHAP for model-agnostic explanations in production debugging.

Saving and Loading Models

  import joblib

joblib.dump(pipeline, "churn_model_v1.joblib")

loaded = joblib.load("churn_model_v1.joblib")
prediction = loaded.predict_proba(new_customer_data)

Version models (v1, v2) and store metadata (training date, metrics, feature list).

Serving Models in APIs

  from fastapi import FastAPI
import joblib
import pandas as pd

app = FastAPI()
model = joblib.load("churn_model_v1.joblib")

@app.post("/predict")
async def predict(customer: dict):
    df = pd.DataFrame([customer])
    prob = model.predict_proba(df)[0][1]
    return {"churn_probability": float(prob)}

For high throughput, use ONNX Runtime, TorchServe, or dedicated ML platforms (SageMaker, Vertex AI).

Deep Learning Basics (PyTorch)

  import torch
import torch.nn as nn

class SimpleNet(nn.Module):
    def __init__(self, input_size: int):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(input_size, 64),
            nn.ReLU(),
            nn.Linear(64, 2),
        )

    def forward(self, x):
        return self.layers(x)

model = SimpleNet(input_size=10)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

Use PyTorch/TensorFlow for images, text, and sequences. Start with scikit-learn for tabular data.

MLOps Essentials

Experiment tracking — MLflow, Weights & Biases
Data versioning — DVC
Model registry — MLflow Model Registry
Monitoring — detect data drift and prediction distribution shifts
Retraining pipeline — scheduled jobs when performance degrades

Common Pitfalls

Training on test data (leakage)
Ignoring class imbalance
Overfitting — use regularization and simpler models first
No baseline — always compare against a simple rule (e.g., predict majority class)
Deploying without input validation

Machine learning in Python is iterative: start simple, measure rigorously, and only add complexity when metrics justify it.

FastAPI

Data Engineering with Python

Machine Learning with Python

ML Workflow Overview link

Environment Setup link

Loading and Exploring Data link

Feature Engineering link

Training with scikit-learn link

Cross-Validation and Hyperparameter Tuning link

Model Evaluation Metrics link

Feature Importance link

Saving and Loading Models link

Serving Models in APIs link

Deep Learning Basics (PyTorch) link

MLOps Essentials link

Common Pitfalls link