The Python Data Stack

Library Purpose
NumPy Numerical arrays and math
Pandas DataFrames, data cleaning, analysis
Matplotlib / Seaborn Visualization
Jupyter Interactive notebooks
Scikit-learn Machine learning

Installation

  pip install numpy pandas matplotlib jupyter scikit-learn
jupyter notebook
  

NumPy Basics

  import numpy as np

arr = np.array([1, 2, 3, 4, 5])
print(arr.mean())       # 3.0
print(arr * 2)          # [2, 4, 6, 8, 10]

matrix = np.array([[1, 2], [3, 4]])
print(matrix.shape)     # (2, 2)
print(matrix.T)         # transpose
  

Pandas DataFrames

  import pandas as pd

df = pd.read_csv('sales.csv')
print(df.head())
print(df.describe())

# Filtering
high_sales = df[df['amount'] > 1000]

# Grouping
by_region = df.groupby('region')['amount'].sum()

# Missing data
df.fillna(0, inplace=True)
df.dropna(subset=['email'], inplace=True)
  

Visualization

  import matplotlib.pyplot as plt

df.groupby('month')['revenue'].sum().plot(kind='bar')
plt.title('Monthly Revenue')
plt.xlabel('Month')
plt.ylabel('Revenue ($)')
plt.tight_layout()
plt.savefig('revenue.png')
plt.show()
  

Scikit-learn Quick Start

  from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print(f"MSE: {mean_squared_error(y_test, predictions):.2f}")
  

Workflow Tips

  1. Explore data with df.info(), df.describe(), visualizations
  2. Clean missing values, outliers, type conversions
  3. Feature engineer new columns from existing data
  4. Model with scikit-learn; evaluate with cross-validation
  5. Document in Jupyter notebooks for reproducibility

Python’s data ecosystem makes it the top choice for analytics, ML, and scientific computing.

Data Cleaning with Pandas

  import pandas as pd

df = pd.read_csv('sales.csv')
df['date'] = pd.to_datetime(df['date'])
df['amount'] = pd.to_numeric(df['amount'], errors='coerce')
df = df.drop_duplicates(subset=['order_id'])
df = df[df['amount'] > 0]
  

Merging DataFrames

  orders = pd.read_csv('orders.csv')
customers = pd.read_csv('customers.csv')

merged = orders.merge(customers, on='customer_id', how='left')
  

Seaborn for Statistical Plots

  import seaborn as sns
import matplotlib.pyplot as plt

sns.scatterplot(data=df, x='ad_spend', y='revenue', hue='region')
plt.title('Ad Spend vs Revenue by Region')
plt.savefig('scatter.png', dpi=150)
  

Train/Test Split Best Practices

  from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"CV accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")
  

Never evaluate on training data — use holdout or cross-validation.

Reproducible Notebooks

  import numpy as np
np.random.seed(42)
  

Pin library versions in requirements.txt and restart kernel after code changes for consistent results.