On this page
Data Science Basics
The Python Data Stack
| Library | Purpose |
|---|---|
| NumPy | Numerical arrays and math |
| Pandas | DataFrames, data cleaning, analysis |
| Matplotlib / Seaborn | Visualization |
| Jupyter | Interactive notebooks |
| Scikit-learn | Machine learning |
Installation
pip install numpy pandas matplotlib jupyter scikit-learn
jupyter notebook
NumPy Basics
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print(arr.mean()) # 3.0
print(arr * 2) # [2, 4, 6, 8, 10]
matrix = np.array([[1, 2], [3, 4]])
print(matrix.shape) # (2, 2)
print(matrix.T) # transpose
Pandas DataFrames
import pandas as pd
df = pd.read_csv('sales.csv')
print(df.head())
print(df.describe())
# Filtering
high_sales = df[df['amount'] > 1000]
# Grouping
by_region = df.groupby('region')['amount'].sum()
# Missing data
df.fillna(0, inplace=True)
df.dropna(subset=['email'], inplace=True)
Visualization
import matplotlib.pyplot as plt
df.groupby('month')['revenue'].sum().plot(kind='bar')
plt.title('Monthly Revenue')
plt.xlabel('Month')
plt.ylabel('Revenue ($)')
plt.tight_layout()
plt.savefig('revenue.png')
plt.show()
Scikit-learn Quick Start
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print(f"MSE: {mean_squared_error(y_test, predictions):.2f}")
Workflow Tips
- Explore data with
df.info(),df.describe(), visualizations - Clean missing values, outliers, type conversions
- Feature engineer new columns from existing data
- Model with scikit-learn; evaluate with cross-validation
- Document in Jupyter notebooks for reproducibility
Python’s data ecosystem makes it the top choice for analytics, ML, and scientific computing.
Data Cleaning with Pandas
import pandas as pd
df = pd.read_csv('sales.csv')
df['date'] = pd.to_datetime(df['date'])
df['amount'] = pd.to_numeric(df['amount'], errors='coerce')
df = df.drop_duplicates(subset=['order_id'])
df = df[df['amount'] > 0]
Merging DataFrames
orders = pd.read_csv('orders.csv')
customers = pd.read_csv('customers.csv')
merged = orders.merge(customers, on='customer_id', how='left')
Seaborn for Statistical Plots
import seaborn as sns
import matplotlib.pyplot as plt
sns.scatterplot(data=df, x='ad_spend', y='revenue', hue='region')
plt.title('Ad Spend vs Revenue by Region')
plt.savefig('scatter.png', dpi=150)
Train/Test Split Best Practices
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"CV accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")
Never evaluate on training data — use holdout or cross-validation.
Reproducible Notebooks
import numpy as np
np.random.seed(42)
Pin library versions in requirements.txt and restart kernel after code changes for consistent results.