Azure Monitor is the unified observability platform for Azure. It collects metrics and logs from resources, applications, and infrastructure — enabling alerts, dashboards, root-cause analysis, and automated remediation. Effective monitoring turns raw telemetry into actionable operational intelligence.

Data Platform Overview

Data Type Source Storage Query Language
Metrics Azure resources, custom Azure Monitor Metrics (time-series) Metrics Explorer
Logs Resources, agents, apps Log Analytics workspace Kusto (KQL)
Traces Application Insights Log Analytics KQL
Activity Logs Control plane operations Log Analytics / Storage KQL
Alerts Metrics, logs, activity Alert rules

All diagnostic data should flow into a Log Analytics workspace for centralized querying and correlation.

Log Analytics Workspace Setup

  # Create workspace
az monitor log-analytics workspace create \
  --resource-group rg-webapp-prod \
  --workspace-name law-webapp-prod \
  --location eastus \
  --retention-time 90

# Enable diagnostic settings on a VM (send logs to workspace)
az monitor diagnostic-settings create \
  --name vm-diagnostics \
  --resource /subscriptions/SUB_ID/resourceGroups/rg-webapp-prod/providers/Microsoft.Compute/virtualMachines/vm-web-01 \
  --workspace law-webapp-prod \
  --metrics '[{"category":"AllMetrics","enabled":true}]' \
  --logs '[{"category":"Syslog","enabled":true},{"category":"Audit","enabled":true}]'
  

Application Insights

Application Insights provides APM for web apps, APIs, and functions — tracking requests, dependencies, exceptions, and custom events:

  az monitor app-insights component create \
  --app ai-webapp-prod \
  --location eastus \
  --resource-group rg-webapp-prod \
  --application-type web \
  --kind web \
  --workspace law-webapp-prod
  

Enable in Node.js:

  const appInsights = require('applicationinsights');
appInsights.setup(process.env.APPLICATIONINSIGHTS_CONNECTION_STRING)
  .setAutoDependencyCorrelation(true)
  .setAutoCollectRequests(true)
  .setAutoCollectExceptions(true)
  .setAutoCollectDependencies(true)
  .setSendLiveMetrics(true)
  .start();
  

Enable on App Service via CLI:

  az webapp config appsettings set \
  --name my-webapp-prod \
  --resource-group rg-webapp-prod \
  --settings APPLICATIONINSIGHTS_CONNECTION_STRING="<connection-string>"
  

KQL Query Examples

Kusto Query Language powers log analysis across all Azure Monitor data:

  // Failed requests in the last hour with error details
requests
| where timestamp > ago(1h)
| where success == false
| summarize count(), avg(duration) by resultCode, name, operation_Name
| order by count_ desc

// Average CPU across VMs (5-minute bins)
Perf
| where ObjectName == "Processor" and CounterName == "% Processor Time"
| where TimeGenerated > ago(24h)
| summarize avg(CounterValue) by bin(TimeGenerated, 5m), Computer
| render timechart

// Top 10 slowest API endpoints (P95 latency)
requests
| where timestamp > ago(7d)
| summarize percentiles(duration, 50, 95, 99) by name
| top 10 by percentile_duration_95 desc

// Correlated exceptions with requests
exceptions
| where timestamp > ago(1h)
| join kind=inner (
    requests | where timestamp > ago(1h)
) on operation_Id
| project timestamp, name, outerMessage, url, resultCode
  

Alerts and Action Groups

  # Create action group (email + webhook)
az monitor action-group create \
  --name ag-platform-oncall \
  --resource-group rg-webapp-prod \
  --short-name platform \
  --email-receiver name=oncall [email protected]

# Metric alert: high CPU on VM
az monitor metrics alert create \
  --name alert-vm-high-cpu \
  --resource-group rg-webapp-prod \
  --scopes /subscriptions/SUB_ID/resourceGroups/rg-webapp-prod/providers/Microsoft.Compute/virtualMachines/vm-web-01 \
  --condition "avg Percentage CPU > 80" \
  --window-size 5m \
  --evaluation-frequency 1m \
  --action ag-platform-oncall \
  --severity 2 \
  --description "VM CPU exceeded 80% for 5 minutes"

# Log query alert: error rate spike
az monitor scheduled-query create \
  --name alert-high-error-rate \
  --resource-group rg-webapp-prod \
  --scopes /subscriptions/SUB_ID/resourceGroups/rg-webapp-prod/providers/Microsoft.OperationalInsights/workspaces/law-webapp-prod \
  --condition-query "requests | where success == false | summarize count() by bin(timestamp, 5m) | where count_ > 50" \
  --condition-threshold 0 \
  --evaluation-frequency 5m \
  --window-size 15m \
  --action ag-platform-oncall \
  --severity 1
  

Action Groups

Channel Use Case
Email/SMS/Voice On-call notifications
Webhook PagerDuty, Slack, Teams integration
Azure Function Custom auto-remediation (restart app, scale out)
Logic App Complex notification and ticketing workflows
ITSM ServiceNow, System Center integration

Dashboards and Workbooks

  • Dashboards: Pin charts and metrics for at-a-glance monitoring — share across teams
  • Workbooks: Interactive reports combining metrics, logs, and parameters — ideal for incident triage
  • Azure Monitor for VMs: Infrastructure health, performance counters, dependency maps
  • Container Insights: AKS pod metrics, node health, controller logs

Real-World Scenario: Production SaaS Monitoring

Layer Monitoring
Application App Insights — request rate, P95 latency, failure rate SLI
Infrastructure VM/AKS metrics — CPU, memory, disk IOPS
Database Azure SQL DMVs via diagnostic logs — DTU%, deadlocks
Network NSG flow logs, Front Door health probes
Alerts P1: error rate > 5%; P2: P95 > 2s; P3: disk > 85%
Dashboard Single pane: availability, latency, error budget burn rate

Monitoring Tools Comparison

Tool Scope Best For
Azure Monitor Full platform Metrics, logs, alerts
Application Insights Application APM Request tracing, dependencies
Log Analytics Log storage + KQL Cross-resource correlation
Azure Monitor Agent VM/on-prem collection Replace legacy Log Analytics agent
Defender for Cloud Security posture Vulnerability and threat detection

Common Mistakes

  1. No diagnostic settings enabled — resources emit metrics but not detailed logs
  2. Alert fatigue — too many low-severity alerts; teams ignore all of them
  3. Missing correlation — App Insights not linked to Log Analytics workspace
  4. Short retention — 30-day default may be insufficient for trend analysis
  5. No baseline before alerting — thresholds set arbitrarily without historical data
  6. Ignoring Activity Log — security incidents missed without control-plane monitoring

Troubleshooting

Issue Diagnosis Fix
No data in workspace Diagnostic settings not configured Enable diagnostics on each resource
App Insights missing traces SDK not initialized or wrong connection string Verify APPLICATIONINSIGHTS_CONNECTION_STRING
Alert not firing Wrong scope or threshold Test KQL query manually; check evaluation frequency
High ingestion costs Verbose logging, no filtering Use transformation rules; filter at source
KQL query timeout Too broad time range or no indexing Add where timestamp > ago(24h); use summarize early
  # Check workspace ingestion volume
az monitor log-analytics workspace get-shared-keys \
  --resource-group rg-webapp-prod \
  --workspace-name law-webapp-prod

# List active alert rules
az monitor metrics alert list --resource-group rg-webapp-prod -o table
  

Best Practices

  1. Define SLIs (latency, error rate, availability) and SLOs for each service
  2. Collect metrics at resource level and traces at application level
  3. Set alerts on actionable thresholds — every alert should require human or automated action
  4. Route all logs to a central Log Analytics workspace with RBAC
  5. Use workbooks for incident response runbooks with pre-built KQL queries
  6. Enable diagnostic settings on every production resource at deployment time
  7. Review dashboards weekly and tune thresholds based on baselines
  8. Implement alert enrichment — include resource links and runbook URLs in notifications

Next: Azure Kubernetes Service (AKS).