On this page
Azure Monitor
Azure Monitor is the unified observability platform for Azure. It collects metrics and logs from resources, applications, and infrastructure — enabling alerts, dashboards, root-cause analysis, and automated remediation. Effective monitoring turns raw telemetry into actionable operational intelligence.
Data Platform Overview
| Data Type | Source | Storage | Query Language |
|---|---|---|---|
| Metrics | Azure resources, custom | Azure Monitor Metrics (time-series) | Metrics Explorer |
| Logs | Resources, agents, apps | Log Analytics workspace | Kusto (KQL) |
| Traces | Application Insights | Log Analytics | KQL |
| Activity Logs | Control plane operations | Log Analytics / Storage | KQL |
| Alerts | Metrics, logs, activity | Alert rules | — |
All diagnostic data should flow into a Log Analytics workspace for centralized querying and correlation.
Log Analytics Workspace Setup
# Create workspace
az monitor log-analytics workspace create \
--resource-group rg-webapp-prod \
--workspace-name law-webapp-prod \
--location eastus \
--retention-time 90
# Enable diagnostic settings on a VM (send logs to workspace)
az monitor diagnostic-settings create \
--name vm-diagnostics \
--resource /subscriptions/SUB_ID/resourceGroups/rg-webapp-prod/providers/Microsoft.Compute/virtualMachines/vm-web-01 \
--workspace law-webapp-prod \
--metrics '[{"category":"AllMetrics","enabled":true}]' \
--logs '[{"category":"Syslog","enabled":true},{"category":"Audit","enabled":true}]'
Application Insights
Application Insights provides APM for web apps, APIs, and functions — tracking requests, dependencies, exceptions, and custom events:
az monitor app-insights component create \
--app ai-webapp-prod \
--location eastus \
--resource-group rg-webapp-prod \
--application-type web \
--kind web \
--workspace law-webapp-prod
Enable in Node.js:
const appInsights = require('applicationinsights');
appInsights.setup(process.env.APPLICATIONINSIGHTS_CONNECTION_STRING)
.setAutoDependencyCorrelation(true)
.setAutoCollectRequests(true)
.setAutoCollectExceptions(true)
.setAutoCollectDependencies(true)
.setSendLiveMetrics(true)
.start();
Enable on App Service via CLI:
az webapp config appsettings set \
--name my-webapp-prod \
--resource-group rg-webapp-prod \
--settings APPLICATIONINSIGHTS_CONNECTION_STRING="<connection-string>"
KQL Query Examples
Kusto Query Language powers log analysis across all Azure Monitor data:
// Failed requests in the last hour with error details
requests
| where timestamp > ago(1h)
| where success == false
| summarize count(), avg(duration) by resultCode, name, operation_Name
| order by count_ desc
// Average CPU across VMs (5-minute bins)
Perf
| where ObjectName == "Processor" and CounterName == "% Processor Time"
| where TimeGenerated > ago(24h)
| summarize avg(CounterValue) by bin(TimeGenerated, 5m), Computer
| render timechart
// Top 10 slowest API endpoints (P95 latency)
requests
| where timestamp > ago(7d)
| summarize percentiles(duration, 50, 95, 99) by name
| top 10 by percentile_duration_95 desc
// Correlated exceptions with requests
exceptions
| where timestamp > ago(1h)
| join kind=inner (
requests | where timestamp > ago(1h)
) on operation_Id
| project timestamp, name, outerMessage, url, resultCode
Alerts and Action Groups
# Create action group (email + webhook)
az monitor action-group create \
--name ag-platform-oncall \
--resource-group rg-webapp-prod \
--short-name platform \
--email-receiver name=oncall [email protected]
# Metric alert: high CPU on VM
az monitor metrics alert create \
--name alert-vm-high-cpu \
--resource-group rg-webapp-prod \
--scopes /subscriptions/SUB_ID/resourceGroups/rg-webapp-prod/providers/Microsoft.Compute/virtualMachines/vm-web-01 \
--condition "avg Percentage CPU > 80" \
--window-size 5m \
--evaluation-frequency 1m \
--action ag-platform-oncall \
--severity 2 \
--description "VM CPU exceeded 80% for 5 minutes"
# Log query alert: error rate spike
az monitor scheduled-query create \
--name alert-high-error-rate \
--resource-group rg-webapp-prod \
--scopes /subscriptions/SUB_ID/resourceGroups/rg-webapp-prod/providers/Microsoft.OperationalInsights/workspaces/law-webapp-prod \
--condition-query "requests | where success == false | summarize count() by bin(timestamp, 5m) | where count_ > 50" \
--condition-threshold 0 \
--evaluation-frequency 5m \
--window-size 15m \
--action ag-platform-oncall \
--severity 1
Action Groups
| Channel | Use Case |
|---|---|
| Email/SMS/Voice | On-call notifications |
| Webhook | PagerDuty, Slack, Teams integration |
| Azure Function | Custom auto-remediation (restart app, scale out) |
| Logic App | Complex notification and ticketing workflows |
| ITSM | ServiceNow, System Center integration |
Dashboards and Workbooks
- Dashboards: Pin charts and metrics for at-a-glance monitoring — share across teams
- Workbooks: Interactive reports combining metrics, logs, and parameters — ideal for incident triage
- Azure Monitor for VMs: Infrastructure health, performance counters, dependency maps
- Container Insights: AKS pod metrics, node health, controller logs
Real-World Scenario: Production SaaS Monitoring
| Layer | Monitoring |
|---|---|
| Application | App Insights — request rate, P95 latency, failure rate SLI |
| Infrastructure | VM/AKS metrics — CPU, memory, disk IOPS |
| Database | Azure SQL DMVs via diagnostic logs — DTU%, deadlocks |
| Network | NSG flow logs, Front Door health probes |
| Alerts | P1: error rate > 5%; P2: P95 > 2s; P3: disk > 85% |
| Dashboard | Single pane: availability, latency, error budget burn rate |
Monitoring Tools Comparison
| Tool | Scope | Best For |
|---|---|---|
| Azure Monitor | Full platform | Metrics, logs, alerts |
| Application Insights | Application APM | Request tracing, dependencies |
| Log Analytics | Log storage + KQL | Cross-resource correlation |
| Azure Monitor Agent | VM/on-prem collection | Replace legacy Log Analytics agent |
| Defender for Cloud | Security posture | Vulnerability and threat detection |
Common Mistakes
- No diagnostic settings enabled — resources emit metrics but not detailed logs
- Alert fatigue — too many low-severity alerts; teams ignore all of them
- Missing correlation — App Insights not linked to Log Analytics workspace
- Short retention — 30-day default may be insufficient for trend analysis
- No baseline before alerting — thresholds set arbitrarily without historical data
- Ignoring Activity Log — security incidents missed without control-plane monitoring
Troubleshooting
| Issue | Diagnosis | Fix |
|---|---|---|
| No data in workspace | Diagnostic settings not configured | Enable diagnostics on each resource |
| App Insights missing traces | SDK not initialized or wrong connection string | Verify APPLICATIONINSIGHTS_CONNECTION_STRING |
| Alert not firing | Wrong scope or threshold | Test KQL query manually; check evaluation frequency |
| High ingestion costs | Verbose logging, no filtering | Use transformation rules; filter at source |
| KQL query timeout | Too broad time range or no indexing | Add where timestamp > ago(24h); use summarize early |
# Check workspace ingestion volume
az monitor log-analytics workspace get-shared-keys \
--resource-group rg-webapp-prod \
--workspace-name law-webapp-prod
# List active alert rules
az monitor metrics alert list --resource-group rg-webapp-prod -o table
Best Practices
- Define SLIs (latency, error rate, availability) and SLOs for each service
- Collect metrics at resource level and traces at application level
- Set alerts on actionable thresholds — every alert should require human or automated action
- Route all logs to a central Log Analytics workspace with RBAC
- Use workbooks for incident response runbooks with pre-built KQL queries
- Enable diagnostic settings on every production resource at deployment time
- Review dashboards weekly and tune thresholds based on baselines
- Implement alert enrichment — include resource links and runbook URLs in notifications