Monitoring Setup
tenement exports Prometheus metrics at /metrics. This guide covers setting up monitoring and alerting.
Quick Start
Section titled “Quick Start”View Raw Metrics
Section titled “View Raw Metrics”curl http://localhost:8080/metricsKey Metrics
Section titled “Key Metrics”| Metric | Type | Description |
|---|---|---|
instance_count | gauge | Total running instances |
instance_status{process,instance} | gauge | Instance health (1=healthy) |
instance_uptime_seconds{process,instance} | gauge | Instance uptime |
instance_restarts{process,instance} | counter | Restart count |
instance_memory_bytes{process,instance} | gauge | Memory usage |
instance_storage_bytes{process,instance} | gauge | Disk usage |
instance_storage_quota_bytes{process,instance} | gauge | Storage limit |
http_requests_total{method,path,status} | counter | Request count |
http_request_duration_seconds{method,path} | histogram | Request latency |
Prometheus Setup
Section titled “Prometheus Setup”Installation
Section titled “Installation”# Ubuntu/Debianapt install prometheus
# Or Dockerdocker run -d -p 9090:9090 -v /etc/prometheus:/etc/prometheus prom/prometheusConfiguration
Section titled “Configuration”Add to /etc/prometheus/prometheus.yml:
scrape_configs: - job_name: 'tenement' static_configs: - targets: ['localhost:8080'] metrics_path: /metrics scrape_interval: 15sVerify
Section titled “Verify”# Restart Prometheussystemctl restart prometheus
# Check targetscurl http://localhost:9090/api/v1/targetsGrafana Setup
Section titled “Grafana Setup”Installation
Section titled “Installation”# Ubuntu/Debianapt install grafana
# Or Dockerdocker run -d -p 3000:3000 grafana/grafanaAdd Prometheus Data Source
Section titled “Add Prometheus Data Source”- Open Grafana (http://localhost:3000)
- Configuration → Data Sources → Add
- Select Prometheus
- URL:
http://localhost:9090 - Save & Test
Import Dashboard
Section titled “Import Dashboard”Create a dashboard with these panels:
Instance Count
instance_countInstance Health
instance_statusMemory Usage by Instance
instance_memory_bytesStorage Usage
instance_storage_bytes / instance_storage_quota_bytes * 100Request Rate
rate(http_requests_total[5m])Request Latency (p99)
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))Restart Rate
increase(instance_restarts[1h])Alerting
Section titled “Alerting”Prometheus Alerting Rules
Section titled “Prometheus Alerting Rules”Create /etc/prometheus/alerts/tenement.yml:
groups: - name: tenement rules: # Instance down - alert: InstanceDown expr: instance_status == 0 for: 1m labels: severity: critical annotations: summary: "Instance {{ $labels.process }}:{{ $labels.instance }} is down" description: "Instance has been unhealthy for more than 1 minute"
# High restart rate - alert: HighRestartRate expr: increase(instance_restarts[1h]) > 5 for: 5m labels: severity: warning annotations: summary: "Instance {{ $labels.process }}:{{ $labels.instance }} restarting frequently" description: "Instance has restarted {{ $value }} times in the last hour"
# Storage near limit - alert: StorageNearLimit expr: instance_storage_bytes / instance_storage_quota_bytes > 0.9 for: 5m labels: severity: warning annotations: summary: "Instance {{ $labels.process }}:{{ $labels.instance }} storage > 90%" description: "Storage at {{ $value | humanizePercentage }}"
# No instances running - alert: NoInstancesRunning expr: instance_count == 0 for: 1m labels: severity: critical annotations: summary: "No tenement instances running" description: "All instances have stopped"
# High latency - alert: HighLatency expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 1 for: 5m labels: severity: warning annotations: summary: "High API latency" description: "p99 latency is {{ $value }}s"Enable Alerts
Section titled “Enable Alerts”Add to prometheus.yml:
rule_files: - /etc/prometheus/alerts/*.ymlAlertmanager (Optional)
Section titled “Alertmanager (Optional)”For Slack/PagerDuty/email alerts:
route: receiver: 'slack'
receivers: - name: 'slack' slack_configs: - api_url: 'https://hooks.slack.com/services/...' channel: '#alerts'Quick Health Checks
Section titled “Quick Health Checks”CLI Check
Section titled “CLI Check”# Instance healthten ps
# Specific instanceten health api:prodHTTP Check
Section titled “HTTP Check”# Server healthcurl http://localhost:8080/health
# All instances via APIcurl -H "Authorization: Bearer $TOKEN" http://localhost:8080/api/instancesMonitoring Script
Section titled “Monitoring Script”#!/bin/bash# Check server is upif ! curl -sf http://localhost:8080/health > /dev/null; then echo "CRITICAL: tenement server down" exit 2fi
# Check instance countCOUNT=$(curl -s http://localhost:8080/metrics | grep "^instance_count" | awk '{print $2}')if [ "$COUNT" -eq 0 ]; then echo "WARNING: no instances running" exit 1fi
echo "OK: $COUNT instances running"exit 0Log Aggregation
Section titled “Log Aggregation”tenement doesn’t persist logs. Ship to external service:
Using Vector
Section titled “Using Vector”[sources.tenement_api]type = "http_client"endpoint = "http://localhost:8080/api/logs/stream"headers.Authorization = "Bearer ${TENEMENT_TOKEN}"
[sinks.loki]type = "loki"inputs = ["tenement_api"]endpoint = "http://loki:3100"Using Promtail
Section titled “Using Promtail”scrape_configs: - job_name: tenement static_configs: - targets: - localhost labels: job: tenement __path__: /var/log/tenement/*.logNext Steps
Section titled “Next Steps”- Upgrading - Zero-downtime upgrades
- Backup and Restore - Data preservation
- Troubleshooting - Debugging issues