Observability and Monitoring
Purpose
This page documents the observability and monitoring approach for Maqsafy services.
The purpose is to detect failures early, reduce incident response time, support troubleshooting, and provide operational visibility across applications, infrastructure, queues, databases, and integrations.
Observability Scope
Observability should cover:
- Application health
- API availability
- Error rates
- Response times
- Queue status
- Redis health
- Database performance
- Nginx errors
- Payment and SMS integration failures
- Backup job status
- Security and audit events
Observability Signals
| Signal | Purpose |
|---|---|
| Logs | Record application, infrastructure, and security events |
| Metrics | Track numerical system behavior over time |
| Traces | Track request flow across services |
| Alerts | Notify responsible team when thresholds are breached |
| Health checks | Confirm whether services are reachable and functioning |
OpenTelemetry defines observability telemetry as data such as traces, metrics, and logs, and provides a vendor-neutral framework for generating, collecting, and exporting that telemetry. (opentelemetry.io)
Health Checks
Purpose
Health checks are used to confirm that key services are running and reachable.
Laravel includes a built-in health check route that may be used in production by uptime monitors, load balancers, or orchestration systems. (laravel.com)
Required Health Checks
| Service | Check | Expected Result |
|---|---|---|
| Web application | HTTP health endpoint | 200 OK |
| Backend API | API health endpoint | 200 OK |
| Database | Connection check | Successful connection |
| Redis | Ping or queue test | Successful response |
| Queue worker | Job processing check | Test job processed |
| Nginx | Config and status check | Valid configuration |
| Object storage | Upload/download test | Successful operation |
| Payment gateway | Test/status endpoint where available | Successful response |
| SMS provider | Provider status or controlled test | Successful response |
Example Commands
curl -I https://example.com
curl https://example.com/health
sudo nginx -t
php artisan queue:failed
Logging
Application Logs
Laravel application logs are commonly stored under:
storage/logs/laravel.log
Useful commands:
tail -f storage/logs/laravel.log
grep -i "error" storage/logs/laravel.log
Infrastructure Logs
| Component | Log Source |
|---|---|
| Nginx access log | /var/log/nginx/access.log |
| Nginx error log | /var/log/nginx/error.log |
| Docker container logs | docker logs <container-name> |
| Queue worker logs | Supervisor / Docker / application logs |
| Redis logs | Redis container or service logs |
| Database logs | MySQL slow query log / database monitoring tool |
Logging Rules
- Do not log passwords.
- Do not log access tokens.
- Do not log OTP values.
- Do not log private keys.
- Do not log payment secrets.
- Do not log unnecessary student personal data.
- Use reference IDs for tracing sensitive workflows.
- Mask sensitive identifiers where possible.
Metrics
Required Metrics
| Area | Metric | Purpose |
|---|---|---|
| API | Request count | Measure traffic |
| API | Error rate | Detect failures |
| API | Response time | Detect latency |
| Authentication | Failed login attempts | Detect brute-force behavior |
| Queue | Pending jobs | Detect backlog |
| Queue | Failed jobs | Detect processing issues |
| Queue | Job duration | Detect slow jobs |
| Redis | Memory usage | Detect capacity issues |
| Redis | Connection failures | Detect cache/queue issues |
| Database | CPU / I/O usage | Detect load |
| Database | Slow queries | Detect performance issues |
| Nginx | 4xx / 5xx responses | Detect routing or application errors |
| Storage | Upload/download failures | Detect file service issues |
| Payment | Pending / failed payments | Detect payment issues |
| SMS | Delivery failures | Detect notification issues |
| Backups | Backup success/failure | Detect recovery risk |
Prometheus is an official open-source monitoring and alerting toolkit with a time-series data model and PromQL query language for metrics and alerts. (prometheus.io)
Dashboards
Recommended Dashboards
| Dashboard | Purpose |
|---|---|
| Application Overview | API traffic, error rate, latency, uptime |
| Infrastructure Overview | CPU, memory, disk, network |
| Queue Dashboard | Pending jobs, failed jobs, job duration |
| Database Dashboard | connections, slow queries, CPU, I/O |
| Redis Dashboard | memory, commands, connections, queue usage |
| Nginx Dashboard | request rate, status codes, upstream errors |
| Payment Dashboard | payment success, failures, pending transactions |
| SMS Dashboard | OTP attempts, delivery success/failure |
| Backup Dashboard | last successful backup, failures, restore test status |
Grafana provides dashboarding and visualization capabilities, and its documentation covers dashboard management, variables, annotations, sharing, and reporting. (grafana.com)
Alerting
Alerting Principles
- Alert on user-impacting symptoms.
- Avoid noisy alerts.
- Every alert must have an owner.
- Every alert must have a clear action.
- Critical alerts must be routed to the responsible team.
- Alert thresholds must be reviewed periodically.
Prometheus separates alerting into alerting rules and Alertmanager, where Prometheus sends alerts to Alertmanager for silencing, inhibition, aggregation, and notification routing. (prometheus.io)
Required Alerts
| Alert | Severity | Condition | Owner |
|---|---|---|---|
| Application down | Critical | Health endpoint unavailable | CTO |
| High API error rate | Critical | 5xx rate above threshold | Needs Technical Verification |
| High response time | High | API latency above threshold | Needs Technical Verification |
| Queue backlog | High | Pending jobs above threshold | Needs Technical Verification |
| Failed jobs increasing | High | Failed job count increasing | Needs Technical Verification |
| Redis unavailable | Critical | Redis ping fails | Needs Technical Verification |
| Database unavailable | Critical | DB connection fails | Needs Technical Verification |
| High database load | High | CPU / I/O above threshold | Needs Technical Verification |
| Nginx upstream errors | High | 502/504 increasing | Needs Technical Verification |
| Payment failures | Critical | Payment failure rate above threshold | Needs Technical Verification |
| SMS failures | Medium | OTP delivery failure rate above threshold | Needs Technical Verification |
| Backup failed | Critical | Scheduled backup fails | Needs Technical Verification |
| Disk usage high | High | Disk usage above threshold | Needs Technical Verification |
| TLS certificate near expiry | High | Certificate expiry within defined window | Needs Technical Verification |
Grafana Alerting supports alert rules across multiple data sources and notification routing. (grafana.com)
Queue Monitoring
Queue Indicators
| Indicator | Risk |
|---|---|
| Jobs not processed | Notifications, exports, and background tasks may fail |
| Failed jobs increasing | Code, integration, or data issue |
| Long-running jobs | Worker timeout or inefficient processing |
| Queue backlog | Insufficient workers or heavy workload |
Commands
php artisan queue:failed
php artisan queue:work
php artisan queue:restart
Required Checks
- Confirm queue workers are running.
- Confirm Redis or queue backend is reachable.
- Confirm failed jobs are reviewed.
- Confirm financial jobs are idempotent.
- Confirm retries do not duplicate wallet or payment effects.
Database Monitoring
Required Checks
| Check | Purpose |
|---|---|
| Slow queries | Identify inefficient queries |
| Full table scans | Identify missing indexes |
| Connection count | Detect connection exhaustion |
| CPU and I/O usage | Detect high load |
| Disk usage | Prevent outage |
| Replication status, if applicable | Detect data replication issues |
| Backup status | Confirm recoverability |
Query Review Example
EXPLAIN SELECT * FROM orders WHERE student_id = 1;
Integration Monitoring
Required Integration Checks
| Integration | Monitor |
|---|---|
| SMS Provider | Delivery failures, provider errors, OTP send latency |
| Payment Gateway | Payment failures, pending payments, refund failures, callback errors |
| Email Provider | Delivery failures, SMTP/API errors |
| Object Storage | Upload/download failures |
| Webhooks | Invalid signatures, retries, duplicate events |
| POS / Cashier | Transaction sync failures, duplicate transaction attempts |
Backup Monitoring
Required Backup Checks
| Check | Expected Result |
|---|---|
| Backup job completed | Success |
| Backup file exists | Yes |
| Backup file size valid | Yes |
| Backup encrypted where applicable | Yes |
| Backup access restricted | Yes |
| Restore test performed | Scheduled and documented |
A failed backup must be treated as an operational risk.
Security Monitoring
Security Events to Monitor
| Event | Reason |
|---|---|
| Repeated failed logins | Brute-force detection |
| Admin login from unusual location | Account risk |
| Role or permission changes | Privilege escalation risk |
| Sensitive export generated | Data exposure risk |
| Credential deactivation or replacement | Student credential lifecycle risk |
| Payment or refund anomaly | Financial risk |
| Webhook signature failure | Integration attack or misconfiguration |
| Repeated 403 responses | Possible access probing |
| Rate-limit triggers | Abuse or automation risk |
Incident Metrics
Track the following operational metrics:
| Metric | Description |
|---|---|
| MTTD | Mean Time To Detect |
| MTTA | Mean Time To Acknowledge |
| MTTR | Mean Time To Resolve |
| Incident count | Number of incidents by period |
| Recurring incidents | Repeated issue types |
| Post-incident actions | Open corrective actions |
Runbook Template
Use this template for monitoring alerts.
## Alert: Alert Name
| Field | Details |
|---|---|
| Severity | Critical / High / Medium / Low |
| Service | Affected service |
| Trigger | Alert condition |
| Impact | User or system impact |
| First Check | First diagnostic command or dashboard |
| Escalation | Responsible owner |
| Resolution | Standard fix steps |
| Verification | How to confirm recovery |
Confirmed Monitoring Stack
| Component | Tool / Source | Status |
|---|---|---|
| Application logs | Laravel logs (storage/logs/laravel.log) | Confirmed |
| Error tracking and alerting | Sentry | Confirmed |
| Additional monitoring tools | Prometheus / Grafana / Laravel Nightwatch / other | Needs Technical Verification |
Note: The confirmed production monitoring stack consists of Laravel logs and Sentry. Additional observability tooling may be adopted later. The sections below document recommended practices and tooling references for future adoption.
Open Items
| Item | Status | Notes |
|---|---|---|
| Confirm monitoring stack | Confirmed — Laravel logs and Sentry | Additional tools need technical verification |
| Confirm alert owners | Partially confirmed — Application down alert owner is CTO | All other alert owners need confirmation |
| Confirm health check endpoints | Needs Technical Verification | Add actual URLs without secrets |
| Confirm log retention | Needs Technical Verification | Define retention period |
| Confirm metrics retention | Needs Technical Verification | Define retention period |
| Confirm backup monitoring | Needs Technical Verification | Required |
| Confirm incident escalation process | Confirmed — See Incident Response page | Slack channel, CTO / Product Manager / Manager Support / CEO |
| Confirm dashboard list | Needs Technical Verification | Define actual dashboards |
Rules
- Do not include real secrets in monitoring screenshots.
- Do not expose private dashboard links publicly.
- Do not log sensitive payloads.
- Do not ignore failed backups.
- Every critical alert must have an owner.
- Every alert must have an action.