Observability and Monitoring

Purpose

This page documents the observability and monitoring approach for Maqsafy services.

The purpose is to detect failures early, reduce incident response time, support troubleshooting, and provide operational visibility across applications, infrastructure, queues, databases, and integrations.

Observability Scope

Observability should cover:

Application health
API availability
Error rates
Response times
Queue status
Redis health
Database performance
Nginx errors
Payment and SMS integration failures
Backup job status
Security and audit events

Observability Signals

Signal	Purpose
Logs	Record application, infrastructure, and security events
Metrics	Track numerical system behavior over time
Traces	Track request flow across services
Alerts	Notify responsible team when thresholds are breached
Health checks	Confirm whether services are reachable and functioning

OpenTelemetry defines observability telemetry as data such as traces, metrics, and logs, and provides a vendor-neutral framework for generating, collecting, and exporting that telemetry. (opentelemetry.io)

Health Checks

Purpose

Health checks are used to confirm that key services are running and reachable.

Laravel includes a built-in health check route that may be used in production by uptime monitors, load balancers, or orchestration systems. (laravel.com)

Required Health Checks

Service	Check	Expected Result
Web application	HTTP health endpoint	200 OK
Backend API	API health endpoint	200 OK
Database	Connection check	Successful connection
Redis	Ping or queue test	Successful response
Queue worker	Job processing check	Test job processed
Nginx	Config and status check	Valid configuration
Object storage	Upload/download test	Successful operation
Payment gateway	Test/status endpoint where available	Successful response
SMS provider	Provider status or controlled test	Successful response

Example Commands

curl -I https://example.com
curl https://example.com/health
sudo nginx -t
php artisan queue:failed

Logging

Application Logs

Laravel application logs are commonly stored under:

storage/logs/laravel.log

Useful commands:

tail -f storage/logs/laravel.log
grep -i "error" storage/logs/laravel.log

Infrastructure Logs

Component	Log Source
Nginx access log	`/var/log/nginx/access.log`
Nginx error log	`/var/log/nginx/error.log`
Docker container logs	`docker logs <container-name>`
Queue worker logs	Supervisor / Docker / application logs
Redis logs	Redis container or service logs
Database logs	MySQL slow query log / database monitoring tool

Logging Rules

Do not log passwords.
Do not log access tokens.
Do not log OTP values.
Do not log private keys.
Do not log payment secrets.
Do not log unnecessary student personal data.
Use reference IDs for tracing sensitive workflows.
Mask sensitive identifiers where possible.

Metrics

Required Metrics

Area	Metric	Purpose
API	Request count	Measure traffic
API	Error rate	Detect failures
API	Response time	Detect latency
Authentication	Failed login attempts	Detect brute-force behavior
Queue	Pending jobs	Detect backlog
Queue	Failed jobs	Detect processing issues
Queue	Job duration	Detect slow jobs
Redis	Memory usage	Detect capacity issues
Redis	Connection failures	Detect cache/queue issues
Database	CPU / I/O usage	Detect load
Database	Slow queries	Detect performance issues
Nginx	4xx / 5xx responses	Detect routing or application errors
Storage	Upload/download failures	Detect file service issues
Payment	Pending / failed payments	Detect payment issues
SMS	Delivery failures	Detect notification issues
Backups	Backup success/failure	Detect recovery risk

Prometheus is an official open-source monitoring and alerting toolkit with a time-series data model and PromQL query language for metrics and alerts. (prometheus.io)

Dashboards

Recommended Dashboards

Dashboard	Purpose
Application Overview	API traffic, error rate, latency, uptime
Infrastructure Overview	CPU, memory, disk, network
Queue Dashboard	Pending jobs, failed jobs, job duration
Database Dashboard	connections, slow queries, CPU, I/O
Redis Dashboard	memory, commands, connections, queue usage
Nginx Dashboard	request rate, status codes, upstream errors
Payment Dashboard	payment success, failures, pending transactions
SMS Dashboard	OTP attempts, delivery success/failure
Backup Dashboard	last successful backup, failures, restore test status

Grafana provides dashboarding and visualization capabilities, and its documentation covers dashboard management, variables, annotations, sharing, and reporting. (grafana.com)

Alerting

Alerting Principles

Alert on user-impacting symptoms.
Avoid noisy alerts.
Every alert must have an owner.
Every alert must have a clear action.
Critical alerts must be routed to the responsible team.
Alert thresholds must be reviewed periodically.

Prometheus separates alerting into alerting rules and Alertmanager, where Prometheus sends alerts to Alertmanager for silencing, inhibition, aggregation, and notification routing. (prometheus.io)

Required Alerts

Alert	Severity	Condition	Owner
Application down	Critical	Health endpoint unavailable	CTO
High API error rate	Critical	5xx rate above threshold	Needs Technical Verification
High response time	High	API latency above threshold	Needs Technical Verification
Queue backlog	High	Pending jobs above threshold	Needs Technical Verification
Failed jobs increasing	High	Failed job count increasing	Needs Technical Verification
Redis unavailable	Critical	Redis ping fails	Needs Technical Verification
Database unavailable	Critical	DB connection fails	Needs Technical Verification
High database load	High	CPU / I/O above threshold	Needs Technical Verification
Nginx upstream errors	High	502/504 increasing	Needs Technical Verification
Payment failures	Critical	Payment failure rate above threshold	Needs Technical Verification
SMS failures	Medium	OTP delivery failure rate above threshold	Needs Technical Verification
Backup failed	Critical	Scheduled backup fails	Needs Technical Verification
Disk usage high	High	Disk usage above threshold	Needs Technical Verification
TLS certificate near expiry	High	Certificate expiry within defined window	Needs Technical Verification

Grafana Alerting supports alert rules across multiple data sources and notification routing. (grafana.com)

Queue Monitoring

Queue Indicators

Indicator	Risk
Jobs not processed	Notifications, exports, and background tasks may fail
Failed jobs increasing	Code, integration, or data issue
Long-running jobs	Worker timeout or inefficient processing
Queue backlog	Insufficient workers or heavy workload

Commands

php artisan queue:failed
php artisan queue:work
php artisan queue:restart

Required Checks

Confirm queue workers are running.
Confirm Redis or queue backend is reachable.
Confirm failed jobs are reviewed.
Confirm financial jobs are idempotent.
Confirm retries do not duplicate wallet or payment effects.

Database Monitoring

Required Checks

Check	Purpose
Slow queries	Identify inefficient queries
Full table scans	Identify missing indexes
Connection count	Detect connection exhaustion
CPU and I/O usage	Detect high load
Disk usage	Prevent outage
Replication status, if applicable	Detect data replication issues
Backup status	Confirm recoverability

Query Review Example

EXPLAIN SELECT * FROM orders WHERE student_id = 1;

Integration Monitoring

Required Integration Checks

Integration	Monitor
SMS Provider	Delivery failures, provider errors, OTP send latency
Payment Gateway	Payment failures, pending payments, refund failures, callback errors
Email Provider	Delivery failures, SMTP/API errors
Object Storage	Upload/download failures
Webhooks	Invalid signatures, retries, duplicate events
POS / Cashier	Transaction sync failures, duplicate transaction attempts

Backup Monitoring

Required Backup Checks

Check	Expected Result
Backup job completed	Success
Backup file exists	Yes
Backup file size valid	Yes
Backup encrypted where applicable	Yes
Backup access restricted	Yes
Restore test performed	Scheduled and documented

A failed backup must be treated as an operational risk.

Security Monitoring

Security Events to Monitor

Event	Reason
Repeated failed logins	Brute-force detection
Admin login from unusual location	Account risk
Role or permission changes	Privilege escalation risk
Sensitive export generated	Data exposure risk
Credential deactivation or replacement	Student credential lifecycle risk
Payment or refund anomaly	Financial risk
Webhook signature failure	Integration attack or misconfiguration
Repeated 403 responses	Possible access probing
Rate-limit triggers	Abuse or automation risk

Incident Metrics

Track the following operational metrics:

Metric	Description
MTTD	Mean Time To Detect
MTTA	Mean Time To Acknowledge
MTTR	Mean Time To Resolve
Incident count	Number of incidents by period
Recurring incidents	Repeated issue types
Post-incident actions	Open corrective actions

Runbook Template

Use this template for monitoring alerts.

## Alert: Alert Name

| Field | Details |
|---|---|
| Severity | Critical / High / Medium / Low |
| Service | Affected service |
| Trigger | Alert condition |
| Impact | User or system impact |
| First Check | First diagnostic command or dashboard |
| Escalation | Responsible owner |
| Resolution | Standard fix steps |
| Verification | How to confirm recovery |

Confirmed Monitoring Stack

Component	Tool / Source	Status
Application logs	Laravel logs (`storage/logs/laravel.log`)	Confirmed
Error tracking and alerting	Sentry	Confirmed
Additional monitoring tools	Prometheus / Grafana / Laravel Nightwatch / other	Needs Technical Verification

Note: The confirmed production monitoring stack consists of Laravel logs and Sentry. Additional observability tooling may be adopted later. The sections below document recommended practices and tooling references for future adoption.

Open Items

Item	Status	Notes
Confirm monitoring stack	Confirmed — Laravel logs and Sentry	Additional tools need technical verification
Confirm alert owners	Partially confirmed — Application down alert owner is CTO	All other alert owners need confirmation
Confirm health check endpoints	Needs Technical Verification	Add actual URLs without secrets
Confirm log retention	Needs Technical Verification	Define retention period
Confirm metrics retention	Needs Technical Verification	Define retention period
Confirm backup monitoring	Needs Technical Verification	Required
Confirm incident escalation process	Confirmed — See Incident Response page	Slack channel, CTO / Product Manager / Manager Support / CEO
Confirm dashboard list	Needs Technical Verification	Define actual dashboards

Rules

Do not include real secrets in monitoring screenshots.
Do not expose private dashboard links publicly.
Do not log sensitive payloads.
Do not ignore failed backups.
Every critical alert must have an owner.
Every alert must have an action.

Purpose​

Observability Scope​

Observability Signals​

Health Checks

Purpose​

Required Health Checks​

Example Commands​

Logging

Application Logs​

Infrastructure Logs​

Logging Rules​

Metrics

Required Metrics​

Dashboards

Recommended Dashboards​

Alerting

Alerting Principles​

Required Alerts​

Queue Monitoring

Queue Indicators​

Commands​

Required Checks​

Database Monitoring

Required Checks​

Query Review Example​

Integration Monitoring

Required Integration Checks​

Backup Monitoring

Required Backup Checks​

Security Monitoring

Security Events to Monitor​

Incident Metrics

Runbook Template

Confirmed Monitoring Stack​

Open Items

Rules​

Purpose

Observability Scope

Observability Signals

Purpose

Required Health Checks

Example Commands

Application Logs

Infrastructure Logs

Logging Rules

Required Metrics

Recommended Dashboards

Alerting Principles

Required Alerts

Queue Indicators

Commands

Required Checks

Required Checks

Query Review Example

Required Integration Checks

Required Backup Checks

Security Events to Monitor

Confirmed Monitoring Stack

Rules