Skip to main content

Observability and Monitoring

Purpose

This page documents the observability and monitoring approach for Maqsafy services.

The purpose is to detect failures early, reduce incident response time, support troubleshooting, and provide operational visibility across applications, infrastructure, queues, databases, and integrations.

Observability Scope

Observability should cover:

  • Application health
  • API availability
  • Error rates
  • Response times
  • Queue status
  • Redis health
  • Database performance
  • Nginx errors
  • Payment and SMS integration failures
  • Backup job status
  • Security and audit events

Observability Signals

SignalPurpose
LogsRecord application, infrastructure, and security events
MetricsTrack numerical system behavior over time
TracesTrack request flow across services
AlertsNotify responsible team when thresholds are breached
Health checksConfirm whether services are reachable and functioning

OpenTelemetry defines observability telemetry as data such as traces, metrics, and logs, and provides a vendor-neutral framework for generating, collecting, and exporting that telemetry. (opentelemetry.io)


Health Checks

Purpose

Health checks are used to confirm that key services are running and reachable.

Laravel includes a built-in health check route that may be used in production by uptime monitors, load balancers, or orchestration systems. (laravel.com)

Required Health Checks

ServiceCheckExpected Result
Web applicationHTTP health endpoint200 OK
Backend APIAPI health endpoint200 OK
DatabaseConnection checkSuccessful connection
RedisPing or queue testSuccessful response
Queue workerJob processing checkTest job processed
NginxConfig and status checkValid configuration
Object storageUpload/download testSuccessful operation
Payment gatewayTest/status endpoint where availableSuccessful response
SMS providerProvider status or controlled testSuccessful response

Example Commands

curl -I https://example.com
curl https://example.com/health
sudo nginx -t
php artisan queue:failed

Logging

Application Logs

Laravel application logs are commonly stored under:

storage/logs/laravel.log

Useful commands:

tail -f storage/logs/laravel.log
grep -i "error" storage/logs/laravel.log

Infrastructure Logs

ComponentLog Source
Nginx access log/var/log/nginx/access.log
Nginx error log/var/log/nginx/error.log
Docker container logsdocker logs <container-name>
Queue worker logsSupervisor / Docker / application logs
Redis logsRedis container or service logs
Database logsMySQL slow query log / database monitoring tool

Logging Rules

  • Do not log passwords.
  • Do not log access tokens.
  • Do not log OTP values.
  • Do not log private keys.
  • Do not log payment secrets.
  • Do not log unnecessary student personal data.
  • Use reference IDs for tracing sensitive workflows.
  • Mask sensitive identifiers where possible.

Metrics

Required Metrics

AreaMetricPurpose
APIRequest countMeasure traffic
APIError rateDetect failures
APIResponse timeDetect latency
AuthenticationFailed login attemptsDetect brute-force behavior
QueuePending jobsDetect backlog
QueueFailed jobsDetect processing issues
QueueJob durationDetect slow jobs
RedisMemory usageDetect capacity issues
RedisConnection failuresDetect cache/queue issues
DatabaseCPU / I/O usageDetect load
DatabaseSlow queriesDetect performance issues
Nginx4xx / 5xx responsesDetect routing or application errors
StorageUpload/download failuresDetect file service issues
PaymentPending / failed paymentsDetect payment issues
SMSDelivery failuresDetect notification issues
BackupsBackup success/failureDetect recovery risk

Prometheus is an official open-source monitoring and alerting toolkit with a time-series data model and PromQL query language for metrics and alerts. (prometheus.io)


Dashboards

DashboardPurpose
Application OverviewAPI traffic, error rate, latency, uptime
Infrastructure OverviewCPU, memory, disk, network
Queue DashboardPending jobs, failed jobs, job duration
Database Dashboardconnections, slow queries, CPU, I/O
Redis Dashboardmemory, commands, connections, queue usage
Nginx Dashboardrequest rate, status codes, upstream errors
Payment Dashboardpayment success, failures, pending transactions
SMS DashboardOTP attempts, delivery success/failure
Backup Dashboardlast successful backup, failures, restore test status

Grafana provides dashboarding and visualization capabilities, and its documentation covers dashboard management, variables, annotations, sharing, and reporting. (grafana.com)


Alerting

Alerting Principles

  • Alert on user-impacting symptoms.
  • Avoid noisy alerts.
  • Every alert must have an owner.
  • Every alert must have a clear action.
  • Critical alerts must be routed to the responsible team.
  • Alert thresholds must be reviewed periodically.

Prometheus separates alerting into alerting rules and Alertmanager, where Prometheus sends alerts to Alertmanager for silencing, inhibition, aggregation, and notification routing. (prometheus.io)

Required Alerts

AlertSeverityConditionOwner
Application downCriticalHealth endpoint unavailableCTO
High API error rateCritical5xx rate above thresholdNeeds Technical Verification
High response timeHighAPI latency above thresholdNeeds Technical Verification
Queue backlogHighPending jobs above thresholdNeeds Technical Verification
Failed jobs increasingHighFailed job count increasingNeeds Technical Verification
Redis unavailableCriticalRedis ping failsNeeds Technical Verification
Database unavailableCriticalDB connection failsNeeds Technical Verification
High database loadHighCPU / I/O above thresholdNeeds Technical Verification
Nginx upstream errorsHigh502/504 increasingNeeds Technical Verification
Payment failuresCriticalPayment failure rate above thresholdNeeds Technical Verification
SMS failuresMediumOTP delivery failure rate above thresholdNeeds Technical Verification
Backup failedCriticalScheduled backup failsNeeds Technical Verification
Disk usage highHighDisk usage above thresholdNeeds Technical Verification
TLS certificate near expiryHighCertificate expiry within defined windowNeeds Technical Verification

Grafana Alerting supports alert rules across multiple data sources and notification routing. (grafana.com)


Queue Monitoring

Queue Indicators

IndicatorRisk
Jobs not processedNotifications, exports, and background tasks may fail
Failed jobs increasingCode, integration, or data issue
Long-running jobsWorker timeout or inefficient processing
Queue backlogInsufficient workers or heavy workload

Commands

php artisan queue:failed
php artisan queue:work
php artisan queue:restart

Required Checks

  • Confirm queue workers are running.
  • Confirm Redis or queue backend is reachable.
  • Confirm failed jobs are reviewed.
  • Confirm financial jobs are idempotent.
  • Confirm retries do not duplicate wallet or payment effects.

Database Monitoring

Required Checks

CheckPurpose
Slow queriesIdentify inefficient queries
Full table scansIdentify missing indexes
Connection countDetect connection exhaustion
CPU and I/O usageDetect high load
Disk usagePrevent outage
Replication status, if applicableDetect data replication issues
Backup statusConfirm recoverability

Query Review Example

EXPLAIN SELECT * FROM orders WHERE student_id = 1;

Integration Monitoring

Required Integration Checks

IntegrationMonitor
SMS ProviderDelivery failures, provider errors, OTP send latency
Payment GatewayPayment failures, pending payments, refund failures, callback errors
Email ProviderDelivery failures, SMTP/API errors
Object StorageUpload/download failures
WebhooksInvalid signatures, retries, duplicate events
POS / CashierTransaction sync failures, duplicate transaction attempts

Backup Monitoring

Required Backup Checks

CheckExpected Result
Backup job completedSuccess
Backup file existsYes
Backup file size validYes
Backup encrypted where applicableYes
Backup access restrictedYes
Restore test performedScheduled and documented

A failed backup must be treated as an operational risk.


Security Monitoring

Security Events to Monitor

EventReason
Repeated failed loginsBrute-force detection
Admin login from unusual locationAccount risk
Role or permission changesPrivilege escalation risk
Sensitive export generatedData exposure risk
Credential deactivation or replacementStudent credential lifecycle risk
Payment or refund anomalyFinancial risk
Webhook signature failureIntegration attack or misconfiguration
Repeated 403 responsesPossible access probing
Rate-limit triggersAbuse or automation risk

Incident Metrics

Track the following operational metrics:

MetricDescription
MTTDMean Time To Detect
MTTAMean Time To Acknowledge
MTTRMean Time To Resolve
Incident countNumber of incidents by period
Recurring incidentsRepeated issue types
Post-incident actionsOpen corrective actions

Runbook Template

Use this template for monitoring alerts.

## Alert: Alert Name

| Field | Details |
|---|---|
| Severity | Critical / High / Medium / Low |
| Service | Affected service |
| Trigger | Alert condition |
| Impact | User or system impact |
| First Check | First diagnostic command or dashboard |
| Escalation | Responsible owner |
| Resolution | Standard fix steps |
| Verification | How to confirm recovery |

Confirmed Monitoring Stack

ComponentTool / SourceStatus
Application logsLaravel logs (storage/logs/laravel.log)Confirmed
Error tracking and alertingSentryConfirmed
Additional monitoring toolsPrometheus / Grafana / Laravel Nightwatch / otherNeeds Technical Verification

Note: The confirmed production monitoring stack consists of Laravel logs and Sentry. Additional observability tooling may be adopted later. The sections below document recommended practices and tooling references for future adoption.


Open Items

ItemStatusNotes
Confirm monitoring stackConfirmed — Laravel logs and SentryAdditional tools need technical verification
Confirm alert ownersPartially confirmed — Application down alert owner is CTOAll other alert owners need confirmation
Confirm health check endpointsNeeds Technical VerificationAdd actual URLs without secrets
Confirm log retentionNeeds Technical VerificationDefine retention period
Confirm metrics retentionNeeds Technical VerificationDefine retention period
Confirm backup monitoringNeeds Technical VerificationRequired
Confirm incident escalation processConfirmed — See Incident Response pageSlack channel, CTO / Product Manager / Manager Support / CEO
Confirm dashboard listNeeds Technical VerificationDefine actual dashboards

Rules

  • Do not include real secrets in monitoring screenshots.
  • Do not expose private dashboard links publicly.
  • Do not log sensitive payloads.
  • Do not ignore failed backups.
  • Every critical alert must have an owner.
  • Every alert must have an action.