Skip to main content

Incident Response

Purpose

This page documents the incident response process for Maqsafy services.

The purpose is to provide a controlled process for detecting, analyzing, containing, resolving, and documenting operational and security incidents.

Incident Response Scope

This process applies to:

  • Application outages
  • API failures
  • Database issues
  • Queue failures
  • Redis failures
  • Nginx or reverse proxy issues
  • Payment or wallet incidents
  • Credential-related incidents
  • Data access or RBAC incidents
  • Security incidents
  • Backup or restore failures
  • Integration failures

Incident Response References

NIST SP 800-61 Rev. 3 is the current NIST guidance for incident response recommendations and considerations within cybersecurity risk management. NIST’s incident response project states that the goal is to help organizations prepare for incident response, reduce incident impact, and improve detection, response, and recovery activities.

CISA’s incident response plan basics states that an incident response plan should clarify roles and responsibilities and guide key activities during incidents.


Incident Severity Levels

SeverityDescriptionExample
SEV-1 CriticalMajor production outage or high-impact security/financial incidentPlatform down, payment/wallet integrity issue, data exposure
SEV-2 HighMajor feature or service degraded with business impactLogin outage, API failure, queue backlog affecting users
SEV-3 MediumPartial issue with workaround availableReport failure, delayed notifications, non-critical integration issue
SEV-4 LowMinor issue with limited impactUI bug, single non-critical job failure

Severity Assignment Rules

  • If financial integrity may be affected, classify as SEV-1 or SEV-2 until confirmed.
  • If student data may be exposed, classify as SEV-1 until confirmed.
  • If tenant isolation may be broken, classify as SEV-1 until confirmed.
  • If credential deactivation/replacement is incorrect, classify as SEV-1 or SEV-2 based on impact.
  • If production is fully unavailable, classify as SEV-1.
  • If the issue is limited to staging or local environment, classify based on release impact.

Incident Lifecycle

1. Detection

Incident detection may come from:

  • Monitoring alerts
  • User reports
  • Support tickets
  • Application logs
  • Payment gateway alerts
  • Failed backup alerts
  • Security events
  • Team observation

2. Triage

During triage, confirm:

CheckRequired
Environment identifiedYes
Affected service identifiedYes
User impact estimatedYes
Financial impact estimatedIf applicable
Data exposure risk assessedIf applicable
Tenant isolation risk assessedIf applicable
Incident severity assignedYes
Incident owner assignedYes

3. Containment

Containment actions may include:

  • Disable affected feature temporarily.
  • Block suspicious access.
  • Stop affected queue job.
  • Isolate compromised account.
  • Disable affected integration.
  • Put application in maintenance mode if required.
  • Prevent duplicate financial processing.
  • Freeze high-risk withdrawal or refund workflow if needed.

4. Eradication

Eradication removes the confirmed root cause.

Examples:

  • Fix code defect.
  • Correct configuration.
  • Patch vulnerable package.
  • Rotate exposed credentials.
  • Fix RBAC or tenant isolation logic.
  • Correct failed migration.
  • Fix Nginx upstream configuration.
  • Repair queue worker configuration.

5. Recovery

Recovery restores normal operation.

Recovery checks:

CheckExpected Result
Application health endpointHealthy
LoginWorking
API endpointsWorking
Queue workersProcessing jobs
RedisReachable
DatabaseReachable
NginxNo critical errors
Payment flowWorking, if impacted
Wallet ledgerReconciled, if impacted
Credential flowCorrect, if impacted
LogsNo new critical exceptions

6. Post-Incident Review

Post-incident review must document:

  • Timeline
  • Root cause
  • Impact
  • Detection method
  • Resolution
  • Preventive actions
  • Owner of corrective actions
  • Evidence collected
  • Follow-up deadline

Incident Communication Channel

ChannelPurposeStatus
SlackPrimary incident communication channel for all severity levelsConfirmed

Incident Roles

RoleResponsibility
Incident CommanderOwns incident coordination and decisions
Technical LeadLeads technical investigation and fix
Operations OwnerHandles infrastructure, deployment, monitoring, and recovery
Product OwnerConfirms business impact and user-facing priority
Support OwnerHandles user/support communication
Security OwnerLeads security assessment and evidence handling
Finance OwnerReviews wallet, payment, refund, and settlement impact
Communications OwnerPrepares internal or external communication if required

Confirmed Incident Leadership

Name / RoleIncident Leadership RoleStatus
CTOIncident Commander / Technical LeadershipConfirmed
Product ManagerProduct Impact and PriorityConfirmed
Manager SupportSupport and User CommunicationConfirmed
CEOExecutive EscalationConfirmed

Role Assignment

Incident TypeRequired Owners
Application outageIncident Commander, Technical Lead, Operations Owner
Payment or wallet incidentIncident Commander, Technical Lead, Finance Owner
Credential incidentIncident Commander, Technical Lead, Product Owner
Data exposure incidentIncident Commander, Security Owner, Technical Lead
Integration failureTechnical Lead, Operations Owner, Integration Owner
Backup failureOperations Owner, Technical Lead

Communication Rules

Internal Communication

Internal updates should include:

FieldDescription
Incident IDUnique incident reference
SeveritySEV-1 / SEV-2 / SEV-3 / SEV-4
StatusInvestigating / Identified / Contained / Recovering / Resolved
ImpactAffected users, services, or operations
Current actionWhat is being done now
OwnerResponsible person/team
Next updateExpected update time

External Communication

External communication is required only when approved by management or required by contract, regulation, or business impact.

External messages must not include:

  • Internal IPs
  • Secrets
  • Private logs
  • Detailed exploit information
  • Customer personal data
  • Unconfirmed assumptions

Incident Types and Runbooks

Application Down

Symptoms

  • Health endpoint fails.
  • Users cannot access dashboard or app.
  • Nginx returns 502, 503, or 504.

First Checks

curl -I https://example.com
sudo nginx -t
sudo tail -f /var/log/nginx/error.log
docker ps
docker logs <container-name>

Containment

  • Confirm affected service.
  • Restart failed service if safe.
  • Roll back recent deployment if confirmed as cause.
  • Notify stakeholders if production impact is high.

Login Failure

Symptoms

  • Users cannot log in.
  • API returns unauthorized or validation errors.
  • OTP delivery fails.

First Checks

tail -f storage/logs/laravel.log
php artisan route:list
php artisan config:cache

Containment

  • Confirm if issue affects all users or a specific role.
  • Check authentication provider or OTP provider.
  • Check rate limits and failed login logs.
  • Avoid disabling security controls without approval.

Queue Backlog

Symptoms

  • Notifications delayed.
  • Reports delayed.
  • Exports not generated.
  • Background jobs not processing.

First Checks

php artisan queue:failed
php artisan queue:work
php artisan queue:restart
docker ps
docker logs <queue-worker-container>

Containment

  • Restart queue workers.
  • Pause non-critical heavy jobs.
  • Prioritize financial or user-impacting jobs.
  • Review failed jobs before retrying sensitive jobs.

Redis Failure

Symptoms

  • Cache, sessions, or queues fail.
  • Logs show Redis connection errors.
  • Queue jobs stop processing.

First Checks

docker ps
docker logs <redis-container>
docker network ls
docker network inspect <network-name>
redis-cli -h <redis-host> -p 6379 ping

Containment

  • Confirm Redis is running.
  • Confirm service name and network.
  • Restart dependent services if required.
  • Avoid clearing production cache without understanding impact.

Database Incident

Symptoms

  • Login fails.
  • Dashboard pages fail to load.
  • API returns database errors.
  • Reports or financial workflows fail.

First Checks

php artisan migrate:status
mysql -h <db-host> -u <db-user> -p
tail -f storage/logs/laravel.log

Containment

  • Stop destructive jobs or migrations.
  • Preserve logs.
  • Confirm backup availability.
  • Avoid manual production data edits without approval and audit record.

Payment or Wallet Incident

Symptoms

  • Wallet balance mismatch.
  • Payment succeeded but wallet not credited.
  • Refund duplicated or failed.
  • Settlement mismatch.

First Checks

  • Check payment gateway reference.
  • Check transaction table.
  • Check wallet ledger.
  • Check failed jobs.
  • Check webhook logs.
  • Check reconciliation reports.

Containment

  • Stop duplicate processing.
  • Freeze affected transaction workflow if needed.
  • Do not manually adjust balances without approval.
  • Use adjustment or reversal records instead of silent edits.
  • Preserve all references and logs.

Recovery

  • Reconcile gateway status with internal ledger.
  • Correct through controlled financial records.
  • Confirm no duplicate wallet effect.
  • Document final financial impact.

Credential Incident

Symptoms

  • Credential assigned to wrong student.
  • Credential delivery status incorrect.
  • Credential deactivated incorrectly.
  • Credential duplication risk.

First Checks

  • Check credential inventory record.
  • Check student and wallet linkage.
  • Check lifecycle events.
  • Check audit logs.
  • Check actor and timestamp.

Containment

  • Disable affected credential if risk is confirmed and approved.
  • Prevent further use if credential integrity is uncertain.
  • Do not delete credential history.
  • Record all corrective actions.

Recovery

  • Correct assignment through approved workflow.
  • Record reason.
  • Confirm lifecycle history.
  • Notify responsible school user if needed.

RBAC or Tenant Isolation Incident

Symptoms

  • User sees data outside assigned scope.
  • School Manager sees another school’s data.
  • Supplier sees another supplier’s order.
  • Operator sees another cafeteria’s records.

First Checks

  • Identify user role and scope.
  • Identify affected endpoint.
  • Review query filters.
  • Review authorization middleware.
  • Check access logs and audit logs.

Containment

  • Disable affected endpoint or feature if needed.
  • Remove excessive permission.
  • Patch backend authorization.
  • Review similar endpoints.
  • Force logout affected sessions if required.

Recovery

  • Deploy fix.
  • Test positive and negative access cases.
  • Confirm no cross-tenant data remains visible.
  • Document data exposure assessment.

Evidence Handling

Preserve evidence for security, financial, and data incidents.

Evidence may include:

  • Application logs
  • Nginx logs
  • Database audit records
  • Payment gateway references
  • Queue failed job records
  • User and role records
  • Credential lifecycle events
  • Screenshots with sensitive data masked
  • Timeline of actions

Evidence Rules

  • Do not edit original logs.
  • Do not share raw logs externally without review.
  • Mask personal data before sharing screenshots.
  • Preserve timestamps and reference IDs.
  • Keep evidence access restricted.

Incident Timeline Template

Use this format during the incident.

## Incident Timeline

| Time | Event | Owner | Notes |
|---|---|---|---|
| YYYY-MM-DD HH:mm | Incident detected | TBD | TBD |
| YYYY-MM-DD HH:mm | Severity assigned | TBD | TBD |
| YYYY-MM-DD HH:mm | Containment started | TBD | TBD |
| YYYY-MM-DD HH:mm | Root cause identified | TBD | TBD |
| YYYY-MM-DD HH:mm | Fix deployed | TBD | TBD |
| YYYY-MM-DD HH:mm | Service recovered | TBD | TBD |
| YYYY-MM-DD HH:mm | Incident closed | TBD | TBD |

Incident Report Template

Use this template after resolution.

## Incident Report: INC-YYYYMMDD-001

| Field | Details |
|---|---|
| Incident ID | INC-YYYYMMDD-001 |
| Severity | SEV-1 / SEV-2 / SEV-3 / SEV-4 |
| Environment | Production / Staging / Local |
| Start Time | YYYY-MM-DD HH:mm |
| End Time | YYYY-MM-DD HH:mm |
| Duration | TBD |
| Affected Services | TBD |
| Affected Users | TBD |
| Business Impact | TBD |
| Financial Impact | TBD |
| Data Exposure | Yes / No / Under Review |
| Root Cause | Confirmed cause only |
| Resolution | Fix applied |
| Verification | How recovery was confirmed |
| Owner | Responsible person/team |

Corrective Actions

ActionOwnerDue DateStatus
TBDTBDTBDOpen

Escalation Criteria

Escalate to management when:

  • Production is down.
  • Payment or wallet integrity may be affected.
  • Student data may be exposed.
  • Tenant isolation may be broken.
  • Credentials may be misassigned or compromised.
  • Backup or restore capability is impaired.
  • Regulatory, contractual, or school communication may be required.

Closure Criteria

An incident can be closed only when:

CheckRequired
Service restoredYes
Root cause confirmedYes
Impact assessedYes
Logs preserved where neededYes
Financial reconciliation completed, if applicableYes
Data exposure assessment completed, if applicableYes
Corrective actions documentedYes
Owner assigned for follow-up actionsYes

Open Items

ItemStatusNotes
Confirm incident commanderConfirmed — CTOSee Confirmed Incident Leadership above
Confirm escalation contactsConfirmed — CTO, Product Manager, Manager Support, CEOSee Confirmed Incident Leadership above
Confirm communication channelsConfirmed — SlackPrimary incident channel
Confirm incident ID formatNeeds Technical VerificationRequired
Confirm evidence storage locationNeeds Technical VerificationRequired
Confirm customer notification policyNeeds Technical VerificationRequired
Confirm security incident legal review processNeeds Technical VerificationRequired
Confirm financial reconciliation ownerNeeds Technical VerificationRequired

Rules

  • Do not make unconfirmed claims during an incident.
  • Do not delete logs.
  • Do not manually change financial records without approval.
  • Do not expose customer data in incident messages.
  • Do not share secrets or internal infrastructure details externally.
  • Document root cause only after confirmation.
  • Use Under Review when impact is not yet confirmed.