Incident Response
Purpose
This page documents the incident response process for Maqsafy services.
The purpose is to provide a controlled process for detecting, analyzing, containing, resolving, and documenting operational and security incidents.
Incident Response Scope
This process applies to:
- Application outages
- API failures
- Database issues
- Queue failures
- Redis failures
- Nginx or reverse proxy issues
- Payment or wallet incidents
- Credential-related incidents
- Data access or RBAC incidents
- Security incidents
- Backup or restore failures
- Integration failures
Incident Response References
NIST SP 800-61 Rev. 3 is the current NIST guidance for incident response recommendations and considerations within cybersecurity risk management. NIST’s incident response project states that the goal is to help organizations prepare for incident response, reduce incident impact, and improve detection, response, and recovery activities.
CISA’s incident response plan basics states that an incident response plan should clarify roles and responsibilities and guide key activities during incidents.
Incident Severity Levels
| Severity | Description | Example |
|---|---|---|
| SEV-1 Critical | Major production outage or high-impact security/financial incident | Platform down, payment/wallet integrity issue, data exposure |
| SEV-2 High | Major feature or service degraded with business impact | Login outage, API failure, queue backlog affecting users |
| SEV-3 Medium | Partial issue with workaround available | Report failure, delayed notifications, non-critical integration issue |
| SEV-4 Low | Minor issue with limited impact | UI bug, single non-critical job failure |
Severity Assignment Rules
- If financial integrity may be affected, classify as SEV-1 or SEV-2 until confirmed.
- If student data may be exposed, classify as SEV-1 until confirmed.
- If tenant isolation may be broken, classify as SEV-1 until confirmed.
- If credential deactivation/replacement is incorrect, classify as SEV-1 or SEV-2 based on impact.
- If production is fully unavailable, classify as SEV-1.
- If the issue is limited to staging or local environment, classify based on release impact.
Incident Lifecycle
1. Detection
Incident detection may come from:
- Monitoring alerts
- User reports
- Support tickets
- Application logs
- Payment gateway alerts
- Failed backup alerts
- Security events
- Team observation
2. Triage
During triage, confirm:
| Check | Required |
|---|---|
| Environment identified | Yes |
| Affected service identified | Yes |
| User impact estimated | Yes |
| Financial impact estimated | If applicable |
| Data exposure risk assessed | If applicable |
| Tenant isolation risk assessed | If applicable |
| Incident severity assigned | Yes |
| Incident owner assigned | Yes |
3. Containment
Containment actions may include:
- Disable affected feature temporarily.
- Block suspicious access.
- Stop affected queue job.
- Isolate compromised account.
- Disable affected integration.
- Put application in maintenance mode if required.
- Prevent duplicate financial processing.
- Freeze high-risk withdrawal or refund workflow if needed.
4. Eradication
Eradication removes the confirmed root cause.
Examples:
- Fix code defect.
- Correct configuration.
- Patch vulnerable package.
- Rotate exposed credentials.
- Fix RBAC or tenant isolation logic.
- Correct failed migration.
- Fix Nginx upstream configuration.
- Repair queue worker configuration.
5. Recovery
Recovery restores normal operation.
Recovery checks:
| Check | Expected Result |
|---|---|
| Application health endpoint | Healthy |
| Login | Working |
| API endpoints | Working |
| Queue workers | Processing jobs |
| Redis | Reachable |
| Database | Reachable |
| Nginx | No critical errors |
| Payment flow | Working, if impacted |
| Wallet ledger | Reconciled, if impacted |
| Credential flow | Correct, if impacted |
| Logs | No new critical exceptions |
6. Post-Incident Review
Post-incident review must document:
- Timeline
- Root cause
- Impact
- Detection method
- Resolution
- Preventive actions
- Owner of corrective actions
- Evidence collected
- Follow-up deadline
Incident Communication Channel
| Channel | Purpose | Status |
|---|---|---|
| Slack | Primary incident communication channel for all severity levels | Confirmed |
Incident Roles
| Role | Responsibility |
|---|---|
| Incident Commander | Owns incident coordination and decisions |
| Technical Lead | Leads technical investigation and fix |
| Operations Owner | Handles infrastructure, deployment, monitoring, and recovery |
| Product Owner | Confirms business impact and user-facing priority |
| Support Owner | Handles user/support communication |
| Security Owner | Leads security assessment and evidence handling |
| Finance Owner | Reviews wallet, payment, refund, and settlement impact |
| Communications Owner | Prepares internal or external communication if required |
Confirmed Incident Leadership
| Name / Role | Incident Leadership Role | Status |
|---|---|---|
| CTO | Incident Commander / Technical Leadership | Confirmed |
| Product Manager | Product Impact and Priority | Confirmed |
| Manager Support | Support and User Communication | Confirmed |
| CEO | Executive Escalation | Confirmed |
Role Assignment
| Incident Type | Required Owners |
|---|---|
| Application outage | Incident Commander, Technical Lead, Operations Owner |
| Payment or wallet incident | Incident Commander, Technical Lead, Finance Owner |
| Credential incident | Incident Commander, Technical Lead, Product Owner |
| Data exposure incident | Incident Commander, Security Owner, Technical Lead |
| Integration failure | Technical Lead, Operations Owner, Integration Owner |
| Backup failure | Operations Owner, Technical Lead |
Communication Rules
Internal Communication
Internal updates should include:
| Field | Description |
|---|---|
| Incident ID | Unique incident reference |
| Severity | SEV-1 / SEV-2 / SEV-3 / SEV-4 |
| Status | Investigating / Identified / Contained / Recovering / Resolved |
| Impact | Affected users, services, or operations |
| Current action | What is being done now |
| Owner | Responsible person/team |
| Next update | Expected update time |
External Communication
External communication is required only when approved by management or required by contract, regulation, or business impact.
External messages must not include:
- Internal IPs
- Secrets
- Private logs
- Detailed exploit information
- Customer personal data
- Unconfirmed assumptions
Incident Types and Runbooks
Application Down
Symptoms
- Health endpoint fails.
- Users cannot access dashboard or app.
- Nginx returns 502, 503, or 504.
First Checks
curl -I https://example.com
sudo nginx -t
sudo tail -f /var/log/nginx/error.log
docker ps
docker logs <container-name>
Containment
- Confirm affected service.
- Restart failed service if safe.
- Roll back recent deployment if confirmed as cause.
- Notify stakeholders if production impact is high.
Login Failure
Symptoms
- Users cannot log in.
- API returns unauthorized or validation errors.
- OTP delivery fails.
First Checks
tail -f storage/logs/laravel.log
php artisan route:list
php artisan config:cache
Containment
- Confirm if issue affects all users or a specific role.
- Check authentication provider or OTP provider.
- Check rate limits and failed login logs.
- Avoid disabling security controls without approval.
Queue Backlog
Symptoms
- Notifications delayed.
- Reports delayed.
- Exports not generated.
- Background jobs not processing.
First Checks
php artisan queue:failed
php artisan queue:work
php artisan queue:restart
docker ps
docker logs <queue-worker-container>
Containment
- Restart queue workers.
- Pause non-critical heavy jobs.
- Prioritize financial or user-impacting jobs.
- Review failed jobs before retrying sensitive jobs.
Redis Failure
Symptoms
- Cache, sessions, or queues fail.
- Logs show Redis connection errors.
- Queue jobs stop processing.
First Checks
docker ps
docker logs <redis-container>
docker network ls
docker network inspect <network-name>
redis-cli -h <redis-host> -p 6379 ping
Containment
- Confirm Redis is running.
- Confirm service name and network.
- Restart dependent services if required.
- Avoid clearing production cache without understanding impact.
Database Incident
Symptoms
- Login fails.
- Dashboard pages fail to load.
- API returns database errors.
- Reports or financial workflows fail.
First Checks
php artisan migrate:status
mysql -h <db-host> -u <db-user> -p
tail -f storage/logs/laravel.log
Containment
- Stop destructive jobs or migrations.
- Preserve logs.
- Confirm backup availability.
- Avoid manual production data edits without approval and audit record.
Payment or Wallet Incident
Symptoms
- Wallet balance mismatch.
- Payment succeeded but wallet not credited.
- Refund duplicated or failed.
- Settlement mismatch.
First Checks
- Check payment gateway reference.
- Check transaction table.
- Check wallet ledger.
- Check failed jobs.
- Check webhook logs.
- Check reconciliation reports.
Containment
- Stop duplicate processing.
- Freeze affected transaction workflow if needed.
- Do not manually adjust balances without approval.
- Use adjustment or reversal records instead of silent edits.
- Preserve all references and logs.
Recovery
- Reconcile gateway status with internal ledger.
- Correct through controlled financial records.
- Confirm no duplicate wallet effect.
- Document final financial impact.
Credential Incident
Symptoms
- Credential assigned to wrong student.
- Credential delivery status incorrect.
- Credential deactivated incorrectly.
- Credential duplication risk.
First Checks
- Check credential inventory record.
- Check student and wallet linkage.
- Check lifecycle events.
- Check audit logs.
- Check actor and timestamp.
Containment
- Disable affected credential if risk is confirmed and approved.
- Prevent further use if credential integrity is uncertain.
- Do not delete credential history.
- Record all corrective actions.
Recovery
- Correct assignment through approved workflow.
- Record reason.
- Confirm lifecycle history.
- Notify responsible school user if needed.
RBAC or Tenant Isolation Incident
Symptoms
- User sees data outside assigned scope.
- School Manager sees another school’s data.
- Supplier sees another supplier’s order.
- Operator sees another cafeteria’s records.
First Checks
- Identify user role and scope.
- Identify affected endpoint.
- Review query filters.
- Review authorization middleware.
- Check access logs and audit logs.
Containment
- Disable affected endpoint or feature if needed.
- Remove excessive permission.
- Patch backend authorization.
- Review similar endpoints.
- Force logout affected sessions if required.
Recovery
- Deploy fix.
- Test positive and negative access cases.
- Confirm no cross-tenant data remains visible.
- Document data exposure assessment.
Evidence Handling
Preserve evidence for security, financial, and data incidents.
Evidence may include:
- Application logs
- Nginx logs
- Database audit records
- Payment gateway references
- Queue failed job records
- User and role records
- Credential lifecycle events
- Screenshots with sensitive data masked
- Timeline of actions
Evidence Rules
- Do not edit original logs.
- Do not share raw logs externally without review.
- Mask personal data before sharing screenshots.
- Preserve timestamps and reference IDs.
- Keep evidence access restricted.
Incident Timeline Template
Use this format during the incident.
## Incident Timeline
| Time | Event | Owner | Notes |
|---|---|---|---|
| YYYY-MM-DD HH:mm | Incident detected | TBD | TBD |
| YYYY-MM-DD HH:mm | Severity assigned | TBD | TBD |
| YYYY-MM-DD HH:mm | Containment started | TBD | TBD |
| YYYY-MM-DD HH:mm | Root cause identified | TBD | TBD |
| YYYY-MM-DD HH:mm | Fix deployed | TBD | TBD |
| YYYY-MM-DD HH:mm | Service recovered | TBD | TBD |
| YYYY-MM-DD HH:mm | Incident closed | TBD | TBD |
Incident Report Template
Use this template after resolution.
## Incident Report: INC-YYYYMMDD-001
| Field | Details |
|---|---|
| Incident ID | INC-YYYYMMDD-001 |
| Severity | SEV-1 / SEV-2 / SEV-3 / SEV-4 |
| Environment | Production / Staging / Local |
| Start Time | YYYY-MM-DD HH:mm |
| End Time | YYYY-MM-DD HH:mm |
| Duration | TBD |
| Affected Services | TBD |
| Affected Users | TBD |
| Business Impact | TBD |
| Financial Impact | TBD |
| Data Exposure | Yes / No / Under Review |
| Root Cause | Confirmed cause only |
| Resolution | Fix applied |
| Verification | How recovery was confirmed |
| Owner | Responsible person/team |
Corrective Actions
| Action | Owner | Due Date | Status |
|---|---|---|---|
| TBD | TBD | TBD | Open |
Escalation Criteria
Escalate to management when:
- Production is down.
- Payment or wallet integrity may be affected.
- Student data may be exposed.
- Tenant isolation may be broken.
- Credentials may be misassigned or compromised.
- Backup or restore capability is impaired.
- Regulatory, contractual, or school communication may be required.
Closure Criteria
An incident can be closed only when:
| Check | Required |
|---|---|
| Service restored | Yes |
| Root cause confirmed | Yes |
| Impact assessed | Yes |
| Logs preserved where needed | Yes |
| Financial reconciliation completed, if applicable | Yes |
| Data exposure assessment completed, if applicable | Yes |
| Corrective actions documented | Yes |
| Owner assigned for follow-up actions | Yes |
Open Items
| Item | Status | Notes |
|---|---|---|
| Confirm incident commander | Confirmed — CTO | See Confirmed Incident Leadership above |
| Confirm escalation contacts | Confirmed — CTO, Product Manager, Manager Support, CEO | See Confirmed Incident Leadership above |
| Confirm communication channels | Confirmed — Slack | Primary incident channel |
| Confirm incident ID format | Needs Technical Verification | Required |
| Confirm evidence storage location | Needs Technical Verification | Required |
| Confirm customer notification policy | Needs Technical Verification | Required |
| Confirm security incident legal review process | Needs Technical Verification | Required |
| Confirm financial reconciliation owner | Needs Technical Verification | Required |
Rules
- Do not make unconfirmed claims during an incident.
- Do not delete logs.
- Do not manually change financial records without approval.
- Do not expose customer data in incident messages.
- Do not share secrets or internal infrastructure details externally.
- Document root cause only after confirmation.
- Use
Under Reviewwhen impact is not yet confirmed.