Home/Blog/Cloud Incident Response: A Step-by-Step Guide for AWS, Azure, and GCP
Cloud Security

Cloud Incident Response: A Step-by-Step Guide for AWS, Azure, and GCP

Learn how to respond to cloud security incidents effectively. This guide covers preparation, detection, containment, and recovery.

By InventiveHQ Team
Cloud Incident Response: A Step-by-Step Guide for AWS, Azure, and GCP

The average time to identify a security breach is 186 days. In cloud environments, where attackers can pivot quickly between services and accounts, delayed response dramatically increases damage.

This guide provides a practical incident response framework for AWS, Azure, and GCP—from preparation through recovery.


Why Cloud Incident Response Is Different

Cloud incident response shares core principles with traditional IR, but the environment introduces unique challenges that security teams must understand.

The shared responsibility model fundamentally changes who handles what. Your cloud provider manages infrastructure-level incidents, but configuration errors and data breaches remain your responsibility. This division creates gaps where incidents can fall through the cracks if roles aren't clearly defined.

Unlike on-premises environments, you have no physical access to systems. Forensic investigation relies entirely on logs and API data—you can't pull a hard drive or capture memory directly. This makes comprehensive logging not just helpful but essential.

Cloud resources are also inherently dynamic. Containers spin up and terminate in seconds. Serverless functions execute and disappear. Auto-scaling groups replace instances constantly. If you don't capture evidence quickly, the resources you need to investigate may simply cease to exist.

Modern cloud architectures add multi-account complexity. A single incident can span multiple accounts, cross regional boundaries, or even involve multiple cloud providers. Attackers know this and deliberately move laterally to complicate investigation and response.

Finally, cloud attacks are predominantly API-driven. Rather than deploying traditional malware, attackers who compromise credentials typically abuse cloud APIs—creating resources, exfiltrating data, or establishing persistence through IAM manipulation.

Despite these differences, the fundamentals remain consistent. The NIST incident response framework (Preparation, Detection, Containment, Eradication, Recovery, Lessons Learned) still applies. Thorough documentation remains critical for legal and compliance purposes. Clear communication with stakeholders is essential. And post-incident review continues to drive improvement.


Phase 1: Preparation

Effective incident response requires preparation before incidents occur. The time to build your capabilities is not during an active breach.

Build Your Team

Define clear roles and responsibilities before an incident forces you to improvise:

RoleResponsibilities
Incident CommanderOverall coordination, decision-making
Security AnalystInvestigation, forensics, technical analysis
Cloud EngineerEnvironment changes, containment actions
CommunicationsInternal/external communications, legal coordination
ManagementEscalation decisions, resource allocation

Document Runbooks

Create detailed procedures for the incident types you're most likely to encounter: compromised credentials, data exposure from public buckets or leaked secrets, cryptomining from unauthorized compute, account takeover, and insider threats. Each runbook should include specific commands, decision trees, and escalation criteria.

Enable Logging

You cannot investigate what you didn't log. This fundamental truth makes logging configuration your most important preparation activity.

For AWS, enable CloudTrail in all regions with both management events and data events for S3. Configure VPC Flow Logs to capture network traffic patterns. Activate GuardDuty for threat detection and enable S3 server access logging for detailed bucket activity.

For Azure, configure Activity Logs and Diagnostic Logs comprehensively. Set up Azure Monitor for centralized collection and enable Microsoft Defender for Cloud for threat detection and security recommendations.

For GCP, enable Cloud Audit Logs for both Admin Activity and Data Access. Configure VPC Flow Logs and activate Security Command Center for centralized visibility.

Regardless of provider, maintain logs for at least 90 days to support incident investigation, and consider 1+ years for compliance requirements.

Configure Alerting

Your detection capability depends on alerting configuration. Focus on high-priority events: root or admin account usage, console logins from unusual locations, security group modifications, IAM policy changes, and high-severity findings from GuardDuty, Defender, or Security Command Center. These alerts should reach your security team within minutes, not hours.

Practice

Conduct tabletop exercises quarterly to build muscle memory before real incidents strike. Walk through realistic scenarios with your full team, identify gaps in procedures and tooling, and update runbooks based on what you learn. Teams that practice respond faster and make fewer mistakes under pressure.


Phase 2: Detection and Analysis

When an incident is suspected, move quickly to confirm the situation and assess its scope.

Initial Triage

Your first minutes should focus on answering four critical questions. First, what happened—what triggered the alert or report? Second, what's affected—which accounts, services, and data are involved? Third, is it ongoing—is the attacker still active in your environment? Fourth, what's the impact—what are the business, data, and compliance implications?

Collect Evidence

Preserve evidence immediately. Cloud resources can be terminated by attackers covering their tracks, or logs can roll over before you extract them.

AWS:

Azure:

GCP:

Investigate

Each incident type requires different investigative focus.

For credential compromise, determine which credentials were compromised and when they were last rotated. Analyze CloudTrail or equivalent logs to understand what API calls the attacker made and what resources they accessed or created. Look for patterns indicating lateral movement or data exfiltration.

For data exposure, identify exactly what data was exposed and for how long. Determine who accessed the exposed resources during the exposure window by analyzing access logs. Assess whether notification is required under GDPR, HIPAA, or other applicable regulations.

For unauthorized resources like cryptomining, catalog what resources were created and in which regions. Calculate the billing impact and verify whether the resources are still running. Check for associated IAM entities or network configurations the attacker may have created.


Phase 3: Containment

Stop the bleeding before conducting a full investigation. Speed matters more than perfection at this stage.

Short-Term Containment

Credential compromise:

Data exposure:

Unauthorized resources:

Long-Term Containment

After immediate threats are neutralized, rotate all potentially compromised credentials—not just the ones you know were compromised. Implement additional monitoring on affected accounts to catch any attacker activity you may have missed. Apply emergency security group restrictions to limit blast radius, and enable additional logging if your investigation revealed gaps.


Phase 4: Eradication

Remove the attacker's access and any persistence mechanisms they established.

Check for Persistence

Sophisticated attackers establish persistence to survive credential rotation and system rebuilds. You must hunt for these mechanisms systematically.

In IAM, look for new users or access keys that weren't part of your original configuration. Check for modified trust policies on roles that could allow external access. Examine SAML providers, identity providers, and cross-account role trust relationships—all favorite persistence mechanisms for cloud-savvy attackers.

For compute resources, search for backdoor AMIs or container images that would reinfect rebuilt systems. Check for modified user data scripts, scheduled tasks, cron jobs, and SSH authorized keys on any instances that weren't terminated.

In network configuration, review security group rule additions that might provide backdoor access. Look for unexpected VPC peering connections or VPN gateway modifications that could allow the attacker to return.

Remove Attacker Access

Work through eradication systematically:

  1. Rotate credentials for all affected users and roles
  2. Delete unauthorized IAM entities (users, roles, policies)
  3. Terminate compromised compute resources
  4. Remove unauthorized network configurations
  5. Invalidate compromised secrets in Secrets Manager or Key Vault

Phase 5: Recovery

Restore normal operations while implementing improved security controls.

Restore Services

Recovery should proceed carefully:

  1. Restore from known-good backups if data was compromised
  2. Rebuild affected compute resources from clean, verified images
  3. Validate configuration against security baselines before reconnecting
  4. Gradually restore connectivity while monitoring for anomalies

Verify Security

Before declaring recovery complete, run vulnerability scans on all restored resources. Verify that logging is functioning correctly and will capture any future suspicious activity. Confirm security groups are properly configured and validate that IAM policies match your intended baseline. Check one final time for any remaining unauthorized resources the attacker may have created.


Phase 6: Lessons Learned

The post-incident review is where incidents become improvements.

Conduct Post-Incident Review

Within one to two weeks of incident closure, gather your response team for a blameless retrospective. Walk through what happened and when, examining the timeline in detail. Analyze how the incident was detected—was it your monitoring, a user report, or an external notification? Discuss what worked well in the response and what created friction or delays. Identify specific changes needed to prevent similar incidents or respond more effectively.

Document and Share

Write a formal incident report for stakeholders that includes timeline, impact, response actions, and improvements. Update your runbooks based on lessons learned—the procedures that seemed clear in theory may have failed in practice. Consider sharing sanitized learnings with the broader security community, and update your detection rules based on attack patterns you observed.

Implement Improvements

Track improvements with clear owners and deadlines. This typically includes detection gaps to close, logging improvements to implement, process changes to codify, tooling investments to make, and training requirements to address. An incident without follow-through on improvements is a missed opportunity.


Cloud-Specific Considerations

Each major cloud provider offers resources to support incident response.

AWS provides dedicated support channels for security incidents through AWS Support. AWS Artifact offers compliance documentation that may be needed for regulatory reporting. For complex investigations, consider engaging AWS's incident response services. GuardDuty findings include remediation guidance that can accelerate containment.

Azure users can leverage Microsoft's dedicated incident response services for critical situations. Defender for Cloud provides incident correlation that helps connect related alerts. Azure Sentinel can automate response workflows through playbooks, and the Microsoft Incident Response team is available for critical incidents affecting enterprise customers.

GCP offers support SLAs specifically for security incidents. Chronicle provides advanced threat detection and investigation capabilities that go beyond basic logging. Security Command Center includes attack path analysis to understand how incidents unfolded, and Mandiant (now Google-owned) offers advanced IR services for complex investigations.


Frequently Asked Questions

How quickly should I respond to a cloud security incident?

Response time depends on severity. For critical incidents involving an active attacker or data breach, begin containment within minutes. For high-severity incidents, respond within hours. Have on-call procedures for after-hours incidents.

Should I contact my cloud provider during an incident?

Contact your provider if the incident involves their infrastructure, you need forensic data they control, you suspect compromise of provider-managed services, or you need emergency support beyond normal channels.

What if I don't have logs for the incident timeframe?

Without logs, investigation is severely limited. Document the gap, implement improved logging immediately, and note the limitation in your incident report. This is why preparation—specifically enabling comprehensive logging—is critical.

Do I need to notify anyone about cloud security incidents?

Notification requirements depend on what data was exposed (PII, PHI, PCI), your industry regulations (HIPAA, GDPR, PCI DSS), contractual obligations, and whether law enforcement involvement is needed. GDPR requires notification within 72 hours of becoming aware of a breach involving personal data.

Use write-once storage for log exports to prevent tampering. Document chain of custody for all evidence. Timestamp all evidence collection activities. Avoid modifying original resources when possible, and involve legal counsel early for significant incidents.


Take Action

  1. Enable comprehensive logging across all cloud accounts
  2. Create incident runbooks for common scenario types
  3. Define your IR team with clear roles and contact information
  4. Conduct tabletop exercises at least quarterly
  5. Review and improve after every incident

Use our Cloud Security Self-Assessment to evaluate your incident response readiness and identify preparation gaps.

For more cloud security guidance, see our comprehensive guide: 30 Cloud Security Tips for 2026.

Let's turn this knowledge into action

Get a free 30-minute consultation with our experts. We'll help you apply these insights to your specific situation.