IT resilience for organizations: business continuity, cyber recovery, NIS2 obligations, and resilient backup architectures. Guide for IT managers and CISOs.
IT resilience describes an organization's ability to handle IT disruptions — from hardware failures through cyberattacks to natural disasters — in a way that keeps business operations running or restores them within an acceptable timeframe.
The term sounds abstract. The reality is not. According to the Allianz Risk Barometer 2025, cyber incidents are the biggest business risk for organizations worldwide — for the fourth consecutive year. BSI (German Federal Office for Information Security) documents a "deeply concerning" threat level in its situation report, while Sophos 2024 found that 65% of ransomware victims needed more than a week for full recovery.
The central question is no longer: Will we be attacked? It is: How quickly can we get back to work afterwards?
This guide shows IT managers, CISOs, and executives how to build IT resilience in practice — from architecture through processes to compliance.
Reading time: approx. 22 minutes | Last updated: April 2026
IT resilience is the ability of an IT infrastructure to remain operational under adverse conditions, or to restore operability within a defined timeframe. The concept goes beyond classical high availability: availability protects against individual component failures. Resilience protects against scenarios where entire systems, locations, or infrastructure layers fail simultaneously.
The critical insight: Availability and security can fail. Resilience must not fail — it is the last safety net when all other layers have been breached.
Three developments make IT resilience an executive responsibility:
as an existential threat: A successful attack can cause weeks to months of operational downtime. Organizations that cannot recover do not survive.
Regulatory pressure:NIS2, the KRITIS umbrella law, and sector-specific regulation (BAIT, DORA) make resilience a legal obligation — with personal liability for management.
Supply chain dependencies: A failure at a critical supplier or cloud provider can interrupt entire value chains. Resilience must be considered beyond the organization’s own boundaries.
Ransomware
Ransomware is malware that encrypts data on infected systems and demands a ransom for decryption — with the goal of forcing organizations and public bodies to pay by paralyzing their operations.
DORA (Digital Operational Resilience Act, EU 2022/2554) is an EU regulation that has applied to all regulated financial market participants since January 2025, setting concrete requirements for ICT risk management, backup systems (Art. 11 and 12), third-party provider management (Art. 28–30) and incident reporting.
Ransomware is malware that encrypts data on infected systems and demands a ransom for decryption — with the goal of forcing organizations and public bodies to pay by paralyzing their operations.
DORA (Digital Operational Resilience Act, EU 2022/2554) is an EU regulation that has applied to all regulated financial market participants since January 2025, setting concrete requirements for ICT risk management, backup systems (Art. 11 and 12), third-party provider management (Art. 28–30) and incident reporting.
Reality check: Prevention reduces risk but does not eliminate it. Attackers often remain undetected in networks for weeks, sometimes months — many attacks are only discovered after the damage has already been done.
The most critical pillar: restoring systems and data within an acceptable timeframe.
Multi-tier backup architecture with air-gap layer
Documented Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO)
Recovery runbooks for all critical systems
Regular recovery tests (quarterly)
Prioritized recovery sequence
Why recovery is the decisive pillar: Prevention, detection, and response can fail. Recovery is the point where it is decided whether an organization survives or not. And recovery only works when the data from which you are recovering has not also been compromised.
Learning from incidents and continuously improving resilience.
Post-incident reviews (lessons learned)
Adapting architecture to new threats
Tabletop exercises and simulations
Annual architecture reviews
Exchange in sector CERTs and ISACs
Disaster Recovery
Disaster recovery refers to the structured processes and technical measures that ensure IT systems can be restored within defined timeframes (RTO) with maximum data loss (RPO) after a severe failure — ransomware attack, hardware failure or data center outage.
Disaster recovery refers to the structured processes and technical measures that ensure IT systems can be restored within defined timeframes (RTO) with maximum data loss (RPO) after a severe failure — ransomware attack, hardware failure or data center outage.
RTO (Recovery Time Objective) is the maximum acceptable downtime after an IT failure; RPO (Recovery Point Objective) is the maximum acceptable data loss — both are metrics that must be technically demonstrably met in backup architectures and must not merely be defined as aspirational targets.
RTO (Recovery Time Objective) is the maximum acceptable downtime after an IT failure; RPO (Recovery Point Objective) is the maximum acceptable data loss — both are metrics that must be technically demonstrably met in backup architectures and must not merely be defined as aspirational targets.
3. Cyber resilience: When prevention is not enough #
Cyber resilience is the specialization of IT resilience for cyberattacks. It addresses a specific problem: cyberattacks — in particular ransomware — are designed not only to disrupt individual systems but to destroy the entire recovery capability.
Modern ransomware specifically targets backup infrastructure. This means: the classical disaster recovery plan, which assumes that backups are intact, no longer works.
The scenario that cyber resilience must solve:
Production systems: encrypted ✗
Active Directory: compromised ✗
Online backup: deleted ✗
Cloud backup: deleted via compromised IAM credentials ✗
Air-gap backup: intact ✓ — was physically unreachable
Cyber resilience means: even in the absolute worst case — where an attacker had domain administrator rights and went undetected for weeks — at least one recovery path remains intact.
Principle 1: Assume breach Assume your network will be compromised. Build your recovery architecture to work even then.
Principle 2: Isolated recovery capability At least one recovery path must be physically separated from the production network — not just logically, not just through software policies, but physically unreachable.
Principle 3: Verified recoverability A backup that has never been tested is not a recovery plan — it is an assumption. Quarterly recovery tests are the minimum.
Zone 1: Production zone
├── Servers, VMs, databases, applications
├── Network-connected systems
└── Attack surface: HIGH
Zone 2: Backup zone (network-connected)
├── Primary backup repository
├── Snapshot immutability (supplementary)
└── Attack surface: MEDIUM (credentials-reachable)
Zone 3: Isolated recovery zone (air gap)
├── Hardware air gap system
├── Only reachable during backup windows
├── No network interface when offline
└── Attack surface: MINIMAL
Zone 3 is the cyber resilience insurance: Even if Zone 1 and Zone 2 are fully compromised, data in Zone 3 remains intact.
IT resilience is the ability of an IT infrastructure to remain functional under adverse conditions — from cyber attacks through hardware failures to natural disasters — or to restore functionality within a defined timeframe so that critical business processes are maintained.
IT resilience is the ability of an IT infrastructure to remain functional under adverse conditions — from cyber attacks through hardware failures to natural disasters — or to restore functionality within a defined timeframe so that critical business processes are maintained.
An air gap is the complete physical interruption of all network connections between a backup system and the rest of the IT infrastructure, so that the system has no addressable network interface in its offline state and is therefore unreachable by ransomware and attackers.
Business Continuity Management is the organizational framework within which IT resilience operates. BCM defines:
Critical business processes: Which processes must be restored first?
Maximum Tolerable Downtime (MTD): How long can a process be unavailable before the organization suffers existentially threatening damage?
Business Impact Analysis (BIA): What financial, operational, and reputational damage occurs per hour of downtime?
RTO and RPO: The two metrics that determine everything #
Metric
Meaning
Example
Determined by
RTO (Recovery Time Objective)
Maximum acceptable downtime
4 hours: ERP system
Business requirement
RPO (Recovery Point Objective)
Maximum acceptable data loss
1 hour: transaction data
Backup frequency
The most common mistake:RTO and RPO are defined but never tested against the actual backup architecture. An RTO of 4 hours is worthless if an actual restore takes 48 hours.
A disaster recovery plan documents exactly how systems are restored after a total failure. It must contain the following elements:
Trigger criteria: When is the DR plan activated?
Roles and responsibilities: Who decides, who acts?
Recovery sequence: Which systems first?
Technical recovery steps: Step-by-step instructions per system
Communication plan: Who is informed when and how?
Success criteria: How do we know recovery is complete?
Critical: The DR plan must be available offline — printed, in a safe. If your IT infrastructure is compromised, your SharePoint folder with the DR plan may not be accessible either.
*Online backup: RTO only achievable if backup was not compromised — no guarantee in a ransomware attack.
Business Continuity Management
Business Continuity Management (BCM) is the organizational framework that ensures critical business processes can be maintained or restored within defined timeframes even during severe IT failures, cyber attacks or other crises.
Business Continuity Management (BCM) is the organizational framework that ensures critical business processes can be maintained or restored within defined timeframes even during severe IT failures, cyber attacks or other crises.
Disaster recovery refers to the structured processes and technical measures that ensure IT systems can be restored within defined timeframes (RTO) with maximum data loss (RPO) after a severe failure — ransomware attack, hardware failure or data center outage.
Disaster recovery refers to the structured processes and technical measures that ensure IT systems can be restored within defined timeframes (RTO) with maximum data loss (RPO) after a severe failure — ransomware attack, hardware failure or data center outage.
Disaster recovery refers to the structured processes and technical measures that ensure IT systems can be restored within defined timeframes (RTO) with maximum data loss (RPO) after a severe failure — ransomware attack, hardware failure or data center outage.
The backup architecture is the technical foundation of every resilience strategy. Without intact backups, there is no recovery — and without recovery, there is no resilience.
In a ransomware situation, Tier 1 (online backup) and Tier 4 (cloud replication) are potentially compromised — both are network-reachable. Tier 3 (WORM) protects archive data but not necessarily current backup generations.
Tier 2 — the air-gap layer — is the resilience insurance: It contains current backup data that was physically unattackable.
WORM (Write Once, Read Many) refers to a storage principle in which data is written once and can technically no longer be altered or deleted — in hardware WORM, this immutability is a physical property of the storage controller, independent of software, operating system or user privileges.
WORM (Write Once, Read Many) refers to a storage principle in which data is written once and can technically no longer be altered or deleted — in hardware WORM, this immutability is a physical property of the storage controller, independent of software, operating system or user privileges.
WORM (Write Once, Read Many) refers to a storage principle in which data is written once and can technically no longer be altered or deleted — in hardware WORM, this immutability is a physical property of the storage controller, independent of software, operating system or user privileges.
Ransomware is malware that encrypts data on infected systems and demands a ransom for decryption — with the goal of forcing organizations and public bodies to pay by paralyzing their operations.
6. NIS2 and critical infrastructure: Resilience as a legal obligation #
NIS2 Directive: Resilience is no longer a recommendation #
The NIS2 Directive and the NIS2 transposition law make IT resilience a legal obligation for thousands of organizations. §30 BSIG-new specifically requires:
NIS2 requirement (§30 BSIG-new)
Resilience measure
Backup management and recovery
Multi-tier backup architecture with defined RTO/RPO
Crisis management
DR plan with roles, escalation, communication
Supply chain security
Assessment of backup hardware and software vendors
NIS2 tightens liability: managing directors and board members are personally liable for ensuring that appropriate risk management measures are implemented. “We did not know” is not a defense — the NIS2 transposition law requires management to inform themselves regularly about the cybersecurity situation and to approve measures.
KRITIS umbrella law: Physical and IT resilience converge #
The KRITIS umbrella law extends the resilience concept to physical security. For critical infrastructure operators, this means: IT resilience and physical resilience must be planned together. A data center requires not only ransomware protection but also protection against power failure, flooding, and physical access.
Affected entities must expect audits. Typical checkpoints in the area of resilience:
[ ] Is a documented data backup concept in place? (BSICON.3)
[ ] Are RTO/RPO documented per system and verified through tests?
[ ] Is a physically separated (air-gapped) backup in place?
[ ] Are recovery tests conducted regularly and documented?
[ ] Is a DR plan with defined roles and communication paths in place?
[ ] Is the DR plan available offline (printed, in a safe)?
[ ] Are backup systems managed with separate administrator accounts?
[ ] Is management informed about the resilience measures?
KRITIS (Critical Infrastructure)
KRITIS refers to organizations and facilities whose failure or impairment would cause significant supply shortages or threats to public safety — KRITIS operators are subject to heightened IT security requirements under §8a of the German BSI Act and must demonstrate compliance to the BSI every two years.
Disaster recovery refers to the structured processes and technical measures that ensure IT systems can be restored within defined timeframes (RTO) with maximum data loss (RPO) after a severe failure — ransomware attack, hardware failure or data center outage.
Disaster recovery refers to the structured processes and technical measures that ensure IT systems can be restored within defined timeframes (RTO) with maximum data loss (RPO) after a severe failure — ransomware attack, hardware failure or data center outage.
Based on our experience from over 2,500 installations, most German organizations are at Level 2 or 3 — they have backups and basic processes, but no demonstrated recovery capability when a ransomware attack also hits the backup infrastructure.
The jump from Level 3 to Level 4 — introducing an air-gap layer and regular recovery tests — is the single most impactful step to increase IT resilience.
Disaster Recovery
Disaster recovery refers to the structured processes and technical measures that ensure IT systems can be restored within defined timeframes (RTO) with maximum data loss (RPO) after a severe failure — ransomware attack, hardware failure or data center outage.
Disaster recovery refers to the structured processes and technical measures that ensure IT systems can be restored within defined timeframes (RTO) with maximum data loss (RPO) after a severe failure — ransomware attack, hardware failure or data center outage.
Disaster recovery refers to the structured processes and technical measures that ensure IT systems can be restored within defined timeframes (RTO) with maximum data loss (RPO) after a severe failure — ransomware attack, hardware failure or data center outage.
Disaster recovery refers to the structured processes and technical measures that ensure IT systems can be restored within defined timeframes (RTO) with maximum data loss (RPO) after a severe failure — ransomware attack, hardware failure or data center outage.
An air gap is the complete physical interruption of all network connections between a backup system and the rest of the IT infrastructure, so that the system has no addressable network interface in its offline state and is therefore unreachable by ransomware and attackers.
IT resilience is the ability of an IT infrastructure to remain functional under adverse conditions — from cyber attacks through hardware failures to natural disasters — or to restore functionality within a defined timeframe so that critical business processes are maintained.
Disaster recovery refers to the structured processes and technical measures that ensure IT systems can be restored within defined timeframes (RTO) with maximum data loss (RPO) after a severe failure — ransomware attack, hardware failure or data center outage.
Disaster recovery refers to the structured processes and technical measures that ensure IT systems can be restored within defined timeframes (RTO) with maximum data loss (RPO) after a severe failure — ransomware attack, hardware failure or data center outage.
What is the difference between IT resilience and IT security?
IT security aims to prevent attacks. IT resilience ensures that the organization can become operational again after a successful attack. Security is a subset of resilience — resilience additionally encompasses recovery, business continuity, and adaptability.
Does every organization need an air-gap layer?
Every organization with real ransomware risk benefits from an air-gap layer. For NIS2-affected organizations and critical infrastructure operators, physically isolated backup is in practice a regulatory obligation.
KRITIS (Critical Infrastructure)
KRITIS refers to organizations and facilities whose failure or impairment would cause significant supply shortages or threats to public safety — KRITIS operators are subject to heightened IT security requirements under §8a of the German BSI Act and must demonstrate compliance to the BSI every two years.
The NIS2 Directive (EU 2022/2555) is an EU regulation that obliges essential and important entities to implement specific cybersecurity measures — including demonstrable backup management, crisis management and reporting obligations — with personal liability for management bodies in case of non-compliance.
The cost of a resilient architecture is a fraction of an uncontrolled outage. According to Bitkom 2024, a ransomware attack causes an average of EUR 5.3 million in damage (estimate based on aggregated total damage figures). An air-gap backup solution costs a fraction of that depending on capacity — and reduces downtime from weeks to hours.
How often should recovery tests be conducted?
Quarterly recovery tests of critical systems are the minimum — recommended by both BSI (CON.3.A11) and NIS2 requirements. Additionally, a complete recovery test of all critical systems with timing against RTO targets should be conducted annually.
NIS2
The NIS2 Directive (EU 2022/2555) is an EU regulation that obliges essential and important entities to implement specific cybersecurity measures — including demonstrable backup management, crisis management and reporting obligations — with personal liability for management bodies in case of non-compliance.
Yes. The most important measurable metrics are: (1) RTO — measured in the recovery test, not estimated; (2) RPO — actual data loss in the recovery test; (3) backup success rate — share of successful backup jobs; (4) recovery success rate — share of successful restore tests; (5) Time to Detect (TTD) and Time to Respond (TTR) for incidents.
What does NIS2 specifically require for IT resilience?
§30 BSIG-new requires: backup management and recovery, crisis management, business continuity, incident handling, supply chain security, and vulnerability management. Management is personally liable for implementation. Fines: up to EUR 10m or 2% of global annual revenue.
Disclaimer
This article was written by our editorial team and edited using AI. It provides a general overview and does not constitute legal advice – we recommend seeking professional advice for your specific situation.