---
title: "Recovery Runbook: What Goes in It and Who Maintains It"
date: 2026-05-27T16:10:00+02:00
author: FAST LTA
canonical_url: "https://www.fast-lta.de//en/blog/recovery-runbook-was-hineingehört-und-wer-es-pflegt"
section: "Entries: Articles"
---
### What Belongs in a Recovery Runbook [\#](#what-belongs-in-a-recovery-runbook "What Belongs in a Recovery Runbook")

#### For Each Critical System [\#](#for-each-critical-system "For Each Critical System")

A runbook should be structured per critical system. For example: ​“Microsoft Active Directory Recovery Runbook” or ​“SAP ERP Recovery Runbook.”

**1. System Overview (1 page)**

- Name, purpose, business owner
- Dependencies (what does this system need to run? e.g. AD, file server, network)
- Dependents (which systems depend on this system? e.g. all clients need AD)
- Critical: restoration sequence. Active Directory must, for example, be restored BEFORE Exchange servers.

**2. RTO/RPO Definitions**

- RTO (Recovery Time Objective): How long can the outage last? e.g. ​“AD RTO: 2 hours”
- RPO (Recovery Point Objective): How much data loss is acceptable? e.g. ​“AD RPO: 1 hour” (meaning: we can tolerate losing 1 hour of password changes)

**3. Backup Locations and Access**

- Where are the backups for this system stored? (Tier 1 online, Tier 2 air gap, Tier 3 WORM archive)
- How do I access the backups? (IP address, port, access method)
- Who has admin access to the backup system? (name, phone number, also available on paper)
- What is the backup management tool? (e.g. Veeam, Bacula, Commvault)

**4. Restoration Steps (Numbered, Detailed)**

Example structure for a Windows server recovery:

1. Boot the recovery server from the recovery medium (backup software ISO)
2. Select the backup from the documented date and time
3. Verify the backup is not compromised (run the backup integrity tool)
4. Start recovery to the designated recovery server in the isolated network zone (no production connection)
5. Wait until recovery is 100% complete (document the expected duration for this system)
6. Boot the recovery server and log in as local admin (credentials: see safe, never in the runbook itself)
7. Open a command prompt, verify disk space, RAM, and network interface are functional
8. Run integrity checks (disk verification, ipconfig /​all, systeminfo)
9. When all checks pass, start services
10. Test functionality: log in as a domain user, create a test file, run a test transaction
11. When everything is OK, migrate the system to the production cluster (describe how)
12. Update DNS/DHCP if necessary
13. Document the completion time in the recovery log

**5. Test Procedure**

- How do we test this runbook regularly?
- Example: ​“Quarterly recovery test: perform a partial recovery on test hardware without affecting production”
- Success criteria: system boots, login works, services start, data integrity confirmed

**6. Admin Credentials (STORED SEPARATELY)**

- Passwords must NOT appear in the runbook itself. That is a security risk.
- Instead: ​“Local admin credential: in the safe at \[location\], last updated \[date\]”
- Or: store credentials in an encrypted vault, accessible only to the incident commander
- Alternative: password manager with access restricted to the incident commander

**7. Troubleshooting Guide**

- “If recovery aborts with error XYZ: do ABC”
- “If the Tier 2 backup is unreadable: restore from the Tier 3 WORM archive instead”
- “If recovery takes longer than the RTO: perform bare-metal recovery for the most critical data first, the rest later”

**8. Post-Recovery Validation**

- How do we verify that the system is truly functional?
- Example for a file server: ​“Run an integrity check on all shares, confirm no corruption is present”
- Example for a database: ​“Run DBCC CHECKDB, verify database size, execute a test query”

### The 7 Core Questions for Every Runbook [\#](#the-7-core-questions-for-every-runbook "The 7 Core Questions for Every Runbook")

A good recovery runbook answers these 7 questions:

1. What is being restored? (System name, version, operating system)
2. From where is it being restored? (Backup location, access, credentials reference)
3. Where is it being restored to? (Recovery hardware, network zone, IP addresses)
4. In what order? (Dependency diagram)
5. How long will it take? (RTO estimate, as measured in the last test)
6. How do we verify success? (Validation steps)
7. What do we do if something goes wrong? (Troubleshooting)

If your runbook does not clearly answer these 7 questions, it is incomplete.

### Who Writes and Maintains the Runbook [\#](#who-writes-and-maintains-the-runbook "Who Writes and Maintains the Runbook")

**Writing:** The team members who normally maintain the system. They know the pitfalls.

**Review:** The IT manager and at least one other person. Four-eyes principle.

**Testing:** At least once per quarter, the runbook should be exercised as an actual recovery test. Not just theoretically, but in practice: with a real backup, real hardware (or a VM), and a real restore operation. For financial entities, DORA (Regulation (EU) 2022⁄2554) makes this kind of resilience testing a regulatory expectation; under NIS2, test records are your evidence of working disaster recovery.

**Maintenance:** After every major system change (update, configuration change, upgrade). At minimum, conduct an annual refresh to ensure passwords references, IP addresses, and contacts are still current.

### Why It Must Be Available Offline [\#](#why-it-must-be-available-offline "Why It Must Be Available Offline")

This is not optional: the recovery runbook must be available as a printed document in a safe or secure storage location.

Why? Because when a ransomware attack brings down your corporate network, you cannot access the wiki or SharePoint. You need a printed handbook that tells you:

- The IP address and location of the air gap backup system
- The recovery server where systems are restored
- The restoration sequence
- Where the admin credentials are stored

An organisation that only stores its recovery runbook digitally does not have a recovery runbook. It has a well-intentioned document that becomes available too late.

### Common Mistakes [\#](#common-mistakes "Common Mistakes")

**Mistake 1: Too generic.** ​“SAP can be restored from backup” is not enough. A good runbook has 20 to 50 detailed steps.

**Mistake 2: Not tested.** A runbook that has not been regularly exercised (at minimum quarterly) is an assumption. It becomes a capability only once you test it.

**Mistake 3: Outdated.** The system was upgraded, the IP address of the backup system changed, the admin account changed, but the runbook was not updated. The runbook becomes an obstacle.

**Mistake 4: Dependent on one person.** Only one person knows the runbook. If that person is on holiday or is themselves affected by the incident, the plan is useless.

**Mistake 5: Too many credentials in the document.** The runbook must NOT contain passwords. That is a security risk. Instead: a reference to the safe where credentials are stored.

### Frequently Asked Questions [\#](#frequently-asked-questions "Frequently Asked Questions")

**How detailed must a runbook be?** Detailed enough that an IT technician without specialist knowledge could restore the system. This means: step-by-step instructions where needed.

**How often must we test the runbook?** At minimum quarterly (4 times per year). Best practice: an actual recovery test once per quarter.

**What is the difference between a recovery runbook and a Disaster Recovery Plan?** The DR plan is strategic and organisational (how do we respond to a disaster?). The runbook is tactical and specific (how do we bring system X back up?).

---

### Further Resources [\#](#further-resources "Further Resources")

→ IT Resilience Guide (/en/blog/it-resilienz-leitfaden/) → Disaster Recovery Test (/en/blog/disaster-recovery-test/) → Defining RTO and RPO Correctly (/en/blog/rto-rpo-definieren/)

### DORA

DORA (Digital Operational Resilience Act, EU 2022/2554) is an EU regulation that has applied to all regulated financial market participants since January 2025, setting concrete requirements for ICT risk management, backup systems (Art. 11 and 12), third-party provider management (Art. 28–30) and incident reporting.

[Mehr erfahren →](https://www.fast-lta.de//en/glossary/dora)

### Disaster Recovery

Disaster recovery refers to the structured processes and technical measures that ensure IT systems can be restored within defined timeframes (RTO) with maximum data loss (RPO) after a severe failure — ransomware attack, hardware failure or data center outage.

[Mehr erfahren →](https://www.fast-lta.de//en/glossary/disaster-recovery)

### Disaster Recovery

Disaster recovery refers to the structured processes and technical measures that ensure IT systems can be restored within defined timeframes (RTO) with maximum data loss (RPO) after a severe failure — ransomware attack, hardware failure or data center outage.

[Mehr erfahren →](https://www.fast-lta.de//en/glossary/disaster-recovery)

### Disaster Recovery

Disaster recovery refers to the structured processes and technical measures that ensure IT systems can be restored within defined timeframes (RTO) with maximum data loss (RPO) after a severe failure — ransomware attack, hardware failure or data center outage.

[Mehr erfahren →](https://www.fast-lta.de//en/glossary/disaster-recovery)

### RTO / RPO

RTO (Recovery Time Objective) is the maximum acceptable downtime after an IT failure; RPO (Recovery Point Objective) is the maximum acceptable data loss — both are metrics that must be technically demonstrably met in backup architectures and must not merely be defined as aspirational targets.

[Mehr erfahren →](https://www.fast-lta.de//en/glossary/rto-rpo)

### RTO / RPO

RTO (Recovery Time Objective) is the maximum acceptable downtime after an IT failure; RPO (Recovery Point Objective) is the maximum acceptable data loss — both are metrics that must be technically demonstrably met in backup architectures and must not merely be defined as aspirational targets.

[Mehr erfahren →](https://www.fast-lta.de//en/glossary/rto-rpo)

### WORM

WORM (Write Once, Read Many) refers to a storage principle in which data is written once and can technically no longer be altered or deleted — in hardware WORM, this immutability is a physical property of the storage controller, independent of software, operating system or user privileges.

[Mehr erfahren →](https://www.fast-lta.de//en/glossary/worm)

### WORM

WORM (Write Once, Read Many) refers to a storage principle in which data is written once and can technically no longer be altered or deleted — in hardware WORM, this immutability is a physical property of the storage controller, independent of software, operating system or user privileges.

[Mehr erfahren →](https://www.fast-lta.de//en/glossary/worm)
