Last reviewed: April 9, 2025
π In this article:
System Resiliency
System Resiliency refers to a systemβs ability to withstand failures and continue operating β whether those failures are within your control or external (e.g., service outages, hardware failures, or human error).
Resiliency isnβt just about uptime β itβs about graceful degradation, quick recovery, and minimizing impact.
𧩠What ArchTechLytics Evaluates
Resiliency is evaluated through TechLytics under the Reliability pillar, combining architecture review with real-world incident awareness.
Key areas include:
π Backups & Recovery
- Are VMs backed up and are backups healthy?
- Are backups geo- or zone-redundant?
- Is backup frequency aligned with RPO?
- Is Site Recovery enabled for cross-region failover?
π High Availability
- Are App Services zone-redundant?
- Are databases replicated?
- Is traffic distributed across Availability Zones or Regions?
- Is load balancing or scale set configuration in place?
π Monitoring & Alerts
- Are system health alerts configured?
- Are logs retained long enough to support post-mortem analysis?
π RPO Awareness Built In
RPO (Recovery Point Objective) defines how much data loss is acceptable.
ArchTechLytics compares your declared RPO against your actual backup and replication configurations, flagging gaps where your recovery posture falls short.
Example: If your RPO is 4 hours but backups only run every 6 hours, ArchTechLytics will flag the misalignment and apply a scoring penalty.
π Tied to Criticality
A lack of resiliency may be tolerable in a Non-Critical system.
But for a Mission-Critical or Life-Critical system, downtime can mean customer impact, financial loss, or worse.
ArchTechLytics considers System Criticality when evaluating resiliency, ensuring architectural expectations align with real-world needs.
β‘οΈ Explore System Criticality
β οΈ Failure Isnβt If β Itβs When
Systems should be designed to:
- Detect failure
- Contain impact
- Recover quickly
ArchTechLytics flags misalignments early β before downtime affects your customers or compliance.
π Outage History (Feature)
Beyond preventive architecture, ArchTechLytics lets you document real outages and track patterns over time via the Outage History tab.
Each recorded outage includes:
- ποΈ Date and duration
- 𧱠Faulted component area
- π₯ Impact statement
- π§Ύ Incident ID
- π Root cause and resolution
- π Affected resources and systems
- π§ Causing resource, if identifiable
This creates a feedback loop between design and reality β helping teams prioritize fixes, improve architectural guardrails, and respond better next time.
Resiliency isnβt binary β itβs a spectrum.
ArchTechLytics helps you see where you are, and how to move closer to where you need to be.