24/7 Availability: Observability and Resilience in the Core

WAU Marketing
Apr 28
4 min read

Updated: 1 hour ago

Your bank's customer doesn't sleep. They make a transfer at 2 a.m., pay at 11 p.m., check their balance on a Sunday. Many cores, by contrast, were designed for a world that closed at five in the afternoon. That gap is paid in outages.

Banking changed its hours without telling its technology. Instant payments made it clear: in Brazil, PIX processed more than 63 billion transactions in 2024, running 365 days without pause, according to Central Bank of Brazil data. The customer's expectation no longer respects the overnight maintenance window the legacy core takes for granted. And when a system built for overnight batch processing has to stand up around the clock, the cracks appear.

The cost of going down, in numbers

An outage isn't an inconvenience: it's a bill. In banking, the cost of an hour of downtime frequently exceeds five million dollars, per ITIC's annual cost-of-downtime survey, which places banking among the sectors where the average tops that figure. In the UK, a study by the consultancy GFT (vendor) put the average investment-bank outage at more than £600,000 per hour, as IBS Intelligence reported. And the Uptime Institute found that more than half of operators reported their last serious outage cost over $100,000, and one in five, over a million, in its 2024 Annual Outage Analysis.

The most telling case is public. Data submitted to the UK Parliament's Treasury Committee showed that nine of the country's largest banks racked up more than 803 hours of unplanned outages—over 33 days—in just two years, across 158 incidents, according to the House of Commons Treasury Committee. Among the causes banks themselves cited most are third-party supplier failures and legacy systems: the risk of concentrating everything in old, opaque infrastructure.

What "always available" means

It's worth putting a number on the promise. Availability is measured in "nines":

99.9% sounds fine until you translate it: nearly 9 hours of downtime a year.
99.99% drops to about 52 minutes a year. It's the standard expected of payments and authentication.
99.999% is barely 5 minutes a year, the territory of the most critical systems, per TechTarget's calculation.

Each additional nine costs an order of magnitude more engineering, and it isn't reached by asking a monolithic core to hold on. It's reached with architecture.

The two capabilities that make it possible: observability and resilience

Here's the technical heart. A core running 24/7 needs two things the legacy doesn't bring out of the box.

The first is observability: the ability to know what's happening inside the system at all times. It rests on three pillars—logs, metrics, and distributed traces. Metrics warn you something broke, traces show you where, and logs give you the context to fix it. Without observability, a 3 a.m. outage is a blind panic call; with it, it's an incident you diagnose and contain.

The second is resilience: the ability—in the Basel Committee's definition—to deliver critical operations through disruption, as set out in the BCBS Principles for Operational Resilience. It's not about never failing; that's impossible. It's about the system staying up when a component fails. Netflix took it to the extreme with its famous "Chaos Monkey," which shuts down servers at random in production on purpose, to force the system to be fault-tolerant by design, as Netflix's own Simian Army project documents. The underlying idea applies to a core: one piece can go down without dragging the rest with it.

Let's be honest about a point that's often mis-sold: splitting the monolith into microservices doesn't make you resilient by magic. In fact, done badly, it adds complexity and new ways to fail. Real resilience comes from the full package: fault-tolerant design, native observability, and the discipline of testing failures before they happen. It's not just slicing; it's adopting an operating model.

The regulator requires it too

It's not just best practice. The Bank of Mexico operates SPEI under the concept of cyber-resilience and requires participants to keep their technological infrastructure available and operating, under Circular 14/2017. The CNBV mandates a Business Continuity Plan updated at least yearly, covering everything from natural disasters to cyberattacks, per Annex 67 of the Banking Single Circular. And internationally, the Basel Committee's Principles for Operational Resilience set the standard. 24/7 availability stopped being a marketing aspiration; it's a regulatory expectation.

How we see it at WAU

At WAU we build cores to operate without pause: native observability—logs, metrics, and traces from day one—fault-tolerant architecture where one piece can go down without toppling the system, and the resilience discipline that turns "let's hope it doesn't go down" into "it's designed not to." Your customer operates 24/7; your core should too.

If your core still thinks in branch hours and every outage costs you sleep and money, let's talk. We'll show you what it takes to always be up. 👉 Book a conversation with our team.

24/7 Availability: Observability and Resilience in the Core

The cost of going down, in numbers

What "always available" means

The two capabilities that make it possible: observability and resilience

The regulator requires it too

How we see it at WAU

Sources

Recent Posts

Comments