[THK] F.6.6 Operational Resilience and Failure Strategy

Production-ready integrations must be designed to remain correct and stable under failure conditions, ensuring that transaction integrity is preserved even when components fail, responses are delayed, or outcomes remain uncertain.

In SIBS Payment Gateway (SPG) integrations, failures are not exceptional – they are an expected aspect of distributed systems. A resilient system must therefore handle failures deterministically, without introducing inconsistent states, duplicate operations, or loss of transaction information.

Failure as a Normal Operating Condition

Production systems must assume that failures will occur at any stage of the transaction lifecycle.

These may include:

Network interruptions or timeouts
Partial execution of API requests
Delayed or missing webhook notifications
Temporary unavailability of external systems

A correct implementation must treat these scenarios as normal operating conditions, not edge cases.

Systems that assume reliable, synchronous execution will fail under real-world conditions.

Classification of Failure Scenarios

Failure handling must distinguish between different types of outcomes:

Technical failures
(e.g., network errors, invalid requests, timeouts)
Business outcomes
(e.g., Declined transactions, user rejection)
Unknown or indeterminate states
(e.g., timeout without confirmed outcome)

Each category requires a different handling strategy.

The system must not:

Retry business outcomes such as Declined payments
Assume success or failure in the absence of confirmation

See F.3 Success and Error Scenarios for outcome interpretation.

Handling Unknown and Indeterminate States

One of the most critical failure scenarios is when the outcome of a transaction is unknown.

This may occur when:

A request times out without receiving a response
A webhook is delayed or not received
Network failures interrupt the execution flow

In such cases, the system must:

Avoid making assumptions about the final outcome
Use the Status API to determine the authoritative transaction state
Defer business decisions until the final state is confirmed

Failure to handle unknown states correctly may result in:

Duplicate charges
Incorrect order fulfillment
Reconciliation discrepancies

Controlled Retry Strategy

Retries must be applied in a controlled and state-aware manner.

A correct retry strategy must:

Retry only when the outcome is unknown or uncertain
Avoid retrying operations that have reached a final state
Use transaction identifiers and current state to determine retry eligibility
Ensure that retry decisions are aligned with the authoritative transaction state.

Retries must not:

Generate duplicate transactions
Override or regress a valid final state

See F.6.2 Transaction Idempotency and Duplicate Protection for idempotent execution requirements.

Resilience in Asynchronous Processing

Failure handling must extend across asynchronous flows.

This includes:

Delayed or duplicated webhook notifications
Out-of-order event delivery
Temporary inconsistencies between systems

A resilient system must:

Maintain consistency despite delayed or repeated events
Ensure that eventual state converges to the correct final outcome
Prevent inconsistent intermediate states from affecting business logic

See F.6.3 Asynchronous Flow Readiness and F.6.4 Webhook Reliability and Processing Guarantees.

State Consistency Under Failure

Transaction state must remain consistent even when failures occur.

This requires:

Validation of all state transitions
Prevention of invalid or regressive updates
Alignment between internal state and SPG authoritative state

Systems must guarantee that:

Final states are preserved
No transaction is left in an unresolved or inconsistent state
State can always be reconciled using authoritative sources

Fallback and Recovery Mechanisms

Resilient systems must implement fallback mechanisms to recover from failure scenarios.

This includes:

Status API validation when webhook delivery fails or is delayed
Reconciliation processes to detect and resolve inconsistencies
Recovery flows for transactions in unknown states

All transactions must be eventually resolved to a final, authoritative state, even if initial processing fails.

Isolation and Fault Containment

Failures must be contained to prevent cascading impact across the system.

A production-ready system should:

Isolate transaction processing units
Prevent a single failure from affecting unrelated transactions
Ensure that retry or recovery mechanisms do not overload the system

This reduces the risk of:

System-wide instability
Amplification of failure conditions
Loss of control over transaction processing

Operational Observability of Failures

Failures must be observable, traceable, and diagnosable.

The system must allow:

Identification of failed or inconsistent transactions
Correlation of failures across API calls, events, and internal processing
Monitoring of retry and recovery behavior

Without proper observability, failure scenarios cannot be effectively managed or resolved.

See F.5 Logging, Monitoring, and Observability.

Final Consideration

Operational resilience is not achieved by preventing failures, but by ensuring correct behavior despite failures.

A production-ready integration guarantees that:

Failures do not compromise transaction integrity
All transactions reach a consistent and final state
Retry and recovery mechanisms are safe and controlled
System behavior remains deterministic under uncertainty

Systems that do not explicitly handle failure conditions will produce inconsistent outcomes, even if they function correctly under ideal conditions.

PAYMENT GATEWAY

[THK] F.6.6 Operational Resilience and Failure Strategy

Failure as a Normal Operating Condition

Classification of Failure Scenarios

Handling Unknown and Indeterminate States

Controlled Retry Strategy

Resilience in Asynchronous Processing

State Consistency Under Failure

Fallback and Recovery Mechanisms

Isolation and Fault Containment

Operational Observability of Failures

Final Consideration

On this page: