Skip to content
Menu

PAYMENT GATEWAY

[THK] F.6.6 Operational Resilience and Failure Strategy

Production-ready integrations must be designed to remain correct and stable under failure conditions, ensuring that transaction integrity is preserved even when components fail, responses are delayed, or outcomes remain uncertain.

In SIBS Payment Gateway (SPG) integrations, failures are not exceptional – they are an expected aspect of distributed systems. A resilient system must therefore handle failures deterministically, without introducing inconsistent states, duplicate operations, or loss of transaction information.

Failure as a Normal Operating Condition

Production systems must assume that failures will occur at any stage of the transaction lifecycle.

These may include:

  • Network interruptions or timeouts
  • Partial execution of API requests
  • Delayed or missing webhook notifications
  • Temporary unavailability of external systems

A correct implementation must treat these scenarios as normal operating conditions, not edge cases.

Systems that assume reliable, synchronous execution will fail under real-world conditions.

Classification of Failure Scenarios

Failure handling must distinguish between different types of outcomes:

  • Technical failures
    (e.g., network errors, invalid requests, timeouts)
  • Business outcomes
    (e.g., Declined transactions, user rejection)
  • Unknown or indeterminate states
    (e.g., timeout without confirmed outcome)

Each category requires a different handling strategy.

The system must not:

  • Retry business outcomes such as Declined payments
  • Assume success or failure in the absence of confirmation

See F.3 Success and Error Scenarios for outcome interpretation.

Handling Unknown and Indeterminate States

One of the most critical failure scenarios is when the outcome of a transaction is unknown.

This may occur when:

  • A request times out without receiving a response
  • A webhook is delayed or not received
  • Network failures interrupt the execution flow

In such cases, the system must:

  • Avoid making assumptions about the final outcome
  • Use the Status API to determine the authoritative transaction state
  • Defer business decisions until the final state is confirmed

Failure to handle unknown states correctly may result in:

  • Duplicate charges
  • Incorrect order fulfillment
  • Reconciliation discrepancies

Controlled Retry Strategy

Retries must be applied in a controlled and state-aware manner.

A correct retry strategy must:

  • Retry only when the outcome is unknown or uncertain
  • Avoid retrying operations that have reached a final state
  • Use transaction identifiers and current state to determine retry eligibility
  • Ensure that retry decisions are aligned with the authoritative transaction state.

Retries must not:

  • Generate duplicate transactions
  • Override or regress a valid final state

See F.6.2 Transaction Idempotency and Duplicate Protection for idempotent execution requirements.

Resilience in Asynchronous Processing

Failure handling must extend across asynchronous flows.

This includes:

  • Delayed or duplicated webhook notifications
  • Out-of-order event delivery
  • Temporary inconsistencies between systems

A resilient system must:

  • Maintain consistency despite delayed or repeated events
  • Ensure that eventual state converges to the correct final outcome
  • Prevent inconsistent intermediate states from affecting business logic

See F.6.3 Asynchronous Flow Readiness and F.6.4 Webhook Reliability and Processing Guarantees.

State Consistency Under Failure

Transaction state must remain consistent even when failures occur.

This requires:

  • Validation of all state transitions
  • Prevention of invalid or regressive updates
  • Alignment between internal state and SPG authoritative state

Systems must guarantee that:

  • Final states are preserved
  • No transaction is left in an unresolved or inconsistent state
  • State can always be reconciled using authoritative sources

Fallback and Recovery Mechanisms

Resilient systems must implement fallback mechanisms to recover from failure scenarios.

This includes:

  • Status API validation when webhook delivery fails or is delayed
  • Reconciliation processes to detect and resolve inconsistencies
  • Recovery flows for transactions in unknown states

All transactions must be eventually resolved to a final, authoritative state, even if initial processing fails.

Isolation and Fault Containment

Failures must be contained to prevent cascading impact across the system.

A production-ready system should:

  • Isolate transaction processing units
  • Prevent a single failure from affecting unrelated transactions
  • Ensure that retry or recovery mechanisms do not overload the system

This reduces the risk of:

  • System-wide instability
  • Amplification of failure conditions
  • Loss of control over transaction processing

Operational Observability of Failures

Failures must be observable, traceable, and diagnosable.

The system must allow:

  • Identification of failed or inconsistent transactions
  • Correlation of failures across API calls, events, and internal processing
  • Monitoring of retry and recovery behavior

Without proper observability, failure scenarios cannot be effectively managed or resolved.

See F.5 Logging, Monitoring, and Observability.

Final Consideration

Operational resilience is not achieved by preventing failures, but by ensuring correct behavior despite failures.

A production-ready integration guarantees that:

  • Failures do not compromise transaction integrity
  • All transactions reach a consistent and final state
  • Retry and recovery mechanisms are safe and controlled
  • System behavior remains deterministic under uncertainty

Systems that do not explicitly handle failure conditions will produce inconsistent outcomes, even if they function correctly under ideal conditions.

Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.

Strictly Necessary Cookies

Strictly Necessary Cookie should be enabled at all times so that we can save your preferences for cookie settings.