Production-ready integrations must be designed to remain correct and stable under failure conditions, ensuring that transaction integrity is preserved even when components fail, responses are delayed, or outcomes remain uncertain.
In SIBS Payment Gateway (SPG) integrations, failures are not exceptional – they are an expected aspect of distributed systems. A resilient system must therefore handle failures deterministically, without introducing inconsistent states, duplicate operations, or loss of transaction information.
Failure as a Normal Operating Condition
Production systems must assume that failures will occur at any stage of the transaction lifecycle.
These may include:
- Network interruptions or timeouts
- Partial execution of API requests
- Delayed or missing webhook notifications
- Temporary unavailability of external systems
A correct implementation must treat these scenarios as normal operating conditions, not edge cases.
Systems that assume reliable, synchronous execution will fail under real-world conditions.
Classification of Failure Scenarios
Failure handling must distinguish between different types of outcomes:
- Technical failures
(e.g., network errors, invalid requests, timeouts) - Business outcomes
(e.g., Declined transactions, user rejection) - Unknown or indeterminate states
(e.g., timeout without confirmed outcome)
Each category requires a different handling strategy.
The system must not:
- Retry business outcomes such as Declined payments
- Assume success or failure in the absence of confirmation
See F.3 Success and Error Scenarios for outcome interpretation.
Handling Unknown and Indeterminate States
One of the most critical failure scenarios is when the outcome of a transaction is unknown.
This may occur when:
- A request times out without receiving a response
- A webhook is delayed or not received
- Network failures interrupt the execution flow
In such cases, the system must:
- Avoid making assumptions about the final outcome
- Use the Status API to determine the authoritative transaction state
- Defer business decisions until the final state is confirmed
Failure to handle unknown states correctly may result in:
- Duplicate charges
- Incorrect order fulfillment
- Reconciliation discrepancies
Controlled Retry Strategy
Retries must be applied in a controlled and state-aware manner.
A correct retry strategy must:
- Retry only when the outcome is unknown or uncertain
- Avoid retrying operations that have reached a final state
- Use transaction identifiers and current state to determine retry eligibility
- Ensure that retry decisions are aligned with the authoritative transaction state.
Retries must not:
- Generate duplicate transactions
- Override or regress a valid final state
See F.6.2 Transaction Idempotency and Duplicate Protection for idempotent execution requirements.
Resilience in Asynchronous Processing
Failure handling must extend across asynchronous flows.
This includes:
- Delayed or duplicated webhook notifications
- Out-of-order event delivery
- Temporary inconsistencies between systems
A resilient system must:
- Maintain consistency despite delayed or repeated events
- Ensure that eventual state converges to the correct final outcome
- Prevent inconsistent intermediate states from affecting business logic
See F.6.3 Asynchronous Flow Readiness and F.6.4 Webhook Reliability and Processing Guarantees.
State Consistency Under Failure
Transaction state must remain consistent even when failures occur.
This requires:
- Validation of all state transitions
- Prevention of invalid or regressive updates
- Alignment between internal state and SPG authoritative state
Systems must guarantee that:
- Final states are preserved
- No transaction is left in an unresolved or inconsistent state
- State can always be reconciled using authoritative sources
Fallback and Recovery Mechanisms
Resilient systems must implement fallback mechanisms to recover from failure scenarios.
This includes:
- Status API validation when webhook delivery fails or is delayed
- Reconciliation processes to detect and resolve inconsistencies
- Recovery flows for transactions in unknown states
All transactions must be eventually resolved to a final, authoritative state, even if initial processing fails.
Isolation and Fault Containment
Failures must be contained to prevent cascading impact across the system.
A production-ready system should:
- Isolate transaction processing units
- Prevent a single failure from affecting unrelated transactions
- Ensure that retry or recovery mechanisms do not overload the system
This reduces the risk of:
- System-wide instability
- Amplification of failure conditions
- Loss of control over transaction processing
Operational Observability of Failures
Failures must be observable, traceable, and diagnosable.
The system must allow:
- Identification of failed or inconsistent transactions
- Correlation of failures across API calls, events, and internal processing
- Monitoring of retry and recovery behavior
Without proper observability, failure scenarios cannot be effectively managed or resolved.
See F.5 Logging, Monitoring, and Observability.
Final Consideration
Operational resilience is not achieved by preventing failures, but by ensuring correct behavior despite failures.
A production-ready integration guarantees that:
- Failures do not compromise transaction integrity
- All transactions reach a consistent and final state
- Retry and recovery mechanisms are safe and controlled
- System behavior remains deterministic under uncertainty
Systems that do not explicitly handle failure conditions will produce inconsistent outcomes, even if they function correctly under ideal conditions.