Reliable services/processes


In this post I am looking at how these concepts can be applied/ or how I have applied in my engineering projects. I would recommend reading this blog first

There are few key dimensions one would have to look at when building reliable platforms;

  • Scalability , performance : Ability to scale different processes, or the need for single processes

  • Reliability of processing: This is a wide topic that cover areas like;

    • Transactionality ( Atomicity, Isolation, consistency)

    • Guaranteed processing and persistence ( idempotent processing if applicable)

  • Stateful/durable vs. state-less services ; how to decide which is what?

  • Reliability in inter-process communication covering;

    • Cross-process transactionality ( or compensating transactions) , recoverability of inter-process “business transaction”. A “business transaction” is a transaction visible to the user ( like placing an order to buy a stock) which will get translated to many individual technical steps including 1) checking instrument eligibility 2) credit limits/ margin limits 3) price validations 4) Placing the order ( limit, market, stop loss order etc). Inter-process communication will be discussed in a separate blog.

While focusing on how to achieve these dimensions we need to look at few architecture considerations

What is a (reliable) process?

  • The update must be transactionally safe (ACID). I am not going to elaborate on these as its very well explained here

    • i.e. must be atomic, consistent, durable

  • Should be thread-safe to avoid any inconsistent updates to state when invoked by concurrent processes.

  • Processing should be guaranteed

  • Should handle exceptions gracefully ( with full commit or rollback)

How do we design reliable (simple) processes?

A process can be any “self-sufficient” piece of software that performs a set of function , spanning from a simple OS level *nix process like "ls”… to a high end process like “file explorer” in Windows. A process can run as a user initiated process or a daemon (background) process. I am a of unix style processes that can be chained ( “>”, “|”) with each other to create a “workflow” of processes. The robust IPC mechanisms in Linux/Unix constituting Signals, pipes and sockets provides a very simplistic , elegant yet “composable” process architecture. This approach can be extended when “business” processes are designed as well.

I have burned my hands few times in the past ( yes! we learn from mistakes and some times the hard way) that adding too much complexity , exponentially increases the supportability/maintainability and also reliability of a system. Hence I always try to keep it KISS!

There should be set of simple answers for every complex problem!

Below are few aspects I would consider when I look at designing an app/services;

  • Service decomposition / services architecture

    I intend to publish a later blog post on how I have effectively used Domain driven architecture in modelling complex business problems spanning from

    • Trading and risk management

    • Settlement / clearing and margin management

    • as well as in non-FS domains

    The right domain driven service decomposition.. i.e;

    • What functionality each service will do keeping in mind single responsibility principle at architecture level ( of course the same principles can applies at class level or at wider service architecture as well ) where a single service only does one thing/or handles one set of domains .. for e.g.;

      • Credit check service only checks the credit limits of a user/ entity

      • Trade booking service only deals with capturing a trade

      • Pricing service will carry out pricing of a trade/position etc

    • How these services interact with each other ensuring

      • The safeness and correctness of overarching “business process” like placing a buy order from end user perspective and how we manage the transactionality across many underlying services that may include

        • validating the customer limits

        • verify the eligibility of order and trade compliance ( jurisdiction, black lists )

        • Place order and take payment ( of buying at limit price) etc

      • And how failure at each point should result in

        • replay/compensating transaction

        • or is there point of no return (i.e. take payment as last step)

      • There are different architecture patterns - Orchestration (like in an Orchestra) and Choreography ( like in filming of a movie) we can use in designing robust architecture

    • How unavailability of dependent services can be managed?

      • Given the distributed nature of today’s applications the importance of thinking inline with CAP theorem is critical! There can be many questions or design patterns we can use when there is a failure in processing or the service

        • Is the service idempotent ( i.e. calling multiple times with same input gives same results ?

        • Can the service handle re-play? i.e. if service goes down and come up can it process the pending items correctly and completely?

        • Is the processing guaranteed once (and only once)? guaranteed at most once ? can handle loss less processing?

        • Can (cant) handle duplicate requests correctly

        • Is the operation irreversible?

      • All these can be in the service guarantees offered by a service (beyond its functional syntax, semantics and handshake). In a future blog I will talk about “making APIs resilient”

  • Thread safety and concurrency safeness of each service

One of the many pitfalls I have seen developers doing is trying to blindly improve the throughput but adding threads to a process without checking whether

  • Is there is any shared state that can be overwritten by concurrent execution? i.e. global and static variables.

  • Is there any shared state that need synchronous access that will actually limit the parallelism like a semaphore/mutex access to a resource.

There are usually two fundamental approaches in handling concurrency;

  • Shared-nothing designs

    • Local variables

      • Avoiding global/static variables. Instead using “thread local variables ( e.g. in Java )”. This can be in line with using “reentrant” design patterns where only local/stack stored variables are used that live in local stack/ functional scope.

    • Immutable objects

      • “The state of an object cannot be changed after construction. This implies both that only read-only data is shared and that inherent thread safety is attained”. Most of the functional language paradigms - Kotlin ( JVM) , Python , react emphasise the need for immutability ( create new objects without modifying them)

  • Shared-state designs

    • Here state is shared but with synchronised access. There are few ways of implementing this;

      • Atomic operations ( that can be interrupted by other threads/invocations) usually implemented by low level operating systems commands (e.g. in Linux)

      • Exclusive access to data by synchronising/serialising access using application level semaphores/mutex

Shared state designs are more complex and requires more thinking as improper implementation can result in deadlocks, resource starvations (locked out) hence most modern development tend to use “shared nothing” design patterns. I my self had done few shared state designs in trading , execution and order management using high performance non-blocking concurrent designs like LMAX which has to done for very low latency designs (which is worth doing compared to shared nothing setup that brings its own overhead in object management ).

But if your application doesn’t require strenuous performance I would recommend adopting a shared nothing/immutable state/object designs

  • Determining whether a process is Stateful vs. Stateless

In early days of web application development we used to hold information ( like the shopping cart) in the HTTP session and use sticky load balancing to route the same HTTP session to same node/JVM handling that user session. Termination/crash of the browser or the back end node would result in losing transient data. Alternatively using a “backing” store like a database or a cache would have addressed the transient nature of the data. This is is more important when a “user state” need to straddle multiple devices ( i.e. mobile and desktop devices … I am a Spotify fan. the platform nicely manage the state of what I am listening across my mobile, laptop, smart TV Spotify app so I can continue/ swap between devices). Of course given the diverse end user device state management , the cross-device state needs to be managed in a backing service.

The emergence of cloud native services with high disposability requires the services to be

  • Small - so the blast radius ( impact of failure) is less

  • Recoverability /start up time/ and recover from a crash is quick (seconds / minutes)

This of course brings additional complexity

  • Scalability / performance of a backing service

  • Complexity in inter-process communication , managing the “transactionality” among them

In stateless services I tried to optimise the resilience by;

  • Reducing the IO impedance and IO chatter in each service. i.e. reducing the data store reads and writes and reducing the different number of persistence transactions

  • Each service only managing/responsible for sub-set of the data model (e.g. trades tables) so that there is no cross-service data updates resulting race conditions. More on “domain driven” architecture in a later post..

  • Each service ideally getting fully formed data request ( like enriched ref data) to avoid multiple data look ups

  • Offloading the transaction isolation ( optimistic or pessimistic ) to the underlying data store (e.g. Oracle, MS SQL DB etc)

  • full recovery vs. partial recovery. I used a concept of “baton pass” like in a relay race where

1. Read from source/ consume event 2. Process the data and perform logic 3. Commit to destination in full ( can be a durable message queue , can be a sync consumer with sync response , can be a database that commits the data 4. Commit back to the reader (send succesful response, or if you read from durable store like message queue, commit the state)

If there is a failure in step 2,3 you can

  • Rollback step 1 so that the sender can;

    • make the same call again (replay)

    • or re-deliver the message ( as done in many messaging infrastructure)

If not , step 4 confirms back to the caller that the request has been fully processed.

  • The sender can move the delivery pointer to next point ( based on the implementation)

If it a failure, The service should have logic to either;

  • to perform de-duplicate processing (and not double counting using time sensitive UUID )

  • or perform Idempotent processing ( depending on the functional logic of the service)

This will make the process under consideration "transactionally” safe for the caller

Transactionality of services

Coming from financial services technology , losing data is deadly! From the beginning the need for ACID was engrained in the architectures. In a distributed architecture , each service has the responsibility to;

  1. Handle/process the full functionality and not partial ( i.e. part of transactions committed and others rollback’ed )

  2. Inform the caller deterministically that either the full request was successful or failed in full.

    1. if failed, fully rollback from the back store

  3. Avoid 2-phase/distributed transactions as

    1. it requires services needing to understand the cross-device ( say Oracle DB, Kafka Queue) transactionality , device libraries , breaching the design pattern to “facade” out the backing store differences

    2. Performance degradation/ state locking

  4. Support compensating transactions by undoing the effects of the steps in the original operation.

  5. or support de-duplicate processing so that if first time an intermediate failure/rollback can be handled properly by de-duplicating same request using a backing store. for e.g. in the above pseudo logic, failure in step 4 would would result in the caller making a duplicate call that needs to be handled by step 2/3.

  • Horizontal scalability vs. Singleton services

If a service is stateless , its easy to scale horizontally. A proven load balancing strategy ( round robin, shard based on a striping algorithm, load based) can be utilised to effectively use the available services. Some times a service can be singleton as well. One e.g. can be check the stock status that is called by multiple order placement services. To enforce consistency and accuracy of the counts, the service needs to use a locking mechanism ( easier of it uses the backing store’s MVCC avoiding the usual pitfalls in transaction isolation like dirty reds, non-repeatable reads, phantom reads , lost updates).

But if we have to scale out the singletons or backing stores one option is sharding / striping using a deterministic algorithm. But this requires the sharding to work at the store level ( to avoid cross cluster index / data block sync overhead that is prevalent in RDMSs architectures) with 100% “shared nothing” architecture which most of the noSQL databases provide.

I have used noSQL DBs like Mongo, Marklogic to massively scale the backing stores as well as functionally singleton services by using business correct striping algorithms ( based on account ID, counterparty ID, stock/instrument ID) where;

  • Each shard/stripe has its own backing store

  • There is no inter-communication / sync between these different backing store

  • The shard routing to the correct backing store was done with common business services which handled all requests (As these are stateless the time spent on processing was small. Hence by using a thread local architecture or a single async loop like in Python it was easy  to use single service for each domain to perform the functionality and off load the persistence to EACH shard.

Previous
Previous

Pragmatic Domain Driven Architecture -1