Addressing scalability and performance in complex enterprise platforms

Context setting

“Computer science is fundamentally about what we do with data. This can be either processing data, moving data from one place to another or storing the data”. Therefore increasing the capacity , performance of a platform can be answered through 5 principles;  

  1. How can we improve the processing speed of a data unit?

  2. How can we improve the transmission speed of a data unit?   

  3. How can we reduce (optimise) the amount of data moved?

  4. How can we reduce (optimise) the amount of processing?

  5. How can we reduce the number of data hops?   

When there is a scalability, performance and reliability problem I look at above dimensions to assess how well the system is architected, how well the engineering is done and how efficient the business processes are? Always its not just a technology problem but also a mix of technology and business/process problems and deficiencies that results in bringing a system down to its knees!.

Once I was handed over a significantly large platform that served 80+ global markets (hence 24*7, 6+ days up time) , that has more than 1000 processes , 1000+ integration points, 1000+ users, source code beyond 10m LoC, 8 digit Euro annual running cost and a failure will result in systemic risk in a critical area of financial services industry.

The structured approach we adopted and executed resulted in improving the capacity by 10X, resilience by 40X and halved the running costs in 2 yrs

Previous teams had struggled to figure out what to do;

  1. Where do we start?

  2. Can infrastructure solve it?

  3. Do we have to re-write the whole platform?

In this situation unless there is a structured technical approach, the whole endeavor can fail…

In below write up I try to dissect and elaborate what I have done without exposing the actual application details itself and some of the complex solutions. Any one interested do DM me.

Approach?

In order to achieve sustained benefits delivered through a continuous improvement program the approach adopted should be able to

  • Identify the biggest bottlenecks

  • Formulate the solutions and efforts for remediation

  • Rank the solutions based on the impact in capacity limitations, coverage of the impact (i.e. more components the better), effort /cost and risk.

  • Ensure the front to back capacity is balanced when they are deployed instead of providing a skewed /unbalanced capacity across different components.

  • Continue the iterative process with the next level of remediation items, until the overall capacity targets are met. This requires executing subsequent iterations based on the results and observations from previous step and feeding them again to the start of the cycle.

Below figure shows the structured approach I set up to analyse the significantly large problem.

process.jpg

Lets discuss each step….

Step A: Understand the Platform

When tackling a complex platform, unless a methodical approach is adopted, fulfilling the front to back capacity becomes very difficult. In order to address the five principles discussed earlier, the first step is to assess and understand the different capacity impacting dimensions and their impact. Figure below shows three dimensions that need to be understood.

Screenshot 2021-09-25 at 14.00.02.png

These dimensions can be further categorised as below based various dimensions like

  • Efficiency of processing, data storage , transmitting

  • What dimensions and levers drives the platform into constraints

  • What are the bottlenecks and what are those drivers ? what layer in the architecture - network, storage, business processing, UI, or a combination of all/ some hitting their peaks?

I have further dissected the above dimensions as below

understanding.jpg

.

A.1. Understanding the Architecture and its impact to Capacity

A poorly architected platform is the main reason for a platform to fail. Therefore, it’s important to understand the detailed application architecture first.For example, this system was primarily a pseudo event driven architecture with multiple logical components (1000+) that interact via messaging. Each component polls for work through database work tables and provide results to the subsequent processes by writing results to a work out table. This ‘heavy data movement’ architecture, which is likely based on older mainframe designs, creates significant stress on processing and IO - requiring additional infrastructure resources. The proliferation of hundreds of such processes continuously seeking for work aggravates this situation. This is compounded with significantly large, de-normalised data models that are used to pass data across these processes creating further strain on the database IO, messaging and storage.

A.2. Different facets of Capacity drivers

The meaning of capacity can have different facets that can depend on the type of data and the functions performed on that data.This becomes complex when common infrastructure is shared to perform different non-functional behaviours where the parameters that affects the performance is different from one component to another. Below diagram shows several examples.

capacity.jpg

A.3. Understanding the type of Bottlenecks

It’s a common misconception that adding more CPU will improve performance of a platform. CPU (or processing capacity) is only one of the key bottlenecks a platform can be impacted by. There are three types of bottlenecks that need to be understood and remediated.

  • CPU Bound

This is evident when the CPUs are almost fully utilized meaning there is not enough user CPU time available for processing the CPU instructions.  This is applicable to any layer of the technology architecture spanning database, integration, and application components.

  • IO Bound

This is evident when there is significant IO wait times (not execution times) of CPU. IO bottlenecks can be either memory related IO, disk related IO and/or network IO that can be due to actual inter-process communication or persistence interaction with a remote storage and/or database.   

  • People/Process Bound

Sometimes even though a technical improvement can be made, due to the way functionality is implemented or processes were defined, the way users utilize the platform may be inefficient, hindering any capacity improvement. For example, in this system it’s a fundamental design to proactively create exceptions until they are closed at end of day.  

A holistic approach requires one to understand;

  1. What types of bottlenecks exists

  2. In which layer(s) of the architecture they exists

bottlenecks.png

Differentiating the Observation vs. Root Cause

One common mistake is to jump to early conclusions in identification of the cause. What you see at the end breaching SLA may not be the cause

After taking all the above into consideration, a front to back view needs to be obtained.

Understanding the performance impacting dimensions is essential for one to answer these questions:

  1. What is the capacity limit of each component?

  2. What functionality is limited by this capacity?

  3. When the capacity driver affects the platform? (trades, positions, intraday vs. daily trades etc)

  4. Which layer is bottlenecked?

  5. What is the reason for the bottleneck

… and finally

What can be the solution?

B.Creating Solutions

Up to now the process covered understanding the platform and its relation to different facets of capacity. Now let’s look at how different solutions were formulated. Irrespective of the solutions undertaken, they should address one or more of the five principles discussed above in the context.

The types of solutions can be driven by several factors, namely:

1)      How soon the capacity remediation needs to be addressed?

2)      How risky is the approach?

3)      What effort/money we expect to spend on the solution?

4)      What is the sustainability of the solution (i.e. improving efficiency vs. stop gap)

In this system a hybrid approach was adopted that constituted a Top Down (address by changing the architecture or design) as well as a Bottom Up analysis to provide a consolidated growth in capacity. The prioritisation of the solutions was then based on the factors considered below. 

Screenshot 2021-09-25 at 14.43.11.png

B.1. Top Down Approach

The top down approach is based on several key principles

  1. Rather than focusing on just the immediate area, look at the front to back architecture to balance the capacity across the whole platform.

  2. Instead of creating point solutions, formulate more over-arching solutions (architectural/framework) that will bring benefits across wider landscape.  This can be ascertained by input gathered from the system architecture study.  

  3. Get a proactive view of relative capacity across the whole platform rather than on a particular component and balance solutions to achieve a balanced increased front to back.

  4. Focus more on improving the efficiency rather than headroom addition. 

There are several types of dimensions that can be looked at in top down approach as depicted in below pic;

top down overall.jpg

B.1.1. Improve the Architecture

Changing the fundamentals gives wider and long lasting improvements but it can be costly and time consuming. When tackling a complex platform it is essential to ensure continuous improvements flow through the platform and the improvements are realized across the regions.

In this system, one of the key areas that required a capacity uplift as well as stability/simplicity was on the front office trade capture functionality.  The fragmented setup contributed to poor performance. It’s essential to understand that capacity is not just about ‘fast processing’ but must also avoid failures, improve straight through processing and ability to recover quickly. Achieving this through a re-engineering approach can be more effective than fixing an old platform.

Architecture simplifications include;

The key solutions implemented included:

  • Enforcing a single standard message format wrapped up in a API to replace N different formats in-situ across front office applications

  • Reduced the number of integration hops, and additional processing (for transformation, split, route)

  • At source validation done by API, assures capacity by improving STP.

  • Deploying High performance (>10x), resilient and highly available messaging setup running on appliance based middleware

  • Reduced the additional hops and processing required for fanning out the messages to compensate deficiency in messaging product

  • Automating recovery and graceful exception handling

  • Assuring system capacity by avoiding manual intervention and down time.

Death by Parallelisation

It’s a common practice, and was quite evident in the system, the proliferation of parallelism adopted across all layers of the architecture. For example parallelising application workers/threads, utilising parallelisation features in Oracle, parallelisation of messaging interactions (e.g. splitter, router).

Several principles need to be adhered to when seeking opportunities to achieve parallelism.

  1. If the parallelisation is blocked by a shared singleton, the parallelisation will drastically reduce inline with the law or diminishing returns. This singleton can be small as a synchronisation lock in a hash map or large as a single instance process in a product (e.g. Oracle singleton Log Writer Process). 

  2. The throughput and capacity should be balanced across all architecture layers. i.e. there is less value in adding more application threads if the combined throughput of that cannot be met by the underlying database configuration.

  3. Hence, parallelism can be exploited until all the layers in the architecture are optimised to utilize their IO and CPU resources (and not to jeopardise another layer).

The degree of parallelism in the system was unbalanced. Therefore, additional care was taken when understanding possible capacity improvements by enhancing parallelism.  

Employ appropriate partition strategy

When IO and CPU resources are shared across different flows, there is always a tendency for one flow to overwhelm the others. Therefore when such shared resources exist, one must ensure:

  1. Appropriate logical or physical resource partitioning is done based on the throughput expected of each flow

  2. If needed increase the parallelism within each flow.

The following scenario explains such a situation in one of the legacy components in the system

Observation

  • The throughput of posting trades seem to go down when corresponding settlement events starts coming in.

Cause

  • The technical flows for both trades and settlement paths were analysed, including measuring the throughput at each step and the bottleneck came down to a set of shared  business engines that performs the journal updates.

Solution

  • Partition and scale out the number of engines allocated specifically for trades posting and settlement posting so that

    • Engines are not shared across two flows

    • Each flow has sufficient throughput to handle peak volumes.

Why?

  • The technical components upstream to these business engines had much higher throughput to scale out and hence the increased parallelism didn’t impact any other resource contention.  Also the downstream component, database has more parallelism than the combined flow interactions could handle. Hence the front to back parallelism was not compromised. 

B.1. 2. Improve Frameworks

The understanding obtained from the architecture study will help addressing over-arching improvements. In this case, CPU and IO overhead was significant due to the way data was transmitted across granular processes (i.e. polling, work-in/work-out tables). Adding headroom to CPU and improving the UI stack only provide a limited (and costly) head room. However, improving the efficiencies of these processes is more sustainable and following scenario explains such a situation.

Observation

  • Low throughput in cross component communication visible through the significant backlog building in message channels

Cause

  • Significant IO waits due to the large number of commits/churn into databases and journals (state files)  

Solution

  • Introduce batch commits to C++ and java work management components. Same time ensure the platform can handle idempotent processing.

Why?

  • Reduce the # of commits reduces the rate IO was generated while that in-turn reduced the overhead on file system, database and even messaging platforms. The resultant improvements provided performance beyond 10X.

B.1. 3. Simplify Processes and Platform

Very simple and quick benefits can be achieved by just removing unnecessary processing. This can be done by

Simplify Technical Implementation

The best solutions are in general the simplest solutions. A technology should be understood based on several factors

  • The key patterns its best suited for

  • What anti-patterns should be avoided and when it should not be used

  • And what are its pros and cons

Below are some scenarios.

Observation #1

  • The SLA for sending a report was breached

Cause

  • The receiving components received the files late due to the significant time spent in the file transmission, compounded by sending the same unzipped file twice to the same server!

Solution

  • Send only one file zipped ( so naive!)

Why?

  • Sending two large unzipped files is much more time consuming that sending one zipped even after adding the zip and unzip times!  

Observation #2

  • Publication of balances from the reporting platform to downstream consumers is breaching SLAs.

Cause

  • Unwanted steps in processing including

    • Publishing a pre-generated  c.800G file into messaging

    • Consuming the file, and loading into a staging table

    • The same source file is generated and consumed by another process!

Solution

  • Avoid the publication of large file to message infrastructure

  • Simply populate staging table from the corresponding source transaction tables

Why?

  • Spent effort in transferring a single file when the source and target are in the same platform sub-domain!

Remove Redundant Functions (re-certify) and dependencies

As a platform ages and housekeeping is not maintained, there can be much unwanted functionality lying around in the platform that eats up scare platform resources. In this system, when contentious functionality was found, one of the simple things done was to evaluate the need for that functionality and where no longer required to simply de-commission it. Such analysis paved the way to decommission high resource consuming reporting jobs. Another flavour of this can be re-defining the dependencies among the different functionality. For example, remove unwanted parent-child dependency between two schedules (e.g. control M) when there is no data, integration or functional dependency between them. Capacity headroom was achieved in in certain batch process with 100 jobs, by removing a dependency of a parent job that didn’t have any data, integration or business logic related dependency, and hence enabling to start the batch much sooner. Creation of a DAG ( Directed Acyclic graph) of dependencies helped in forming the right sequence and parallelisation

Simplify Business Requirement

Contention can be introduced by wrong implementation or interpretation of functionality.  Following scenario explains such a situation.

Observation

  • Providing matching breaks to the Operations is delayed due to the slowness in exceptions processing. 

Cause

  • Exceptions/tasks were proactively generated even before the street side of the trade has not arrived or not expected to be arrived. And when the street side arrives and matches more than 99% of these pre-created exceptions had to be rolled back. This created redundant processing as well as created a significant backlog to roll back

Solution

  • Stop generating pre-planned exceptions, but rather generate exceptions when either matching of both sides are completed or a particular SLA is breached for a particular exchange.

Why?

  • Redundant exceptions generation and subsequent rolling back slowed down the overall process as well as hindered actual exceptions been reported on time to the operations.  

B.2. Bottom up Approach

Several approaches can be adopted here as well. But in principle, this is about providing additional head room by means of either providing processing capacity or IO capacity by focusing on specific bottlenecks.

The Improvements can be/should be looked at in each of the architecture layers of the platform spanning application, database, storage and integration.

The below sections articulate these by breaking down to different layers of the architecture, and each approach is analysed in relation to the 1) observation 2) cause and 3) solution.

Below diagram outlines some of the dimensions I considered.

bottom up.jpg

B.2.1. Application Layer

In this platform as almost all the business logic was implemented in database, the thin application layer was mostly responsible for reading data out from database or writing into the database. There were several areas where the overhead of even this simple process was too significant.

 Observation

  • The time spent in between reading from the data for and publishing out a single message  was too high.  

Cause

  • The framework design stipulated expensive/poorly written message transformation logic that took more than 80% of the roundtrip time.

  • These transformations didn’t add any value to either downstream or upstream processes.  This was aggravated by expensive XML marshalling/un-marshalling introduced during the height of XML adoption

Solution

  • Direct read and 1:1: publishing in simple delimited format removed this transformation overhead.

  • Further read IO improved by increasing the data buffering in the application’s oracle driver (i.e. pre-fetch)

Why?

  • Simpler to change, test and deploy

  • Improved the through by 4X  

B.2.2. Database Layer

There can be many options to address database layer bottlenecks. In addition quick infrastructure upgrades we carried out some key solutions ( that actually comprised of 100 tasks)

Three such examples are explained below, covering different dimensions.

Observation #1

  • Significant CPU and IO overhead in Key Databases from specific PL/SQL code

Cause

  • Poorly designed /written SQL procedures utilising too much CPU and IO     

Solution

  • Select the worst performing SQL w.r.t. IO and CPU utilisation and improve these without affecting the functionality. Patterns adopted included reducing physical IO, improving execution paths (indexes), effective and controlled usage of query parallelism.

Why?

  • Rather than giving further headroom, reducing the IO and CPU churn is longer lasting and hardware sympathetic.   

Observation #2

  • Exposing the IO bottlenecks after CPU upgrade! 

Cause

  • Solving CPU bottleneck, opened up the next level of bottleneck in IO, mainly due to the high volume of consistent IO writes happening. Some due to the fundamental architecture of the platform     

Solution

  • Improve the IO stack by Oracle Direct IO configuration for the Log Writer 

Why?

  • Singleton Log writer performance was the immediate bottleneck in the IO stack. This paved way for further work planned for tuning the IO interactions across the OS IO Stack

Observation #3

  • Intra-day holding journal data from transactions table appearing in the reporting database was significantly delayed (e.g. 3-4 hrs), impacting the operations getting up-to-date view resulting in operational, reputational and regulatory risks.

Cause

  • Replication from the positions and balances OLTP database to the reporting platform was hindered due to the inefficient configuration of the replication tool , Golden Gate

Solution

  • Improve the throughput by having parallel replication thread from source to target database

Why?

  • No comparable deficiencies in the redo log generation at source  or the application of the changes in the target

B.2.3.Messaging Layer

As the platform has 1000+ messaging points hosted on 30+ messaging servers instances (i.e. Channel Servers), there was significant contention in the integration Layer. Following are some scenarios and solutions.

Observation #1

  • Significant throttling in processing trades to generate confirmations.

Cause

  • Even though the up stream process submits trades at a much higher rate, confirmation manager was unable to consume them at the same rate due to the significant IO contention in messaging servers.  

Solution

  • Simplifying the integration flow by removing additional messaging integration steps and thus reducing the IO contentions

Why?

  • Additional steps put more IO contention (read/write/log) for the same effective business events.  Removing the unwanted steps reduces the additional IO.  

Balancing the workload across flows

Even though not specific to messaging, it’s essential that the platform’s ability to handle load is balanced out across all business flows. The below scenario explains one example.

Observation

  • One of the front office integration flows exhibited a lower throughout compared to another similarly loaded flow.

  • The integration set up was identical.

Cause

  • The flow with the lower throughput had messaging servers (in this case channel servers) over-loaded due to they been used at different steps in the integration flow, putting significant back pressure on them.

Solution

  • Re-balance the channels under load so that a dedicated channel server was allocated to each of the steps avoiding one been used across different processes.

Why?

  • Each server instance has an optimum thread and journal configuration set up. Excessive number of transactions put more IO contention and context switching on each channel server instance

B.2.4. Improving Stability

Capacity is not only constrained by how much work can be done at which speed, but also its ability to be resilient and available while processing high volumes. Improving the capacity can improve the stability of the platform and also conversely improving stability can improve the capacity.  Below is one example.

Observation

  • SLA Breach in critical batch job

Cause

  • Daily manual intervention up to 5 hrs to fix the issues in the suite of control M jobs, due to the wrong implementation of the functionality!

Solution

  • Obvious solution of incorporating the correct functionality to the sub-steps that were failing.

Why?

  • This not only wasted production support personal time as well as impacted a key SLA!

B.2.5. Interplay between IO and Processing bottlenecks

There can be situations where the observation can point towards a processing capacity limitation, where as this can be due to an IO contention and vice versa as well.

Observation

  • The System CPU utilisation of the server running messaging infrastructure was significantly high  

Cause

  • Multiple instance of messaging servers continuously invoking system calls, especially IO traps to get a handler to journal the messages. This was due to the write-head log size was small, resulting in the continuous need to request system IO calls to write to disk.

Solution

  • Avoid the IO contention by increasing the write-ahead log size to reduce the frequency of IO Calls, ensuring the memory requirement (to cache the logs) didn’t compromise the available resources.

Why?

  • The IO contention resulted in a visible CPU contention. Adding more CPU would have not solved the problem long term.  

Step C: Ranking and Prioritisation

Up to now the process addressed two aspects:

  1. Understand the platform and its relative and absolute bottlenecks

  2. Understand the architectural (over-arching) and point solutions to remediate them

The next step is ranking them based on several criteria:

  1. Which solutions give the highest up tick

  2. What solutions are comparatively simpler? i.e. less invasive like simple hardware uplift.

  3. The risk of the change. This can be based on the

    1. # of business flows/processes impacted

    2. Significance in deployment

    3. Significance in testing.

  4. How well the benefit balances out with the front to back improvements. For example, the capacity increase in a particular release should be balanced (as much as possible) across all functions without skewed to a few.

Based on the priority, a book of work (BoW) can be created and a set of solutions can then be allocated to each release in the year.

Step D: Deliver Solutions

Based on the ranking, the prioritized items were added to the year’s BoW for delivery and planning. The Solution Formulation section discussed some of the areas but the actual solutions are specific to the underlying platform and its architecture. The dimensions addressed in the Solution formulation section can be used as templates and areas for consideration.

 

Step E: Measure, Model and Re-baseline

After delivering the solutions it’s essential that:

  1. The improvements are measured

  2. Models are updated to reflect the changes

  3. Re-baseline the capacity parameter to reflect the improvements and/or identification of new bottlenecks.

    • New type of bottleneck surfaced in the same point/layer of the architecture.

For example, when CPU bottlenecks were addressed in Trades database subsequent IO bottlenecks were unearthed.

  • New bottlenecks in different layer of the architecture have surfaced. 

For example, when the bottlenecks in the databases were addressed, further bottlenecks were unearthed in the messaging/integration layer.

  • New bottlenecks in different function/location/points have surfaced.

For example, when Trades processing improvements were done, bottlenecks in further downstream processes like confirmations generation surfaced.

Hint #1: It’s very important to have an early view of the next level of bottlenecks. This can be achieved by observing the architecture properties and understanding anti-patterns that would obviously limit capacity or by measuring the next level of perceived bottlenecks, for example extrapolating the possible thread utilisation or rate of IO, Storage growth.

Hint #2: Another way to test the next level of bottlenecks is component wise stress testing. This is useful when the predecessor components have a lower throughput compared to the downstream components and hence until predecessor component is remediated, the actual throughput cannot be injected downstream. By short circuiting the flow and testing the downstream component directly can unearth next level of bottlenecks in the flow.

Hint #3: Ensure the relative capacity limits of the platform in understood front to back.

In the front to back, top down approach adopted within the system the third hint above was addressed proactively by means of identifying the bottleneck and capacity of each point up front.

Subsequent iterations

Solving capacity of a platform cannot be realised by a single cycle. Each release can represent a single cycle and the results from this release and achievements confirmed, need to used to re-baseline the limitations and hence the solutions.

Subsequent release will repeat the process;

  1. Formulating or ratifying the solutions based on the results

  2. Priorities solutions  

  3. Deliver solutions

  4. And again measure the results for subsequent release.

Lessons Learned

The capacity improvements were delivered in conjunction with other deliverables. Some of the key lessons learned during past year are.

1)      Start from simplest solutions

It’s easier to uplift the infrastructure (increase the number of CPU/cores, migrate to a higher end server. This will not make the application run more efficiently but give more headroom in the interim until subsequent releases can bring in more efficiency through application improvements. 

2)      Do not try to do everything in a single release

Even though there can be enough arguments to do most of the things in a single release (i.e. less overall testing effort) when remediating acomplex platform it’s more effective to deliver in shorter cycles. This will not only build confidence but also

1)      Provide more insight to further issues that were not identified

2)      Tune and improve solutions

3)      Improve the way capacity is tested and monitored.

1)      Be ready for surprises

When working on a complex platform there is no lack of surprises. The lack of standardized design patterns and implementation patterns and more importantly poor development /design patterns surprised the team in various instances. The intricate tentacles of business logic proliferation impacted some of the capacity solutions and also significantly expanded the testing scope. This becomes more relevant when moving from less-invasive infrastructure changes to framework and functionality impacting changes. Hence, investing more time on up front due diligence and impact assessment is imperative.

2)      Have a representative stable performance testing environment

The testing environments were not representative of the production environments as a result of poor maintenance of these environments. There were differences in infrastructure setup, a lack of monitoring and housekeeping, capacity related constraints, and incorrect configurations. This not only impacted getting comparative results but more importantly impacted the stability and in turn delayed our testing.

3)      Automate realistic front to back test cases

When environment time is limited, aggravated by lengthy testing schedule, running comprehensive tests was virtually impossible.  To achieve the best realistic results in a limited time, The testing team moved away from unrealistic and isolated tests to testing actual business flows using representative production loading patterns. This helped us move away from extrapolating capacity based on available capacity to benchmark limits based on real performance testing.

4)      Ensure proper monitoring in place

Again the lack of production equivalent monitoring in testing environments at all the requirement points in the architecture made capacity realisation unrealistic. Hence, ensuring the required points within the flows are monitored is essential.  In some cases, real time monitoring and capturing of the run time behavior gave significant insight on the dynamic shifts of IO bottlenecks between components and layers in the architecture.

Hope you enjoyed reading this blog.. I had only managed to provide few of the 100+ solutions we delivered. But tried to give a concise view while covering the overall approach and fundamental thinking. Any one needing a more detailed technical and architectural discussion can DM me

Previous
Previous

Technology transformation -key learnings..

Next
Next

Building reliable, cloud friendly software