Sunday, May 01, 2022

Audacious Goals

"Set a big future for yourself and use it as motivation to get up every day and get your work done."  — Elon Musk

"Move fast and break things." — Mark Zuckerberg

“Most people overestimate what they can do in one year and underestimate what they can do in 10 years.” — Bill Gates

A Big, Hairy, Audacious Goal or BHAG s a long-term, 10 to 25-year goal, guided by your company’s core values and purpose. — Jim Collins and Jerry Porras in their book, Built to Last: Successful Habits of Visionary Companies"

The above are some of the quotes I directly or indirectly get thrown in my face from time to time by the management team at my company.  Our director of QA loves to talk about how he takes his lead from the way Facebook creates their software.  This week was one of those where someone stated a goal to one of our clients and it was all I could do not to start screaming or crying - still not sure what the appropriate response would be.

A little background.  We have a client who has gone to RFP to replace us for their core processing, along with any of our partner products.  They have been systematically mishandled by management, mostly by being promised things then us not delivering, so I understand why this has happened.  The issue is that this client is our 3rd largest client by transaction volume and a leader in the among Credit Unions.  So yea, huge panic by management, lots of meetings that I just haven't had the heart to comment on.  The client's biggest concern is that they will not be able to grow their CU any further (they want to increase by 30% minimum) due to them hitting the throughput limit of our legacy system.  (This is valid issue and occurs due to the design of the legacy DB key index system that runs as a single thread for each database.  The CU has hit its processing limit of about 40,000 actions per second.)  The new relational DB system is not ready to handle clients of this size, both for performance and concurrency reasons, so that is not an option.  For clients of this size (this one is just the most vocal about this limitation) we have to do something about this processing bottleneck.  Like all bottlenecks there are only two approaches, increase its throughput or decrease the load on it.

So we are on the phone with the client having come to an understanding of their issues, and the ones the Product Development group can address.  Chief is that they cannot implement their growth strategy due to the processing limits they have hit with our system.  Some of these can be mitigated, the problem being we don't have a monitoring system on the legacy DB (we do with the new one) that can provide reporting as to what is generating the load on the DB with enough specificity to allow it to be used to correctly identify the source of the load.  So I have worked since January to implement such a system and it will be put into production on the client's system the last week of May.  When the client asked how much of a load will we be able to help them shed from the current system, my new Director said 50 percent.

50 Percent - Half of the load on the key file manger - all of it driven by application actions.

I was left flabbergasted.  I figured we can identify when they max out the key file system using the new monitoring and trace back exactly why.  (Part of it we know, they launch too many concurrent batch jobs against the live database.  But a large portion of the overload during daytime hours is not understood hence the activity monitoring implementation.)  Once we have the data as to the source of the load during the peak times, I figured we could reduce the cause of the peak load by 10-20 percent preventing noticeable end user request latency issues.  To reduce the load by 50% we would have to change the application, there is no way around that.

After we finished with the client, I asked where the 50% number came from.  The Director stated that he threw it out as a BHAG.  This is something he wants done this year.  Something like that is nice to have as an internal goal, but to throw it out as an expectation for the client, especially for a client that has already been mishandled to the point they want to leave, is in my opinion simply inexcusable.

As for the key file system bottleneck, I have two potential strategies.  The first is to use the monitoring system in real time identify the application(s) overloading the key file system and to throttle them to provide consistent interactive response throughput.  This would stabilize the system for the client for a short term with the long term goal to move them to our new DB system when it becomes usable for them.  In short, manage the load on the bottleneck.  The second strategy is the use a quirk of the key file system in that it processes each of the key file requests within a silo such that one request only acts on a single file.  This provides a very straight forward design path to implement an individual processing thread for the most heavily loaded key files, which we can determine with the results of the monitoring system.  In short, improve the throughput of the bottleneck.