Thursday, May 08, 2025

UUID

So back in July of 2024 a number of us got pulled off of our work and put onto a project that was to modify every table we have in our system.  The project was to add 2 columns to every table, a UUID and a Version Number.  Although it was to be done as a single project the reason each field was to be added was to support completely separate process.  This was done with 5 weeks left in the normal development of our release and we have around 80 tables to modify.  Not impossible if handled correctly.  The problem was and is the UUID field and the fact the project was not handled correctly.

A UUID for those of you who don't know is a Universally Unique Identifier and consists of 16 bytes (128 bits) of information that are generated in various ways.  There is a Wiki page all about it if you want to delve into it.  For the purposes of our discussion it is a unique number assigned to each row of every table.  The following is what I wrote as my recommendation.

The problem is that, because of the makeup of our system, we do not have a way to natively store a UUID and interface it between the systems.  [ed note: The core of our database is 40 years old.]  We have not implemented a UUID data type in the database and as most interfaces are generated based on the database schema, that would need to be implemented first.  For the interfaces between our systems we also did not implement a datatype for a UUID.  We can implement this, but it will take time due to the many touch-points (2 schemas, 4 interfaces, 3 primary DBs, 2 secondary DBs) it cannot be completed in the time available.  My recommendation is that for this first pass on the limited set of data files we have selected for the pilot, to store the UUID in its text format as a 36 character field then in the next release implement a UUID datatype across all the touch-points, convert the text field to a UUID in its binary 16 byte representation, and add the UUID to the remaining tables.  A 36 character field is something that we know will flow through the entire ecosystem using the existing tooling we have built.

This was rejected out of hand, within minutes of me sending the email.  Until the day of my “Now it makes sense” post I did not understand why.   We pushed forward with a bodged together handling of the UUID in its binary form into all systems and interfaces for a limited set of tables.  This one project pushed that release’s completion date out 8 weeks in total with 2 major pre-release issues and one post-release issue caused by the UUID work.  All of the work required extensive overtime by multiple individuals and resulted in several secondary databases on client systems to be corrupted, requiring extended recovery support work.  All of this development will need to be redone, every last line of it every table definition entry.  It is all exception coding and barely working schema entries.

And the worst part of it is that the downstream platform that this was done for is in an early development phase.  There was no need to put the production release at risk it could have waited one release.  The other problem I have with this is that the new platform has been under development for almost 2 years at this point, so why was the requirement for us to implement a UUID handed down in such an abrupt manner with a get-it-done deadline.

Thursday, April 24, 2025

Audacious Goals - Fail!

To follow-up on my Audacious Goals post the client that went to RFP gave notice in March of 2023, about 8 months after that post, that they had selected our newest competitor for their core solution.  They will be moving to that core in about 48 months - sounds better than 4 years - from the notice date so that is about 2 years out from now.  For me this was no surprise, I provided some of the reasons why they had not been happy with our service and those reasons go back about 8 years prior to their leaving.  Right up until they gave formal notice to us management was scrambling, starting work and moving people around to do it, to try and save them.  Given the long standing grievances the client had and their perception about the limitations of our software I knew that they were leaving our platform, it was just a matter of where they landed.

Timeline of events I was involved with:

2017 January

Our current performance improvement project has notified the client that they will not be part of the first round of efforts but will be part of the second round, in difference to a different client.  We know we have daytime performance issues with this client that need to be addressed.  (These issues would be the root cause of a number of crisis situations with them in the coming years, with a badly implemented fix causing a crash, and a proper fix implemented mere months before they formally notified us of them leaving.)

2017 July 

The client has experienced 30-45 second delays (verified by us) accessing the Loan records interactively.  This was traced to sub-optimal code that was accessing 10,000 parameter records for each of a number of fields on the Loan record.  This occurred in combination with the core DBMS process being bottlenecked due to some other bad processing.

2017 August

Client has been experiencing multiple issues off and on for over a month.  The issues and our response became so acute their CEO was in contact with our division president and the parent company C level executives.  There were several concurrent issues the most egregious was when a few members logged in to the home banking they we show information for a different member.  In the same timeframe to this we had multiple down events and multiple slowdowns.  

2017 Fall (Don't have the date)

Due to a change in prioritization, the performance improvement project was shut down. I turned to my direct manager during this meeting and told him we still had issues at the client and we need to investigate them to determine their root cause.

2018 October

Client has been experiencing recurring issues with their ATM processing slowing to a relative crawl.  On multiple nights, the ATM requests started slowing down taking 2 to 3 seconds to process each request.  This does not sound like much but due to request volume and queuing this was resulting in overall turnaround times for a request that eventually built up to a 30 second queue.  After much diagnosis it was found there was one Account involved in fraudulent activity that had 100 active share records, 87 active card records, and 14,400 active transaction hold records.  Some sub-optimal coding was having to sift through that multiple times for each transaction and in combination with the processing mode the system was in at the time resulted in the backlog.  The fraud came in in that when the overall turnaround time hit 15 seconds, the ATM network went into 'stand-in' mode and when that happened all 87 cards were used simultaneously to withdraw money across the entire ATM network.  Ouch.

2019 July

Its baack . . . The slowdown accessing the Loan records has reoccurred.  There was an initial project to stopgap the issue with the intent to follow-up with a full project to minimize these reads across the system.  The latter was not implemented.

2020 May

The client crashed during their afternoon rush and through the teller and branch end of day processing.  The short of this is that a fix for an issue that was mostly specific to them, but did occur sporadically for other large clients, was improperly implemented, a workaround for that created, but when the release with the fix was installed by the client the workaround was forgotten to be implemented, resulting in them crashing.  They were pissed (no other word for it) at us and rightfully so.

2020 October/November

The client hired a new person to interface with our company.  Her attitude seemed a little off in the first couple of meetings.  By early November our CSR for their company, our CSR manager, and their interface person were having meetings where they were talking past each other.  Their interface person wanted to examine the defect trends and our CSR manager only wanted to talk about the absolute backlog.  This person and their attitude should was an indicator of how unhappy they were with our service.

2021 March

The client ended up implementing a business change to work around the area that caused their crash in May of 2020.  Yes 8 months later and we still have not implemented a fix for this area and the client had to change instead of us fixing it.

A second issue involved running of concurrent posting programs.  This works correctly but in combination with a 3rd issue caused a slowdown.  What do we tell them, they can only run one at a time.  We should have investigated this further to determine what the 3rd item was.

2021 April

The client is now 2 full years behind installing our software for production and will install 2 yearly releases back to back.  From the meeting I was in it is clear to me they are scared of our software now.  This is also my first note that they are looking to replace us.

2021 October

We are having a series of meetings with the client over the last 6 months.  In this month's meeting the client stated the are going to review what processes and/or functions within our software impact the DBMS process.  In other words, our client is going to research our software to determine what is wrong with it.  This whole thing is a reaction to their crashes over the prior year, most of which were preventable with some good decisions (like not deferring maintenance on known broken items).

2021 December

I wrote the following about a status meeting, "The opening comments regarding the business changes that <the client> has made in an attempt to reduce their daytime load was just devastating: They have held their branch expansion (c o v i d impacted this); They have shut down their day-time automated submission system and are down to process that are required for operations; They are considering shutting down part of our off-host integration service to reduce load"

We found out that an important performance improvement configuration had not been implemented on their primary production server.  This was a project based on hard one (deep investigation) knowledge that produces a 10-20% performance improvement under normal load and, most importantly for this client, improves even more on highly loaded systems.  This performance improvement setup should have been put in place when they installed their current system back in mid 2020.

2022 January

Implemented the performance improvement configuration change.  There is another performance improvement change available at the system level.  This change recovers processor performance we are 'leaving on the floor' according to the systems expert we brought in (at no small expense).

2022 February

The end of the line for the current crisis.  Not sure how much of the many items for improving the client's system performance were implemented in the end.  For sure we had them adjust their processing to work around the issues they had run up against.

2022 September

Per an in-house conversation, the client feels we "have a gun to their head" concerning them converting to our new platform.  Also they have not made a decision about what platform they will be moving to.  They are doing a Proof Of Concept with our newest competitor.

2023 January

The client is finally testing the proper fix for what crashed them in May of 2020.  It only took us 3 years to get a 12 line fix in place.  And we are only doing it now to try and save them as a client.

2023 March

The client formally notified us that they are going to our competitor.  No surprise here.

Wednesday, April 23, 2025

Now it makes sense

Its been a while, been a bad year personally and professionally, but I'll be posting here semi-reguarlly going forward.  My company has gone through quite a lot of upset, almost all of it self inflicted.  But today there was a meeting of the Product Development staff with all hands present.  Half the people in the room I did not recognize which shows the level of turnover in the last 2 years.  Until today I did not understand why work levels had been pushed so hard resulting in that turnover.

I'll be putting out a number of posts in the coming days showing what has been going on.  But here is my take on it.  Our parent company is putting together a platform that is intended to support both Banking and Credit Union core operations.  All well and good, if properly executed.  Until today I understood that to mean that it would function as a front-end to our existing core products (both the Banking and Credit Union back-end).  Today it was stated that the new platform will replace all of the core functionality for the existing Banking and Credit Union systems.  And it was discussed that we will be transitioning to the new system with the timeline bandied about of 5 years.  However what was actually discussed was that the new platform should be ready to support production clients in a minimum of 5 years and at that point we would be able to start transitioning our clients to that platform.  All of the decisions I will be showing as questionable make sense, if you assume our software only has to last 5 years or so and the existing staff will not be brought forward to work on the new system.  You don't have to care if the existing system gets a bad reputation, we will have a new system to show the clients.  You don't have to worry about burning out the existing staff because we are not going to be retained for the new system.

Now it makes sense.  Some of the stupid things that have been done based on the idea we only have to continue to develop for 5 years or so on our current platform will follow in the coming days.

Note, I said to show the clients.  I didn't say it would be usable for the majority of them, especially the largest ones, after that 5 year time span.  The reason is that automation and customizations that don't make sense for small operations become nearly mandatory for large operations.  This is for both Banks and Credit Unions.  These automations and our ability to support customizations are what makes our Credit Union software attractive to our largest clients.  There is also a non-trivial 3rd party ecosystem that exists for our Credit Union software.  We also have to review all of the custom coded interfaces that we have built for our clients, some of which are one offs that will need to be retained.  All of this will need to be reviewed and re-implemented to transition each client from the old system to the new system.  Also we need to train the CUs staff on the new system and ensure there is adequate support from us to answer questions post transition.  All of this for each of our 700 clients.  If we transition 1 per day that would take 2 1/2 years to complete.  Any way I look at this I'm thinking from the day the new system is production ready, it is only in demo mode for a very limited set of functionality today, if we convert all 700 of our clients within a 5 year time span that will be a decent result.  Optimistically it is 5 years to get to that initial production stage so at a bare minimum this is a 10 year undertaking until we can retire our current core system.  Pessimistically call it 15 years.

So the idea that we only have to keep developing on our existing software for 5 more years and that is what we can plan to is nonsensical.

But at least what has been going on now makes sense.

Sunday, May 01, 2022

Audacious Goals

"Set a big future for yourself and use it as motivation to get up every day and get your work done."  — Elon Musk

"Move fast and break things." — Mark Zuckerberg

“Most people overestimate what they can do in one year and underestimate what they can do in 10 years.” — Bill Gates

A Big, Hairy, Audacious Goal or BHAG s a long-term, 10 to 25-year goal, guided by your company’s core values and purpose. — Jim Collins and Jerry Porras in their book, Built to Last: Successful Habits of Visionary Companies"

The above are some of the quotes I directly or indirectly get thrown in my face from time to time by the management team at my company.  Our director of QA loves to talk about how he takes his lead from the way Facebook creates their software.  This week was one of those where someone stated a goal to one of our clients and it was all I could do not to start screaming or crying - still not sure what the appropriate response would be.

A little background.  We have a client who has gone to RFP to replace us for their core processing, along with any of our partner products.  They have been systematically mishandled by management, mostly by being promised things then us not delivering, so I understand why this has happened.  The issue is that this client is our 3rd largest client by transaction volume and a leader in the among Credit Unions.  So yea, huge panic by management, lots of meetings that I just haven't had the heart to comment on.  The client's biggest concern is that they will not be able to grow their CU any further (they want to increase by 30% minimum) due to them hitting the throughput limit of our legacy system.  (This is valid issue and occurs due to the design of the legacy DB key index system that runs as a single thread for each database.  The CU has hit its processing limit of about 40,000 actions per second.)  The new relational DB system is not ready to handle clients of this size, both for performance and concurrency reasons, so that is not an option.  For clients of this size (this one is just the most vocal about this limitation) we have to do something about this processing bottleneck.  Like all bottlenecks there are only two approaches, increase its throughput or decrease the load on it.

So we are on the phone with the client having come to an understanding of their issues, and the ones the Product Development group can address.  Chief is that they cannot implement their growth strategy due to the processing limits they have hit with our system.  Some of these can be mitigated, the problem being we don't have a monitoring system on the legacy DB (we do with the new one) that can provide reporting as to what is generating the load on the DB with enough specificity to allow it to be used to correctly identify the source of the load.  So I have worked since January to implement such a system and it will be put into production on the client's system the last week of May.  When the client asked how much of a load will we be able to help them shed from the current system, my new Director said 50 percent.

50 Percent - Half of the load on the key file manger - all of it driven by application actions.

I was left flabbergasted.  I figured we can identify when they max out the key file system using the new monitoring and trace back exactly why.  (Part of it we know, they launch too many concurrent batch jobs against the live database.  But a large portion of the overload during daytime hours is not understood hence the activity monitoring implementation.)  Once we have the data as to the source of the load during the peak times, I figured we could reduce the cause of the peak load by 10-20 percent preventing noticeable end user request latency issues.  To reduce the load by 50% we would have to change the application, there is no way around that.

After we finished with the client, I asked where the 50% number came from.  The Director stated that he threw it out as a BHAG.  This is something he wants done this year.  Something like that is nice to have as an internal goal, but to throw it out as an expectation for the client, especially for a client that has already been mishandled to the point they want to leave, is in my opinion simply inexcusable.

As for the key file system bottleneck, I have two potential strategies.  The first is to use the monitoring system in real time identify the application(s) overloading the key file system and to throttle them to provide consistent interactive response throughput.  This would stabilize the system for the client for a short term with the long term goal to move them to our new DB system when it becomes usable for them.  In short, manage the load on the bottleneck.  The second strategy is the use a quirk of the key file system in that it processes each of the key file requests within a silo such that one request only acts on a single file.  This provides a very straight forward design path to implement an individual processing thread for the most heavily loaded key files, which we can determine with the results of the monitoring system.  In short, improve the throughput of the bottleneck.

Thursday, February 24, 2022

They manage by the numbers, but don't understand the impact on them from their actions

 So our management team uses numbers to guide them.  How many open defects in the release we are working on, how many in the prior one, how many hours spent developing, how many working production issues.  We pions have to account for our hours per project, per task, and by the GL category resulting in us having to log them into 3 different systems in order to supply everyone who wants the information.  We are told to lie, there is no other word for it, in how these numbers are accumulated.  This is done by waiting until a bunch of un-related issues pile up and then management 'decides' that they are an emergency project that needs to be addressed, but because it is project the hours can now be depreciated, reducing the 'cost' (tax burden) of fixing them.  But it skews the project vs defects hours ratio making it look like we are making more forward progress than we actually are.  (It also results in the clients having to live with ongoing issues for much longer than they should, but who cares about them - they just pay the bills.)  They will also rework defects into projects for much the same reason to allow the hours to be depreciated.

Early in 2021 it was decided to reduce expenses by releasing the offshore contractors we had working to build up our automated testing effort.  It was also decided to have that work continue on, being done by the full time staff.  What we, the development staff, didn't understand was that there were 20 development contractors experienced in automated testing working on this, plus 4 contracted supervisors paired with 4 QA analyst employees.  The 4 employees were retained the 24 contractors were all let go.  So the work of 20 full time developers working on test automation was dumped on the 45 developers in our group.  Until today, I did not know that 24 people were let go when the contractors were let go.  What happened over the 10 months after that event is that the director of software development, the lead designer / developer for our database, the number 2 developer for our database, the primary developer and maintainer for our code generation, both of our MSSQL interface experts, the supervisor for the user interface development, and 4 business functionality developers all left for other companies. The database people all worked as an insular group due to the way the lead functioned (which had always been a source of friction between the various work groups) and he would have been informed of the decision shift the QA work to developers as soon as it was made.  He and the director (his immediate supervisor) were friends outside of work and the director recruited him into the company.  What I have realized this evening is that all these people moved on because in part because they knew what was coming and it incentivized them to search for their next position.

So we have the work that was being done by 20 contractor developers that is now to be done by the 35 of us developers who are currently employed.  We have 15 open positions with 10 being software engineering positions.  The goal is to continue to implement test coverage at the prior pace, and even if we fill them all of the open positions it would still take 40-50% our the developers time to produce the same results as the contracting group that was let go.  And that is if we are as productive as the group of experienced automated test developer that was let go.  But somehow management has convinced themselves that we can do this allocating only 10-15% of our time, because "we are the developers that are creating the application code being tested."  The problem is, we are all being assigned automated test cases to be created in the existing application and user interface areas.  For some of us, myself included, this is an area of the code I last worked in 10 years ago, as I have been working back-end interfaces and database interfaces since then.  Oh yes, all of this work is being tracked within our existing project work instead of being broken out into a separate project or accounting bucket, resulting in the numbers looking like we are making forward progress when in fact all we are doing is back-filling tests in existing areas and not improving the product in any way.

Saturday, August 21, 2021

Classic example of why 'Agile' development fails

So I just have had a rough week, and it was caused by 6 lines of new code.  The reason is a classic example of why Agile development fails in the real world.

Six weeks ago a senior, experienced, and relatively conscientious developer submitted a change that included a new method added to an existing class.  This change was part of a fix for a client discovered issue when voiding transactions in our core Credit Union software.  The code has a complexity of 2 (has 2 conditional with 2 branches) and consists of 6 lines of active code along with 4 braces.  Its purpose is to take a passed object and either update a copy of it in an internal store or add the copy to the internal store.  This code was submitted by this relatively conscientious developer without unit tests.  (I do not know what his reason is for not submitting unit tests in this case was, I have not asked for it.)  Pause here and think about the description of this code maintenance and you should be able to guess what the issue was with the code.  

The code change was tested by QA both manually to verify it fixed the client found issue and updated automation tests with both passing.  The code was scheduled for incorporation and was released to the clients for installation in a service pack.  (Credit Union software release process is regulated by Federal CU license.  For us the end result is we have to provide specific releases to our clients and cannot use continuous delivery.)  This service pack was installed on 8 client systems on the weekend of August 14th/15th that are using this (new) version of our DB backend that had the fix applied.  By the close of business on the 17th we had processing issues at 3 of the 8 that resulted in multiple Member accounts being out of balance, a failed cashier's check disbursement in the Member presence ("The CU won't give me my money!"), another Member finding their savings account $4,000.00 short after a deposit, and all 3 CUs with out of balance General Ledgers.  As you might imagine the Member visibility elevated attention paid to this, but the general processing failure in the new DB was the most troubling aspect of this for the CUs.  Between the multiple CUs, the Member visibility, and functional failure the number of people involved with this from our company peaked above 40.  This started with the new DB team to identify the source of their issue, the application team who had to pick their way though the code for each processing failure (the DB failure occurred in multiple areas of the application) to identify what processing was not completed for the different transactions, CSRs to coordinate with the CUs (change requests to fix the DB, notification of all clients using the new DB and coordinating install of fix which required a database restart), and our management team.  Overall we burned over 400 hours addressing this.  I personally put in over 40 hours in 3 days on this, as I understand the business logic and update requirements to fix the Member accounts.

The root cause in the 6 lines of code?  The class that maintains the stored information uses a lazy initialization for part of its stored data, and the new code did not check that the stored data object was initialized before checking to see if the passed object was already stored in it.  This resulted in a NullPointerException during a database read that, due to the actions of the application, would occur as part of a write / read / write series of steps with the read failing and the application not performing the subsequent write.  The fix, check if the stored data object is initialized and if not fall back to adding the data as new, using the second data store path that was already written.  (Literally add a "storeObject != NULL &&" to the first conditional of the method.)  Why with this installed at 8 clients did only 3 have an issue, turns out those 3 had their General Ledgers configured in a way that they tracked offsetting transactions in a specific fashion and that caused additional DB activity, which resulted in the object being accessed in a way that was not tested for.

For me, this is a why Agile development fails in the real world.  This developer submitted a change without unit tests (why doesn't matter), that was code review and approved, that passed QA testing, but when the clients starting using it resulted in chaos and 100s of hours of cleanup work.  The clients in the real world performed the QA using Member transactions.  Regardless of how good your processes are, Agile expects (requires?) near perfection from developers because, in real world applications there are too many configuration variations for any level of automated testing to cover when maintenance was done.  Why was this maintenance done?  To fix a client found issue in processing.  Why was this found by the clients and not in development?  Because development was done without taking into account all the known usage patterns of the clients, just what the development team thought was the most common with the expectation we would handle the exceptional stuff, the transaction void process, as part of the ongoing maintenance.  This latter part is from their project notes, such as they are.  At its heart this whole disaster did not start with the bad code maintenance, but with the expectation that there will be ongoing rework of base functionality of an application after it was 'delivered.'

Friday, July 30, 2021

You know management doesn't have a clue when . . .

Today management put out a 15 minute fiscal year in review for our primary project.  Intent is to switch the DB of our primary product from legacy home grown (30+ year old) to a more modern one.  Toward end of accomplishments this director listed the conversion of one of our larger clients over to the new system as an accomplishment.  Problem is, it was done by accident!  They were not intended to be part of the auto migration process and were supposed to be exempt from it!  And yes, they are completely pissed off about it.  It has been nothing but production headaches for them (including one today that was being worked when this accomplishment was being trumpeted) including some corrupted data and lots of processing slowdowns.

This really left me with a bad feeling in the pit of my stomach.  If this VP doesn't understand that this occurred because of an oversight and is an ongoing issue so is not something to be touted as an accomplishment, what else has he been misinformed about?