Saturday, August 21, 2021

Classic example of why 'Agile' development fails

So I just have had a rough week, and it was caused by 6 lines of new code.  The reason is a classic example of why Agile development fails in the real world.

Six weeks ago a senior, experienced, and relatively conscientious developer submitted a change that included a new method added to an existing class.  This change was part of a fix for a client discovered issue when voiding transactions in our core Credit Union software.  The code has a complexity of 2 (has 2 conditional with 2 branches) and consists of 6 lines of active code along with 4 braces.  Its purpose is to take a passed object and either update a copy of it in an internal store or add the copy to the internal store.  This code was submitted by this relatively conscientious developer without unit tests.  (I do not know what his reason is for not submitting unit tests in this case was, I have not asked for it.)  Pause here and think about the description of this code maintenance and you should be able to guess what the issue was with the code.  

The code change was tested by QA both manually to verify it fixed the client found issue and updated automation tests with both passing.  The code was scheduled for incorporation and was released to the clients for installation in a service pack.  (Credit Union software release process is regulated by Federal CU license.  For us the end result is we have to provide specific releases to our clients and cannot use continuous delivery.)  This service pack was installed on 8 client systems on the weekend of August 14th/15th that are using this (new) version of our DB backend that had the fix applied.  By the close of business on the 17th we had processing issues at 3 of the 8 that resulted in multiple Member accounts being out of balance, a failed cashier's check disbursement in the Member presence ("The CU won't give me my money!"), another Member finding their savings account $4,000.00 short after a deposit, and all 3 CUs with out of balance General Ledgers.  As you might imagine the Member visibility elevated attention paid to this, but the general processing failure in the new DB was the most troubling aspect of this for the CUs.  Between the multiple CUs, the Member visibility, and functional failure the number of people involved with this from our company peaked above 40.  This started with the new DB team to identify the source of their issue, the application team who had to pick their way though the code for each processing failure (the DB failure occurred in multiple areas of the application) to identify what processing was not completed for the different transactions, CSRs to coordinate with the CUs (change requests to fix the DB, notification of all clients using the new DB and coordinating install of fix which required a database restart), and our management team.  Overall we burned over 400 hours addressing this.  I personally put in over 40 hours in 3 days on this, as I understand the business logic and update requirements to fix the Member accounts.

The root cause in the 6 lines of code?  The class that maintains the stored information uses a lazy initialization for part of its stored data, and the new code did not check that the stored data object was initialized before checking to see if the passed object was already stored in it.  This resulted in a NullPointerException during a database read that, due to the actions of the application, would occur as part of a write / read / write series of steps with the read failing and the application not performing the subsequent write.  The fix, check if the stored data object is initialized and if not fall back to adding the data as new, using the second data store path that was already written.  (Literally add a "storeObject != NULL &&" to the first conditional of the method.)  Why with this installed at 8 clients did only 3 have an issue, turns out those 3 had their General Ledgers configured in a way that they tracked offsetting transactions in a specific fashion and that caused additional DB activity, which resulted in the object being accessed in a way that was not tested for.

For me, this is a why Agile development fails in the real world.  This developer submitted a change without unit tests (why doesn't matter), that was code review and approved, that passed QA testing, but when the clients starting using it resulted in chaos and 100s of hours of cleanup work.  The clients in the real world performed the QA using Member transactions.  Regardless of how good your processes are, Agile expects (requires?) near perfection from developers because, in real world applications there are too many configuration variations for any level of automated testing to cover when maintenance was done.  Why was this maintenance done?  To fix a client found issue in processing.  Why was this found by the clients and not in development?  Because development was done without taking into account all the known usage patterns of the clients, just what the development team thought was the most common with the expectation we would handle the exceptional stuff, the transaction void process, as part of the ongoing maintenance.  This latter part is from their project notes, such as they are.  At its heart this whole disaster did not start with the bad code maintenance, but with the expectation that there will be ongoing rework of base functionality of an application after it was 'delivered.'

0 Comments:

Post a Comment

<< Home