Getting it wrong . . . twice . . . toward the end of a long week
As a company that provides core software for Credit Unions, the software that tracks a Member's shares, loans, credit cards, and other assets and debts, one of the things our software does is track information for IRS reports. These are things like interest earned, interest paid, and in this case distributions from retirement accounts. All of these have to be reported to the Members (via 1098, 1099, and other forms) and the IRS via multiple submissions. We provide the clients an end of year release that includes any late rule changes made, with the understanding out clients may not install it live but may install it in a test area with a post end of year copy of the CU database to generate reports against. Three years ago, when contractors were used for IRS reporting (due to ongoing database changeover work) that release caused a number of client databases to become corrupted in production and when used on a database backup, data sent to the IRS that was to be retained and reused in subsequent years to be lost. The latter was a perceived by management as a bigger deal due to data recovery cost incurred by our company with the former seen as just a 'loss of trust' in our release software. I would argue that they are equally as bad, with the loss in trust resulting in slower uptake of each release of our software.
This is all background to the fiasco we found ourselves in last week. Starting with 2021 the rule for when individuals are required to take the Required Minimum Distribution (RMD) from a IRA or 401K savings account was changed from 70 1/2 to 72 years of age. For our software we change an entry on a record for each affected share record, and the January reports and notifications are adjusted with no code changes required. This has to be done sometime in the month of January after December reports are run and before January reports are (typically after close on the last day of the month.) However, when this was designed out the lead person forgot that not every client is going to have loaded the end of year release before the end of the year and many of them won't load it into production in January. They packaged this record update process into the end of year release only and did not provide it as a separate software program. This is first time we got this wrong, and crucially not the first time we have done this to the clients in recent years per the preceding paragraph. So on Thursday (the 14th) we fix the issue by releasing stand-alone software that can be run by the clients to update the records. Now the clients have 2 weeks to test this software and run it on their production database to apply the change before reports for January are run. Sounds like a lot of time, but for a large CU (some of our clients have 100,000+ members) validating the run of one-off software is time consuming, making the lateness of this release an issue. (For Federally chartered credit unions, one requirement is that they test and validate all software.) So we release this early in the morning, and before we get to lunch we have multiple CUs calling us and to quote one, "do you need someone to read federal regulations to you to get the implementation correct?" It turns out the group that put together the the stand-alone software used an incorrect calculation and set the date of birth of a member before which an RMD from a share is required to July 1, 1949 resulting in the age being 72 1/2 instead of 72. So we had to quickly put out a fix that day with the date of birth being January 1, 1950. Bear in mind, that date is explicitly stated in the regulation implementation guidelines.
So we got it wrong twice and all the clients saw it all. How much client trust in our software did this cost us.
For me that was just Thursday's kerfuffle. On Friday or CM system went down for half a day because the CM team changed their install strategy for the build system then couldn't diagnose their issue without help, on Wednesday I was asked to review code for the implementation of a program access issue (implement a way to validate the calling program is a production version before returning sensitive information that should be limited to a single program) that I had stated was required along with the fact that we had all of the pieces implemented in our primary development language, and that we should use them. Instead, I was presented with over a thousand lines of new 'C' code to do the validation with. On Tuesday I found out that for our newest sever program, written in our up and coming development language Java, we are recommending clients reboot it on a daily basis. Oh and the 'Agile' team that created it did not implement the hooks we normally do for a standing server to allow the scheduling system to access it to do things like reboot it, and the clients are very unhappy with the situation (it requires their personnel to manually perform these actions), so could someone come up with a way we can do this without patching our software. On Monday and Tuesday I spent a total of 90 minutes in two meetings with development and QA on the test results of a fix of an issue. For the next service pack of a prior release the fix failed testing, not because it did not work but because of an operation it is dependent upon is now failing and that failure is also a regression failure. What should have been a couple of emails, specifically that QA needed to test the new regression issue on the current SP for the release and the new SP for it to verify that the new SP introduces the issue (once tested it was verified) and that the new SP should not be released (it is documented as a critical area). Instead it required 9 people in 90 minutes of meetings or 13 hours of PD and QA time. It was a long week. One in a series, and all I can think is that the faster we try and go the worse it gets.
"Shift Left" - straight into a hole - The Agile / DevOps way. Code first, design once the coding is found wanting, plan after it is found that the design is inadequate.