Wednesday, April 21, 2021

Don't disagree with the master 😁

So today we were in a design session discussing how to fix a troublesome issue.  This has a lot of history, but the short version is we are changing out our database engine and as part of it we changed from using a mix of file types (hierarchical, relative, and linked list) to a solely relational file organization with an adapter layer between the application code and database.  Our old files were structured such that a record with an attached link list would have a field that contained the record number at the start of the linked list, with each linked list record containing a link to the next one.  All of the application code expects this.  As part of the switch to a relational file organization this field was changed to a 1/0 indicator if the records exist or not, allowing the existing application code to continue to work without change with the implementation of an adapter to perform the relational record reads.  I objected to this change when proposed 4 years back, in that it changed the representation of the field from indicating the record the linked list started on to one that indicates if linked list records exist, discarding the retention of the first record in the linked list.

Well today the lack of this discarded data became a project killer.  One of the core design principals of the old database is that these linked list records were only deleted with the database offline in a purge process (with a read only stand-in being used for access).  One of the reasons for the DB switch is to enable live deletes of these records.  Because the entire application was built around the idea of delaying deletes of records (via design of the old DB) and this was not correctly considered during the DB switchover, we have been trying to figure a way to fix a number of operational issues including not being able to purge the linked lists live (still having to use the read only stand-in process).  We came up with a design using 'soft deletes' where records are marked as deleted (which prevents them from being returned in queries) but still exist so if specifically read by their unique key number will be returned.  This would solve a number of problems except for when all the records of a linked list are deleted.  There is a timing issue with stale data, but it comes down to the fact we have to do a query (because of the 1/0 indicator) to start the read of those records and that query will return no records when the indicator says they are there.  This is a huge problem because the application code considers this a database consistency issue.  This problem is why we cannot use the live purge now.  If we had kept the record field representation and changed from storing the record number at the start of the linked list to storing the unique key number of the first record, we could use the 'soft deletes' for the linked list records just the same as all the other records without recoding the application were it touches these lists.

Our only alternative is to recode the 678 (yes six hundred and seventy eight) locations where the application code interacts with the linked list.  This is a 37 year old product with 3.5 million lines of business logic and another 2 million of generated code (DB and external interfaces).  The individual code change is not hard, but this is financial software so each and every change has to be QA tested.  All in, about 10,000 hours of work.  And this is holding up our clients ability to drop the unpopular use of the read only stand-in required to purge records.  Something that was promised to be delivered a year ago.

Told. You. So.  I didn't have this in mind when I stated that we wanted to keep the representation of that field.  This was to my detriment as I was removed from the DB conversion project after this discussion because I wasn't "forward thinking enough".  Now I'm back to 'fix' the issues with it.

And really, this is just another manifestation of the Agile development process.  Code now and fix the issues as they occur.  The people leading this work at a technical level did not take into account the fundamentals of the DB and account for it in their design.  They just kept moving forward as the issues occurred and kept pushing off delivering fixes for them, not understanding the fundamental issues of the legacy DB they were violating as the reason for the issues.  Now they are gone and those of us left have to clean the mess up.