Sunday, May 01, 2022

Audacious Goals

"Set a big future for yourself and use it as motivation to get up every day and get your work done."  — Elon Musk

"Move fast and break things." — Mark Zuckerberg

“Most people overestimate what they can do in one year and underestimate what they can do in 10 years.” — Bill Gates

A Big, Hairy, Audacious Goal or BHAG s a long-term, 10 to 25-year goal, guided by your company’s core values and purpose. — Jim Collins and Jerry Porras in their book, Built to Last: Successful Habits of Visionary Companies"

The above are some of the quotes I directly or indirectly get thrown in my face from time to time by the management team at my company.  Our director of QA loves to talk about how he takes his lead from the way Facebook creates their software.  This week was one of those where someone stated a goal to one of our clients and it was all I could do not to start screaming or crying - still not sure what the appropriate response would be.

A little background.  We have a client who has gone to RFP to replace us for their core processing, along with any of our partner products.  They have been systematically mishandled by management, mostly by being promised things then us not delivering, so I understand why this has happened.  The issue is that this client is our 3rd largest client by transaction volume and a leader in the among Credit Unions.  So yea, huge panic by management, lots of meetings that I just haven't had the heart to comment on.  The client's biggest concern is that they will not be able to grow their CU any further (they want to increase by 30% minimum) due to them hitting the throughput limit of our legacy system.  (This is valid issue and occurs due to the design of the legacy DB key index system that runs as a single thread for each database.  The CU has hit its processing limit of about 40,000 actions per second.)  The new relational DB system is not ready to handle clients of this size, both for performance and concurrency reasons, so that is not an option.  For clients of this size (this one is just the most vocal about this limitation) we have to do something about this processing bottleneck.  Like all bottlenecks there are only two approaches, increase its throughput or decrease the load on it.

So we are on the phone with the client having come to an understanding of their issues, and the ones the Product Development group can address.  Chief is that they cannot implement their growth strategy due to the processing limits they have hit with our system.  Some of these can be mitigated, the problem being we don't have a monitoring system on the legacy DB (we do with the new one) that can provide reporting as to what is generating the load on the DB with enough specificity to allow it to be used to correctly identify the source of the load.  So I have worked since January to implement such a system and it will be put into production on the client's system the last week of May.  When the client asked how much of a load will we be able to help them shed from the current system, my new Director said 50 percent.

50 Percent - Half of the load on the key file manger - all of it driven by application actions.

I was left flabbergasted.  I figured we can identify when they max out the key file system using the new monitoring and trace back exactly why.  (Part of it we know, they launch too many concurrent batch jobs against the live database.  But a large portion of the overload during daytime hours is not understood hence the activity monitoring implementation.)  Once we have the data as to the source of the load during the peak times, I figured we could reduce the cause of the peak load by 10-20 percent preventing noticeable end user request latency issues.  To reduce the load by 50% we would have to change the application, there is no way around that.

After we finished with the client, I asked where the 50% number came from.  The Director stated that he threw it out as a BHAG.  This is something he wants done this year.  Something like that is nice to have as an internal goal, but to throw it out as an expectation for the client, especially for a client that has already been mishandled to the point they want to leave, is in my opinion simply inexcusable.

As for the key file system bottleneck, I have two potential strategies.  The first is to use the monitoring system in real time identify the application(s) overloading the key file system and to throttle them to provide consistent interactive response throughput.  This would stabilize the system for the client for a short term with the long term goal to move them to our new DB system when it becomes usable for them.  In short, manage the load on the bottleneck.  The second strategy is the use a quirk of the key file system in that it processes each of the key file requests within a silo such that one request only acts on a single file.  This provides a very straight forward design path to implement an individual processing thread for the most heavily loaded key files, which we can determine with the results of the monitoring system.  In short, improve the throughput of the bottleneck.

Thursday, February 24, 2022

They manage by the numbers, but don't understand the impact on them from their actions

 So our management team uses numbers to guide them.  How many open defects in the release we are working on, how many in the prior one, how many hours spent developing, how many working production issues.  We pions have to account for our hours per project, per task, and by the GL category resulting in us having to log them into 3 different systems in order to supply everyone who wants the information.  We are told to lie, there is no other word for it, in how these numbers are accumulated.  This is done by waiting until a bunch of un-related issues pile up and then management 'decides' that they are an emergency project that needs to be addressed, but because it is project the hours can now be depreciated, reducing the 'cost' (tax burden) of fixing them.  But it skews the project vs defects hours ratio making it look like we are making more forward progress than we actually are.  (It also results in the clients having to live with ongoing issues for much longer than they should, but who cares about them - they just pay the bills.)  They will also rework defects into projects for much the same reason to allow the hours to be depreciated.

Early in 2021 it was decided to reduce expenses by releasing the offshore contractors we had working to build up our automated testing effort.  It was also decided to have that work continue on, being done by the full time staff.  What we, the development staff, didn't understand was that there were 20 development contractors experienced in automated testing working on this, plus 4 contracted supervisors paired with 4 QA analyst employees.  The 4 employees were retained the 24 contractors were all let go.  So the work of 20 full time developers working on test automation was dumped on the 45 developers in our group.  Until today, I did not know that 24 people were let go when the contractors were let go.  What happened over the 10 months after that event is that the director of software development, the lead designer / developer for our database, the number 2 developer for our database, the primary developer and maintainer for our code generation, both of our MSSQL interface experts, the supervisor for the user interface development, and 4 business functionality developers all left for other companies. The database people all worked as an insular group due to the way the lead functioned (which had always been a source of friction between the various work groups) and he would have been informed of the decision shift the QA work to developers as soon as it was made.  He and the director (his immediate supervisor) were friends outside of work and the director recruited him into the company.  What I have realized this evening is that all these people moved on because in part because they knew what was coming and it incentivized them to search for their next position.

So we have the work that was being done by 20 contractor developers that is now to be done by the 35 of us developers who are currently employed.  We have 15 open positions with 10 being software engineering positions.  The goal is to continue to implement test coverage at the prior pace, and even if we fill them all of the open positions it would still take 40-50% our the developers time to produce the same results as the contracting group that was let go.  And that is if we are as productive as the group of experienced automated test developer that was let go.  But somehow management has convinced themselves that we can do this allocating only 10-15% of our time, because "we are the developers that are creating the application code being tested."  The problem is, we are all being assigned automated test cases to be created in the existing application and user interface areas.  For some of us, myself included, this is an area of the code I last worked in 10 years ago, as I have been working back-end interfaces and database interfaces since then.  Oh yes, all of this work is being tracked within our existing project work instead of being broken out into a separate project or accounting bucket, resulting in the numbers looking like we are making forward progress when in fact all we are doing is back-filling tests in existing areas and not improving the product in any way.

Saturday, August 21, 2021

Classic example of why 'Agile' development fails

So I just have had a rough week, and it was caused by 6 lines of new code.  The reason is a classic example of why Agile development fails in the real world.

Six weeks ago a senior, experienced, and relatively conscientious developer submitted a change that included a new method added to an existing class.  This change was part of a fix for a client discovered issue when voiding transactions in our core Credit Union software.  The code has a complexity of 2 (has 2 conditional with 2 branches) and consists of 6 lines of active code along with 4 braces.  Its purpose is to take a passed object and either update a copy of it in an internal store or add the copy to the internal store.  This code was submitted by this relatively conscientious developer without unit tests.  (I do not know what his reason is for not submitting unit tests in this case was, I have not asked for it.)  Pause here and think about the description of this code maintenance and you should be able to guess what the issue was with the code.  

The code change was tested by QA both manually to verify it fixed the client found issue and updated automation tests with both passing.  The code was scheduled for incorporation and was released to the clients for installation in a service pack.  (Credit Union software release process is regulated by Federal CU license.  For us the end result is we have to provide specific releases to our clients and cannot use continuous delivery.)  This service pack was installed on 8 client systems on the weekend of August 14th/15th that are using this (new) version of our DB backend that had the fix applied.  By the close of business on the 17th we had processing issues at 3 of the 8 that resulted in multiple Member accounts being out of balance, a failed cashier's check disbursement in the Member presence ("The CU won't give me my money!"), another Member finding their savings account $4,000.00 short after a deposit, and all 3 CUs with out of balance General Ledgers.  As you might imagine the Member visibility elevated attention paid to this, but the general processing failure in the new DB was the most troubling aspect of this for the CUs.  Between the multiple CUs, the Member visibility, and functional failure the number of people involved with this from our company peaked above 40.  This started with the new DB team to identify the source of their issue, the application team who had to pick their way though the code for each processing failure (the DB failure occurred in multiple areas of the application) to identify what processing was not completed for the different transactions, CSRs to coordinate with the CUs (change requests to fix the DB, notification of all clients using the new DB and coordinating install of fix which required a database restart), and our management team.  Overall we burned over 400 hours addressing this.  I personally put in over 40 hours in 3 days on this, as I understand the business logic and update requirements to fix the Member accounts.

The root cause in the 6 lines of code?  The class that maintains the stored information uses a lazy initialization for part of its stored data, and the new code did not check that the stored data object was initialized before checking to see if the passed object was already stored in it.  This resulted in a NullPointerException during a database read that, due to the actions of the application, would occur as part of a write / read / write series of steps with the read failing and the application not performing the subsequent write.  The fix, check if the stored data object is initialized and if not fall back to adding the data as new, using the second data store path that was already written.  (Literally add a "storeObject != NULL &&" to the first conditional of the method.)  Why with this installed at 8 clients did only 3 have an issue, turns out those 3 had their General Ledgers configured in a way that they tracked offsetting transactions in a specific fashion and that caused additional DB activity, which resulted in the object being accessed in a way that was not tested for.

For me, this is a why Agile development fails in the real world.  This developer submitted a change without unit tests (why doesn't matter), that was code review and approved, that passed QA testing, but when the clients starting using it resulted in chaos and 100s of hours of cleanup work.  The clients in the real world performed the QA using Member transactions.  Regardless of how good your processes are, Agile expects (requires?) near perfection from developers because, in real world applications there are too many configuration variations for any level of automated testing to cover when maintenance was done.  Why was this maintenance done?  To fix a client found issue in processing.  Why was this found by the clients and not in development?  Because development was done without taking into account all the known usage patterns of the clients, just what the development team thought was the most common with the expectation we would handle the exceptional stuff, the transaction void process, as part of the ongoing maintenance.  This latter part is from their project notes, such as they are.  At its heart this whole disaster did not start with the bad code maintenance, but with the expectation that there will be ongoing rework of base functionality of an application after it was 'delivered.'

Friday, July 30, 2021

You know management doesn't have a clue when . . .

Today management put out a 15 minute fiscal year in review for our primary project.  Intent is to switch the DB of our primary product from legacy home grown (30+ year old) to a more modern one.  Toward end of accomplishments this director listed the conversion of one of our larger clients over to the new system as an accomplishment.  Problem is, it was done by accident!  They were not intended to be part of the auto migration process and were supposed to be exempt from it!  And yes, they are completely pissed off about it.  It has been nothing but production headaches for them (including one today that was being worked when this accomplishment was being trumpeted) including some corrupted data and lots of processing slowdowns.

This really left me with a bad feeling in the pit of my stomach.  If this VP doesn't understand that this occurred because of an oversight and is an ongoing issue so is not something to be touted as an accomplishment, what else has he been misinformed about?

Wednesday, July 14, 2021

Another working remote Fail and another Agile Fail

So yesterday I came home and cried I was so frustrated with the day.

My boss chimed in on a client emergency that sent the team working on it down the wrong path for half a day.  I knew his analysis was wrong the instant I read his email.  If he was working in the office, we would have had a personal discussion before providing guidance to the team, but because of working remote, and him being 2 hours ahead the team working to fix the client's issue wasted half a day on the wrong analysis.  The day already sucked, I wasted a couple hours of my time correcting their course, then I started the code review.

There is a project that is reworking our legacy trouble reporting code to provide it in JSON formatted output in addition to the legacy text output.  The JSON output will be used to transmit the information from our clients back to our in-house systems for analysis.  (This is something we've wanted for years, a good improvement project.)  Last week I was asked to review the initial set of changes and found code that, by analysis, I could tell didn't work.  The developer tasked with these changes is known to take whatever shortcuts that he thinks he can get away with, writes lots of repeating code (instead of methods), has no sense of separation of concerns, and does not adequately test before submitting code.  He was in the group of developers who were the last to be transitioned from Waterfall development to Agile, as we "need to be uniformly an Agile development shop."  So I provided feedback in 23 code review notes and a 2 hour discussion / walk-through of the code to try and fix this.  Of primary concern was lack of use of a JSON formatter for output (our primary development language does not have a library available) and repeating code throughout.  In the walkthrough I gave detailed information on how to build this so it would be flexible going forward and robust in its processing.  The short of it is, when reviewing the code about half the information I provided (all of which was written down) was followed and the rest was not.  He also did all of the changes in the module and submitted them, instead of a minimal set and having me check for correctness before doing the rest of the module.  It is 1102 changed lines in a single submission.  This is code that has to work every time, when this runs we are in a situation where a running, user facing program is about to exit due to internal error and we have to have a record of why in order to attempt to diagnose the reason (our language is not memory safe, so memory related issues are the primary concern.) I provided another 7 notes before I just gave up.  I asked about why my guidance was not followed, and was told that he was, "in a hurry because this is Agile and didn't feel the need to follow everything [he] was told to do."  The stuff he skipped is going to bite him in the ass when he moves onto the other modules, and I don't have the mental attitude to deal with it.

This code has already failed to work in-house, as (because of Agile) we are checking into our Trunk directly.  I expect it to periodically fail in production when this code hits the client systems.  Before Agile, this code would never have been incorporated, and this idiot would have been shown how to rework it before it was.  Now, he checks into Trunk and the rest of us have to clean the mess up.  His team is lead by someone whose errors are simply ignored and once they have made a mess of something assignments are  rotated and they, somehow, are never left supporting the changes they have screwed up.

Saturday, July 03, 2021

Why our company is failing at Agile

 So on Tuesday and Wednesday of this week I was bombarded with questions concerning a project we worked on for inclusion in the release we are about to ship.  (Our company has to ship our product in testable releases to our clients due to regulations.)  The people asking the questions attended every design session and demo, but now 5 months after the work is complete they are asking detailed questions about the functionality.  Why?  Because now they need to update the release documentation.  Why these actions were not done at the time of the project is a mystery to me.  This is one of those processes where the people involved have not adjusted their thinking away from the Waterfall project / release model to our current Agile-ish model.  The documentation was not updated in parallel with the development, but instead is now being updated now that they are finalizing the release and it is 'time' to update the documentation.  So I spend many hours going back over what we implemented to answer these questions, and in one case simply demo'd the functionality as it exists in the product.  Waste. Of. Time.

Monday, June 28, 2021

Typical result of Agile code first approach

 Doing a code review today involving a project to remove functionality that is not longer used by 3rd parties.  Someone posted a comment in the review, "So I know it would be a large work effort addition, but have we discussed removing <columnName> from the table?  It is only used for <business function being removed>"  This is typical for what I have seen of Agile development (by supposed experienced agile developers) in that they start something, start coding, then as they hit issues their scope expands because the didn't have a correct design to start with.

The answer to the question that was asked, was yes we need to remove all of the functionality supporting the business function being removed, not just the user interface for it.  Their reply was that scope change tripled or quadrupled the work they need to do.  If you have ever worked on a system with a percentage of its code associated with no-longer in use functionality, you know they answer they received.  "You need to pull all of the code and columns exclusive to <business function>."  Identify it, remove it, and supply QA with the information to test the remaining functionality that is touched by these changes.