How to troubleshoot a problem in a computer system? There are a number of standard troubleshooting methodologies like trial-and-error, bottom-up, top-down, divide-and-conquer, and even root cause analysis, but none of them is a silver bullet for all possible problems. Often the most effective way is to dynamically combine these approaches based on experience and understanding of the underlying technologies. This is a story of a particularly hard problem with an unexpected cause that I troubleshooted recently.
The problem I was facing was related to a Midgard staging/live setup using the Exorcist tool. The staging site was working just fine but the live site was having weird problems on some pages. It appeared as if the MidCOM components of the troublesome pages were losing a part of their configuration parameters.
Step 1. Because everything was working fine on the staging site I figured that the problem is most likely related to the staging/live replication step performed using the Exorcist. Were the parameters not being replicated? None of the common causes for such situations (unapproved content, parameters in a different sitegroup than the main object, etc.) seemed to apply however so I had to inspect whether the replication dump actually contained the missing parameters. It did!
Step 2. So the parameters were being replicated, but was there some trouble in how they were being imported in the live database? There was nothing strange in the import log file, so I opened a MySQL prompt to inspect the live database. The troublesome parameters seemed to exist just where they should be, and they even had all the correct GUID mappings and other details that had previously caused problem with the Exorcist.
Step 3. It seemed that the live database was in perfect condition, so the problem must be caused by something else. The problem was affecting just some MidCOM components, so I figured that there might be some difference in the component versions used on the staging and live hosts. No, the MidCOM source trees on both the staging and live hosts were identical.
Step 4. If the problem wasn’t with MidCOM, then it must be a problem with the Midgard framework. Adding debug prints to one of the troublesome MidCOM components I was able to narrow down the problem to a
$object->listparameters($component) call in the MidCOM configuration class. The configuration class seemed to be working just fine everywhere else, so the problem must be in the listparameter method! For some object/component combinations the method just returned nothing and didn’t log any errors even though I double- and triple-checked that the parameters existed in the database.
Step 5. I added a set of debug prints to the Midgard-PHP listparameters method, compiled and installed the modified module, and reloaded the troublesome pages. They still didn’t work properly, but now I got a detailed trace of what was happening under the hood. The method actually did execute the correct SQL query but received no results from the database. This was certainly weird, as I was very certain that the parameters in question actually did exist.
Step 6. Could there be something wrong with the MySQL client library used by the Midgard core to send queries to and receive results from the database? The normal MySQL command line tools use the same library so I started the MySQL prompt and entered the SQL statement copied from the Midgard log file. No results! The SQL statement was:
SELECT id,domain,name,value FROM record_extension WHERE (tablename='topic' AND oid=27 AND domain='de.linkm.sitemap' AND lang=0) AND (record_extension.sitegroup in (0,2)) ORDER BY name
Step 7. There are some autogeneration artifacts in the query, but no real reason why it shouldn’t work. I removed the
domain='de.linkm.sitemap' constraint and got a list of parameters, including ones with the domain column set to
de.linkm.sitemap! Even worse, the exact same query with
domain='midcom.helper.nav' were returning results, for some totally unknown reason MySQL seemed to not like the
domain='de.linkm.sitemap' constraint. I tried also some other variations of the query and found out that even just removing the
ORDER BY clause made the query return the correct rows. This should ring heavy warning bells for anyone who knows SQL, as the
ORDER BY clause only orders the returned rows.
Step 8. Could this be some weird character set and collation issue like the ones we’ve been facing recently? No, it works the same way (rows are returned or not returned depending on whether the ORDER BY clause is used) regardless of the character encoding being used.
Step 9. Has the database been corrupted? I backed up and restored the entire database but the query was still misbehaving. I then copied the correctly behaving database from the staging host to the live host. The same error appeared in that database as well!
Step 10. The staging host was running MySQL version 4.1.12 while the live host had MySQL 4.1.7. Both stable and tested releases installed from the standard RHEL packages. I was recalling ugly memories from the MySQL 3.x times, when weird database errors seemed to be much more common.
Step 11. OK, it seemed that the problem was some MySQL bug that got fixed somewhere between versions 4.1.7 and 4.1.12. I didn’t want to risk messing with the database installation so I figured I’d better find some workaround rather than trying to upgrade the database version.
Step 12. I used the EXPLAIN statement trying to find out what could be triggering the MySQL bug. The output suggested that the query planner was using some of the indexes on the record_extension table. The indexes actually look quite weird:
The indexes don’t seem to be reasonable. Why the arbitrary ten character limit, and why use only a single column per index? This seems anomalous enough to trigger an obscure bug in MySQL, so I just dropped the record_domain_tablename_idx index. And it worked!
The entire troubleshooting session took about three hours with a couple of breaks in between.