ColdFusion Muse

Load Testing Gotcha - Testing Lighter loads

An interesting issue came up last week. A customer has a production server servicing a high load and running CFMX 6.1. We migrated his site from CF 4.5 "spaghetti" code sprinkled with custom tags to CFMX code with CFCs. The project was thorough and included load testing and a dev and staging environment. When we launched it seemed fine - better than expected actually. It ran fine for about 2 weeks, then suddenly developed a problem. It seems that under a very light (almost non-existent) load the server slowed to a crawl.

> The web server uses a third party ODBC driver to a legacy main-frame. We thought we had tested everything, but this brings up an important point. When you are load testing you should consider all loads that the server might experience. You should also be testing variances in the load. Our scripts (using Win Runner) used 3 tiers of concurrent users. We were hammering away at specific capacity points looking to see where the system would bog down. There was some variance in the testing, but mostly we were testing to "see how much it could handle".

We should have included some wider variances - going dormant after a heavy load for example. If we had we would have found that the main frame tuned its listener and removed threads it no longer needed - respawning them as necessary. This behavior of dropping child processes on the main-frame only showed up under a nearly dormant load after a heavy load, because under a heavy load the worker threads were almost never dropped. This little nuance caused the JDBC to ODBC to Main-frame Listener pool to fall out of synch causing database timeouts and removeOnException errors.

To be sure, testing these extra conditions might not be necessary in every case. We often work with situations where everything about the system is pretty well known - and our experience guides us. The extra "unknowns" about this system (the Main-frame behavior and the db driver) made it necessary to test each possibility. If we had, we would have had a more direct contingency and a little less "chicken little" activity. We should have also dropped child processes on the HP to see what recovery looked like on the web server. We did kill the HP altogether, but that is a "brute force" test where recovery is pretty well assured. Most agents and drivers can recover from a reboot. That's why a 17 year old high school student can do first tier help desk support and look like a genius... "Uh ma'am, have you tried rebooting the PC?" It's the little nuances in behavior between disparate systems that cause exasperation.

  • Share:

0 Comments

Leave this field empty

Write a comment

If you subscribe, any new posts to this thread will be sent to your email address.