A successful startup company with an ailing server contacted me recently. This company sold beautiful photography sites with a customizable flash interface. The sites feature public and private galleries and an amazing dashboard that uses the flash uploader for afusion to allow a user to upload multiple files at once. It was very cool stuff. They called me because their server was experiencing difficulty like that I described in some previous posts on Hanging Threads. What we found was a brand new clue...
This particular customer was using a server with no local hard disk. Instead there were 2 iScsi partitions - a system "c:" partition and a data "d:" partition. The site was so file heavy that they had set it up with a substantial amount of disk space on an iScsi array. This made me think they were onto something thinking it could be a network issue - since vitually nothing that was happening on the server was unrelated to the network. All the file I/O was going to be going through the network as well as typical network traffic.
I thought to myself, "self... I bet you can solve this problem with a few hours of troubleshooting." Boy was I wrong! Before I tell you about the plan let me explain the problem. As I said the system allowed site owners to upload multiple files at once. This is done through a flash interface that uses serial posts requests to a Coldfusion handler. These are very expensive requests. They are long running by nature, and the end game here is resizing files (using a COM in this case - on CFMX... ick!) and storing some different sizes and resolutions in different folders. These folders could potentially contain thousands of files. Using See Fusion (wow.. what a product! I wish they had an ASP, PHP and .NET version) and some raw log files it was easy to see that these long running upload requests were at least related to the hanging server problem. Yet it was also clear that the server could hang without crossing the simultaneous request threshhold - which is no doubt what lead the site owners to my blog post.
We figured that the problem was network congestion or port negotiation. I also speculated that the iScsi array exacerbated the problem because all of the file I/O was going through the Network Interface Card. So our plan focused on 2 areas - troubleshoot the network and reduce file I/O. We did the following:
Well... when you are troubleshooting an iSsci array where the system partition is also on the array - make sure the system partition is performing well and has enough space to handle a contiguous page file. My guess is that the actual telemetry of the array is out of the control of the OS - and is handled by the controller. That being the case, the OS can't guarantee that the page file is contiguous. That means a heavily used page file may become fragmented and suffer from performance issues related to space. Or, I could be full of bantha fodder. In any case, pay attention to paging and system partition space on an iScsi array.