ColdFusion Muse

Coldfusion, File IO and an iSsci Array - Swapping Mayhem

Mark Kruger October 7, 2006 12:09 AM Coldfusion Troubleshooting Comments (1)

A successful startup company with an ailing server contacted me recently. This company sold beautiful photography sites with a customizable flash interface. The sites feature public and private galleries and an amazing dashboard that uses the flash uploader for afusion to allow a user to upload multiple files at once. It was very cool stuff. They called me because their server was experiencing difficulty like that I described in some previous posts on Hanging Threads. What we found was a brand new clue...

This particular customer was using a server with no local hard disk. Instead there were 2 iScsi partitions - a system "c:" partition and a data "d:" partition. The site was so file heavy that they had set it up with a substantial amount of disk space on an iScsi array. This made me think they were onto something thinking it could be a network issue - since vitually nothing that was happening on the server was unrelated to the network. All the file I/O was going to be going through the network as well as typical network traffic.

The Plan

I thought to myself, "self... I bet you can solve this problem with a few hours of troubleshooting." Boy was I wrong! Before I tell you about the plan let me explain the problem. As I said the system allowed site owners to upload multiple files at once. This is done through a flash interface that uses serial posts requests to a Coldfusion handler. These are very expensive requests. They are long running by nature, and the end game here is resizing files (using a COM in this case - on CFMX... ick!) and storing some different sizes and resolutions in different folders. These folders could potentially contain thousands of files. Using See Fusion (wow.. what a product! I wish they had an ASP, PHP and .NET version) and some raw log files it was easy to see that these long running upload requests were at least related to the hanging server problem. Yet it was also clear that the server could hang without crossing the simultaneous request threshhold - which is no doubt what lead the site owners to my blog post.

We figured that the problem was network congestion or port negotiation. I also speculated that the iScsi array exacerbated the problem because all of the file I/O was going through the Network Interface Card. So our plan focused on 2 areas - troubleshoot the network and reduce file I/O. We did the following:

  • We set the switch port and NIC to hard coded 1Gig (Gigabit Ethernet over copper).
  • We worked with the data center pros to fine tune the iScsi settings and other settings like flow control.
  • We tweaked the upload script to reduce the number of calls to the file system. For example, we reduced the number of times we called "fileExists( )". We found a way to remove one of the copy operations, and we reduced the frequency of the uploads from the flash object.
  • We added more RAM to the server and retuned the JVM to use the max amount of RAM and the concMarkSweep Garbage Collector.
  • We moved the admin interface to it's own App pool in IIS and tweaked the number of worker threads available (this had the potential to affect the upload operation prior to the hand-off to IIS.
  • We tweaked the memory allocation for mySQL (also running on this server - to my chagrin).
And did any of this help? Nope.... the threads just kept on hanging. In fact, in spite of the amount of RAM we noticed the swap file was still quite busy. At this point we were at a loss. The site owner noticed that the C: drive (the system partition) which was a 5 gig partition, was down to about 1.5 gigs of space. More to have something to try than anything else, he started deleting unneeded files from the C: drive till he got the space up to about 2.5 gigs. Miraculously this seems to have solved the problem.

Lessons Learned

Well... when you are troubleshooting an iSsci array where the system partition is also on the array - make sure the system partition is performing well and has enough space to handle a contiguous page file. My guess is that the actual telemetry of the array is out of the control of the OS - and is handled by the controller. That being the case, the OS can't guarantee that the page file is contiguous. That means a heavily used page file may become fragmented and suffer from performance issues related to space. Or, I could be full of bantha fodder. In any case, pay attention to paging and system partition space on an iScsi array.

  • Share:


  • Michael White's Gravatar
    Posted By
    Michael White | 10/8/06 9:23 PM
    As a network/sever admin guy, I often deal with stuff like this so I always start with the basic server/OS health.

    I also do a little photography and I'm looking to re-do my site... any chance of a hint on this start up company's URL?