Let me set the scene. A client's server was set up with about 20 sites. One site in particular was quite busy. After what has been described as a "spontaneous reboot" the server began have problems. It would stay up with all the sites enabled except for the one busy site. As soon as that site was enabled, running requests would climb slowly till they reached the simultaneous request threshold, then queued request would climb until the server was unresponsive.
NOTE: There is a follow up to this post.
I took the following steps:
Maybe the db connection is just flaky. I opened a command prompt and started a ping:
I took a look at network utilization for the NIC on the web server. Although it was high - it wasn't so high as to cause me to think it was past capacity. It ran at about 65 percent. Because of the nature of Ethernet and TCP/IP this is actually a fairly high number (duplexing notwithstanding). It meant the NIC card was near capacity. I logged into the DB Server via RDP and took a look at it's network utilization (no problem - 20 to 30 percent). I checked the link speed. These 2 boxes were connected to the same switch. They were both running at 10 Megabits. The provider in this case restricts each box to 10 megabits. So, both the db server, the good web server, and the bad (or perhaps just misunderstood) web server were all running 10 megabits.
Modern switches will often try to negotiate for a higher speed. Perhaps the port for the "bad" web server was set to auto-negotiate and it was periodically crowding the NIC with negotiation requests. This doesn't explain why RDP continued to run fine, but perhaps the DB packets were more fragmented or came in faster and were queued up in the buffer. In any case, network capacity seemed marginal for the amount of traffic (about 20 to 40 visitors a second) I set the NIC to auto-negotiate with some fear - I thought I might end up being permanently disconnected. The auto-negotiation spooled the NIC up to 100 megabits. This resulted (to my everlasting delight) in a 100 percent improvement. Requests were handled and no longer hung. The server began to perform perfectly.
Because of the way that Coldfusion deals with timeouts the dreaded "hanging thread" is often related to Coldfusion's access to external services. That "timeout requests after N seconds" setting only applies to threads that Coldfusion can terminate. When it "hands off" a thread to a database driver it's like the day after a first date. It's really hoping the database server calls back. When it doesn't, Coldfusion just sits in it's room waiting by the phone - a poor lonely thread, hanging there (it makes me tear up). So anything with the potential to interrupt or stall that call-back process has the potential to hang your server - including low level networking issues.
In this case, I believe that something is wrong with the switch. I believe that the auto-negotiation process "solved" something in the configuration of this specific port. Otherwise, why did the "other" web server (equally busy) continue to operate fine under load… a mystery for another time perhaps. In any case, I now have one more thing to check on my check-list. Luckily, it's a simple thing that I can eliminate as a possibility within minutes.
The "spontaneous reboot" was not recorded in the event logs. I suspect that this was a network outage - probably a reboot of the switch or router. When I see a "broken pipes" type message in the stack trace I will now have another thing to consider - that the DB is losing connectivity.