ColdFusion Muse

Network Issues and Hanging Threads Part II

Mark Kruger May 25, 2006 7:10 AM Coldfusion Troubleshooting Comments (1)

The plot thickens. Those of you who read my previous post on this issue will be gratified to know that a second customer with a "hanging server" problem found relief by troubleshooting the network. The symptoms where slightly different. On this server JRUN would peg at 90% to 95% but the server would stay "up". This caused a "paucity of processor capacity" so that other things could bring the server down. We did not suspect networking because no log file errors indicated networking problems. There were no socket errors or tcpip windows event log errors. Whenever we tested the database we found connectivity was "up" and capacity utilization was low. After my experience with the previous customer that was solved by resetting the link speed (see my previous post for more detail), we decided to take a closer look at networking.

What we found was an unmanaged switch sitting in between the DB server and the Web server. It's impossible to tell what was "really" going on because the switch was unmanaged, but I suspect a queue buffer was being overrun or perhaps the ports were re-negotiating periodically. Rebooting this switch resulted in an immediate "fix" of our problem. In other words, after rebooting the switch JRUN processor usage immediately dropped to acceptable levels. I'm wagering that JRUN was maintaining unused sockets through the switch that were "killed off" when the switch was rebooted.

Lessons Learned

As in the previous case, there is was nothing really "wrong" with the configuration. An unmanaged switch running 100mg full duplex and auto-negotiating ports should not create any particular problems unless all the network capacity was used up (it was not). Because there were no overt issues on the server there were no real clues in the various log files and stack traces to tell us about this problem - nothing we could "hang our hat on".

The only real common denominator was that in both cases the problem apparently had to do with port settings on the switch. In both cases it was a busy web server connecting via JDBC/TCP to a database server over an internal network. In one case it was through a decent "managed" switch and in the other case it was an "unmanaged" switch. I suspect duplexing or handshaking protocols, but I haven't had the time to try and duplicate the problem on a test network setup - and I don't even know if it is possible to duplicate it. My next step is to look for a network counter or utility that will help me examine the sockets and their activity. Part of my triage from this point forward will be to examine switching, link speeds and intermediate directors.

  • Share:


  • Brad Wood's Gravatar
    Posted By
    Brad Wood | 10/23/07 12:30 PM
    Great posts. They remind me of some of the problems we have had with our remote offices who use a tunnel to access our intranet site.

    We will get random server not found errors in their browser, and a handful of hung threads who appear to be waiting for the rest of the HTTP request with an open socket method.
    Running a ping will show occasional timed out packets.

    Of course, I am just a code monkey and the network guys tell me their network is fine and it's probably my server's fault so it's hard to prove anything.

    I posted to say this though-- SysInternals (now Microsoft) has a nice utility called Tcpview.exe which will show you open TCP connections that JRUN has open. \

    This came in very handy when our cfdocuments would hang waiting for an image because the remote server would drop the conversation midstream (I captured it with Ethereal). There was NO WAY to kill those threads, but when I fired up Tcpview.exe, I could manually close the TCP connection, and CF would instantly send a second HTTP request for the image (which usually worked) and the thread would complete. Unfortunately, the images came from a 3rd party who refused to admit the possibility that their network was dropping any packets.

    It seems that network connectivity is just one of those things that no one suspects at first, but can really bite you.