Network and Power cables were strewn everywhere. Laptops littered the large table top. A conference bridge was constantly open and provided access to outside individuals interested in the outcome. A team of 10 or 12 people surrounded the problem like a group of hyenas closing in on a hapless gazelle. It was part of a troubleshooting effort at a large client site this week. I sat in on the "war room" and offered help in the area of Coldfusion server configuration and optimization. The system in question was a complex amalgam encompassing a JMS system, an AS/400, CFMX 7, an FTP server, and multiple database servers. Folks from networking, database administration, server administration, development, customer service and business analysis were all in on the battle.
Read More
For those of you following this issue I've had a breakthrough - an epiphany brought to me by the inestimable Cameron Childress. Cameron was musing over a post I made to a mutual list we both monitor. In the post I laid out the 3 instances where network issues (particularly auto-sensing NICs or Switches) had caused Jrun to hang. I think he must be a better googler than I am because he came up with this link to an article on setting up a Win2003 server cluster that contains some excellent information regarding our issue. Among the instructions on setting up a multi-home network for clustering was this item:
NOTE: This post is a follow up of these 2 posts:
Queued Requests Hanging Coldfusion
Network Issues and Hanging Threads
The plot thickens. Those of you who read my previous post on this issue will be gratified to know that a second customer with a "hanging server" problem found relief by troubleshooting the network. The symptoms where slightly different. On this server JRUN would peg at 90% to 95% but the server would stay "up". This caused a "paucity of processor capacity" so that other things could bring the server down. We did not suspect networking because no log file errors indicated networking problems. There were no socket errors or tcpip windows event log errors. Whenever we tested the database we found connectivity was "up" and capacity utilization was low. After my experience with the previous customer that was solved by resetting the link speed (see my previous post for more detail), we decided to take a closer look at networking.
Read More
Let me set the scene. A client's server was set up with about 20 sites. One site in particular was quite busy. After what has been described as a "spontaneous reboot" the server began have problems. It would stay up with all the sites enabled except for the one busy site. As soon as that site was enabled, running requests would climb slowly till they reached the simultaneous request threshold, then queued request would climb until the server was unresponsive.
NOTE: There is a follow up to this post.
In a follow up to my previous post on Jrun Processor Pegging Issues and Solutions I was intrigued by an idea given to me by Steven Erat. The problem is that the "-err" and "-out" logs do not roll over like the event logs. This is a problem that is slated to be fixed in the next version. To "fix" this problem Steven thought it might be possible to move or delete the file using CFFILE without restarting CF. He reasoned that while windows could not defeat the lock, perhaps CF could defeat it because it owned the lock to begin with. Here's what I found.
Read More
Lately I've been involved in a couple of troubleshooting sessions where JRUN on a CFMX server was causing 95% to 100% processor utilization. Unfortunately I have not yet stumbled onto a magic bullet for this. Tweaking memory settings, changing garbage collection routines, modifying the threads for the scheduler and the simultaneous requests all seem to help, and in some cases solve the problem. I have never found one single solution that solves this problem. It usually comes down to either JVM arguments or an external process (a database, queue, COM, FTP etc) that is causing a hanging request.
Today, however, I stumbled upon a solution that seemed to solve the problem immediately. If your processor spike is due to this specific issue then this seems to fix it. Keep in mind, that I'm basing that opinion on the fact that taking the following steps seems to have fixed a production server in my care - so take it for what it's worth.
Read More
Every week I seem to find myself dealing with intractable bugs or performance issues for CF Webtools' customers. Last week, for example, I found myself troubleshooting a JVM for a CF 7 customer, a Database performance issue, a JMS issue and a persistent memory leak in a COM object. That's a pretty typical week for me.
I like troubleshooting and debugging. I suppose it's the Sherlock Holmes in me that likes to pour over minute details looking for clues and possibilities. I think a good troubleshooter has that quality in his nature - the thirst for knowledge and the desire for intellectual growth. I would say that's one of my strengths. That is not to say you can't be a good troubleshooter without those skills, but it helps if you really enjoy uncharted territory.
Listen Here
When creating complicated web applications you sometimes run into situations where long running requests are necessary. Please don't email me with altruistic best practice jargon. I know they are not a good idea. My point is, that they can be unavoidable in some cases. Consider file upload for example. Suppose you need to allow users to upload more than 1 file in a single request (let's say 5) and suppose the files are typically .5 to 2 Megabytes. Potentially this means a maximum total aggregate of 6 megabytes.
Read More