Another hard nut to crack, anytime you have problem and intermittent in the same sentence you get the feeling there is going to be a lot of hide-and-seek! So here I was lending a patient hearing to a client problem. They had 4 Citrix Servers that they were using to host applications and every now and then things would become very slow, even logging would hang and then it would get so bad that the server would become unresponsive and the only way out is a reboot. They said they had checked everything, network, CItrix config, hot fixes, Windows patches etc but not a clue. Now, for those new to Citrix XenApp, this is just a thin client product used to provide connectivity to applications hosted on central servers instead of on each PC. It is built on top of Microsoft Terminal Server.
So, I did the usual checking, any particular time of day when issues pop up, any shift changes, are all patches, firmware updates etc up to snuff. Any patterns?...nothing.. but wait what I found out was out of the 4 Citrix servers only 2 servers had problems consistently. So to stop immediate bleeding I asked them to take the 2 problem servers out of the load balancer rotation. Once this was done all user connections went to the 2 performing Citrix servers and user complaints stopped.
Now, we had to find root cause and bring the 2 offlined Citrix servers back to rotation since the client's monthly processing was fast approaching that required more horsepower. So scratching my head, started to look at event logs, and kept seeing intermittent timeout for the Terminal services. These servers had all the hotfixes needed, then I thought we would examine the network again, and lo... the 4 application servers were split across two different subnets with a firewall in between, the 2 problem servers on one subnet and the other 2 on the other. Now, this in and by itself should not explain why 2 servers were ok while 2 were bad. They when we looked at the domain controller, file server for profiles, authentication, ldap etc they were all on the same subnet as the 2 servers that did not have any issues. There were no firewall or multiple networks to traverse. So we got our guy, the 2 poor Citrix servers that had issues were timing out at the DNS trying to get to the other subnet and the remote domain controller and file servers, slowing down logins, etc. We moved all 4 application servers to the same subnet, removed the firewall between them and put them close to the domain controllers and file servers and things went back to being fine and dandy! I don't know why anyone decided to split the 4 application servers across multiple subnets and firewall, still scratching my head....!
Comments