Recently I was involved in a mysterious case of intermittent user disconnects from their application with either Oracle error 03113 (end-of-file on communication channel) or error 03114 (not connected to Oracle). So when I got involved the only unusual thing about their Oracle configuration was that they were using Oracle Multi-Threaded Server (MTS), this is a configuration that I see more scant these days, especially considering how cheap hardware has become. As any sane person would, we went through the routine checklist of checking Oracle AWR, server load, timeouts on the application, listener, dead connection detection, any profile setting limiting idle time, etc. The application server was being hosted on Windows in a Citrix environment. Time and again I would ask the network guys if they saw any dropped packets on the firewall or any heavy retransmits or loss. Given the intermittent nature of the issue, they would always check when there was no issue and would come up empty handed.
Finally I decided I would put a sniffer to capture TCP packets between the application server and the DB server. And what I found was so simple yet it took us a while to capture the culprit. On examining the sniffer capture I could see intermittent retransmission of packets from the application server to the DB server, and after 5 attempts the application server would reset the connection resulting in the user seeing ORA-03113 or ORA-03114. The 5 max attempts for retransmits was because the default Windows OS tries 5 times to retransmit a packet and then gives up resetting the connection. So as a bandaid I asked them to change the registry setting of maxtcpretransmissions to 10.
While the bandaid above continues to work, with no users complaining any more, the root cause was harder to explain. Then when I looked at their traceroutes from the application server to the DB server, I found it was taking 11 hops and spanning two different networks for communication between the two. Though these 11 hops don't look too many, I think in a heavy network chatty environment it is better to keep the design simple, and especially on a corporate intranet that is not exposed to the world, it is better always to have the application server and the db server on the subnet and have the least amount of redirection as far as communication goes. Lesson learnt, looks trivial now doesn't it?!
Comments