The Case of the Cisco ACE 4710 and the RST packet
Some of the most satisfying moments at the office come from being able to solve problems I’ve never seen before, and could not find any information on the internet about. It’s especially satisfying when it involves equipment you’ve never used before.
The problem started out with the launch of one of our new websites. The site has an admin portal that was built to accommodate reporting and CMS pieces, and was nice enough to use domain authentication. All of that…is irrelevant.
The relevant part was that we launched the site, load balanced it, and thought everything was working just fine. Turns out, it wasnt. Browsing anonymously had no issues, but it seemed like once you attempted to login, the site would greet you with a “This page has been reset” message. Was it code? Was it the webserver? Was it the load balancer?
We temporarily moved the site off the load balancer internally and began troubleshooting. With us going directly to the webserver the problem did not appear, so we started looking at the ACE. The serverfarms were correct, the sticky policy was correct, the probes showed the site as up, the policy-map looked good, all the other sites on the load balancer worked just fine. I was at a loss, I couldn’t find anything that would indicate a problem. What now?
I stand behind the statement that Wireshark is probably one of the most useful and versatile tools in the universe. It’s not just for looking at packets, it can help you figure out a plethora of problems that are seemingly unrelated to the traffic flow (I have another post regarding that in a bit; also involving load balancers, go figure). I started looking at packet captures between my machine and the site (going through the ACE), and I saw something strange. After the initial GET /, there was a SYN,RST packet coming back. Well, was it coming from the load balancer or the server? I fired up wireshark on another machine at the Datacenter to see if I could see that same packet on the other side of the ACE. I did not.
I started thinking about why the ACE would be terminating that connection. It seemed to make no sense. It worked SOME of the time, why not all the time? All the other sites worked fine, whether you were logged in or anonymous. It seemingly wasn’t the server either. I dug deeper into the wireshark capture, and I decided to follow the TCP stream of the web request for easier viewing. It wasn’t clear right away, but for some reason I started looking closely at the data. I noticed that the header data seemed strange. After logging in, my local machine sent the header data, and it was XXXX bytes in size (I dont remember the number anymore), however, when it was coming back from the webserver/ACE, right before the RST packet, it was XXXX – 200 bytes. The header value coming back was smaller, and then the site was reset. I looked at the value of the header, and saw that I was sending a giant string of characters for the .Net viewstate and token, but when it was coming back, it seemed to be cut off after YYYY bytes.
Turns out, the ACE has a default limit for HTTP header values. This particular site was throwing a lot of garbage in the headers and cookies for its own tracking and tokenization scheme, and it was bigger than the ACE’s default limit.
Quick fix: set header-maxparse-length bytes 8192 … slam bam, it worked.
Unfortunately, I have since moved jobs, and forgot that I started this draft almost a year ago, so I no longer have the wireshark captures that I used to fix this.
