Thinking outside the box - Filenames and Web Optimization
Overlooking the obvious, web developers and content management systems using long file and path names are wasting bandwidth. While these are handy to identify and manage objects, they increase the length of HTTP requests, force extra processing on Application Delivery Controllers and waste bandwidth. We look at simple and yet creative ways to reduce bandwidth costs for high traffic sites.
by
John Buswell
There are many techniques a web developer can use to improve the performance of their high traffic web site. However one that is often overlooked, despite being blatantly obvious is the length of directories and filenames that are used to create URLs. This article addresses the problem of descriptive yet wasteful path names, points out the pitfalls and demonstrates some solutions to address the problem.
Basics: The HTTP GET Request
When a client visits a web site, regardless of whether they click on a link or type the URL directly into their browser, they generate what is called a GET request.
The initial GET request pulls down the HTML file, which is then processed, and subsequent links generate new HTTP GET requests, such as CSS, Javascript or Images. The longer the PATH to those links, the larger the HTTP GET request will be. Using telnet we can simulate a HTTP GET request. Here is an example to o3magazine.com.
- telnet www.o3magazine.com 80
- Trying 38.106.106.237...
- Connected to www.o3magazine.com (38.106.106.237).
- Escape character is '^]'.
- GET /index.html HTTP/1.1
- host: www.o3magazine.com
-
- HTTP/1.1 200 OK
- Server: nginx/0.6.35
- Date: Fri, 27 Mar 2009 00:30:37 GMT
- Content-Type: text/html
- Content-Length: 15954
- Last-Modified: Fri, 27 March 2009 00:30:06 GMT
- Connection: keep-alive
- Accept-Ranges: bytes
-
- .... HTML document is returned ....
So the GET request consists of GET /URL HTTP/1.1. For o3magazine, the next GET request would be for /c/0.css. o3magazine already optimized its filenames, so it will produce a smaller GET request. Compare that to say techcrunch.com, the next GET request would be for /wp-content/themes/techcrunchmu/style.1238108540.css. Assuming UTF-8 encoding, and standard ASCII characters then each character is represented by one byte. The shorter o3magazine GET request will use four bytes for the GET and whitespace, then eight bytes for the /c/0.css and nine bytes for the trailing information. This is a grand total of 21 bytes. The longer techcrunch request will share the GET and trailer, so thats thirteen bytes, then it has 52 bytes for its request. This is a grand total of 65 bytes, over three times the size of the shorter request.
But its just a few bytes
So by now you are wondering why we care about this 44 byte difference. Going through the rest of the techcrunch source file, you see that similarly long requests are done all over the place thanks to the way Word Press is designed. There is roughly 57 such requests on the techcrunch page, from javascript to images, most of the URL paths are actually longer than the CSS one but for arguments sake, lets average it down to just 44 bytes extra per request. So on a single page load, we have 44 bytes x 57 requests for a total of 2508 bytes.
Factoring in the 2508 bytes extra is just on the GET request from the client to the server. The server has sent 2508 bytes of extra information that we really don't need in the initial HTTP response, looking at the source file even closer, we can make the same assumption about the HREF code pointing to other links. The techcrunch page has around 182 HREF references, assuming our 44 byte different on the longer URLs, and thats being generous since Word Press URLs are basically sentences, gives us another 8,008 bytes of waste. So downstream we have a total of 10.5k in waste, and 2.5k upstream in waste.
Factoring in the masses
Using data from compete.com, techcrunch gets at least 7,650,594 visits a month. This is roughly 246,793 per day. So in a single day, techcrunch has wasted 2,531MB downstream, and 603MB of upstream. Over the space of a month, Techcrunch has wasted 78,461MB and 18,693MB in unnecessary data transfer. That is approximately 232k/sec of bandwidth (sustained). No big deal right? Its just 232k/sec.
Perhaps a valid point until the concept is applied to the 100+ character URL happy Facebook home.php page. There are roughly 150 source file references on this page, and rounding down to about 100 HREF requests for arguments sake. Being generous, assuming 80 bytes of waste per URL. Thats 12000 bytes of upstream, and 20000 bytes of downstream waste. So using data again from compete.com, facebook has 1,273,004,274 visits per month. This is roughly 41,064,654 requests per day. So on a single day, the folks over at facebook have wasted roughly 783GB downstream and 469GB upstream. This works out to be 74Mbit/sec downstream and 44MBit/sec upstream of bandwidth.
The Questionable Math
So to calculate the bandwidth utilization we took the visits per month (1,273,0004,274) and divided it by 31. Giving us 41,064,654. We then multiplied that by 20, to give us the transfer in kilobytes per day of downstream waste, based on 20k of waste per visit. This gave us 821293080, which we then divided by 86400 which is the number of seconds in a day. This gives us 9505 kilobytes per second, but we want it in kilobits, so we multiply it by 8. Giving us 76040, finally we divide that by 1024 to give us the value in MBits/sec. Giving us 74Mbit/sec. One caveat with these calculations is that we do not factor in gzip compression. Using gzip compression, we could safely divide the bandwidth wasting figures by about 50%. Browser caching does not factor in the downstream values, as we are calculating the waste just on the HTML file. It could impact the upstream usage as not all objects maybe requested with every HTML request.
CSS Identifiers
A similar approach to optimization can be taken with CSS identifiers. These are the arbitrary names given to classes and identifiers in div blocks to denote styles. Most web developers, especially the folks over at facebook like to make these very descriptive, such as sound_player_holder, presence_preload or presence_menu_opts_wrapper, while these are easy to identify they are a terrible waste of bandwidth. These could shortened to say pr_m_op_w and be relatively easy to figure out. A simple script could be used to minify these identifiers, leaving the developers with their descriptive names but the production solution running a shorter, optimized name.
Bandwidth Savings
So far we have shown that descriptive but excessively long URL paths can waste a considerable amount of bandwidth, and just acting with some common sense in this area can yield dramatic cost savings. As service providers look to cut costs, looking at the bandwidth savings from smaller GET requests alone could save service providers some money, especially if such optimizations were done to many different high-volume sites. However there are more advantages than just saving bandwidth.
Mobile Users
Shorter URLs and shorter CSS identifiers lead to friendlier content for mobile-device users. Taking a device such as an iPhone running on AT&Ts EDGE network, which typically gets between 75 and 135Kbps, you are looking at dramatically faster content and happier users by reducing the size of the URLs. Reducing the size of the GET requests, as well as reducing the amount of processing the mobile device has to do itself in parsing the HTML, will improve the overall experience for the user, as well as reduce their data-usage rates. Everyone wins.
Layer 7 Processing
Many providers and content delivery networks deploy deep packet inspection devices, sometimes called Layer 7 switches or Application Delivery Controllers. These devices are typically embedded network switches or appliances that perform what is known as deep packet inspection. They look inside at the content and requests for content, to perform additional processing and traffic management. There are dozens of different features, but popular ones include URL load balancing, URL rewriting, Cookie Processing and HTTP Header processing. These all typically look at the HTTP request and make some kind of routing decision based on its content. For example, URL load balancing may look for an image file extension in the request, such as .jpg, and forward it to a different set of servers than those that process .php or .html.
These devices have to typically buffer the HTTP client request. This requires storing the HTTP request packet in memory, perform processing on it, perhaps altering the request, then forwarding it on to the server. The longer the URL or Cookie contained within the request, the more memory and processor resources it will consume on one of these devices. So while the device is capable of processing the packet and handling the request, it is far from optimal. In fact, by using very long URLs, the web developer is inadvertently putting undue load on the system. The shorter the URL, the less buffer space it will take up, and the faster the device can process it. This results in lower latency times, increased capacity and lower loads on such systems. It just makes common sense to use shorter URLs.
CMS Developers
Developers of CMS, Blogging and other popular Web based applications should take this kind of optimization to heart. It is a relatively simple design change, that they can implement within their web application, enabling their end users to reap the benefits of shorter URLs and pathnames. The best solution would be one that allows developers to utilize descriptive and easy to follow names, while the published solution uses short and highly optimized URLs. Combined with Javascript and CSS minification (which is stripping out unnecessary characters, removing duplication from such files) and gzip compression, you would have a highly optimized solution.
Web Developers
While CMS developers such as WordPress can foot the blame for some things like Techcrunchs numbers. Web developers should also take these techniques to heart when developing custom applications. While a low-bandwidth site probably wouldnt benefit from short URLs, any site that has the potential to become a high-traffic site, or a widely used web application would benefit considerably. It is only responsible to invest some time and effort into using optimally sized URLs. For example, shortening /images to /i, /javascript to /j and /css to /c, you would save a few bytes everywhere. Using numbers for filenames for images, javascript and css files, and perhaps documenting in a separate file what they are for, could also save some bandwidth. For example, 0.jpg could have a note indicating its the company logo instead of logo.jpg.
URL Rewrite
Changing back ends, especially third party back ends can be a time consuming effort and is often difficult to support. So instead of waiting for the web application provider to fix their application, one possible work around is to use URL rewrite capabilities on some web servers to use short URLs and map them to the longer ones used in the CMS or other application. This obviously is not an ideal solution, but its one possible stop-gap measure to utilize shorter URLs without having to rewrite the back end. However it will still use some additional processing on the server side.
Proxies
Many content providers place reverse proxies in front of web server farms to perform what is known as Application Acceleration. This is basically a fancy marketing term for using a reverse proxy so that not all requests are passed back to the server farm. It should be possible with most modern reverse proxies to perform the URL rewrite function there. It would require the proxy to be capable of altering the HTML content thats passed back to the user and mapping it back to the actual real value when passing requests to the server. However it could be an easier option than changing massive web applications.
Conclusion
For high-bandwidth web-sites the highly descriptive URLs which make development a lot easier are not necessary a cost-effective mechanism or a necessary one. As shown with the facebook example, there are considerable savings that can be made by just reducing the size of the URLs. While not necessary, it is one extra tool in the efforts to cut costs and produce more efficient websites.