Notes
Powered by Gregarious (33)
Go to Post Index Blog Index
Subscribe Subscribe
Subscribe to RSS feed via Email Subscribe via Email
Sphere: Related Content
 

Understanding webserver logs

Filed under Programming, Python, Tools, analytics.

Viewed 944 times times.

 

 

Series table of contents:

  1. Understanding webserver logs
  2. Visits, PageViews and Links

With the ever growing preponderance of the online world over “First Life”, specially in the realm of business, the problem of quantifying growth and identifying trends in the way an online service is used becomes also increasingly important. This post is the first of a (probably) long series of post about the techniques and principles underlying the field of web analytics. Throughout this series, we will develop the conceptual and software tools that are necessary to gather and process the wealth of information that is available. I will use mostly Python, but other languages might be used whenever they seem to be useful.

Every time you click on a link or type an address in your browsers address bar, you are sending a request to a webserver. After receiving your request, the server will send you the appropriate content or warn you that an error has occurred. In either case, the server will log your request along with various other bits of information regarding the origin of the request, the time it was made, and if the request was successful or not.

Unfortunately, each web server software has its own format for these logs, which complicates the task of mining the data they contain in order to obtain an accurate view of the traffic being generated by your content. Our first step is then defining the data that is accessible to us, and create a layer that will abstract away the detail concerning the specific web log format. This isn’t as formidable a task as it might seem, since there is a limited amount of information available to a server about each request and all log formats contain pretty much the same data. I will restrict myself to looking at the format of logs generated by “Apache” and “IIS” (they were the only two I could easily get my hands on, please send me other formats if you have them), which according to Netcraft’s Web Server Survey were running in 87.49% of the total web servers in May 2007.

Log Formats

Apache is by far the server with the largest market share, with 56% market share and large community support. Apache keeps several different types of logs, but I am only interested in the “access_log” file, usually stored in “/var/log/httpd/” under RedHat based systems, that records all the requests made to it. If you take a look inside on of this files (there might be several), you will see something like this:

127.0.0.121 - - [03/Jun/2007:04:04:46 -0400] "GET /online/latex/ HTTP/1.1" 200 6556 "http://www.bgoncalves.com/notes/2007/04/20/professional-looking-equations-by-rendering-latex-online/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; Maxthon)"

This example shows us that Apache logs answer the fundamental questions that everybody that studies web analytics is interested in:

  • What content was accessed - Requested page /online/latex/
  • On which platform - Maxthon running on Windows NT 5.1 (XP)
  • How the user found it - By following a link in http://www.bgoncalves.com/notes/2007/04/20/professional-looking-equations-by-rendering-latex-online/

These three pieces of information are invaluable to a responsible web master that wants to know more about the visitors to his/her web site. Apache also gives us other tidbits of information whose relevance is less obvious, but which we will later come to see as being important in some cases as well, such as who the user was, (through the IP address), when the request was made, what was the file size (6656 bytes) and the http status code (200) generated by the request.

 

On the other hand, Microsoft’s IIS has a market share of 31.49% and keeps it’s log files (named ex*.log) in c:\winnt\system32\LogFiles\W3SVC1 by default. The same request as above, if made to a IIS server would result in:

2007-06-03 04:40:46 W3SVC591358343 STORMSERVER 127.0.0.1 GET /online/latex/ - 80 - 127.0.0.121 HTTP/1.1 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+Maxthon) - http://www.bgoncalves.com/notes/2007/04/20/professional-looking-equations-by-rendering-latex-online/ www.bgoncalves.com 200 0 0 6556 313 234

Clearly, all the same basic information is available, although in a slightly different format, along with other information that might be relevant for debugging and other more specific analysis. On this first approach I will concentrate mainly on the information that is common to both log formats. A simple Python “class” (aptly named ApacheRecord and IISRecord) that extracts the relevant information from a line read from one of these log file formats is provided in the file attached. In the next parts of the series I’ll use these classes as building blocks with which to build a basic and easily customizable web analytics package.

Difficulties

Alas, things aren’t always the way one would have them, and on many occasions the information that is available to us is limited. In particular, the User Agent field (Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+Maxthon)) is reported directly by the user’s browser and can be easily manipulated. In fact, one of the much touted features of the “Opera” web browser is the ease with which this can be done. It is also possible for users to instruct their browsers not to report the referrer information (the page where they came from) for privacy reasons. Although one might understand these precautions in a world where crimes like identity theft and computer hijacking are increasingly common, one must lament the difficulties it poses to someone interested in optimizing their website according to their users characteristics.

The final difficulty, and by far the one that torments most webmasters, is the lack of access to the actual web logs. There are many ways of overcoming this problem that involve modifying the web pages you want to track in one of several ways:

  • Add a link to a small image located in another server. This allows you (assuming you have access to the logs of the third party server) to know how many times your website was accessed and by whom. It does, however, eliminate any traces of the referrer field.
  • Modify the php, asp, etc.. code that generates your dynamic pages in order to extract the information you are interested in, and convey it to you in some useful format.
  • Use javascript (which runs directly on the users browser and not on the server) to obtain the information you want and relay it back to you. This method, due to its power and flexibility, is used by high-end analytics packages, such as Google Analytics, etc…

You can, of course, use any of these methods to obtain the relevant data, instead of looking directly on the server logs as suggested above. For the remainder of this series, I will assume that you have solved this problem and have a way of getting that information in to a Python data structure with an interface similar to ApacheRecord and IISRecord.

I hope this has given you a little taste of the information that available to you as a web master and of the difficulties that are posed by it. Next time we will look in more detail in to the various data fields and generate our first statistics. In the meantime, please leave any comments, questions or suggestions below.

Sphere: Related Content




Leave a Reply




 

© Copyright 2004 Bruno Goncalves - All rights reserved

Valid XhtmlValid CSS

Socialized through Gregarious 33
Close
E-mail It