Notes
Powered by Gregarious (33)
Go to Post Index Blog Index
Subscribe Subscribe
Subscribe to RSS feed via Email Subscribe via Email
Sphere: Related Content
 

Visits, PageViews and Links

Filed under Programming, Python, Tools, analytics.

Viewed 766 times times.

 

 

Series table of contents:

  1. Understanding webserver logs
  2. Visits, PageViews and Links

“How many people are visiting my site” and “how many people are linking to it”, are possibly the two most basic questions webmasters (and bloggers in particular) are interested in answering. The first defines the audience you are able to reach and the second defines how relevant your content is to other people, and has important consequences in how high up you rank in googles search results. In this second post of the web analytics series we will see how we can use the web log abstraction we created last time to quickly answer these questions using a simple python script.

Last time we defined two data structures corresponding to two different formats of web server logs. Our script expects the user to tell it which format it will be looking at using a command line argument, so it can load the appropriate definition. Since both types of record define similar fields, we can simply do:

48
49
50
51
52
53
54
# Pick the right format
if(sys.argv[1]=="IIS"):
	from Records import IISRecord as Record
elif(sys.argv[1]=="Apache"):
	from Records import ApacheRecord as Record
else:
	sys.exit(3)

After which you have a “Record” that is the appropriate type for the log file you are using.

Dictionaries galore

Before we can proceed in analyzing our data set, we need to agree on how to represent the relevant information. I will be using dictionaries to keep track of all the counters, so I start by defining a couple of useful functions to operate on them. Each dictionary will associate a key with the number of times we have already encountered that key. This lets us keep lists of everything we are interested in, using as little memory as possible.

29
30
31
32
33
def incrementDict(dict,value):
	if(value in dic):
		dic[value]+=1
	else:
		dic[tex][value]=1

The first function is incrementDict that simply increments the counter associated with the supplied key. If the key isn’t in the dictionary yet, it is added with a count of “1″. In the spirit of keeping the code as simple as possible, I don’t explicitly perform a check to make sure that the argument supplied is indeed of the type we expect. In actual production code (which isn’t the case), you should always use an “isinstance(dic,dict)” or “type(dic)==dict” to verify this.

35
36
37
38
39
40
41
def total(dic):
	sum=0
 
	for key in dic:
		sum+=dic[key];
 
	return sum

We will also be interested in calculating totals, like the total number of visitors, the total number of links, etc… We can do this using the totalDict function, defined above, to sum all counters.

Finally, the complete list of elements in the dictionary along with the counter for each can be outputted using a simple loop over all the keys:

43
44
45
def outputDict(dict):
	for key in dict:
		print key,dict[key]

Visitors and PageViews

After defining these conventions, we can start to read the file and turn each line in to a record:

59
60
61
62
63
64
65
	for line in open(sys.argv[2]):
 
		#ignore comments and empty lines
		if line[0]=="#" or len(line)==1:
			continue
 
		record=Record(line);

A first approach to determining the number of visitors is to simply count the number of IPs (using record.IP) that request pages from our server.

26
27
def visitors(visit):
	incrementDict(Visits,visit)

where Visits is the dict we are using in this case. The total number of visitors is just given by len(Visitors), but we must note that the total number of page views is NOT given by total(Visits) unless we count only requests for certain file types (since each page probably uses several images, stylesheets, javascript files, etc…), by comparing record.type against a list of known page formats using:

67
68
		if(record.type in ('html','shtml')):
			visitors(record.IP)

Links and Hosts

Tracking incoming links is much easier, since we only have to look at record.referrer. In this case, we don’t need to worry about the different file types, since images, etc… linked from our own pages will have a local page as a referrer and deep links to files other than web pages is probably something we are interested in as well. Sometimes, we are also interested in knowing which hosts are providing the link love. Is there a web site that just links constantly to you on almost every page or are all our links coming from different web sites? The function to do both of these tasks is:

29
30
31
32
33
def incoming(link):
	#ignore links from my own pages and clicks without referrers
	if(link[1]!='www.bgoncalves.com' and link[2]!='-'):
		incrementDict(Links,link)
		incrementDict(Hosts,link[1])

The counter associated with each incoming host and link is useful to determine their relative importance in driving traffic to your site.

You can download the complete code we have developed so far using the links at the bottom of this post. Next time we will generate more statistics, but in the meantime, please leave any comments, questions or suggestions below.

Sphere: Related Content




Leave a Reply




 

© Copyright 2004 Bruno Goncalves - All rights reserved

Valid XhtmlValid CSS

Socialized through Gregarious 33
Close
E-mail It