dot Stop testing, start deploying your AI apps. See how with MIT Technology Review’s latest research.

Download now

2.5 Web page analytics

back to home

2.5 Web page analytics

As people come to the websites that we build, interact with them, maybe even purchase something from them, we can learn valuable information. For example, if we only pay attention to pages that get the most views, we can try to change the way the pages are formatted, what colors are being used, maybe even change what other links are shown on the pages. Each one of these changes can lead to a better or worse experience on a page or subsequent pages, or even affect buying behavior.

In sections 2.1 and 2.2, we talked about gathering information about items that a user has looked at or added to their cart. In section 2.3, we talked about caching generated web pages in order to reduce page load times and improve responsiveness. Unfortunately, we went overboard with our caching for Fake Web Retailer; we cached every one of the 100,000 available product pages, and now we’re running out of memory. After some work, we’ve determined that we can only reasonably hold about 10,000 pages in the cache.

If you remember from section 2.1, we kept a reference to every item that was visited. Though we can use that information directly to help us decide what pages to cache, actually calculating that could take a long time to get good numbers. Instead, let’s add one line to the update_token() function from listing 2.2, which we see next.

Listing 2.9 The updated update_token() function
def update_token(conn, token, user, item=None):
	timestamp = time.time()
	conn.hset('login:', token, user)
	conn.zadd('recent:', token, timestamp)
	if item:
		conn.zadd('viewed:' + token, item, timestamp)
		conn.zremrangebyrank('viewed:' + token, 0, -26)
		conn.zincrby('viewed:', item, -1)

The line we need to add to update_token()

With this one line added, we now have a record of all of the items that are viewed. Even more useful, that list of items is ordered by the number of times that people have seen the items, with the most-viewed item having the lowest score, and thus having an index of 0. Over time, some items will be seen many times and others rarely. Obviously we only want to cache commonly seen items, but we also want to be able to discover new items that are becoming popular, so we know when to cache them.

To keep our top list of pages fresh, we need to trim our list of viewed items, while at the same time adjusting the score to allow new items to become popular. You already know how to remove items from the ZSET from section 2.1, but rescaling is new. ZSETs have a function called ZINTERSTORE, which lets us combine one or more ZSETs and multiply every score in the input ZSETs by a given number. (Each input ZSET can be multiplied by a different number.) Every 5 minutes, let’s go ahead and delete any item that isn’t in the top 20,000 items, and rescale the view counts to be half has much as they were before. The following listing will both delete items and rescale remaining scores.

Listing 2.10 The rescale_viewed() daemon function
def rescale_viewed(conn):
	while not QUIT:
		conn.zremrangebyrank('viewed:', 20000, -1)

Remove any item not in the top 20,000 viewed items.

		conn.zinterstore('viewed:', {'viewed:': .5})

Rescale all counts to be 1/2 of what they were before.

		time.sleep(300)

Do it again in 5 minutes.

With the rescaling and the counting, we now have a constantly updated list of the most-frequently viewed items at Fake Web Retailer. Now all we need to do is to update our can_cache() function to take into consideration our new method of deciding whether a page can be cached, and we’re done. You can see our new can_cache() function here.

Listing 2.11 The can_cache() function
def can_cache(conn, request):
	item_id = extract_item_id(request)

Get the item ID for the page, if any.

	if not item_id or is_dynamic(request):

Check whether the page can be statically cached and whether this is an item page.

		return False
	rank = conn.zrank('viewed:', item_id)

Get the rank of the item.

	return rank is not None and rank < 10000

Return whether the item has a high enough view count to be cached.

And with that final piece, we’re now able to take our actual viewing statistics and only cache those pages that are in the top 10,000 product pages. If we wanted to store even more pages with minimal effort, we could compress the pages before storing them in Redis, use a technology called edge side includes to remove parts of our pages, or we could pre-optimize our templates to get rid of unnecessary whitespace. Each of these techniques and more can reduce memory use and increase how many pages we could store in Redis, all for additional performance improvements as our site grows.