Building my own web analytics
SourceI’ve built a simple client-side website analytics tool for this site, you can see it at
/analytics. It has the following metrics:
- Page views per day,
- Unique IP addresses per day
- Views per page per day.
This article eventually made it to the front page of Hacker News, which resulted
in a lot of extra traffic and an opportunity to see how the tool performed under a
much heavier load. I wrote about the affects of this and subsequent design changes here.
I compare the different results from CloudFlare Analytics, CloudFlare Web Analytics and my
own tool in this follow-up article.
Motivation
Google Analytics
Google Analytics felt like overkill. It has so many data-points that the
useful metrics are obscured. I also like this site to load quickly
and GA makes it slower.
CloudFlare Analytics
I’ve also tried CloudFlare Analytics. It’s a lot simpler than GA and better
suits my use case, but I don’t think its accurate.
The analytics should be easy to access and easy to understand1.
I know from my work visualizing data and building dashboards that the metrics
presented will alter the users perception of the underlying reality.
The way that someone thinks about their impact on a business, the value they’ve
produced, or the dynamics of the underlying system (a product’s quality, site
performance, growth, etc) is influenced by the design decisions I make, such as
which metrics are available, how easy they are to access, or which metrics are
above the fold.
If I present a particular metric as if its important, it will be difficult
for someone who uses the dashboard to resist this implied message. They’ll eventually
consider the metric as a Key Indicator of some kind.
For these reasons I wanted to see only the most important metrics about my
website, and I wanted to see them in a simple way without distraction.
The only metrics I’m interested in are:
- How many people are reading my site
- What are they reading
- How much are they reading.
I’d like to be able to infer whether I have a few people who read a lot, or a lot of people
who read a little. (Or, as is the case, a few people who read a little.)
Method
Motivation
The main reason for making my own analytics tool it because its a fun challenge
with an obvious and useful result. Building it required connecting a few technologies -
Serverless Computing (Cloud Functions on GCP), NoSQL databases (DataStore),
JavaScript, HTTP headers.
Assumptions
I’m assuming that unique IP addresses is a good enough proxy for unique readers, even
though I’m not considering crawlers, bots, or RSS subscribers2 .
Technique
The analytics “engine” works by consuming a request that is sent by the client
each time a page is loaded. The request is parsed by a Cloud Function on GCP
which extracts the page URL and the IP address. This is then recorded in a
DataStore database along with the current date and time.
Viewing the analytics is as simple (and as complicated) as making a request to
the database, parsing the data and visualizing it conveniently. For example,
group the data by days and count the distinct IP Addresses to figure out how
many people are visiting each day. This is achieved by making a request to
another Cloud Function that returns a response with a JSON payload.
It’s not a perfect solution, there are edge cases I’m not considering. I expect
it to be mostly right and good enough for my purposes. It didn’t take much
effort and it was a fun mini project. The hardest part was figuring outchart.js, the slowest part was iterating on the Cloud Functions.
Mocking Cloud Functions
I haven’t figured out how to easily test cloud functions locally - it would
require setting up a NoSQL database and mocking Flask requests and responses.
Instead of doing that, I watched Peaky Blinders for a couple of minutes whilst
each new version of the Cloud Function was deploying.
Improvements
Eventually I’ll want to group the metrics by week or month I expect. It’ll be a
good way of learning and playing with cloud technologies and JavaScript.
Unless someone decides to spam the site, I expect the costs to be less than
€1/month. This site is hosted using CloudFlare, so I suppose I could setup some
page rules to prevent malicious traffic3 .
Tasks for later Questions
- I’d be interested to know if there is a way to track RSS subscribers. I know
that the usual method is to inspect server logs, but this site is hosted on
GitHub pages so I don’t think this is possible. - To what extent does requiring JavaScript in order to log a page view filter out bots and crawlers?
I’ve used the
chart.jslibrary because its reasonably fast and lightweight. My
preferred library would bePlotlyif it could be responsive and fast even
if there are >10 charts to render.Has
plotly.jsimproved recently to the point where it wouldn’t cause a browser to lag if multiple plots are being rendered?
Finally, it occurs to me that I could make an analytics widget for my desktop
using Übersicht. It could show page views
for the current day perhaps. I’ve made a couple of widgets before
[1,
2] which were written in
CoffeeScript, but the newer widgets are written in React, so I guess this is an
opportunity to learn4 .
Writing the “Time Since” (my daughters birth) and “Time Until” (my next
accounting exam5 ) widgets were my
first ever taste of CSS, HTML and JavaScript. The first ever article on
this blog was about the “Time Since” widget. CoffeeScript, and Ubersicht were
just about simple enough for me to learn by trial and error, copying someone
else’s code and changing it bit by bit until I had what I want.