ClickStream Analysis Tool

#clickstream #analysis #tools, #bigdata #cio #cto #cdo #agile #nosql #devops #database #cmo #digital #transformation #iot


ClickStream Analysis Tool

For those of you that want a free/easy clickstream analysis tool, have a look at StatViz. If you’re running Apache and using the standard log format then plugging in this tool is very easy.


  1. Download and install GraphViz. There’s an RPM for linux.
  2. Download and install StatViz in a directory. It’s basically one php file. The README file will tell you how to customize the configuration file and run it.
  3. I don’t have too many PHP apps running so there’s a couple of other things you may need to do. First, you’ll need PEAR:Config. Once you have this, uncompress/untar it the easiest thing to do is move it Config.php and the Config dir to /usr/share/pear. Second, statviz takes up a lot of memory so you may need to increase the memory_limit configuration parameter in your /etc/php.ini

That’s pretty much it.

You can run it using

/statviz.php –config configfile

and then create a gif file of the output by doing something like

dot -Tgif -o OutputGifFileName InputDotFile

If you put the output gif file in a web accessible dir then you’ll be able to see it from your browser.

Things To Look For

There are a number of things you’ll need to consider if you want accurate results:

  • Make sure you look at the bot extensions and make best attempts to get these filtered out.
  • Make sure you have all non-pages (graphics, js, css) filtered out.
  • If possible, try to filter out requests from internal users. Statviz doesn’t have a filter for this, so I just scrubbed out of the logs myself using a grep -v.
  • If you’re site has long URL’s, you will most certaintly want to clean them up before processing. The tool allows you to create an alias file, but you may need/want to do some log scrubbing on your own.
  • Play around with the GraphNReferrerPairs parameter. You can get a lot more detail on site activity with higher numbers, but the graph becomes the graph then becomes a lot more complex to digest. If you decide on a large graph, you may need to modify the source and change the size of the graph. It defaults to 10, 8 and there isn’t a parameter to configure this. I changed it to 20, 16 for most of my small graphs ( GraphNReferrerPairs <> ) and to 40, 32 for larger graphs.
  • Very long URLs are going to be a hassle, especially if they come from external referrers and out of your control. I put in some checks in the code to clip the very long URLs.

I’ve automated a couple of things on my site:
– A report that updates hourly on today’s activity.
– I archive a daily gif file. (I will add weekly and monthly in the future).
– I have a ‘full report’ that shows activity for the last 30 days. I update this daily.

I’ll put out another entry with a quick 101 on interpretting the results.