Data science: Website traffic analysis with R

Overview of the News From Nan website

Since 1998 we have maintained a family-oriented, WordPress website: The site consists of over 200 posts and 20 pages.

News From Nan home page
News From Nan home page

The site content has attached tag and category metadata. Example categories are history, news, and travel. Example tags are hiking and biking (two of our favorite activities), and California (where we live) and Minnesota (where some of our family lives).

Overview of the data analysis project

We recently completed a project that analyzed the popularity of content in various categories and with various tags. Popularity was measured by total hits over a one-month time interval.

Two sets of data were collected from the WordPress site to perform the analysis:

    • An export of the site content in XML format
    • Content hits counts gathered by the WP Statistics plugin in CSV format

Once the data was collected, a Python data wrangling script merged the two sets together to produce a single CSV file with the following columns:

  • id
  • url
  • type of post (page or post)
  • author
  • categories attached
  • tags attached
  • total number of hits

The merged CSV file was then used as input to an R script executed by RStudio. The entire implementation process is illustrated below.

image of Implementation Process
Implementation Process

Histogram of the hits-count

The first step in the analysis was to plot a histogram of the hits-count to see the distribution of hits over all pages and posts.


image of Histogram of hits for all pages and posts
Histogram of hits for all pages and posts

The above chart shows that a small number of pages/posts gets most of the hits.

Looking at which categories are most popular

In this study we looked at which categories are the most popular. Below is a plot of total number of hits by category that shows which category is the most popular.

Image of Plot of total hits by category
Plot of total hits by category

The plot shows that the most popular category is “family history.” The category consists of a series of posts we created to describe the history of various branches of our family history and ancestry. If we look at a list of the top few pages, we see confirmation of this.

image of Pages in hits order
Pages in hits order

At the top of the list are the Dienst and Nuss families. We have been contacted many times by distant relatives that have found a connection to us by looking at these pages.

Planning future projects

In a future study, we plan to look at how tags affect post popularity (the tags describe the posts in greater detail than the categories do).