Archive for February, 2009

Linked Data Access Analysis

Posted by Knud on February 4th, 2009

I’m currently working on an analysis of the log files of the Semantic Web Dog Food server. Apart from the obvious queries such as “How much traffic was there?”, “When were the peaks in traffic?” or “Where did the traffic come from?”, Semantic Web-type linked data inspires some other questions as well. Examples of such questions are to figure out how intensively the Semantic Web portion of the data was used (i.e., how often was RDF requested compared to HTML), how the distribution of “semantic” vs. “conventional” user agents was or what kind of data was requested.

Using the techniques described earlier in a post on my Confused Development blog I sifted through about 7 months worth of log files and generated some pretty pictures. Here is what I came up with so far:

Linked data hit analysis (Data tail)

The serving of linked data on the dog food server works through content negotiation – basically, the first request by an agent would be to the URI of the resource (“plain” in the graph), specifying in the header whether an RDF or HTML representation is desired. The server then redirects to either the HTML or RDF document with the desired representation. In theory, this means that requests(rdf) + requests(html) = requests(plain). However, since it is perfectly feasible to request the HTML or RDF documents directly, the total of RDF+HTML is slightly higher. The total numbers are:

HTML: 238486
RDF: 35491
HTML+RDF: 273977
Plain: 247576

As the graph and the numbers show, the usage in terms of RDF requests is relatively low at the moment, indicating that there is still a long way to go for the Semantic Web to really take off (and that we need to work on making the site more popular).

Linked data hit analysis (Resource type)

This second graph shows the distribution of hits over time for the different kinds of resources which the server offers, as indicated by the requested namespace (dogfood:person, dogfood:conference, …). Interest in people resources is highest almost all of the time. Partially, this may be due to ego surfing of Semantic Web researchers. However, as the graphs below will show, bot traffic far exceeds traffic by human visitors, so my hunch is that the preference of people pages can be explained through the search strategies of the big search engine players out there – people information is probably considered more valuable. Of course, another factor is the fact that there are about three times as many people resources on the dog food server than e.g. conference resources.

Regarding the conference and workshop resources, those need to be examined in a more fine-grained fashion, since the respective namespaces cover everything connected to an event: papers, talks, chairs, the event itself, etc.

Linked data hit analysis (Agent tail)

No self-respecting analysis can live without a nice longtail graph these days. Looking at visiting agents, we get such a distribution (y-scale is logarithmic). The agents in the head are the big search engine crawlers – GoogleBot, Yahoo! Slurp and MSNBot -, as well as the big name browsers. In the middle and long tail we find lots and lots of different other bots, crawlers and browsers, as well as various tools, data services and agents who didn’t give themselves a proper identifier and instead just show up as “Java” or “perl-libwww” (very naughty behaviour indeed…).

Linked data hit analysis (Agent types)

More interesting is probably this graph, which shows the agent distribution after I had sliced and diced it manually according to some criteria:

  • What type of agent is it: bot/crawler, browser (=human visitor), unspecified programming library, debugging or scripting tool (curl, wget, …) or data-service. The latter is Richard’s term for agents which provide a service for other agents by processing some data on the Web. In contrast to crawlers, the purpose here is not archiving or indexing. Examples are format converters, snapshot generators, etc.
  • What is the “semanticity” of the agent: is it a conventional agent, or one that operates in a Semantic Web-aware fashion?
  • Mobile or not: I noticed a (small) amount of visits by mobile browsers, which I thought could be interesting to record separately.

All this and more will become part of my thesis and also (hopefully) make into some sort of more polished publication soon.