Close, but a Cigar Nevertheless

Posted by Knud on May 4th, 2010

I just came back from this year’s Web Science Confernce in Raleigh, NC. The idea of the conference – as of Web Science in general – is to give a holistic, multi-disciplinary view on the Web, and while I’m still not sure if and exactly how this will work like in the end (there was a heated discussion between social and computer scientists in the closing panel), I found the event very interesting and a lot of fun. Of course, the best surprise came right at the end, when our paper on Linked Data Usage (I had reported on early stages of this quite a while ago on this blog) was shortlisted as one of three papers for the best paper award! In the end we didn’t win (the prize went to the paper by Metaxas and Mustafaraj: From Obscurity to Prominence in Minutes: Political Speech and Real-Time Search), but just to get the nomination was pretty awesome. I really didn’t expect this, considering that this paper had been in the pipeline for more that a year now, but never quite made it for any submission deadline, and was therefore delayed time and time again. This is great encouragement for continuing our work in this area!

Semantic User Agents

Posted by Knud on October 8th, 2009

I’m still very much interested in the topic of analysing usage of linked data sites. To that end, an interesting question to ask is what kinds of agents access a linked data site. And here, apart from the usual categorisation into bots, browsers and such, it makes sense to differentiate between semantic and non-semantic agents. Very loosely, we could say that

Semantic agents are agents which are aware of RDF data and actively request it.

To know whether or not an agent requests RDF, we could look at the header of an individual HTTP request and check if the agent had specified Accept: application/rdf+xml. However, the Apache server log files unfortunately don’t tell us anything about the request header. Luckily though, there is an indirect way of finding out about this. If our linked data site uses best practice content negotiation and 303 redirects, we can look at pairs of requests in the log files. E.g., the Semantic Web Dog Food site uses a particular URI pattern for resources and their HTML and RDF representations. E.g.:


http://data.semanticweb.org/organization/deri-nui-galway

http://data.semanticweb.org/organization/deri-nui-galway/html

http://data.semanticweb.org/organization/deri-nui-galway/rdf

If the plain URI is requested, the server will either redirect to the HTML or the RDF representation, based on what was specified by the agent. Therefore, if we find a request for a plain URI and a request for the corresponding RDF URI, from the same IP address and the same agent, within a short time frame (e.g. 5 seconds), then we can infer that the agent had requested application/rdf+xml and can therefore be classified as a semantic agent.

90.21.243.141 - - [06/Oct/2008:16:07:58 +0100] "GET /organization/vrije-universiteit-amsterdam-the-netherlands HTTP/1.1" 303 7592 "-" "rdflib-2.4.0 (http://rdflib.net/; )"
90.21.243.141 - - [06/Oct/2008:16:08:02 +0100] "GET /organization/vrije-universiteit-amsterdam-the-netherlands/rdf HTTP/1.1" 200 45358 "-" "rdflib-2.4.0 (http://rdflib.net/; )"

The example above shows this: the “rdflib.net” agent requested the plain URI .../organization/vrije-universiteit-amsterdam-the-netherlands and was 303 redirected to .../organization/vrije-universiteit-amsterdam-the-netherlands/rdf a few seconds later. From this we can automatically infer that “rdflib.net” is a semantic agent.

A list of 423 semantic agents found in this way for the dog food site from 10/2008-10/2009 is here. Looking at the list, we can find a lot of agents that are clearly “semantic”, such as the “SindiceFetcher” or a SIOC browser. However, most of them are actually not what I would normally consider “semantic”, such as hordes of “Mozilla”-branded agents or dodgy looking bots. More research is awaiting…

Linked Data Access Analysis

Posted by Knud on February 4th, 2009

I’m currently working on an analysis of the log files of the Semantic Web Dog Food server. Apart from the obvious queries such as “How much traffic was there?”, “When were the peaks in traffic?” or “Where did the traffic come from?”, Semantic Web-type linked data inspires some other questions as well. Examples of such questions are to figure out how intensively the Semantic Web portion of the data was used (i.e., how often was RDF requested compared to HTML), how the distribution of “semantic” vs. “conventional” user agents was or what kind of data was requested.

Using the techniques described earlier in a post on my Confused Development blog I sifted through about 7 months worth of log files and generated some pretty pictures. Here is what I came up with so far:

Linked data hit analysis (Data tail)

The serving of linked data on the dog food server works through content negotiation – basically, the first request by an agent would be to the URI of the resource (“plain” in the graph), specifying in the header whether an RDF or HTML representation is desired. The server then redirects to either the HTML or RDF document with the desired representation. In theory, this means that requests(rdf) + requests(html) = requests(plain). However, since it is perfectly feasible to request the HTML or RDF documents directly, the total of RDF+HTML is slightly higher. The total numbers are:

HTML: 238486
RDF: 35491
HTML+RDF: 273977
Plain: 247576

As the graph and the numbers show, the usage in terms of RDF requests is relatively low at the moment, indicating that there is still a long way to go for the Semantic Web to really take off (and that we need to work on making the site more popular).

Linked data hit analysis (Resource type)

This second graph shows the distribution of hits over time for the different kinds of resources which the server offers, as indicated by the requested namespace (dogfood:person, dogfood:conference, …). Interest in people resources is highest almost all of the time. Partially, this may be due to ego surfing of Semantic Web researchers. However, as the graphs below will show, bot traffic far exceeds traffic by human visitors, so my hunch is that the preference of people pages can be explained through the search strategies of the big search engine players out there – people information is probably considered more valuable. Of course, another factor is the fact that there are about three times as many people resources on the dog food server than e.g. conference resources.

Regarding the conference and workshop resources, those need to be examined in a more fine-grained fashion, since the respective namespaces cover everything connected to an event: papers, talks, chairs, the event itself, etc.

Linked data hit analysis (Agent tail)

No self-respecting analysis can live without a nice longtail graph these days. Looking at visiting agents, we get such a distribution (y-scale is logarithmic). The agents in the head are the big search engine crawlers – GoogleBot, Yahoo! Slurp and MSNBot -, as well as the big name browsers. In the middle and long tail we find lots and lots of different other bots, crawlers and browsers, as well as various tools, data services and agents who didn’t give themselves a proper identifier and instead just show up as “Java” or “perl-libwww” (very naughty behaviour indeed…).

Linked data hit analysis (Agent types)

More interesting is probably this graph, which shows the agent distribution after I had sliced and diced it manually according to some criteria:

  • What type of agent is it: bot/crawler, browser (=human visitor), unspecified programming library, debugging or scripting tool (curl, wget, …) or data-service. The latter is Richard’s term for agents which provide a service for other agents by processing some data on the Web. In contrast to crawlers, the purpose here is not archiving or indexing. Examples are format converters, snapshot generators, etc.
  • What is the “semanticity” of the agent: is it a conventional agent, or one that operates in a Semantic Web-aware fashion?
  • Mobile or not: I noticed a (small) amount of visits by mobile browsers, which I thought could be interesting to record separately.

All this and more will become part of my thesis and also (hopefully) make into some sort of more polished publication soon.