Archive for October, 2009

Semantic User Agents

Posted by Knud on October 8th, 2009

I’m still very much interested in the topic of analysing usage of linked data sites. To that end, an interesting question to ask is what kinds of agents access a linked data site. And here, apart from the usual categorisation into bots, browsers and such, it makes sense to differentiate between semantic and non-semantic agents. Very loosely, we could say that

Semantic agents are agents which are aware of RDF data and actively request it.

To know whether or not an agent requests RDF, we could look at the header of an individual HTTP request and check if the agent had specified Accept: application/rdf+xml. However, the Apache server log files unfortunately don’t tell us anything about the request header. Luckily though, there is an indirect way of finding out about this. If our linked data site uses best practice content negotiation and 303 redirects, we can look at pairs of requests in the log files. E.g., the Semantic Web Dog Food site uses a particular URI pattern for resources and their HTML and RDF representations. E.g.:


http://data.semanticweb.org/organization/deri-nui-galway

http://data.semanticweb.org/organization/deri-nui-galway/html

http://data.semanticweb.org/organization/deri-nui-galway/rdf

If the plain URI is requested, the server will either redirect to the HTML or the RDF representation, based on what was specified by the agent. Therefore, if we find a request for a plain URI and a request for the corresponding RDF URI, from the same IP address and the same agent, within a short time frame (e.g. 5 seconds), then we can infer that the agent had requested application/rdf+xml and can therefore be classified as a semantic agent.

90.21.243.141 - - [06/Oct/2008:16:07:58 +0100] "GET /organization/vrije-universiteit-amsterdam-the-netherlands HTTP/1.1" 303 7592 "-" "rdflib-2.4.0 (http://rdflib.net/; )"
90.21.243.141 - - [06/Oct/2008:16:08:02 +0100] "GET /organization/vrije-universiteit-amsterdam-the-netherlands/rdf HTTP/1.1" 200 45358 "-" "rdflib-2.4.0 (http://rdflib.net/; )"

The example above shows this: the “rdflib.net” agent requested the plain URI .../organization/vrije-universiteit-amsterdam-the-netherlands and was 303 redirected to .../organization/vrije-universiteit-amsterdam-the-netherlands/rdf a few seconds later. From this we can automatically infer that “rdflib.net” is a semantic agent.

A list of 423 semantic agents found in this way for the dog food site from 10/2008-10/2009 is here. Looking at the list, we can find a lot of agents that are clearly “semantic”, such as the “SindiceFetcher” or a SIOC browser. However, most of them are actually not what I would normally consider “semantic”, such as hordes of “Mozilla”-branded agents or dodgy looking bots. More research is awaiting…