Someone in DERI brought back a set of Semantic Web fridge poetry magnets from a workshop! A joyous occasion for all SemWeb nerds, and there are plenty of those in DERI.
I’m still very much interested in the topic of analysing usage of linked data sites. To that end, an interesting question to ask is what kinds of agents access a linked data site. And here, apart from the usual categorisation into bots, browsers and such, it makes sense to differentiate between semantic and non-semantic agents. Very loosely, we could say that
Semantic agents are agents which are aware of RDF data and actively request it.
To know whether or not an agent requests RDF, we could look at the header of an individual HTTP request and check if the agent had specified Accept: application/rdf+xml. However, the Apache server log files unfortunately don’t tell us anything about the request header. Luckily though, there is an indirect way of finding out about this. If our linked data site uses best practice content negotiation and 303 redirects, we can look at pairs of requests in the log files. E.g., the Semantic Web Dog Food site uses a particular URI pattern for resources and their HTML and RDF representations. E.g.:
http://data.semanticweb.org/organization/deri-nui-galway http://data.semanticweb.org/organization/deri-nui-galway/html http://data.semanticweb.org/organization/deri-nui-galway/rdf
If the plain URI is requested, the server will either redirect to the HTML or the RDF representation, based on what was specified by the agent. Therefore, if we find a request for a plain URI and a request for the corresponding RDF URI, from the same IP address and the same agent, within a short time frame (e.g. 5 seconds), then we can infer that the agent had requested application/rdf+xml and can therefore be classified as a semantic agent.
90.21.243.141 - - [06/Oct/2008:16:07:58 +0100] "GET /organization/vrije-universiteit-amsterdam-the-netherlands HTTP/1.1" 303 7592 "-" "rdflib-2.4.0 (http://rdflib.net/; eikeon@eikeon.com)" 90.21.243.141 - - [06/Oct/2008:16:08:02 +0100] "GET /organization/vrije-universiteit-amsterdam-the-netherlands/rdf HTTP/1.1" 200 45358 "-" "rdflib-2.4.0 (http://rdflib.net/; eikeon@eikeon.com)"
The example above shows this: the “rdflib.net” agent requested the plain URI .../organization/vrije-universiteit-amsterdam-the-netherlands and was 303 redirected to .../organization/vrije-universiteit-amsterdam-the-netherlands/rdf a few seconds later. From this we can automatically infer that “rdflib.net” is a semantic agent.
A list of 423 semantic agents found in this way for the dog food site from 10/2008-10/2009 is here. Looking at the list, we can find a lot of agents that are clearly “semantic”, such as the “SindiceFetcher” or a SIOC browser. However, most of them are actually not what I would normally consider “semantic”, such as hordes of “Mozilla”-branded agents or dodgy looking bots. More research is awaiting…
Bob DuCharme points out nicely how much the Web of Linked Data has grown in the past year by comparing to versions of Richard Cyganiak’s LOD cloud diagram. It looks pretty impressive when you compare the two versions side by side!
Apparently, the European Semantic Web Conference will be renamed to Extended Semantic Web Conference. That is fantastic news, the original name was so boring. However, renaming to extended seems a lost opportunity to me: the organisers of all major Semantic Web conferences should come together and adopt far more exciting names. Some suggestions came up:
I don’t announce every new addition to the Semantic Web Dog Food Server, but this is a big one: based on the data available from EPrints, we managed to get information about papers and authors for the upcoming WWW2009 in Madrid up as linked data on the dog food server. You can get all the papers, authors and their affiliations, all nicely integrated with the rest of the dog food data from other conferences. You can start start browsing here or get a dump of the data. Enjoy!
… now stop infering and get lodding!” A great little (a great little?!) photoshop tribute to the Atheist Bus Campaign in London and elsewhere (now also in Germany). I don’t know exactly where this picture appeared originally – a friend of a friend saw it on Twitter somewhere, and I don’t use Twitter. Anyway, I love it! I also love the fact that we now have a new verb. I wonder how it is inflected? It’s probably regular, so it should look like this:
to lod (verb): lod, lodded, lodding – the act of publishing linked open data on the World Wide Web, adhering to the rules of linked data.
Tim Berners-Lee1 gave an enthusiastic talk about linked data at TED, urging everybody to get their data out there or, if they don’t have any, to demand access to data in a proper format.
Interestingly, he didn’t mention the words “Semantic Web” once during the talk, nor did he ever say “RDF” or even “URI” – instead he spoke about “names starting with ‘http’”. Cool enough, his slides had the dog food data set in them! :)
A video of the talk and a link to the slides can be found on the ebiquity blog.
1I wish this link would lead me to something nice when I go to it with a Web browser!
I might be a bit late (one month) to discover this, but IT book publisher O’Reilly have recently started a service called O’Reilly Product Metadata Interface (OPMI), which provides RDF metadata for their whole catalogue of books. More details about this can be found on the O’Reilly Labs page.
I think it’s great news that a major publisher starts to open up their data to the Semantic Web! Term-wise, they do the right thing and use vocabularies that have turned into de-facto standards (FOAF and DC (terms) in particular), as well as some newly coined terms in their own O’Reilly namespace. They also get brownie points for actually making their namespace dereferencable. Good practice!
There are a few things that could be improved to make their data more useful, though:
urn:x-domain:oreilly.com:agent:pdb:1210. That’s perfectly fine RDF, but it breaks the linked data rules – URIs like that are not dereferencable, which means it is impossible for interested agents to find out more about those resources.xmlns:p3="http://purl.org/dc/terms/#". These might be artifacts from the ontology editor they used, though. Not really harmful, just ugly.I’m currently working on an analysis of the log files of the Semantic Web Dog Food server. Apart from the obvious queries such as “How much traffic was there?”, “When were the peaks in traffic?” or “Where did the traffic come from?”, Semantic Web-type linked data inspires some other questions as well. Examples of such questions are to figure out how intensively the Semantic Web portion of the data was used (i.e., how often was RDF requested compared to HTML), how the distribution of “semantic” vs. “conventional” user agents was or what kind of data was requested.
Using the techniques described earlier in a post on my Confused Development blog I sifted through about 7 months worth of log files and generated some pretty pictures. Here is what I came up with so far:
The serving of linked data on the dog food server works through content negotiation – basically, the first request by an agent would be to the URI of the resource (“plain” in the graph), specifying in the header whether an RDF or HTML representation is desired. The server then redirects to either the HTML or RDF document with the desired representation. In theory, this means that requests(rdf) + requests(html) = requests(plain). However, since it is perfectly feasible to request the HTML or RDF documents directly, the total of RDF+HTML is slightly higher. The total numbers are:
| HTML: | 238486 |
| RDF: | 35491 |
| HTML+RDF: | 273977 |
| Plain: | 247576 |
As the graph and the numbers show, the usage in terms of RDF requests is relatively low at the moment, indicating that there is still a long way to go for the Semantic Web to really take off (and that we need to work on making the site more popular).
This second graph shows the distribution of hits over time for the different kinds of resources which the server offers, as indicated by the requested namespace (dogfood:person, dogfood:conference, …). Interest in people resources is highest almost all of the time. Partially, this may be due to ego surfing of Semantic Web researchers. However, as the graphs below will show, bot traffic far exceeds traffic by human visitors, so my hunch is that the preference of people pages can be explained through the search strategies of the big search engine players out there – people information is probably considered more valuable. Of course, another factor is the fact that there are about three times as many people resources on the dog food server than e.g. conference resources.
Regarding the conference and workshop resources, those need to be examined in a more fine-grained fashion, since the respective namespaces cover everything connected to an event: papers, talks, chairs, the event itself, etc.
No self-respecting analysis can live without a nice longtail graph these days. Looking at visiting agents, we get such a distribution (y-scale is logarithmic). The agents in the head are the big search engine crawlers – GoogleBot, Yahoo! Slurp and MSNBot -, as well as the big name browsers. In the middle and long tail we find lots and lots of different other bots, crawlers and browsers, as well as various tools, data services and agents who didn’t give themselves a proper identifier and instead just show up as “Java” or “perl-libwww” (very naughty behaviour indeed…).
More interesting is probably this graph, which shows the agent distribution after I had sliced and diced it manually according to some criteria:
All this and more will become part of my thesis and also (hopefully) make into some sort of more polished publication soon.
So, ISWC2008 is over and I’m back in Galway. What did I learn this year?