Like most respectable geeks, we at Factual get pretty excited about data. And sometimes we get so excited about something that we want to make sure our data geek brethren are aware of it. Today we have something that falls into that category. CommonCrawl.org, a non-profit web crawler, provided a data set of about 4 million websites (primarily hosted at Top Level Domains as well as some popular subdomains) with 30 various attributes. That’s about 350MB — not a shabby corpus of data to be made available to the public. The attributes on these 4 million websites include information on what’s on the page (i.e., “contains a Twitter link”), what technology was used (i.e., “server”), and what crawling rules are set-up (i.e., “excludes GoogleBot”). The websites come from the CommonCrawl repository, which consists of over 3 billion URLs, and is a reasonable representation of the Internet, not to mention an interesting slice of what’s happening on the Web.

A couple of interesting things we noticed were…

  • 28% of websites have Google Analytics –- pretty impressive, while 12% of the sites have AdSense.  (Side note: we’re using the count of GetContentResults=Http-200 for the denominator, since it’s not fair to count the sites that CommonCrawl was unable to get content from.)
  • 5% of websites have a Twitter link and 5% have a Facebook URL, yet only 2% have both a Twitter and a Facebook URL.  It’ll be interesting to see how this changes over time.
  • The top five versions of Apache discovered are 2.2.11 (210,984 instances), 2.2.3 (200,065 instances), 1.3.41 (168,660 instances), 2.2.14 (166,644), and 2.0.52 (97,004 instances).
  • A bunch of very long names had enough linkage to get included in the crawl.  Here’s a fun one:  http://iwillusegooglebeforeaskingdumbquestions.com/.  Apparently they don’t use Google Analytics.
  • You can check out the specific regular expression list on the table, just click on the “Discuss” tab. If you have any suggestions or see something wrong, chime in on the thread and let us know!

Since this data set is now on Factual, it is open for the world to share, collaborate, and mash. That’s how our data sets roll! However, unlike most of the tables on Factual, this table is read-only, meaning cells can’t be edited by inputting new data. Since this is CommonCrawl’s analysis, it made more sense to place restrictions on who could change/add to the data.

Of course, there’s still plenty of things you can do with the data. And if you want to join or merge data from another table to this one, this won’t impact the original table; you simply have to do a “save as” and effectively fork it. However, if you do this, we suggest that you mention it as a thread in the “Discuss” tab.  That way the community can track the various related data sets.

All of the data is available through the Factual API and anyone with some programming skills can build innovative apps on top of this table and/or any related ones. The sky’s the limit. Indeed, as large as this data set already is, this is just the beginning of life for this table. Hopefully we’ll see it grow and improve, and potentially power exciting products.

We believe that open and collaborative data is not only important unto itself, it also drives innovation. Removing the hassles associated with data licensing and data curation (verification, de-duping, updating) can free folks to concentrate on building the apps themselves.  Perhaps over time, developers will see data as another layer in the open solution stack. Like the other open software elements in the stack, if others give back to the community, the resources can multiply and quickly surpass closed enterprise versions.

When we shared it with Creative Commons, their CTO Mike Linksvayer emailed us this response: “I am wildly enthusiastic about collaborative, web-scale analysis of the web itself, which is likely the best path to a more complete understanding and appreciation of the impact of Creative Commons. CommonCrawl and Factual are each extremely interesting in this regard, for providing web-scale data and a unique take on collaborative data curation.” We couldn’t have said it better ourselves.

If you have any suggested attributes or just general questions on how you can use the table, feel free to start a thread on the Factual Developer Google Group.

Again, here’s a link to the table. Enjoy it and keep the data coming.

Best,

Bill
VP Developer Platform