<?xml version="1.0" encoding="UTF-8"?>
<dataset>
  <cached-score type="decimal">114.3</cached-score>
  <created-at type="datetime">2009-11-11T14:41:34Z</created-at>
  <id type="integer">11897</id>
  <main-link>http://www.twitter.com/</main-link>
  <owner-id type="integer">602</owner-id>
  <protected type="boolean">true</protected>
  <subtitle>Occurrence counts of tweet tokens: hashtags, URLs, &amp; smileys by hour or month</subtitle>
  <title>Twitter Census :: Conversation Metrics - One year of URLs, Hashtags, Smileys usage by hour</title>
  <updated-at type="datetime">2010-01-31T10:24:47Z</updated-at>
  <tag-list>bigdata,emoticon,hashtag,hours,huge,months,networking,smiley,social,socialnetwork,social_network,tokens,twitter,twitter.com,url,words</tag-list>
  <categories type="array">
    <category>
      <datasets-count type="integer">0</datasets-count>
      <id type="integer">36</id>
      <parent-id type="integer">30</parent-id>
      <title>Social Networks</title>
    </category>
  </categories>
  <sources type="array">
    <source>
      <created-at type="datetime">2009-11-11T14:56:02Z</created-at>
      <id type="integer">13432</id>
      <main-link>http://www.infinitemonkeywrench.com/</main-link>
      <title>Monkeywrench Consultancy</title>
      <updated-at type="datetime">2009-12-16T06:55:53Z</updated-at>
      <description>The Monkeywrench Consultancy is an organization dedicated to producing and providing analytics for valuable data.</description>
    </source>
    <source>
      <created-at type="datetime">2009-11-11T14:56:58Z</created-at>
      <id type="integer">13433</id>
      <main-link>http://apiwiki.twitter.com/</main-link>
      <title>Twitter API</title>
      <updated-at type="datetime">2009-12-16T06:55:57Z</updated-at>
      <description>The Twitter API currently consists of two discrete APIs. Most application developers mix and match the APIs to produce their application. The separation of the REST and Search APIs is less than ideal and it is entirely due to history It is in our pipeline to ameliorate Twitter's API by combining the Search and REST pieces as development cycles allow. The API Overview portion of the Getting Started series explains the history.</description>
    </source>
  </sources>
  <collection>
    <created-at type="datetime">2009-11-11T14:38:14Z</created-at>
    <id type="integer">7</id>
    <title>Twitter Census</title>
  </collection>
  <license>
    <main-link></main-link>
    <title>Monkeywrench Consultancy License</title>
    <description>This data is not re-distributable in bulk form.  
This data may be used to produce derivative works and to power applications for commercial gain.
Any API or service built on this dataset may not have the same effect as re-distributing this data in bulk.</description>
  </license>
  <notes type="array">
    <note>
      <body>This data comes from a scrape of the Twitter social network conducted by the Monkeywrench Consultancy.  The full scrape consists of 35 million users, 500 million tweets, and 1 billion relationships between users.

This dataset is a corpus of tokens collected from tweets sent between March 2006 and November 2009.  A &quot;token&quot; is either a hashtag (#data), a URL, or an emoticon (smiley face -- ;)).  Think about comparing this data to the stock market, new movies, new video games, or even trendingtopics.org.  For example, use it to look at the social networking adoption of Google Wave on the rate of its mentions. 

The tokens are binned by hour and month and the occurrence count of each token for each hour is given.

The Monkeywrench Consultancy will produce custom slices, subscription services, and analysis of the full scrape for a fee.  Please contact &quot;imw@infochimps.org&quot;:mailto:imw@infochimps.org for more information.</body>
      <created-at type="datetime">2009-11-11T14:41:34Z</created-at>
      <id type="integer">58825</id>
      <title>Description</title>
      <updated-at type="datetime">2010-01-26T07:05:40Z</updated-at>
    </note>
  </notes>
  <payloads type="array">
    <payload>
      <created-at type="datetime">2009-11-11T17:02:42Z</created-at>
      <fmt>tsv</fmt>
      <id type="integer">15372</id>
      <num-files type="integer" nil="true"></num-files>
      <num-records type="integer">41746479</num-records>
      <owner-id type="integer">602</owner-id>
      <packaged-at type="datetime" nil="true"></packaged-at>
      <path nil="true"></path>
      <pkg-size type="integer" nil="true"></pkg-size>
      <price type="integer">100000</price>
      <protected type="boolean">true</protected>
      <title>Token Counts by Month</title>
      <updated-at type="datetime">2009-11-20T21:12:57Z</updated-at>
      <description>The data comes in three files:

h3. tokens_by_month-20091111.tsv

This file gives the occurrence count of each token for each month.  It has fields

* token_type -- one of &quot;hashtag&quot;, &quot;url&quot;, or &quot;smiley&quot;, identifying the token type
* year_and_month -- the month to which the token count refers
* count -- the number of occurrences for that hour across all collected tweets
* token -- the actual text of the token

h3. total_tokens_by_hour-20091111.tsv

This file gives the total number of tokens of each type for each hour.  It has fields

* token_type -- one of &quot;hashtag&quot;, &quot;url&quot;, or &quot;smiley&quot; describing the token type
* date_with_hour -- the date and hour to which the count refers (in the format 2009120113, for 1pm on December 13th, 2009).  All times are UTC.
* count -- the number of tokens of this type for this hour

h3. tweet_coverage-20091111.tsv

This file gives the count of collected tweets by hour.  It has fields

* date_with_hour -- the date and hour to which the count refers (in the format 2009120113, for 1pm on December 13th, 2009).  All times are UTC.
* first_tweet_id_found -- the id of the first tweet collected during this hour.
* tweet_count -- the number of tweets collected during this hour.

Comparing the id of the first tweet collected in one hour and the next to the number of tweets collected in the hour gives an estimate of the coverage of the scrape during that hour.</description>
    </payload>
    <payload>
      <created-at type="datetime">2009-11-11T16:56:01Z</created-at>
      <fmt>tsv</fmt>
      <id type="integer">15371</id>
      <num-files type="integer" nil="true"></num-files>
      <num-records type="integer">142531101</num-records>
      <owner-id type="integer">602</owner-id>
      <packaged-at type="datetime" nil="true"></packaged-at>
      <path nil="true"></path>
      <pkg-size type="integer" nil="true"></pkg-size>
      <price type="integer">800000</price>
      <protected type="boolean">true</protected>
      <title>Token Counts by Hour</title>
      <updated-at type="datetime">2009-11-20T21:08:01Z</updated-at>
      <description>The data comes in three files:

h3. tokens_by_hour-20091111.tsv

This file gives the occurrence count of each token for each hour.  It has fields

* token_type -- one of &quot;hashtag&quot;, &quot;url&quot;, or &quot;smiley&quot;, identifying the token type
* date_with_hour -- the date and hour to which the count refers (in the format 2009120113, for 1pm on December 13th, 2009).  All times are UTC.
* count -- the number of occurrences for that hour across all collected tweets
* token -- the actual text of the token

h3. total_tokens_by_hour-20091111.tsv

This file gives the total number of tokens of each type for each hour.  It has fields

* token_type -- one of &quot;hashtag&quot;, &quot;url&quot;, or &quot;smiley&quot; describing the token type
* date_with_hour -- the date and hour to which the count refers (in the format 2009120113, for 1pm on December 13th, 2009).  All times are UTC.
* count -- the number of tokens of this type for this hour

h3. tweet_coverage-20091111.tsv

This file gives the count of collected tweets by hour.  It has fields

* date_with_hour -- the date and hour to which the count refers (in the format 2009120113, for 1pm on December 13th, 2009).  All times are UTC.
* first_tweet_id_found -- the id of the first tweet collected during this hour.
* tweet_count -- the number of tweets collected during this hour.

Comparing the id of the first tweet collected in one hour and the next to the number of tweets collected in the hour gives an estimate of the coverage of the scrape during that hour.</description>
    </payload>
    <payload>
      <created-at type="datetime">2009-11-11T18:49:38Z</created-at>
      <fmt>tsv</fmt>
      <id type="integer">15374</id>
      <num-files type="integer" nil="true"></num-files>
      <num-records type="integer">1479</num-records>
      <owner-id type="integer">602</owner-id>
      <packaged-at type="datetime" nil="true"></packaged-at>
      <path nil="true"></path>
      <pkg-size type="integer" nil="true"></pkg-size>
      <price type="integer" nil="true"></price>
      <protected type="boolean">true</protected>
      <title>Smiley Counts</title>
      <updated-at type="datetime">2009-11-20T20:39:00Z</updated-at>
      <description>Counts of all emoticons used on Twitter, March 2006 to November 2009.</description>
    </payload>
  </payloads>
</dataset>
