The Linguistic Data Consortium is an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes. The University of Pennsylvania is the LDC’s hos …
From the CALO Project at Carnegie-Mellon University a massive dataset of emails recovered from discovery documents in the Enron trials
From distribution page:
> This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that …
9/11 tragedy pager intercepts.
The following are more than half a million national US pager intercepts released by wikileaks.org. This covers the September 11 tragedy from 3am on the same day (Tuesday) until 3am the following day, a 24 hour period surrounding the attacks in New York and Washing …
This README file describes all the data files associated with the
OHSUMED document collection as it was used for the TREC-9
Filtering Track. Please see “The TREC-9 Filtering Track Final
Report” by Stephen Robertson and David A. Hull in the TREC-9
proceedings for a description of the tasks per …
A capture of all tweets from Twitter’s sample feed during the 2010 state of the union address. Tweets are in JSON format. The feed is described here: http://apiwiki.twitter.com/Streaming-API-Documentation#statuses/sample.