Listing 51 datasets tagged with "corpus"

Moby Project Word Lists | Added by Infochimps

113,809 official crosswords A list of words permitted in crossword games such as Scrabble™. Compatible with the first edition of the Official Scrabble Players Dictionary™. Since this list has all forms: -ing, -ed, -s, and so on of words, it makes a good addition when building a custom spell …

Linguistics » Word Lists

Moby Project Word Lists | Added by Infochimps

113,809 official crosswords A list of words permitted in crossword games such as Scrabble™. Compatible with the first edition of the Official Scrabble Players Dictionary™. Since this list has all forms: -ing, -ed, -s, and so on of words, it makes a good addition when building a custom spell …

Linguistics » Word Lists

Moby Project Word Lists | Added by Infochimps

113,809 official crosswords A list of words permitted in crossword games such as Scrabble™. Compatible with the first edition of the Official Scrabble Players Dictionary™. Since this list has all forms: -ing, -ed, -s, and so on of words, it makes a good addition when building a custom spell …

Linguistics » Word Lists

Moby Project Word Lists | Added by Infochimps

Over 256,700 hyphenated or other entries containing more than one word as well as all capitalized words and acronyms. Phrases were considered ‘common’ if they or variations of them occur in standard dictionaries or thesauruses.

Linguistics » Word Lists

Moby Project Word Lists | Added by Infochimps

Over 354,000 single words, excluding proper names, acronyms, or compound words and phrases. This list does not exclude archaic words or significant variant spellings.

Linguistics » Word Lists

Pete Skomoroch's Bookmarks | Added by Infochimps

The Linguistic Data Consortium is an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes. The University of Pennsylvania is the LDC’s hos …

Linguistics

Moby Project Word Lists | Added by Infochimps

Over 354,000 single words, excluding proper names, acronyms, or compound words and phrases. This list does not exclude archaic words or significant variant spellings.

Linguistics » Word Lists

Moby Project Word Lists | Added by Infochimps

Over 354,000 single words, excluding proper names, acronyms, or compound words and phrases. This list does not exclude archaic words or significant variant spellings.

Linguistics » Word Lists

Enron Email Dataset *****

Free

0.5M email messages among managers at Enron Corporation | The Comprehensive Knowledge Archive Network (CKAN) Collection | Added by Infochimps

From the CALO Project at Carnegie-Mellon University a massive dataset of emails recovered from discovery documents in the Enron trials

  1. About

From distribution page:

> This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that …

Computers » Social Networks

Pete Skomoroch's Bookmarks | Added by Infochimps

Continuing the ICWSM tradition, ICWSM 2009 is making a dataset available to researchers in the blog and social media fields. We invite you to download the dataset, explore it, learn something interesting about it, and submit a paper about it to ICWSM 2009.

Good research topics might include… …


500k+ pager (sms) messages sent on September 11, 2001, published on Wikileaks.org | Added by mrflip

9/11 tragedy pager intercepts.

The following are more than half a million national US pager intercepts released by wikileaks.org. This covers the September 11 tragedy from 3am on the same day (Tuesday) until 3am the following day, a 24 hour period surrounding the attacks in New York and Washing …

Social Sciences » Sociology

Pete Skomoroch's Bookmarks | Added by Infochimps


Large text corpus, useful for qualifying Text Retrieval algorithms | Added by Infochimps

This README file describes all the data files associated with the
OHSUMED document collection as it was used for the TREC-9
Filtering Track. Please see “The TREC-9 Filtering Track Final
Report” by Stephen Robertson and David A. Hull in the TREC-9
proceedings for a description of the tasks per …

Medicine

Moby Project Word Lists | Added by Infochimps

This file consists of the 1,000 most frequently used English words from a wide variety of common texts listed in decreasing order of frequency

Linguistics » Word Lists

Moby Project Word Lists | Added by Infochimps

This file consists of the 1,000 most frequently used English words as used on the Internet computer network in 1992.

Linguistics » Text Corpora

Moby Project Word Lists | Added by Infochimps

This file consists of the 1,000 most frequently used English words from a wide variety of common texts listed in decreasing order of frequency

Linguistics » Word Lists

Moby Project Word Lists | Added by Infochimps

74,550 common dictionary words — A list of words in common with two or more published dictionaries. This gives the developer of a custom spelling checker a good beginning pool of relatively common words.

Linguistics » Word Lists

Moby Project Word Lists | Added by Infochimps

10,196 places (places.txt) a large selection of place names in the United States

Geography » Geographical Names

Moby Project Word Lists | Added by Infochimps

74,550 common dictionary words — A list of words in common with two or more published dictionaries. This gives the developer of a custom spelling checker a good beginning pool of relatively common words.

Linguistics » Word Lists

TalkBank **

Free

The Comprehensive Knowledge Archive Network (CKAN) Collection | Added by Infochimps

  1. About

About TalkBank:

> The goal of TalkBank is to foster fundamental research in the study of human and animal communication. It will construct sample databases within each of the subfields studying communication. It will use these databases to advance the development of standards and tools …