Listing 21 datasets tagged with "corpora"

Moby Project Word Lists | Added by Infochimps

113,809 official crosswords A list of words permitted in crossword games such as Scrabble™. Compatible with the first edition of the Official Scrabble Players Dictionary™. Since this list has all forms: -ing, -ed, -s, and so on of words, it makes a good addition when building a custom spell …

Linguistics » Word Lists

Moby Project Word Lists | Added by Infochimps

113,809 official crosswords A list of words permitted in crossword games such as Scrabble™. Compatible with the first edition of the Official Scrabble Players Dictionary™. Since this list has all forms: -ing, -ed, -s, and so on of words, it makes a good addition when building a custom spell …

Linguistics » Word Lists

Moby Project Word Lists | Added by Infochimps

113,809 official crosswords A list of words permitted in crossword games such as Scrabble™. Compatible with the first edition of the Official Scrabble Players Dictionary™. Since this list has all forms: -ing, -ed, -s, and so on of words, it makes a good addition when building a custom spell …

Linguistics » Word Lists

Moby Project Word Lists | Added by Infochimps

Over 256,700 hyphenated or other entries containing more than one word as well as all capitalized words and acronyms. Phrases were considered ‘common’ if they or variations of them occur in standard dictionaries or thesauruses.

Linguistics » Word Lists

Moby Project Word Lists | Added by Infochimps

Over 354,000 single words, excluding proper names, acronyms, or compound words and phrases. This list does not exclude archaic words or significant variant spellings.

Linguistics » Word Lists

Moby Project Word Lists | Added by Infochimps

Over 354,000 single words, excluding proper names, acronyms, or compound words and phrases. This list does not exclude archaic words or significant variant spellings.

Linguistics » Word Lists

Moby Project Word Lists | Added by Infochimps

Over 354,000 single words, excluding proper names, acronyms, or compound words and phrases. This list does not exclude archaic words or significant variant spellings.

Linguistics » Word Lists

500k+ pager (sms) messages sent on September 11, 2001, published on Wikileaks.org | Added by mrflip

9/11 tragedy pager intercepts.

The following are more than half a million national US pager intercepts released by wikileaks.org. This covers the September 11 tragedy from 3am on the same day (Tuesday) until 3am the following day, a 24 hour period surrounding the attacks in New York and Washing …

Social Sciences » Sociology

Moby Project Word Lists | Added by Infochimps

This file consists of the 1,000 most frequently used English words from a wide variety of common texts listed in decreasing order of frequency

Linguistics » Word Lists

Moby Project Word Lists | Added by Infochimps

This file consists of the 1,000 most frequently used English words as used on the Internet computer network in 1992.

Linguistics » Text Corpora

Moby Project Word Lists | Added by Infochimps

This file consists of the 1,000 most frequently used English words from a wide variety of common texts listed in decreasing order of frequency

Linguistics » Word Lists

Moby Project Word Lists | Added by Infochimps

74,550 common dictionary words — A list of words in common with two or more published dictionaries. This gives the developer of a custom spelling checker a good beginning pool of relatively common words.

Linguistics » Word Lists

Moby Project Word Lists | Added by Infochimps

10,196 places (places.txt) a large selection of place names in the United States

Geography » Geographical Names

Moby Project Word Lists | Added by Infochimps

74,550 common dictionary words — A list of words in common with two or more published dictionaries. This gives the developer of a custom spelling checker a good beginning pool of relatively common words.

Linguistics » Word Lists

VoxForge **

Free

The Comprehensive Knowledge Archive Network (CKAN) Collection | Added by Infochimps

  1. About

> VoxForge was set up to collect transcribed speech for use with Free and Open Source Speech Recognition Engines (on Linux, Windows and Mac).

> We will make available all submitted audio files under the GPL license, and then ‘compile’ them into acoustic models for use with Open Source …


The Comprehensive Knowledge Archive Network (CKAN) Collection | Added by Infochimps

  1. About

Overview:

> The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 11 European languages: Romanic (French, Italian, Spanish, Portuguese), Germanic (English, Dutch, German, Danish, Swedish), Greek and Finnish.

> The goal of th …


The Comprehensive Knowledge Archive Network (CKAN) Collection | Added by Infochimps

  1. About

From [website](http://ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T19):

> The New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New Y …


Moby Project Word Lists | Added by Infochimps

1,185 King James Version frequent substrings (KJVfreq.txt) The most frequently occurring 1,185 substrings in the King James Version Bible ranked and counted by order of frequency.

Linguistics » Word Lists

Moby Project Word Lists | Added by Infochimps

467 current fiction substrings (fiction.txt) The most frequently occurring 467 character sequences (n-grams) occurring in a best-selling novel by Amy Tan in 1990.

Linguistics » Text Corpora