Listing 21 datasets tagged with "corpora"

Word List - 100,000+ official crossword words (Excel readable) *****

Moby Project Word Lists | Added by Infochimps almost 2 years ago

113,809 official crosswords A list of words permitted in crossword games such as Scrabble™. Compatible with the first edition of the Official Scrabble Players Dictionary™. Since this list has all forms: -ing, -ed, -s, and so on of words, it makes a good addition when building a custom spell …

Linguistics » Word Lists

Word List - 100,000+ official crossword words (with Definitions, Excel format) *****

Moby Project Word Lists | Added by Infochimps almost 2 years ago

113,809 official crosswords A list of words permitted in crossword games such as Scrabble™. Compatible with the first edition of the Official Scrabble Players Dictionary™. Since this list has all forms: -ing, -ed, -s, and so on of words, it makes a good addition when building a custom spell …

Linguistics » Word Lists

Word List - 100,000+ official crossword words (Excel readable) *****

Moby Project Word Lists | Added by Infochimps almost 2 years ago

113,809 official crosswords A list of words permitted in crossword games such as Scrabble™. Compatible with the first edition of the Official Scrabble Players Dictionary™. Since this list has all forms: -ing, -ed, -s, and so on of words, it makes a good addition when building a custom spell …

Linguistics » Word Lists

Word List - 250,000+ Hyphenated, Capitalized and Compound English words *****

Moby Project Word Lists | Added by Infochimps almost 2 years ago

Over 256,700 hyphenated or other entries containing more than one word as well as all capitalized words and acronyms. Phrases were considered ‘common’ if they or variations of them occur in standard dictionaries or thesauruses.

Linguistics » Word Lists

Word List - 350,000+ Simple English Words (with Definitions, Excel format) *****

Moby Project Word Lists | Added by Infochimps almost 2 years ago

Over 354,000 single words, excluding proper names, acronyms, or compound words and phrases. This list does not exclude archaic words or significant variant spellings.

Linguistics » Word Lists

Word List - 350,000+ Simple English Words (Excel readable) *****

Moby Project Word Lists | Added by Infochimps almost 2 years ago

Over 354,000 single words, excluding proper names, acronyms, or compound words and phrases. This list does not exclude archaic words or significant variant spellings.

Linguistics » Word Lists

Word List - 350,000+ Simple English Words (with Definitions, Excel format) *****

Moby Project Word Lists | Added by Infochimps almost 2 years ago

Over 354,000 single words, excluding proper names, acronyms, or compound words and phrases. This list does not exclude archaic words or significant variant spellings.

Linguistics » Word Lists

Text Messages sent on 9/11/2001 (wikileaks.org) *****

500k+ pager (sms) messages sent on September 11, 2001, published on Wikileaks.org | Added by mrflip 4 months ago

9/11 tragedy pager intercepts.

The following are more than half a million national US pager intercepts released by wikileaks.org. This covers the September 11 tragedy from 3am on the same day (Tuesday) until 3am the following day, a 24 hour period surrounding the attacks in New York and Washing …

Social Sciences » Sociology

Word List - 1,000 Most Frequently Used English Words by Frequency (with Definitions, Excel format) ****

Moby Project Word Lists | Added by Infochimps almost 2 years ago

This file consists of the 1,000 most frequently used English words from a wide variety of common texts listed in decreasing order of frequency

Linguistics » Word Lists

Word List - 1000 Most Frequent Words from an Internet Corpus ***

Moby Project Word Lists | Added by Infochimps almost 2 years ago

This file consists of the 1,000 most frequently used English words as used on the Internet computer network in 1992.

Linguistics » Text Corpora

Word List - 1,000 Most Frequently Used English Words by Frequency (with Definitions, Excel format) ***

Moby Project Word Lists | Added by Infochimps almost 2 years ago

This file consists of the 1,000 most frequently used English words from a wide variety of common texts listed in decreasing order of frequency

Linguistics » Word Lists

Word List - 74,000+ Common English Dictionary Words (with Definitions, Excel format) ***

Moby Project Word Lists | Added by Infochimps almost 2 years ago

74,550 common dictionary words — A list of words in common with two or more published dictionaries. This gives the developer of a custom spelling checker a good beginning pool of relatively common words.

Linguistics » Word Lists

Word List - 10,000+ Common Place Names ***

Moby Project Word Lists | Added by Infochimps almost 2 years ago

10,196 places (places.txt) a large selection of place names in the United States

Geography » Geographical Names

Word List - 74,000+ Common English Dictionary Words (with Definitions, Excel format) ***

Moby Project Word Lists | Added by Infochimps almost 2 years ago

74,550 common dictionary words — A list of words in common with two or more published dictionaries. This gives the developer of a custom spelling checker a good beginning pool of relatively common words.

Linguistics » Word Lists

VoxForge **

The Comprehensive Knowledge Archive Network (CKAN) Collection | Added by Infochimps 10 months ago

  1. About

> VoxForge was set up to collect transcribed speech for use with Free and Open Source Speech Recognition Engines (on Linux, Windows and Mac).

> We will make available all submitted audio files under the GPL license, and then ‘compile’ them into acoustic models for use with Open Source …


Statistical Machine Translation - Europarl Parallel Corpus **

The Comprehensive Knowledge Archive Network (CKAN) Collection | Added by Infochimps 10 months ago

  1. About

Overview:

> The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 11 European languages: Romanic (French, Italian, Spanish, Portuguese), Germanic (English, Dutch, German, Danish, Swedish), Greek and Finnish.

> The goal of th …


The New York Times Annotated Corpus **

The Comprehensive Knowledge Archive Network (CKAN) Collection | Added by Infochimps 10 months ago

  1. About

From [website](http://ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T19):

> The New York Times Annotated Corpus contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New Y …


Word List - 1,000+ Most Frequent words in King James Bible *

Moby Project Word Lists | Added by Infochimps almost 2 years ago

1,185 King James Version frequent substrings (KJVfreq.txt) The most frequently occurring 1,185 substrings in the King James Version Bible ranked and counted by order of frequency.

Linguistics » Word Lists

Letter frequency - Substring frequency in an Amy Tan Novel *

Moby Project Word Lists | Added by Infochimps almost 2 years ago

467 current fiction substrings (fiction.txt) The most frequently occurring 467 character sequences (n-grams) occurring in a best-selling novel by Amy Tan in 1990.

Linguistics » Text Corpora