Category: Linguistics (30 datasets)

Not finding the datasets you're looking for? Not all of our datasets are categorized yet. Try checking out tags instead.

Linguistic Data Consortium (LDC) - Collection of Linguistic Corpora and Datasets *****

Pete Skomoroch's Bookmarks | Added by Infochimps 11 months ago

The Linguistic Data Consortium is an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes. The University of Pennsylvania is the LDC’s hos …

Linguistics

Dataset Title

Added by Infochimps 3 months ago

Linguistics

MySpace User Activity Stream: Cumulative word count from from Dec 2009 to March 2010 *****

MySpace Real-Time Stream | Added by MonkeywrenchConsultancy 5 days ago

This data is derived from the MySpace real-time stream API. The word count is from the free-form text fields MySpace moods, forum topic titles, replies to forum topics, text from sharing a link or item, and status mood updates. For the last three months the words from these fields has been extra …

Computers » Social Networks | Linguistics

MySpace User Activity Stream: Word count by day from December 2009-March 2010 *****

MySpace Real-Time Stream | Added by MonkeywrenchConsultancy 5 days ago

This data is derived from the MySpace real-time stream API. The word count is from the free-form text fields MySpace moods, forum topic titles, replies to forum topics, text from sharing a link or item, and status mood updates. For the last three months the words from these fields has been extra …

Computers » Social Networks | Linguistics

MySpace User Activity Stream: Word count by hour from December 2009-March 2010 *****

MySpace Real-Time Stream | Added by MonkeywrenchConsultancy 5 days ago

This data is derived from the MySpace real-time stream API. The word count is from the free-form text fields MySpace moods, forum topic titles, replies to forum topics, text from sharing a link or item, and status mood updates. For the last three months the words from these fields has been extra …

Computers » Social Networks | Linguistics

Word List - List of Acronyms

Moby Project Word Lists | Added by Infochimps almost 2 years ago

6,213 acronyms (acronyms.txt) common acronyms & abbreviations

Linguistics » Word Lists

Word List - 350,000+ Words

Moby Project Word Lists | Added by Infochimps almost 2 years ago

Over 354,000 single words, excluding proper names, acronyms, or compound words and phrases. This list does not exclude archaic words or significant variant spellings.

Linguistics » Word Lists

Word List - 350,000+ Simple English Words (with Definitions, Excel format) *****

Moby Project Word Lists | Added by Infochimps almost 2 years ago

Over 354,000 single words, excluding proper names, acronyms, or compound words and phrases. This list does not exclude archaic words or significant variant spellings.

Linguistics » Word Lists

Word List - 74,000+ Common English Dictionary Words (with Definitions, Excel format) ***

Moby Project Word Lists | Added by Infochimps almost 2 years ago

74,550 common dictionary words — A list of words in common with two or more published dictionaries. This gives the developer of a custom spelling checker a good beginning pool of relatively common words.

Linguistics » Word Lists

Word List - 1,000+ Most Frequent words in King James Bible *

Moby Project Word Lists | Added by Infochimps almost 2 years ago

1,185 King James Version frequent substrings (KJVfreq.txt) The most frequently occurring 1,185 substrings in the King James Version Bible ranked and counted by order of frequency.

Linguistics » Word Lists

Word List - 1,000 Most Frequently Used English Words by Frequency (with Definitions, Excel format) ***

Moby Project Word Lists | Added by Infochimps almost 2 years ago

This file consists of the 1,000 most frequently used English words from a wide variety of common texts listed in decreasing order of frequency

Linguistics » Word Lists

Word List - 21,000+ Common Given Names (US & Great Britain)

Moby Project Word Lists | Added by Infochimps almost 2 years ago

21,986 names (names.txt) This database contains the most common names used in the United States and Great Britain. Spelling checkers may want to supplement their basic word list with this one.
Linguistics » Word Lists

Word List - 4,900+ Common Female Given Names (English-speaking Countries)

Moby Project Word Lists | Added by Infochimps almost 2 years ago

4,946 female names (names-f.txt) Frequent given names of females in English speaking countries. Spelling checkers may want to supplement their basic word list with this one.

Linguistics » Word Lists

Word List - 3,800+ Common Male Given Names (English-speaking Countries)

Moby Project Word Lists | Added by Infochimps almost 2 years ago

3,800 male names Frequent given names of male in English speaking countries. Spelling checkers may want to supplement their basic word list with this one.

Linguistics » Word Lists

Word List - 250,000+ Hyphenated, Capitalized and Compound English words *****

Moby Project Word Lists | Added by Infochimps almost 2 years ago

Over 256,700 hyphenated or other entries containing more than one word as well as all capitalized words and acronyms. Phrases were considered ‘common’ if they or variations of them occur in standard dictionaries or thesauruses.

Linguistics » Word Lists

Word List - Commonly Misspelled English Words

Moby Project Word Lists | Added by Infochimps almost 2 years ago

366 often misspelled words (oftenmis.txt) many of the most commonly misspelled words in English speaking countries

Linguistics » Word Lists

Computer hacker wordlists from packetstormsecurity.org **

Pete Skomoroch's Bookmarks | Added by Infochimps 11 months ago

Linguistics » Word Lists

Wordnet *****

A large lexical database of English | The Comprehensive Knowledge Archive Network (CKAN) Collection | Added by Infochimps 10 months ago

“WordNet® is a large lexical database of English, developed under the direction of George A. Miller. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical …

Linguistics » Word Lists

Word List - 100,000+ official crossword words (Excel readable) *****

Moby Project Word Lists | Added by Infochimps almost 2 years ago

113,809 official crosswords A list of words permitted in crossword games such as Scrabble™. Compatible with the first edition of the Official Scrabble Players Dictionary™. Since this list has all forms: -ing, -ed, -s, and so on of words, it makes a good addition when building a custom spell …

Linguistics » Word Lists

Word List - 350,000+ Simple English Words (Excel readable) *****

Moby Project Word Lists | Added by Infochimps almost 2 years ago

Over 354,000 single words, excluding proper names, acronyms, or compound words and phrases. This list does not exclude archaic words or significant variant spellings.

Linguistics » Word Lists

Word List - 350,000+ Simple English Words (with Definitions, Excel format) *****

Moby Project Word Lists | Added by Infochimps almost 2 years ago

Over 354,000 single words, excluding proper names, acronyms, or compound words and phrases. This list does not exclude archaic words or significant variant spellings.

Linguistics » Word Lists

Word List - 74,000+ Common English Dictionary Words (with Definitions, Excel format) ***

Moby Project Word Lists | Added by Infochimps almost 2 years ago

74,550 common dictionary words — A list of words in common with two or more published dictionaries. This gives the developer of a custom spelling checker a good beginning pool of relatively common words.

Linguistics » Word Lists

Word List - 100,000+ official crossword words (with Definitions, Excel format) *****

Moby Project Word Lists | Added by Infochimps almost 2 years ago

113,809 official crosswords A list of words permitted in crossword games such as Scrabble™. Compatible with the first edition of the Official Scrabble Players Dictionary™. Since this list has all forms: -ing, -ed, -s, and so on of words, it makes a good addition when building a custom spell …

Linguistics » Word Lists

Word List - 100,000+ official crossword words (Excel readable) *****

Moby Project Word Lists | Added by Infochimps almost 2 years ago

113,809 official crosswords A list of words permitted in crossword games such as Scrabble™. Compatible with the first edition of the Official Scrabble Players Dictionary™. Since this list has all forms: -ing, -ed, -s, and so on of words, it makes a good addition when building a custom spell …

Linguistics » Word Lists

Word List - 1,000 Most Frequently Used English Words by Frequency (with Definitions, Excel format) ****

Moby Project Word Lists | Added by Infochimps almost 2 years ago

This file consists of the 1,000 most frequently used English words from a wide variety of common texts listed in decreasing order of frequency

Linguistics » Word Lists

The Quantz Corpus (Dinosaur Comics)

Added by doncarlo 5 days ago

This is all the text from every Dinosaur Comic ever made in convenient XML format. It was released by the author, Ryan North, as a tool to help solve an anagram presented in the comic for March 1, 2010. The text was also sort …

Computers » Internet | Linguistics » Word Lists | Linguistics » Text Corpora | Linguistics » Transcript Corpora

Letter frequency - Substring frequency in an Amy Tan Novel *

Moby Project Word Lists | Added by Infochimps almost 2 years ago

467 current fiction substrings (fiction.txt) The most frequently occurring 467 character sequences (n-grams) occurring in a best-selling novel by Amy Tan in 1990.

Linguistics » Text Corpora

Word List - 1000 Most Frequent Words from an Internet Corpus ***

Moby Project Word Lists | Added by Infochimps almost 2 years ago

This file consists of the 1,000 most frequently used English words as used on the Internet computer network in 1992.

Linguistics » Text Corpora

Web FAQ collection | ILPS **

Pete Skomoroch's Bookmarks | Added by Infochimps 11 months ago

Linguistics » Text Corpora

A list of all 22,802 words in the Scribblenauts dictionary. *****

Added by mrflip 6 months ago

List of summonable objects from the Nintendo DS game Scribblenauts, from AARDVARK, ABOMINABLE SNOWMAN and ABSCONDER to ZOMBIE, ZUNICERATOPS and ZYGOTE.

via the Scribblenauts Wikipedia entry:

Scribbl …

Linguistics » Text Corpora