Listing 9 datasets tagged with "huge"

A record of major league games played from 1871-2008 | Added by Infochimps

The game logs contain a record of major league games played from 1871-2008. At a minimum, it provides a listing of the date and score of each game. Where our research is more complete, we include information such as team statistics, winning and losing pitchers, linescores, attendance, starting pit …

Sports » Baseball

Occurrence counts of tweet tokens: hashtags, URLs, & smileys by hour or month | Twitter Census | Added by Infochimps

This data comes from a scrape of the Twitter social network conducted by the Monkeywrench Consultancy. The full scrape consists of 35 million users, 500 million tweets, and 1 billion relationships between users.

This dataset is a corpus of tokens collected from tweets sent between March 2006 a …

Computers » Social Networks | Social Sciences » Communications | Social Sciences » Sociology | History » Modern History

Freebase Data Dump *****

Free

Added by Infochimps

A data dump of all the current facts and assertions in the Freebase system.

Freebase is an open database of the worlds information, covering millions of topics in hundreds of categories. Drawing from large open data sets like Wikipedia, MusicBrainz, and the SEC archi …

Encyclopedic » Encyclopedias

Added by Infochimps

The Freebase Wikipedia Extraction (WEX) is a processed dump of the English language Wikipedia. The wiki markup for each article is transformed into machine-readable XML, and common relational features such as templates, infoboxes, categories, article sections, and redirects are extracted intabul …

Encyclopedic » Encyclopedias

DBPedia Main *****

Free

Added by Infochimps

DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. The DBpedia knowledge base currently describes more than 2.6 million things, including at least 213,000 persons, 328,000 places, 57,000 music albums, 36,000 films, 20,0 …

Encyclopedic » Encyclopedias

Federal Climate Complex GSOD (Global Surface Summary of Day) version 7 | Added by Infochimps

The GSOD (Global Daily) Data

The GSOD dataset is from National Climate Data Center, and downloadable at ftp://ftp.ncdc.noaa.gov/pub/data/gsod/

You can fetch your own copy with

wget -r -l3 —no-clobber —no-parent —no-verbos …
Science » Meteorology

Occurrence counts of tweet tokens: hashtags, URLs, & smileys by hour or month | Twitter Census | Added by Infochimps

This data comes from a scrape of the Twitter social network conducted by the Monkeywrench Consultancy. The full scrape consists of 35 million users, 500 million tweets, and 1 billion relationships between users.

This dataset is a corpus of tokens collected from tweets sent between March 2006 a …

Computers » Social Networks | Social Sciences » Communications | Social Sciences » Sociology | History » Modern History

Occurrence counts of tweet tokens: hashtags, URLs, & smileys by hour or month | Twitter Census | Added by Infochimps

This data comes from a scrape of the Twitter social network conducted by the Monkeywrench Consultancy. The full scrape consists of 35 million users, 500 million tweets, and 1 billion relationships between users.

This dataset is a corpus of tokens collected from tweets sent between March 2006 a …

Computers » Social Networks

Federal Climate Complex GSOD (Global Surface Summary of Day) version 7 | Added by Infochimps

About

This is an extract from the “Global Daily Weather Data from the National Climate Data Center (NCDC)” dataset for just austin.

Graphs

!http://infochimps.org/static/ga …

Science » Meteorology

FreeBase **

Free

The Comprehensive Knowledge Archive Network (CKAN) Collection | Added by Infochimps

  1. Description

“Freebase is an open database of the world’s information. It is built by the community and for the community—free for anyone to query, contribute to, built applications on top of, or integrate into their websites.”

  1. Openness: OPEN
  • License: cc-by + GFDL for wikip …

The Comprehensive Knowledge Archive Network (CKAN) Collection | Added by Infochimps

  1. About

> One web page for every book ever published. It’s a lofty, but achievable, goal.

> To build it, we need hundreds of millions of book records, a brand new database infrastructure for handling huge amounts of dynamic information, a wiki interface, multi-language support, and people w …


Added by mrflip

A USENET corpus (2005-2009)

This corpus is a collection of public USENET postings. This corpus was collected between Oct 2005 and Jan 2010, and covers 47860 English language, non-binary-file news groups. Despite our best effots, this corpus includes a very small number of non-English words, non …

Linguistics » Text Corpora

Added by mrflip

Stack Overflow Creative Commons Data Dump

We decided early on that all user-generated content on Stack Overflow would be under a Creative Commons license.

All those great Stack Overflow questions, answers, and comments, so generously contributed by all of you, are licensed under cc-wiki: …


Added by mrflip

ICPSR offers more than 500,000 digital files containing social science research data. Disciplines represented include political science, sociology, demography, economics, history, gerontology, criminal justice, public health, foreign policy, terrorism, health and medical care, early education, edu …