The game logs contain a record of major league games played from 1871-2008. At a minimum, it provides a listing of the date and score of each game. Where our research is more complete, we include information such as team statistics, winning and losing pitchers, linescores, attendance, starting pit …
This data comes from a scrape of the Twitter social network conducted by the Monkeywrench Consultancy. The full scrape consists of 35 million users, 500 million tweets, and 1 billion relationships between users.
This dataset is a corpus of tokens collected from tweets sent between March 2006 a …
A data dump of all the current facts and assertions in the Freebase system.
Freebase is an open database of the worlds information, covering millions of topics in hundreds of categories. Drawing from large open data sets like Wikipedia, MusicBrainz, and the SEC archi …
The Freebase Wikipedia Extraction (WEX) is a processed dump of the English language Wikipedia. The wiki markup for each article is transformed into machine-readable XML, and common relational features such as templates, infoboxes, categories, article sections, and redirects are extracted intabul …
DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. The DBpedia knowledge base currently describes more than 2.6 million things, including at least 213,000 persons, 328,000 places, 57,000 music albums, 36,000 films, 20,0 …
The GSOD dataset is from National Climate Data Center, and downloadable at ftp://ftp.ncdc.noaa.gov/pub/data/gsod/
You can fetch your own copy with
wget -r -l3 —no-clobber —no-parent —no-verbos …This data comes from a scrape of the Twitter social network conducted by the Monkeywrench Consultancy. The full scrape consists of 35 million users, 500 million tweets, and 1 billion relationships between users.
This dataset is a corpus of tokens collected from tweets sent between March 2006 a …
This data comes from a scrape of the Twitter social network conducted by the Monkeywrench Consultancy. The full scrape consists of 35 million users, 500 million tweets, and 1 billion relationships between users.
This dataset is a corpus of tokens collected from tweets sent between March 2006 a …
This is an extract from the “Global Daily Weather Data from the National Climate Data Center (NCDC)” dataset for just austin.
!http://infochimps.org/static/ga …
“Freebase is an open database of the world’s information. It is built by the community and for the community—free for anyone to query, contribute to, built applications on top of, or integrate into their websites.”
> One web page for every book ever published. It’s a lofty, but achievable, goal.
> To build it, we need hundreds of millions of book records, a brand new database infrastructure for handling huge amounts of dynamic information, a wiki interface, multi-language support, and people w …
A USENET corpus (2005-2009)
This corpus is a collection of public USENET postings. This corpus was collected between Oct 2005 and Jan 2010, and covers 47860 English language, non-binary-file news groups. Despite our best effots, this corpus includes a very small number of non-English words, non …
Stack Overflow Creative Commons Data Dump
We decided early on that all user-generated content on Stack Overflow would be under a Creative Commons license.
All those great Stack Overflow questions, answers, and comments, so generously contributed by all of you, are licensed under cc-wiki: …
ICPSR offers more than 500,000 digital files containing social science research data. Disciplines represented include political science, sociology, demography, economics, history, gerontology, criminal justice, public health, foreign policy, terrorism, health and medical care, early education, edu …