The Infochimps repository contains thousands of datasets. Many are unique but some are part of a larger collection. Some of the collections we're especially proud of are listed below.
Click one to explore datasets for that collection.
From their website:
CKAN is the Comprehensive Knowledge Archive Network, a registry of open knowledge packages and projects (and a few closed ones)…Those familiar with freshmeat, CPAN or PyPI can think of CKAN as providing an analogous service for open knowledge…CKAN is developed and maintained by the Open Knowledge Foundation. Both the CKAN code and data are open: free for anyone to use and reuse. To find out more check out the the CKAN project at "knowledgef …
From Wikipedia
An infobox on Wikipedia is a consistently formatted table which is present in articles with a common subject to provide summary information consistently between articles or improve navigation to closely related articles in that subject. (An infobox is a generalization of a taxobox (from taxonomy) which summarizes information for an organism or group of organisms.)
Wikipedia Infoboxes are the small tables that appear on the rig …
From the US Census bureau
The Statistical Abstract of the United States, published since 1878, is the authoritative and comprehensive summary of statistics on the social, political, and economic organization of the United States.
Use the Abstract as a convenient volume for statistical reference, and as a guide to sources of more information both in print and on the Web.
Sources of data include the Census Bureau, Bureau of Labor Statistics, Bure …
Pete Skomoroch is President and Lead Consultant at Data Wrangling in Arlington, VA, a firm which specializes in mining large datasets to solve problems in search, finance, and recommendation systems.
He maintains an ever-expanding (near 400 as of last count!) list of datasets which have now been incorporated into the Infochimps repository.
The Moby Project has assembled some of the world’s largest collections of word lists. Sixteen datasets containing common male and female first names, special words for crossword puzzles, and commonly misspelled words, and many other collections are stored in the Infochimps repository.
AggData sells aggregated lists of data, culled from the websites of major companies like Starbucks, Ace Hardware, &c. Their lists are geolocated and have more information on each branch of each company.
The purpose of Data.gov is to increase public access to high value, machine readable datasets generated by the Executive Branch of the Federal Government.
A collection of datasets that link IP address geolocation data from MaxMind to the United States Census 2000 data.
A collection of datasets concerning the names, locations, and other information about places in the world.
Datamob aims to show, in a very simple way, how public data sources are being used.
Their listings emphasize the connection between data posted by governments and public institutions and the interfaces people are building to explore that data.
This is a collection of data from MySpace’s real-time stream API. Bulk dumps, derived datasets, and utility datasets are available here. Developers and academics should find this data useful.
Data.gov.uk seeks to give a way into the wealth of government data. As highlighted by the Power of Information Taskforce, this means it needs to be:
They are drawing on the expertise and wisdom of Sir Tim Berners-Lee and Professor Nigel Shadbolt to publish government data as RDF – enabling data to be linked together.