infochimps.org - help
faq (changes)

Showing changes from revision #8 to #9: Added | Removed

What is infochimps.org?

infochimps.org is a community to assemble and interconnect a giant free almanac, with tables on everything you can put in a table—things like a century of hourly weather, every major league baseball game, decades of stock prices, or every US patent filing. Built by data nerds and used by data nerds to house the information you need to power the projects the world needs.

Why is that good?

Exploring rich data is fun, but finding it, formatting it, tagging it with metadata is drudge work barely fit for a trained chimp. And if you want to share a large raw dataset online, you face two troubling prospects: a) that no one will find it, or b) that everyone will find it.

A central, community-driven repository solves these problems, and also presents amazing possibilities. Interconnect the datasets along concepts they share: instead of 100,000 datasets, there’s just one. Study the physics of baseball by comparing the hourly weather during every single baseball game to game outcomes. Uncover political campaign irregularities by comparing neighborhood per-capita income, historical voter trends, and public campaign finance records. Plan real-estate decisions based on what news-and-other-media keywords rank highly in each area. If you’ve read Freakonomics, you know the power of this approach—let’s start building tools that make this way of thinking available to everychimp.

Can’t you already get this data elsewhere?

Yes, but it’s often trapped behind large bureaucratic and monetary barriers. We’re talking 100- to 10,000-times markup (PriceOfFreeData) over the raw bandwidth charge for freely redistributable data gathered at taxpayer expense. Not to mention the hassles with formatting, and converting, and finding, and sharing, and …

Aren’t (these other guys) doing this, and better?

Yes—freebase.com is, and so are swivel.com, numbrary.com, CKAN.net, dbpedia.com, and a bunch of others, and all in their own way much, much better than this site. There’s a community of us hanging out at theinfo.org, and we’re all working together, because this job is way too big to be solved by any one group or any finite number of monkeys.

The virtues of infochimps.org lie in its suckiness:
  • it’s messy: we loosely couple data, make it discoverable, make it publicly curated, make it interconnect—but impose no strict structure or format or means of description.
  • it’s stupid: there’s no live access. You download data to play with on your machine, using your tools, immediately—no sandbox, no further rights and restrictions, no APIs, no network latency.
  • it’s incompetent: there’s no one specific knowledge domain specialty, such as economic or astronomical or sports or social network graph data. But simply aggregating the data and giving immediate access can inspire connections among all them. And once our metadata curation tools go live, you’ll be able to nimbly traverse from dataset to dataset by the concepts they share.

The other important feature of infochimps.org is its essential poverty. We’re a community effort beholden to no one, and everything we produce is and will remain free. Only the cooperation of the community (this means you, chimpy) can ensure its success—if you have resources or talent to provide, please Contact us.

Sharing data

Why should I upload data?

Sharing a large raw dataset presents two troubling prospects: a) no-one will find it, or b) everyone will find it.

Infochimps.org lets you makes interesting data available to all. You get the credit, you get the bandwidth off your server, and the world gets a little bit smarter. Instead of hundreds of thousands of datasets scattered all over the web, there should be Just One Dataset, with open formats, interlocking fields, and a finite number of infochimps helping to organize and distribute it.

How do I edit a dataset? How do I upload data?

Painfully; here’s how (HOWTO Upload)

You can’t edit a dataset online, yet—but if you usefully convert or reformat a dataset, or add information to the dataset’s Infochimps Simple Schema file (the one ending in .icss.yaml), then for now just re-upload it.

About the data

What kinds of datasets can go on infochimps.org?

If it’s broadly interesting, we want to host it. Unless it’s interesting and 20GB large, in which case we want to point to it.

Now “Broadly Interesting” has a certain restricted meaning considering what we’re talking about, but if you browse the existing collection you’ll get a sense of it.

Here’s a useful rule of thumb: would a motivated geeky person from a different field or region of interest find this useful?

A table giving the best known values for all the physical constants and fundamental particles is highly desirable; a petabyte of raw sensor values from the LHCb beam at the CERN supercollider is outside our scope. Weekly water consumption for each major metropolitan area is interesting, but a three-year table showing how much your water bill was each month is not. (Of course, if it were something awesomely obsessive like ’’everything you did, saw and spent money on for a year’’ then it’s interesting again.)

What kinds of data do we want to have on infochimps.org?

The broad goal is to build a repository of data that helps you discover, share and download raw data sets that are:

  • Open: Share, give credit where credit is due, and respect existing restrictions. Other than that, here it is and have fun.
  • Free: Within the reasonable limits of our server costs, datasets are provided at no cost.
  • Descriptive: Fields show the real-world objects they describe—they arrive labelled with their type, representation and measurement units, and tags explaining what they mean. Fields shouldn’t just be “int’s” or “strings”, they should be concepts like ‘location’, ‘time’, ‘baseball team’, ‘IP address’.
  • Universal: Help get the world’s data into universal and transmutable formats like XML, YAML and JSON, and make it so we can stop parsing flat files.
  • Verifiable: All major contributors and sources listed. If you need to trace the provenance of a dataset—whether to verify information or to shower them with thanks—it’s all right there.

Understand that if it’s interesting we’ll take it how it stands. If other people agree it’s interesting, they’ll enrich it or make it more useful and share it back. All that’s really needed to host a dataset is it’s title; a brief description; and most importantly a pointers to who gathered the data and the source that distributed it.

You have disappointed me in the following ways:

I found something that doesn’t work.

Check the What works / What doesn’t list; there’s a lot of stuff that needs a-fixin’, as the site is still very new.

If some part of the site is obviously and horribly broken, please post a report or Contact us directly.

The site is called infochimps, but you have pictures showing monkeys and bonobos and gorillas. Geez, don’t you know the difference?

In order to fuel rapid growth and encourage the farflung tribe of infochimps, we’re outsourcing across the whole simian family. Reportedly many humans are contributing to the site as well.


About the site

How infochimps.org was built



 


 


 


 


 


 


 


’...considering what we’re talking about’—semantically endowed highly dimensional collections of data with a useful ontological framework interconnecting them.