Five Public Datasets, and Lots of Ideas for Exploring Them

ufo

The world is full of interesting datasets. But even though data is increasingly accessible, it’s sometimes hard think up an interesting problem to analyze. Maybe there are just too many possible questions, maybe it’s a pain to set up analytical tools, or maybe it’s just too easy to get distracted by animal GIFs.

Whatever the case, we want to make it easier to start working on interesting problems right away. Here are five datasets, already loaded into Mode’s public database, that you can query, analyze, and visualize right now.

For each dataset, I’ve provided a link to the table in Mode’s public data warehouse. If you’re feeling lazy and only want to work with a tiny amount of data (as in, one row), I found the best single row of data from each dataset. And if you’re feeling ambitious—and want to get popular on the internet or explain some things—I added some ideas for turning these datasets into maps.

FEC Campaign Finance Data

The Federal Election Commission requires candidates to make their campaign expenditures public. This dataset includes over 200,000 campaign expenditures from the 2012 U.S. presidential campaign, and is full of fascinating discoveries. Like Herman Cain’s $150,000 expense on Herman Cain. And the $5,000–the most of any candidate by far–Mitt Romney spent at liquor stores. And Ron Paul’s and Romney’s addiction to fast food (and Obama’s clear preference for Subway).

Herman Cain be like: treat yo self

  • What kinds of questions can I ask? What do candidates spend the most money on? Do some candidates get a spending lead early, while others save for the end? How do spending patterns differ for the incumbent compared to challengers?
  • What’s the table called?cooldata.fec_2012_presidential_campaign_expenditures
  • What’s the best row? wut.
  • Can I use this data to make a map? Absolutely. You could see how money was spent in each state. Or you could see if different candidates focus on different regions. Or, if you were feeling particularly ambitious, you could map candidate expenses by day to sketch out how they traveled across the country during their campaigns.

Crunchbase

Crunchbase is quickly becoming the dataset of record for the startup and venture capital communities. It can provide information on anything from what industries are hot (biotech) to the potential effects of founder experience or age. The dataset includes funding, investment, and acquisition data on over 40,000 companies.

  • What kinds of questions can I ask? Are there characteristics of a company—industry, location, etc.—that differ by VC? Do some VCs typically invest together, while others rarely do so? Are companies raising more money earlier? ARE WE IN A BUBBLE??
  • What are the tables called? crunchbase.acquisitions; crunchbase.companies; crunchbase.investments; crunchbase.rounds.
  • What’s the best row? This one, which is approaching the theoretical limit of how good a row of data can be.
  • Can I use this data to make a map? Yes! Like this rather uninformative one, showing the number of startups by the county where they’re headquartered.

UFO Sightings

Quandl, which provides millions of free datasets on vast range of subjects, added data on UFO sightings to Mode. The data includes the number of reported sightings by month. Quandl gets the data from the National UFO Reporting Center (and in case you need to report a sighting, they have a hotline).

  • What kinds of questions can I ask? Are some months more popular for sightings? What correlates with UFO sightings?
  • What’s the table called? thomas.ufo_sightings
  • What’s the best row? The first one, on a sighting from June 1400. The first sighting of the Black Knight?
  • Can I use this data to make a map? No. But you can probably combine it with some Independence Day GIFs and make a killer listicle.

FiveThirtyEight

FiveThirtyEight, Nate Silver’s data journalism site, produces a lot of great analysis. For some articles, they publish the underlying data on GitHub. If you want to explore their data or expand on their analyses, we’ve uploaded most of their datasets. A few topics include classic rock radio plays, the ages of Congressional representatives, World Cup predictions, and surveys about defining U.S. geographic regions and international cuisine preferences.

Holidays all over the world

This dataset includes a list of all the holidays in the world over the next year. While this data is useful for analysis, it could be even more valuable for figuring out which parts of the world—and which of your customers—are on vacation.

  • What kinds of questions can I ask? Which countries have the most holidays? Which months and days have the most holidays? Which countries share a lot of holidays, and which only share a few?
  • What’s the table called? reference_lookups.holidays_by_country
  • What’s the best row? The freedom row.
  • Can I use this data to make a map? Yes! You could show the average number of holidays during the year by country, or where there’s a holiday on any given day.

Ideas for More?

Inspired to do something fun with one of these datasets? Send us a link to your project on Twitter or Facebook, and we’ll share some of the best work! And if you want to make a map, we’ll soon be publishing a quick tutorial for how make one, but feel free to email us if you have any questions now.

Category: Community
Vote on HN