dot Stop testing, start deploying your AI apps. See how with MIT Technology Review’s latest research.

Download now

Groovy Datasets for Test Databases

When you experiment with a new-to-you data science skill, you need some sort of data to work with. Why be boring?

Teaching yourself new tech skills often requires a “starter project” and data to support that project. Your motivation for learning the new skill could be anything: preparing for a career upgrade, curiosity about a hot, new programming language, or an intent to better exploit features in an existing development environment.

Good starter projects – once you’re done with “Hello, world,” – accomplish something, however trivial, even if they have nothing to do with work. You need to experiment with realistic coding scenarios, including edge-cases, so the starter project should represent the way you’ll use the tool in real life. On the other hand, you don’t want to spend months debugging a practice application.

Which is to say: Why not have fun? Choose a starter project that lets you play. In my past, such projects have included creating dungeon master tools, food co-op ordering systems, and software developer market research.

With that in mind, I offer several entertaining datasets for inspiration – from astronomy to science fiction to parking meter revenue – many of which support a range of data types. I like to think you’ll use them as you teach yourself about Redis features.

These datasets are all free to access, though a few require you to create a site login. They are downloadable (most are CSV) or accessible via an API. A lot of cool archives are designed for interactive search (such as the Women and Gender Marginalized Composers Repertoire Database, Baseball Reference, or the Tulsa Historical Society’s photo archive), However, this list is for developers, not for people who like to scan fascinating data collections.

I haven’t explored the data in any depth, nor do I vouch for their accuracy. This is purely a pointer to useful resources and a source of many, many internet rabbit holes.

Science fiction datasets

The Star Trek API provides a read-only model of all things Star Trek, including characters, performers, species, episodes, spacecrafts, books, astronomical objects, and video releases. For an idea of its scope, this dataset has information about 7,560 characters, 3,207 technology pieces, 2,497 locations, and 2,348 astronomical objects. Similar information is available from a shared Airtable with “every official Star Trek book, audiobook, comic, episode, movie, and more.” 

If your science fiction fandom lies elsewhere, you might choose the Mutant Moneyball project, which tracks comic book market data for individual X-Men characters’ financial value. The project’s dataset has decade-by-decade statistics for 26 X-Men characters drawn from sales histories and pricing guides.

Programming languages in all their glory

If you prefer to geek out with tech relevance, the programming language database describes several thousand programming languages, including their file formats, communications protocols, and other related concepts. You get information on the year the language was announced, its technical features, creators, countries and communities of origin, relevant books and URLs, and popularity metrics.

World music

For a different set of data characteristics, consult the Global Jukebox, an interactive map, and its accompanying compilation of datasets. It’s focused on traditional songs from around the world, based on information collected by musicologist Alan Lomax. The core dataset, called Cantometrics, encodes “37 aspects of musical style for 5,776 traditional songs from 1,026 societies.”

Coconut acoustics

This is my favorite find: the extracted acoustic signal features of tall Philippine coconut fruits. In the Philippines, it turns out, coconuts are classified manually into their maturity levels. “Traders often use their fingernails, knuckles, or the blunt end of the knife to tap the coconuts before assessing the sounds produced,” write the study’s authors, who developed hardware and software to emulate that process. They used it to collect acoustic signal data from 129 premature, mature, and overmature coconuts, each mechanically knocked on each of its three ridges.

This may be a good data source for AI or machine learning experimentation, particularly if you are interested in digital-signal processing or audio signal processing. Though really, we know the reason to look at this dataset is that you want to tell your friends, “I’m working on an application evaluating coconut acoustics.” I don’t blame you.

Exploding stars

exploding stars datasets
Observations of the supernova SN2020oi in the grand-design spiral M100 and its evolving light curve. (Credit: Alex Gagliano/Young Supernova Experiment Team)

The University of Hawai’i released what it claims is the largest catalog of exploding stars.” The largest data release of relatively nearby supernovae (colossal explosions of stars), containing three years of data from the University of Hawaii Institute for Astronomy’s (IfA) Pan-STARRS telescope atop Haleakalā on Maui, is publicly available via the Young Supernova Experiment,” reports the university. The data contains information on nearly 2,000 supernovae and other luminous variable objects with observations in multiple colors, and also extensively uses multi-color imaging to classify the supernovae and estimate their distances.

Literary prizes

If you want a bookish application for your sample project, build on the Post45 Data Collective’s dataset of major literary prizes. It has more than 7,100 “winners and judges of prizes for prose, poetry, or unspecified genre between 1918 and 2020 with a purse of $10,000 and over.” The data represent 50 awards and fellowships, plus the Library of Congress’s poet laureateship. This dataset is heavily text-based, with entries including the prize name, institution, type, genre, year, and dollar amount, among other fields.

If you are looking for less literary text-based data, consider the Dad Jokes API managed by the national responsible fatherhood clearinghouse.

FIFA World Cup

The Fjelstul World Cup Database covers 22 men’s World Cup tournaments from 1930-2022. The database includes 27 datasets that cover all aspects of the event, accounting for about 1.1 million data points. (I’d say more about this one, but you all know that I’m a baseball girl. I do note that Brazil is the only team with five World Cups.)

Swiss apartment layouts

Maybe you plan to work with spatial and architectural data? The Swiss Dwellings dataset contains detailed data on over 42,500 apartments (250,000 rooms) in about 3,100 buildings, including their geometries and room typology as well as the apartments’ visual, acoustical, topological, and daylight characteristics. It also has location-specific characteristics for the buildings, including climatic data and points of interest within walking distance.

Datasets are a girl’s best friend

If you’re interested in gem quality, diamond pricing, or merely a good-sized dataset for your sample application to chew on, consider this diamond dataset. It has information on about 220,000 diamonds, with 25 columns of data including fluor (measuring the effect of longwave UV light), the stone’s measurements, and the total sales price. That should add a bit of sparkle to your analysis.

Bird locations

Vendors sometimes offer (anonymized) data for public use and analysis. For instance, in addition to a cool world map that shows you live bird pictures that are taken with the company’s smart bird feeder, you can download Bird Buddy’s monthly datasets with longitude, latitude, and species name. Surely you can build a geospatial application that incorporates a northern cardinal, tufted titmouse, and red-headed woodpecker?

Another example that’s less visually attractive comes from BackBlaze, which regularly provides reports about true hard drive failure rates based on its extensive hardware use – 231,309 hard drives at the end of 2022. In addition to its own in-depth analysis, the company also provides its source data.

Government and municipal data

Open-data policies made it easy to find and download datasets that government agencies collect or generate. And it’s a lot of data. The U.S. government has a data search site where you can look for statistics on a wide range of topics, such as healthcare, car sales, and sensor data gathered from agricultural farm use. Whether any of these qualify as “cool datasets” is an exercise left to the user – but they often are large enough to be useful for experimental programming, and some have unique data types.

For instance, if you are exploring geospatial database features, you might want to use a data set that includes location data. One such example is the 31 million parking meter receipts collected since 2015 by the city of Arlington, Virginia, which includes where the meter is as well as the monies paid ($68.6 million dollars in revenue, if you are keeping score).

Similarly, the City of Los Angeles publishes the location and orientation of more than 50,000 stop signs; you can find similar information for Houston, San Francisco, and Detroit. Some datasets, such as from OpenStreetMap, are available through an API as well as downloadable files; if you can think of something to do with information about 1.4 million stop signs around the world, you can do so with ease.

Choose a dataset to support your tool exploration

I like to think that these datasets can help you expand your database skills – particularly as you explore what you can accomplish with Redis features such as search, gaming leaderboards, vector similarity search, time series, and geospatial capabilities. Choose datasets that match the application domain you want to learn about.

For example, if you want to experiment with database processing that incorporates geospatial analysis, your sample data need location data (birds! Stop signs!). To expand your knowledge of database search features (because ultimately, you want to speed up internal searches in production databases), choose a huge dataset (stars! diamonds!); your performance testing needs something to work with. Pick a numbers-heavy dataset when you want to learn how to create data visualizations that make everyone say, “Oooh!” And so on.

This isn’t silly. It’s a wise business choice

Don’t feel foolish about choosing one of these datasets. It’s a bad idea to use existing internal data for a starter project. Banging on real customer data raises privacy concerns, particularly when you aren’t using it for the reasons it was gathered.

You certainly cannot use real information when you speak at an industry conference. But you can entertain and engage an audience when you describe the graph database essentials in the context of Dungeons and Dragons. Mental models help us reframe our knowledge with familiar examples, turning abstract functions into practical analysis.

And, speaking from personal experience, if your intent is to show the boss a technology proof-of-concept (“Here’s what we could accomplish if we deployed this database feature!”), they could be distracted by the “real” data. (True story. A user saw the output of a “play with the tool” experiment – “show a graph of hotel reservations made, displayed by the day of the week” – and said, “Oh, can I get a copy of this report monthly?”)

Need more? This is just a start

If you’re interested in collecting datasets (oh look, a dataset of datasets!), I highly recommend the Data is Plural newsletter, which I drew on liberally to inform my suggestions. You also should visit and subscribe to ResearchBuzz, which shares dataset descriptions as well as archive-related news and tools (a recent example: Turn Wikipedia into an RSS Search Engine With WikiRSS). Google Research maintains a search site for test datasets, too, if you know what you’re looking for.

If you use any of these datasets in your personal projects, please tell me about them!