Previously, large data sets such as the mapping of the Human Genome and the US Census data required hours or days to locate, download, customize, and analyze. Now, anyone can access these data sets from their Amazon Elastic Compute Cloud (Amazon EC2) instances and start computing on the data within minutes. Users can also leverage the entire AWS ecosystem and easily collaborate with other AWS users. For example, users can produce or use prebuilt server images with tools and applications to analyze the data sets. Users can also discuss best practices and solutions in the dedicated Public Data Sets forum.
By hosting this important and useful data with cost-efficient services such as Amazon EC2, AWS hopes to provide researchers across a variety of disciplines and industries with tools to enable more innovation, more quickly.
Available Public Data Sets on AWS
AWS will continue to add to the available collection of public domain and non-proprietary data sets over time. The data sets currently available are shown below. The Linux/UNIX snapshots are in ISO9660 or EXT3 format and the Windows snapshots are in NTFS format.
Here are some examples of popular Public Data Sets:
- Annotated Human Genome Data provided by ENSEMBL
The Ensembl project produces genome databases for human as well as almost 50 other species, and makes this information freely available.
- Various US Census Databases from The US Census Bureau
United States demographic data from the 1980, 1990, and 2000 US Censuses, summary information about Business and Industry, and 2003-2006 Economic Household Profile Data.
- UniGene provided by the National Center for Biotechnology Information
A set of transcript sequences of well-characterized genes and hundreds of thousands of expressed sequence tags (EST) that provide an organized view of the transcriptome.
- Freebase Data Dump from Freebase.com
A data dump of all the current facts and assertions in the Freebase system. Freebase is an open database of the world’s information, covering millions of topics in hundreds of categories. Drawing from large open data sets like Wikipedia, MusicBrainz, and the SEC archives, it contains structured information on many popular topics, including movies, music, people and locations – all reconciled and freely available.
Below, the screen-shot of AWS Public Data Sets resource center