18 February 2011

Public Data Sets on Amazon Web Services

Public Data Sets on AWS provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications. AWS is hosting the public data sets at no charge for the community, and like all AWS services, users pay only for the compute and storage they use for their own applications.

Previously, large data sets such as the mapping of the Human Genome and the US Census data required hours or days to locate, download, customize, and analyze. Now, anyone can access these data sets from their Amazon Elastic Compute Cloud (Amazon EC2) instances and start computing on the data within minutes. Users can also leverage the entire AWS ecosystem and easily collaborate with other AWS users. For example, users can produce or use prebuilt server images with tools and applications to analyze the data sets. Users can also discuss best practices and solutions in the dedicated Public Data Sets forum.

By hosting this important and useful data with cost-efficient services such as Amazon EC2, AWS hopes to provide researchers across a variety of disciplines and industries with tools to enable more innovation, more quickly.

Available Public Data Sets on AWS
AWS will continue to add to the available collection of public domain and non-proprietary data sets over time. The data sets currently available are shown below. The Linux/UNIX snapshots are in ISO9660 or EXT3 format and the Windows snapshots are in NTFS format.

Here are some examples of popular Public Data Sets:
  • Annotated Human Genome Data provided by ENSEMBL
    The Ensembl project produces genome databases for human as well as almost 50 other species, and makes this information freely available.
  • Various US Census Databases from The US Census Bureau
    United States demographic data from the 1980, 1990, and 2000 US Censuses, summary information about Business and Industry, and 2003-2006 Economic Household Profile Data.
  • UniGene provided by the National Center for Biotechnology Information
    A set of transcript sequences of well-characterized genes and hundreds of thousands of expressed sequence tags (EST) that provide an organized view of the transcriptome.
  • Freebase Data Dump from Freebase.com
    A data dump of all the current facts and assertions in the Freebase system. Freebase is an open database of the world’s information, covering millions of topics in hundreds of categories. Drawing from large open data sets like Wikipedia, MusicBrainz, and the SEC archives, it contains structured information on many popular topics, including movies, music, people and locations – all reconciled and freely available.

Below, the screen-shot of AWS Public Data Sets resource center

You may also like:

FAIR USE NOTICE: This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available in our efforts to advance understanding of environmental, political, human rights, economic, democracy, scientific, and social justice issues, sustainable development, environmental, community and worker health, democracy, public disclosure, corporate accountability, and social justice issues, etc. We have included the full text of the article rather than a simple link because we have found that links frequently go "bad" or change over time. We believe this constitutes a "fair use" of any such copyrighted material as provided for in section 107 of the US Copyright Law. In accordance with Title 17 U.S.C. Section 107, the material on this site is distributed without fee or payment of any kind to those who have expressed a prior interest in receiving the included information for research and educational purposes. If you wish to use copyrighted material from this site for purposes of your own that go beyond 'fair use', you must obtain permission from the copyright owner.