6.3 Data Set

Adolescents


This library has data taken from two waves of the National Longitudinal Study of Adolescent Health.

The National Longitudinal Study of Adolescent to Adult Health (Add Health) is a longitudinal study of a nationally representative sample of adolescents in grades 7-12 in the United States during the 1994-95 school year. The Add Health cohort has been followed into young adulthood with four in-home interviews, the most recent in 2008, when the sample was aged 24-32*. Add Health combines longitudinal survey data on respondents’ social, economic, psychological and physical well-being with contextual data on the family, neighborhood, community, school, friendships, peer groups, and romantic relationships, providing unique opportunities to study how social environments and behaviors in adolescence are linked to health and achievement outcomes in young adulthood. The fourth wave of interviews expanded the collection of biological data in Add Health to understand the social, behavioral, and biological linkages in health trajectories as the Add Health cohort ages through adulthood.


To begin, download the following three files to your computational thinking folder.

 

If you investigate the keys to each row of the dictionary, they will appear cryptic. The following documentation will help you learn about this data:

  • Codebook: ( C Wave 1 ) ( C Wave 2 ) These PDF files will explain what each key means.
  • Questionaire: ( Q Wave 1 ) ( Q Wave 2 ) These PDF files will give the full text of each survey question.
  • Summary: ( S Wave 1 ) ( S Wave 2 ) These TXT files will show the possible values that each survey question can respond with, and even more critically, the percentage of people who answered that question. This percentage should influence what data you analyze - there is no sense learning about a question that was only responded to by 1/100 of the test takers!

This data set is massive (together, the waves are about 113MB). It is difficult to fit this much data into memory, so by default it only loads 20 survey results into memory. When you are ready to start testing with more data, call the use_all_data() function.

 

 

Airports


This dataset is all about airports. To begin, download the following two files to your computational thinking folder.

 

 

If you investigate the keys to each row of the dictionary, they will appear a little cryptic. The following documentation will help you learn about this data:

Data Dictionary

There are four data dictionaries relevant to this data source on the FAA website

 

 

Baseball


The following metadata describes this database.

This database contains pitching, hitting, and fielding statistics for Major League Baseball from 1871 through 2012. It includes data from the two current leagues (American and National), the four other “major” leagues (American Association, Union Association, Players League, and Federal League), and the National Association of 1871-1875.

This database was created by Sean Lahman, who pioneered the effort to make baseball statistics freely available to the general public. What started as a one man effort in 1994 has grown tremendously, and now a team of researchers have collected their efforts to make this the largest and most accurate source for baseball statistics available anywhere. (See Acknowledgements below for a list of the key contributors to this project.)

None of what we have done would have been possible without the pioneering work of Hy Turkin, S.C. Thompson, David Neft, and Pete Palmer (among others). All baseball fans owe a debt of gratitude to the people who have worked so hard to build the tremendous set of data that we have today. Our thanks also to the many members of the Society for American Baseball Research who have helped us over the years. We strongly urge you to support and join their efforts. Please vist their website (www.sabr.org).


To begin, download the following three files to your computational thinking folder.

 

 

Crime


This dataset is all about crimes.

The Uniform Crime Reporting (UCR) Program has been the starting place for law enforcement executives, students of criminal justice, researchers, members of the media, and the public at large seeking information on crime in the nation. The program was conceived in 1929 by the International Association of Chiefs of Police to meet the need for reliable uniform crime statistics for the nation. In 1930, the FBI was tasked with collecting, publishing, and archiving those statistics.


To begin, download the following three files to your computational thinking folder.

 

 

Fuel


The following metadata describes this database:

Fuel economy data are the result of vehicle testing done at the Environmental Protection Agency’s National Vehicle and Fuel Emissions Laboratory in Ann Arbor, Michigan, and by vehicle manufacturers with oversight by EPA.

To begin, download the following three files to your computational thinking folder.

 

To learn more about the dataset, you will want to read the following documentation.
Data Dictionary
  • Vehicles
  • Emissions

 

 

HorseRacing

 

This dataset is taken from crawling the Churchill Downs website. It represents several years and many races worth of wins and losses.

 

To begin, download the following two files to your computational thinking folder.

 

 

Hospital

 

These are the official datasets used on the Medicare.gov Hospital Compare Website provided by the Centers for Medicare & Medicaid Services. These data allow you to compare the quality of care at over 4,000 Medicare-certified hospitals across the country.

 

To begin, download the following two files to your computational thinking folder.

 

If you investigate the keys to each row of the dictionary, they will appear a little cryptic. The following documentation will help you learn about this data:
Data Dictionary

 

Hydropower


This dataset is a combination of the “Hydropower Potential in the Western U.S.” report and crawling the US Department of Interior’s Beaureu of Reclamation website.

The dataset includes design elements, installed capacity, production capability, associated costs and cost -to-benefit ratios for nearly 200 water storing and conveying structures currently maintained by the Bureau of Reclamation. These data were used to support the internal study and report for assessing hydropower capability at 70 of Reclamation’s existing facilities where hydropower has not been developed. The dataset can further be leveraged to support applications designed to provide a better understanding of our hydropower production potential and resource utilization.


To begin, download the following three files to your computational thinking folder.

 

MovieScript

This dataset is all about over 600 movies. It was collected and organized by the Cornell Movie Dialogs Corpus. More information about the project is available at the CMDC site .

 

To begin, download the following two files to your computational thinking folder.

movie_scripts.py

movie_script.json


There are mostly complete scripts for over 600 movies in the movie_scripts library, which is a lot of data. If you try to use all the data at once, it can make running your program very slow. To speed things up when you’re developing, the library defaults to only having information about 8 movies:

  • “romeo and juliet”
  • “the rocky horror picture show”
  • “the princess bride”
  • “casablanca”
  • “citizen kane”
  • “2001: a space odyssey”
  • “star wars”
  • “the wizard of oz”

 

 

Personality


This dataset is all about personality surveys. To begin, download the following two files to your computational thinking folder.

To begin, download the following three files to your computational thinking folder.