Note:
** We need your help!**

This document is being developed collaboratively. We hope to make it the best possible resource by leveraging the collective knowledge of government experts and other stakeholders. You can contribute by:

- Adding your comments to the discussion at the bottom of this page

Data Science is use of data to answer a scientific question. Scientific questions are not limited to experiments in laboratories - they occur any time a person or organization asks a question and seeks to answer that question in a way that can be verified, explained, or theorized based on the analysis of information. Because data science involves many different but related domains (i.e., computer science, quantitative analysis, statistics, communications), there is a lot packed into the field of data science.

The purpose of this guide is to provide a brief overview of the courses, concepts, and resources anyone interested in advancing their knowledge of data science should/could leverage. The mention of any particular course is not an endorsement, but simply an acknowledgement. There are many other resources outside the items mentioned here and using this guide as a launching point for additional research can help identify other potential data science learning options.

If you want to become a bonafide Data Scientist, there are many routes and options available to you. For a list of Data Science programs across the United States, offered by major universities and colleges, check out this data visualization. The table below shows some of the most popular online/in-person offerings for those wishing to do it on their own. Many colleges and universities offer similar programs, often through statistics or business departments, so check with your local universities and professors to see if there are any local offerings that can help meet data science needs.

Course | Provided Through | Developed By | Price | Format |
---|---|---|---|---|

Data Science Specialization | Coursera | Johns Hopkins University | $441 | Online |

Executive Data Science Specialization | Coursera | Johns Hopkins University | From $49/course | Online |

Master of Computer Science in Data Science | Coursera | University of Illinois at Urbana-Champaign | $19,000 for 32 credit hour degree | Online |

Data Science at Scale | Coursera | University of Washington | $284 | Online |

Intro to Data Science | UDACITY | UDACITY | Free | Online |

Applied Data Science: An Introduction | Course Central | Syracuse University | Free | Online |

Open Source Society | Open Source Society University | Github | Free | Online |

Data Science Immersive | General Assembly | General Assembly | $14,500 | In Person |

Data Science Part Time | General Assembly | General Assembly | $4,000 | ½ Online ½ In Person |

Not everyone wants to be a Data Scientist, but plenty of people want to strengthen their skills in the underlying competencies. It’s possible to learn key concepts in data science without going through a full-fledged program like those mentioned above. For a high-level overview of some key (and commonly confusing) terms, check out the GovEx Data Science Cheat Sheet. The list below includes the subject areas which are often taught in data science coursework.

The internet is rich with resources for learning data science. Plus, the community is an active one… so join in and get involved! Liberally use sites like StackXchange, R Bloggers, GitHub, and even Twitter to start learning from your peers, no matter your skill level as a data scientist.

Concept | Type | Description | Level |
---|---|---|---|

Getting and Cleaning Data | Concept | Acquiring and preparing data for analysis through a variety of manual and automated techniques. | Introductory |

Exploratory Data Analysis | Concept | Performing initial analysis on data without a particular research question in mind in order to discover potential insights. | Intermediate |

Reproducible Research | Concept | Cataloguing research so that others can follow data, steps, and analysis in order to replicate/test findings. | Advanced |

Descriptive Statistics | Statistics | Statistics used to describe and summarize data, including measures of central tendency (mean, median, mode, etc.) and variance. | Introductory |

Inferential Statistics | Statistics | Using data from a sample to make inferences about a larger population of data. | Intermediate |

Bayesian Statistics | Statistics | Field of statistics that treats probability as a state of belief that can change given new information. | Advanced |

Probability | Statistics | A measure of the likelihood that an event will occur. | Introductory |

R Programming | Programming Language | A programming language used for statistical computing. | Advanced |

Python Programming | Programming Language | A general purpose programming language. | Advanced |

Regression Models | Analysis | Statistical methods for analyzing relationships between variables. | Introductory |

Machine Learning | Concept | Computational algorithms used to make predictions. | Advanced |

Data Visualization | Concept | The visual representation of information in a multidimensional space. | Introductory |

Econometrics | Statistics | A field of statistics for analyzing economic data. | Advanced |

Big Data | Concept | A term used to describe data that is extremely large in storage size, or that requires large amounts of processing to analyze. | Advanced |

Algorithms | Concept | A set of defined operations on a given input that result in an output. | Introductory |

Survey Data Collection & Analysis | Concept | Surveys are a set of questions answered by a selected group of people. The answers can be analyzed quantitatively and/or qualitatively. | Intermediate |

Text Mining | Concept | Method of computational analysis to derive information from text. | Intermediate |

Business Intelligence | Analysis | Method of data analysis to produce useful information for business purposes. | Intermediate |

Data Warehousing | Concept | System for electronically storing data in an organized manner. | Advanced |

Systems (GIS)/ spatial analysis | Analysis | Method for analyzing the geographical dimension of various types of data. | Advanced |

SQL* | Databases | Querying language used to interact with a relational database. | Intermediate |

PostgreSQL* | Databases | Querying language for PostgreSQL relational databases. | Intermediate |

NoSQL* | Databases | Querying language used to interact with a non-relational database. | Advanced |

Concept | Type | Description | Level |
---|---|---|---|

Excel | Analysis | A Microsoft spreadsheet application used for calculation and other purposes. | Introductory |

R | Analysis | A programming language used for statistical computing. | Advanced |

Python | Analysis | A general purpose programming language made with ease and accessibility in mind. | Advanced |

Tableau | Analysis | An application used for data visualization. | Intermediate |

SPSS | Analysis | A software package used for statistical computing. | Advanced |

SAS | Analysis | A software package used for statistical computing. | Advanced |

*If you’re not familiar with databases, SQL, Postgre, or NoSQL, check out this fun and information introduction to databases from Guru 99. http://www.guru99.com/introduction-to-database-sql.html*

Being a Data Scientist, or even pretending to be one, requires the utilization of some of the tools below. This table provides a brief overview of some of the tools used in data science and what those tools are best suited for. Many of these tools have functionality in other areas outside their focus area, so careful research and choice of tools are important when working in data science.

Tool | Used for | Cost |
---|---|---|

Git | Version control | Free |

Github | Collaborative development | Free |

R/ RStudio | Statistical analysis, visualization | Free |

Python | Statistical analysis, applications | Free |

Excel | Database, statistical analysis | $ |

Hadoop (large datasets) | Data storage, computing | Free |

Hive (large datasets) | Data storage | Free |

Pig (large datasets) | Data analysis | Free |

Apache Spark (large datasets) | Data analysis | Free |

Tableau | Data analysis, visualization | $$$ |

SPSS | Data analysis | $$ |

SAS | Analysis, data management | $$ |

MySQL | Database | Free |

PostgreSQL | Database | Free |

- An Introduction to Data Science by Jeffrey Stanton
- What is Data Science? by Mike Loukides
- Building Data Science Teams by DJ Patil
- Executive Data Science by Brian Caffo, Roger D. Peng and Jeffrey Leek
- R Programming for Data Science by Roger Peng
- The Art of Data Science by Roger D. Peng and Elizabeth Matsui
- Statistical Inference for Data Science by Brian Caffo
- Regression Models for Data Science in R by Brian Caffo
- Report Writing for Data Science in R by Roger Peng
- Advanced Linear Models for Data Science by Brian Caffo
- Data Science from Scratch: Principles with Python by Joel Grus
- Data Science for Business: What you Need to Know about Data Mining and Data Analytic Thinking by Foster Provost
- Data Smart: Using Data Science to Transform Information into Insight by John Foreman
- Data Science for Dummies by Lillian Pierson
- Doing Data Science: Straight Talk from the Frontline by Cathy O’Neil
- The Data Science Handbook: Advice and Insights from 25 Amazing Data Scientists by Carl Shan
- Python for Data Analysis by Wes McKinney
- Bad Data Handbook by Q Ethan McCallum
- Mining the Social Web by Matthew Russel
- Creating a Data-Driven Organization by Carl Anderson
- R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham & Garrett Grolemund