Note: We need your help!

This document is being developed collaboratively. We hope to make it the best possible resource by leveraging the collective knowledge of government experts and other stakeholders. You can contribute by:

  • Adding your comments to the discussion at the bottom of this page

What Is Data Science?

Data Science is use of data to answer a scientific question. Scientific questions are not limited to experiments in laboratories - they occur any time a person or organization asks a question and seeks to answer that question in a way that can be verified, explained, or theorized based on the analysis of information. Because data science involves many different but related domains (i.e., computer science, quantitative analysis, statistics, communications), there is a lot packed into the field of data science.

The purpose of this guide is to provide a brief overview of the courses, concepts, and resources anyone interested in advancing their knowledge of data science should/could leverage. The mention of any particular course is not an endorsement, but simply an acknowledgement. There are many other resources outside the items mentioned here and using this guide as a launching point for additional research can help identify other potential data science learning options.

Data Science Courses

If you want to become a bonafide Data Scientist, there are many routes and options available to you. For a list of Data Science programs across the United States, offered by major universities and colleges, check out this data visualization. The table below shows some of the most popular online/in-person offerings for those wishing to do it on their own. Many colleges and universities offer similar programs, often through statistics or business departments, so check with your local universities and professors to see if there are any local offerings that can help meet data science needs.

Course Provided Through Developed By Price Format
Data Science Specialization Coursera Johns Hopkins University $441 Online
Executive Data Science Specialization Coursera Johns Hopkins University From $49/course Online
Master of Computer Science in Data Science Coursera University of Illinois at Urbana-Champaign $19,000 for 32 credit hour degree Online
Data Science at Scale Coursera University of Washington $284 Online
Intro to Data Science UDACITY UDACITY Free Online
Applied Data Science: An Introduction Course Central Syracuse University Free Online
Open Source Society Open Source Society University Github Free Online
Data Science Immersive General Assembly General Assembly $14,500 In Person
Data Science Part Time General Assembly General Assembly $4,000 ½ Online ½ In Person

Data Science Key Concepts

Not everyone wants to be a Data Scientist, but plenty of people want to strengthen their skills in the underlying competencies. It’s possible to learn key concepts in data science without going through a full-fledged program like those mentioned above. For a high-level overview of some key (and commonly confusing) terms, check out the GovEx Data Science Cheat Sheet. The list below includes the subject areas which are often taught in data science coursework.

The internet is rich with resources for learning data science. Plus, the community is an active one… so join in and get involved! Liberally use sites like StackXchange, R Bloggers, GitHub, and even Twitter to start learning from your peers, no matter your skill level as a data scientist.

Concept Type Description Level
Getting and Cleaning Data Concept Acquiring and preparing data for analysis through a variety of manual and automated techniques. Introductory
Exploratory Data Analysis Concept Performing initial analysis on data without a particular research question in mind in order to discover potential insights. Intermediate
Reproducible Research Concept Cataloguing research so that others can follow data, steps, and analysis in order to replicate/test findings. Advanced
Descriptive Statistics Statistics Statistics used to describe and summarize data, including measures of central tendency (mean, median, mode, etc.) and variance. Introductory
Inferential Statistics Statistics Using data from a sample to make inferences about a larger population of data. Intermediate
Bayesian Statistics Statistics Field of statistics that treats probability as a state of belief that can change given new information. Advanced
Probability Statistics A measure of the likelihood that an event will occur. Introductory
R Programming Programming Language A programming language used for statistical computing. Advanced
Python Programming Programming Language A general purpose programming language. Advanced
Regression Models Analysis Statistical methods for analyzing relationships between variables. Introductory
Machine Learning Concept Computational algorithms used to make predictions. Advanced
Data Visualization Concept The visual representation of information in a multidimensional space. Introductory
Econometrics Statistics A field of statistics for analyzing economic data. Advanced
Big Data Concept A term used to describe data that is extremely large in storage size, or that requires large amounts of processing to analyze. Advanced
Algorithms Concept A set of defined operations on a given input that result in an output. Introductory
Survey Data Collection & Analysis Concept Surveys are a set of questions answered by a selected group of people. The answers can be analyzed quantitatively and/or qualitatively. Intermediate
Text Mining Concept Method of computational analysis to derive information from text. Intermediate
Business Intelligence Analysis Method of data analysis to produce useful information for business purposes. Intermediate
Data Warehousing Concept System for electronically storing data in an organized manner. Advanced
Systems (GIS)/ spatial analysis Analysis Method for analyzing the geographical dimension of various types of data. Advanced
SQL* Databases Querying language used to interact with a relational database. Intermediate
PostgreSQL* Databases Querying language for PostgreSQL relational databases. Intermediate
NoSQL* Databases Querying language used to interact with a non-relational database. Advanced

Tools for conducting data analysis:

Concept Type Description Level
Excel Analysis A Microsoft spreadsheet application used for calculation and other purposes. Introductory
R Analysis A programming language used for statistical computing. Advanced
Python Analysis A general purpose programming language made with ease and accessibility in mind. Advanced
Tableau Analysis An application used for data visualization. Intermediate
SPSS Analysis A software package used for statistical computing. Advanced
SAS Analysis A software package used for statistical computing. Advanced

If you’re not familiar with databases, SQL, Postgre, or NoSQL, check out this fun and information introduction to databases from Guru 99. http://www.guru99.com/introduction-to-database-sql.html

Data Science Tools

Being a Data Scientist, or even pretending to be one, requires the utilization of some of the tools below. This table provides a brief overview of some of the tools used in data science and what those tools are best suited for. Many of these tools have functionality in other areas outside their focus area, so careful research and choice of tools are important when working in data science.

Tool Used for Cost
Git Version control Free
Github Collaborative development Free
R/ RStudio Statistical analysis, visualization Free
Python Statistical analysis, applications Free
Excel Database, statistical analysis $
Hadoop (large datasets) Data storage, computing Free
Hive (large datasets) Data storage Free
Pig (large datasets) Data analysis Free
Apache Spark (large datasets) Data analysis Free
Tableau Data analysis, visualization $$$
SPSS Data analysis $$
SAS Analysis, data management $$
MySQL Database Free
PostgreSQL Database Free

Data Science Resources

Books:




How can we improve this content?

Note: Moderator approval is required to make anonymous comments visible to all readers.