Data Science Projects

Causality and linear regression: Building models in R and looking for causality using infant data.

Hypothesis testing: Testing hypotheses in R using political survey data.

Puffin Really Matters: Designing a case study that attempts to use images of Puffins to identify individuals in populations.

Exploratory data analysis: EDA of CEO data using R, associations between performance and salary.

Best Hospitals: Created a data lake in HDFS for government hospital data. Using Hive and Spark transformed the data for analysis. Discovered the best performing hospitals across a wide variety of measures.

Tweet Storm: Running basic analysis of live tweets using Apache Storm and the Twitter API. Tweets were parsed using python and stored in Postgresql.

Whiskey Business: A comparison of whiskey prices and reviews from a liquor controlled state and free market states to find the best deals outside the state controlled prices. Whiskey prices and reviews were taken from multiple web sources, cleaned the data with python, stored it in HDFS, combined the data into one table, exported it to Google sheets and connected it to Data Studio for interactive visualizations.

Predicting Tree Coverage: Based on a Kaggle competition we used Sklearn to predict the type of tree coverage from forest service data and Random Forest models.

Machine Learning at Scale: Implementing embarrassingly parallel algorithms such as Naive Bayes with MapReduce, KMeans with MapReduce and Spark, Cosine similarity with MRJob, Linear Regression using gradient descent with MRJob, Single Source Shortest Path with MRJob, PageRank with MRJob and Spark, Logistic Regression using gradient descent with Spark, pairwise data mining in MapReduce.

Disaster Perplexity: Detected perplexity and topic changes on social media when natural and man-made disasters happened in local communities.

Email Response Experiment: Ran an experiment to see if women receive slower responses to emails than men by sending out emails to businesses with randomly assigned male/female names and measuring response times. A data collection and insight engine designed to turn a teacher’s observations into credible data that can inform classroom and school level decision making.

Rodents in NYC: A visualization created in Tableau using data from the NYC Open Data site. This project visually and interactively explores signs of rodents in NYC buildings and restaurants.

Step Data Viz: An interactive visualization built with d3.js that explores insights gathered from iOS step data.