Data Science Projects

Causality and linear regression: Building models in R and looking for causality using infant data

Hypothesis testing: Testing hypotheses in R using political survey data

Puffin Really Matters: Designing a case study that attempts to use images of Puffins to identify individuals in populations

Exploratory data analysis: EDA of CEO data using R, associations between performance and salary

Best Hospitals: A data lake in HDFS was created for government hospital data. It was then transformed for analysis using Hive and Spark. The best performing hospitals were discovered across a wide variety of measures.

Tweet Storm: Running basic analysis of live tweets using Apache Storm and the Twitter API. Tweets were parsed using python and stored in Postgresql.

Whiskey Business: A comparison of whiskey prices and reviews from a liquor controlled state and free market states to find the best deals outside the state controlled prices. Whiskey prices and reviews were taken from multiple web sources, cleaned the data with python, stored it in HDFS, combined the data into one table, exported it to Google sheets and connected it to Data Studio for interactive visualizations.

Predicting Tree Coverage: Based on a Kaggle competition we used Sklearn to predict the type of tree coverage from forest service data and Random Forest models.

Machine Learning at Scale: Implementing embarrassingly parallel algorithms such as Naive Bayes with MapReduce, KMeans with MapReduce and Spark, Cosine similarity with MRJob, Linear Regression using gradient descent with MRJob, Single Source Shortest Path with MRJob, PageRank with MRJob and Spark, Logistic Regression using gradient descent with Spark

Disaster Perplexity: Detected perplexity and topic changes on social media when natural and man-made disasters happened in local communities.

Email Response Experiment: Ran an experiment to see if women receive slower responses to emails than men by sending out emails to businesses with randomly assigned male/female names and measuring response times.