Spark Optimizations (October 2015)
Technologies used: Spark, Scala
- Implemented in memory and out of core hashing in Scala
- Used out of core hashing in order to implement UDF caching in the interest of optimizing queries involving UDFs
Decision Trees and Random Forests (November 2015)
Technologies used: Python, NumPy
- Created decision trees and random forests that trained on labeled examples in order to classify unlabeled examples
- Dealt with categorical features using one hot encoding and imputed missing values of samples
- Trained on and classified spam and ham emails to distinguish between the two as well as US census data to predict whether a given individual made above or below $50k annually
Nonpartisan Traveling Politician Solver (May 2015)
Technologies Used: Python
Wrote a simulated annealing algorithm in Python to quickly and accurately find valid, low-cost solutions to hundreds of instances of a modified version of the traveling salesman problem. See README on github for a more in depth description of the problem.