Feature Engineering

The assembled dataset consisted of relevant information that could be leveraged to predict the number of rides that can be expected to occur between a given origin-destination tract pare within a date-time period. For feature engineering we focused on POIs per area and key types of POIs as important features to express. We first one-hot encoded the tag converting the data from a list to 104 unique features. Next we grouped by track to get a total count of POIs by type and tract. While this gives good granularity, the 104 feature columns were quite sparse so we created a Total POIs feature. We also thought it was important to express access to public transit within the tract, so we created a summary feature for all transportation (bus, subway, train, etc.) within the tract. While tracts are relatively consistent at a population level, they vary widely in area. To reflect this we found the area of each tract and divided Total_POI and Total_transport features by this area to create density based metrics.

We performed a compute intensive merge of the POIs, census and weather tables at the many levels of temporal granularity shown in the table below. We then performed exploratory data analysis inspecting the average and distribution of record level counts at each of these levels of granularity. As granularity reduced, we saw a better spread on the distribution of rideshare demand, which could lead to more accurate predictions. However, there was also an important tradeoff to be considered between working with the lower levels of granularity that were more computationally manageable, and attempting to incorporate detailed information at the level at which the demand modelling would be useful for practical applications. For this reason, we retained these varying levels of granularity and tested them throughout our modeling process.

We then normalized the data - bounding outliers, demeaning and dividing by the standard deviation per variable - for the purpose of doing PCA later, as well as for testing different regression models. After normalization, we used PCA to reduce feature dimensionality from the initial 315 features to 127, while preserving 95% of the predictive power.

Modeling Results

When choosing how to measure our accuracy we considered standard accuracy metrics like the F1 score and Root Mean Squared Error (RMSE), but decided we should stick with the Common Part of Commuters (CPC) as a well accepted standard for benchmarking commuting network analysis models. Beyond its popularity in the field, we found that this metric placed penalty in accordance with our priorities. CPC places equal weight on misestimating rides across all predictions (does not disproportionately penalize misestimates in low volume or high volume instances). It also penalized overestimates and underestimates evenly.

For modeling, it was important to our team to develop a baseline to compare our boosted regressor and random forest models against. We used the Extended Radiation Model (ERM) for this purpose. For the ERM we used the population, POIs, distance and existing trip data as our input features. We used daily trip data and assumed an alpha value ranging from 0.1 to 2.0 from which we selected 0.1. We also experimented with taking different time aggregations of the data. Hourly aggregation resulted in unreasonable predictions, so we decided to aggregate by day. We also narrowed our selected days to weekdays only (Monday to Thursday) because of their relatively similar trends in distribution. This returned a CPC of 0.35.

After observing heavy skew in our data, with the vast majority of rideshare demand being equal to 1 for a numbers for an OD pair at a given time, we considered running two models in series; a classification model to separate few trips vs many trips and then a MPL Regressor to predict the count of trips for the many-trip classified subset. This was an attempt to further reduce prediction skew. Sadly, this approach yielded an only slight improvement above baseline with a CPC of 0.41.

Running a MLP Regressor directly improved our results markedly yielding a CPC of 0.702. While we found this result initially surprising, review of the predictions rendered by this ensemble method provided insight into the observed increase in accuracy. The initial classification was assigning a Boolean high/low value to all records. Because of the heavy left skew of the data and a simple mean ride count assignment label for all low-ride predicted records, the error created through this assignment and the relatively high count of records receiving this assignment significantly added to the final error.

To further improve results, we shifted to using a gradient boosted decision tree regressor, which has proven to be useful for a variety of demand modeling problems. After tuning, the model returned a CPC value of 0.752. The figure to bellow shows these improvements in CPC by model.