Through this work we see that a Gradient Boosting regressor aggregated at the hour-weekday level most accurately predicts the number of rides. See Appendix A for the full table of results by testing method, feature transformation, and selection criteria. The various approaches to transform the assimilated data, including treating outliers, principal component analysis to reduce dimensionality, and log transformations were all contributing factors to increase model performance. Our model beat the baseline radiation model by a large margin, although more could be done to tune the hyperparameters of the gradient boosted regressor. We also found that the gains made from the PCA show that many of the features in the feature set we used were highly correlated with respect to the distribution patterns in the city of Chicago.
Real-world implications
The goal of this project was to construct a transportation demand model, using location and time characteristics to predict demand for rides in any US city. Our model was trained on data from the city of Chicago, but we see its value residing in the model’s ability to predict the number of rides in any date-time combination for any US city. Improving transportation network providers and planners understanding of the characteristics of rideshare activity in their city of interest enables more informed decision making about investments in infrastructure (roads, bus stations, etc.) and services (more buses, new mobility companies, etc.). This primary value is derived from the fact that the model can also be used in cities or regions where no historical ride-share data exists to model the demand.
While we already talked about which accuracy measurements were best to understand this model’s performance, all measurements we considered focus on internal validity. We believe an exciting and necessary next step in this work is to test its generalizability and external validity. The model has already been refined for the Chicago dataset, so testing - and probably retraining - it on different cities is an impactful avenue for future work. We envision building a more robust, standardized model for all cities, with additional neural network layers to learn weights on the differences between cities. In summary, we see immediate real-world value this model could provide through directional insights to transportation decision makers. Beyond the current state of the model and noted areas for improvement, we also see great value in our final feature set and data transformation pipeline. Its ability to capture inter-city differences may be very useful in future transportation demand related analysis.
References