Two weeks ago, Chris Whong published a massive dataset of every taxi trip taken in New York in 2013. The data, provided through a Freedom of Information Law request, includes an incredible amount of detail on where trips started, where they ended, when they occurred, how much they cost, and how many passengers there were.
A number of people have already done incredible things with this data, including making a remarkably detailed map of where cabs typically pick up and drop off passengers. A dataset of this detail opens the door for countless questions and angles of exploration.
One such question surrounds accusations that New York City cabs discriminate against potential passengers. A number of anecdotes claim that New York cabs are reluctant to stop for black passengers, especially after dark. Could this new dataset shed any light on this issue?
The dataset, which provides no personal details on passengers and drivers (sort of), can’t answer this question directly. However, by looking at where cabs pick up and drop off passengers—and by considering the racial makeup of those neighborhoods—we can start piecing together the evidence. It’s not conclusive, but it could be a start.
Where Cabs Go—And Where They Don’t
On the surface, taxis appear to avoid picking up passengers in neighborhoods more heavily populated by minorities. The chart below show the number of taxi pickups per 1,000 people in areas bucketed by the demographic makeup of the residential population. As the chart shows—and the map above makes even more clear—as minorities make up more of the population, the fewer taxi trips originate from those areas.
A few details about the numbers above—and the numbers considered in the rest of this post—are important to note. First, the data above only includes trips from one week in June (chosen because it’s not affected by major holidays or severe weather). Even though this represents less than a 50th of the dataset, it still includes over 3 million trips. Second, because of the complex politics behind the New York City taxi system and and its services outside of Manhattan, this analysis is limited to trips that originated in Manhattan (officially, New York County).
In order to determine the demographic makeup of pickup and dropoff locations, I rounded the latitude and longitude of each pickup and dropoff point to the nearest thousandth, which approximates the location within about 100 meters. For every rounded location, I looked up the Census tract that it falls in via the FCC Census block API. The Census provides demographic and economic data by Census tract, allowing the mapping between GPS coordinates and neighborhood demographics.
Though the chart above raise suspicions about racial profiling among taxi drivers, the result is far from conclusive. After all, pickup rates are determined by passenger demand as well as preferences of cab drivers. Neighborhoods with more minorities could simply have fewer prospective passengers.
Why this might be true is a question worth its own exploration, but for the sake of this analysis, I added a couple factors that could serve as proxies for taxi demand:
1. Residents' incomes - Wealthier people likely take more cabs than the less-well-off. If neighborhoods with mostly white residents tend to be more affluent (which they are), the effect above could be caused by economics rather than discrimination.
2. Location - Central Manhattan is largely populated by whites. Though the graph above controls for population size, it doesn’t account for commercial or tourist activity, which is likely concentrated in central Manhattan. The apparent high rate of cab activity in white tracts could actually be because whites live in central areas with more commercial activity, while minorities live on Manhattan’s edges.
To attempt to account for these two factors, I made a simple model that estimates how many taxi pickups are expected in a Census tract given its population, median income, distance from Times Square (roughly the center of Manhattan), and non-white population.
A basic model with incomes, population, distance from Times Square, and the size of the white population suggests that these other factors—and primarily distance from Times Square—account for the apparent effect shown above. Nevertheless, even controlling for these variables, the correlation between the size of the white population and taxi pickups is strong enough to at least warrant further investigation.
# This model regresses the number of pickups by Census tract against the tract's income, population, white population ratio, and distance from Times Square. As expected, incomes are positively correlated with pickups, and distance from Times Square is negatively correlated. The size of white population is also positively correlated with pickups, though, as show by the p-value in the final column, the relationship isn't highly significant. Regression output: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.204e+04 2.149e+03 5.600 5.69e-08 *** median_income 6.346e-02 2.006e-02 3.163 0.00176 ** population -7.785e-02 1.728e-01 -0.450 0.65282 white_percent 4.748e+03 3.244e+03 1.464 0.14455 distance_in_miles -2.840e+03 3.126e+02 -9.084 < 2e-16 ***
One claim worth exploring is the specific case highlighted in the anecdotes above: Cab drivers are reluctant to pick up black people (not necessarily all minorities) at night. If we limit our model to trips taken between 9:00 PM and 6:00 AM, the size of the black population appears to have a significantly negative effect on cab pickups.
# This model examines the number of pickups between 9:00 PM and 6:00 AM by Census tract. This restriction weakens the correlation between pickups and income, but the relationship between pickups and the size of the black population becomes very strong. Regression output: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.188e+03 8.790e+02 8.177 1.53e-14 *** median_income 1.510e-03 6.055e-03 0.249 0.803301 population -9.184e-02 6.859e-02 -1.339 0.181815 black_percent -3.937e+03 1.142e+03 -3.447 0.000667 *** distance_in_miles -9.869e+02 1.194e+02 -8.266 8.56e-15 ***
Before concluding that this is damning evidence against cabs, it’s important to note that accounting for Manhattan’s geography in this model is a trickier statistical problem than just adding it as a variable to a regression. Because many of the non-white areas are far from Times Square, there’s a strong correlation between distance from Times Square and the racial makeup of the Census tract. This creates collinearity problems that undermine the results above.
Though more complex methods can correct this, I chose to make a simple approximation. There’s little correlation between race and distance between 3 and 4 miles from Times Square. Limiting on these set of tracts, we can apply the model as before. In this band, only income, and not racial makeup, appear to matter.
# This model considers the number of pickups between 9:00 PM and 6:00 AM by Census tract, but only includes Census tracts between 3 and 4 miles from Times Square. This limited dataset shows no correlation between race and pickups. The model only finds a significant relationship between pickups and tract income. Regression output: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.294e+02 1.821e+03 0.456 0.652 median_income 2.756e-02 3.675e-03 7.501 8.73e-09 *** population 5.658e-02 5.119e-02 1.105 0.277 black_percent -3.785e+02 6.655e+02 -0.569 0.573 distance_in_miles -4.579e+02 5.124e+02 -0.894 0.378
Dropoff patterns and their relationships to pickup locations could provide an additional angle of exploration. Dropoffs, as the chart below shows, are also heavily biased towards white areas.
Though this supports the argument that minority neighborhoods are underserved, it says little about racial motivations. On the one hand, if cab drivers were discriminating against minorities, we’d expect taxis to be taking fewer passengers to minority neighborhoods and therefore have fewer dropoffs in those areas. On the other hand, the low dropoff rate could be caused by the same reasons as the low pickup rate: There’s little demand to travel to and from these neighborhoods.
Dropoff data combined with subsequent pickup data could show something far more interesting. Because passengers can’t pre-arrange for a New York yellow cab to pick them up, most trips come from street hails. If cabs wanted to avoid minority fares, they would likely move away from areas where they would get hailed by minorities once they dropped off a passenger, especially if that fare was dropped off in a minority neighborhood.
Nevertheless, this approach suffers from the same flaw as the chart above—non-discriminatory cabs would exhibit this same behavior if demand were higher in white neighborhoods.
A different angle can provide some conclusions. Cabs do pick up passengers in minority neighborhoods. If these pickups were preceded exclusively by dropoffs in the same neighborhoods, that would suggest cabs only traveled to these areas when a passenger requested they do so (the reasons for their reluctance would still unclear). However, if pickups in minority neighborhhoods followed dropoffs in white neighborhoods—in other words, if drivers voluntarily traveled to non-white areas—that could provide a piece of evidence in support of non-discriminatory practices.
As the chart below shows, pickups from minority neighborhoods are typically preceded by dropoffs in white neighborhoods, suggesting pickups in minority neighborhoods aren’t dependent on dropoffs in minority neighborhoods.
This ultimately provides one strong conclusion and one mixed one. First, by nearly every measure, cab drivers have a strong preference for white neighborhoods and non-white neighborhoods are severely underserved. This is undeniable.
Drivers' motivations, however, aren’t clear. There isn’t a lot of evidence for geographically opportunistic discrimination: Cabs in white neighborhoods aren’t much more likely to stay there than cabs elsewhere. Still, the degree to which minority neighborhoods are underserved and the models above, despite their flaws, do raise questions. But the evidence supports several plausible explanations for these disparities in service.
Regardless of the results of this analysis, racism exists, and the experiences described by those above—and many others—are real and can’t be discounted. No matter what this smoothed, stylized, and aggregated perspective describes, we also have to acknowledge the view from the street.
Furthermore, these arguments are only a very cursory first step into the data. There are numerous problems with these conclusions, and many questions that could be explored further. Just to name a few:
Data quality is a concern. This is particularly true for trip sequencing because some trips appear to begin before the previous trip ended.
Because GPS coordinates were rounded to the nearest 100 meters, Census tract matching is imprecise.
Demographic figures for Census tracts represent residential populations. The demographics of commuters, tourists, and others traveling to and from different tracts is unidentified.
Income is an imperfect measure of taxi demand. After incomes reach a certain point, residents may begin to favor other forms of transportation, like their own cars or limo services.
Other unmeasured factors can affect demand as well. For example, older populations may be more likely to take cabs, while residents in tracts with a high concentration of restaurants and bars may prefer to stay in their neighborhoods and demand fewer cabs. Demand in these neighborhoods could represent that of outsiders as much as that of residents. Moreover, taxi demand could be endogenous to discrimination—minority populations may hail fewer cabs because they’re concerned about racism.
Census tracts are not homogenous, but by identifying them as largely white or non-white, we’re flattening each tract into a single category. It’s possible that many passengers from a non-white tract are white, or many passengers from a white district are non-white.
Similarly, non-white districts are homogenized into a single group. A more detailed inspection of demographics could yield different results. As noted above, there seems to be evidence that black passengers face worse treatment. This analysis could be greatly expanded, or applied to other groups.
The data only includes one form of travel. Different parts of the city may be better served by other methods of transportation (subways, buses, Green cabs); travel choices may be more affected by these options rather than the biases of cab drivers.
It’s possible that these conclusions change during different times of the day. The type of passenger that rides a cab at 9:00 AM on a Monday morning is likely different than the type of passenger looking for a cab at 11:00 PM on Friday night. These potential differences are largely unexplored.
All these questions and caveats could be investigated further. To that end, all of my analysis above can be easily accessed via in-text links or links below the graphs. Additional details ares provided below. The links to Mode provide access to both the raw data and the underlying analysis. For anyone interested in exploring these ideas further—or digging into an entirely new issue with the same data—I invite you to copy and extend my work however you see fit.
Taxi data was provided by Chris Whong and Andrés Monroy. June data is available as trip_data_6. I trimmed this dataset down to one week in June using Python. In the process, I also added a by-driver trip counter in order to identify trip order. Data on the New York Census tracts was provided by the Census. This data was matched to GPS coordinates using another Python script. Finally, simple regression models were constructed in R. The models presented above, as well as several intermediate steps, can be found in GitHub.
Looks like you've got a thing for cutting-edge data news.
So do we. Stay in the know with our regular selection of the best analytics and data science pieces, plus occasional news from Mode. Sign up here and we'll keep you posted: