Citibikes in New York City - Christian Budow

As a German or in general as a European you usually do not think of bikes when you are talking about the United States of America or New York City. You would rather think of cars and trucks first, maybe trains as well, or rather the lack of trains in the US and then maybe the subway system in New York (and its chaos, if you experienced it first-hand). So, let us change this up. Let us look at the data of bikes in New York City, where New Yorkers start using the citibikes, for how long and to which point. And more importantly, we will look who is using the bikes. Due to a lack of resources, I will specify the last question as, which NTA (Neighbourhood Tabular Area) does pick up the most bikes. My theory is, that especially wealthier NTAs will use bikes more often.

In order to do so, I will use the Citi Bike Trip Histories from September 2017 and September 2016, group the data by NTA coordinates and will add Census Data and Median Income per NTA to it. Followed up by a table join based on the NTA names and a correlation analysis of start points per NTA to income per NTA, we will see a relation between those two variables. To delve deeper into this relation, I will define if there is and how strong the relation between the median income and start point of citibikes per NTA, based on a Pearson Correlation Coefficient. A coefficient of 1 equals the perfect positive relationship; -1 equals the perfect negative relationship (both cases are very rare). To avoid misunderstandings, I will define the strength of the relationship as follows:

0 to 0.05: no relation

0.05 to 0.10: weak relation

0.10 to 0.30: medium relation

0.30 to 0.50: strong relation

> 0.50: very strong relation

Due to the size of the files and the limitations of Google Colaboratory of 12 GB, the source files were manually sized down at the cost of some data, from roughly 3.5 million rows to 260.000 rows. For the same reason it was not possible to create a detailed map with concentrations of citibike start_points per Neighbourhood Tabulation Area. Similar, more detailed maps concentrating on spatial visualisation of citibike trips based on different variables like gender or start/end point in relation to income per NTA were unfortunately not possible. It is unfortunate, that due to the limitation of 12 GB, computations of Big Data are barely possible, or only with a big decrease in data accuracy.

All variables are based on Neighbourhood Tabulation Area. Citibike Starts is the amount of the registered starting point of citibike usage. Median Income is based on the US Census Data created in 2015 and updated 2018, Trip Duration is the amount of time between starting and ending point, while Gender represents Male (1), Female (2) and Diverse (0).

As shown above, there is a strong correlation of median income per NTA and the starting points of citibikes in the same NTA. Based on our previous definition of the strength of a correlation, we can claim, that there is a strong positive correlation (0.5) between those two variables. Therefore, my original hypothesis is correct. Citibikes are more often used in areas with a high median income and by that you can argue, that especially wealthy people use citibikes in New York City. There are most likely many underlying factors for this correlation. However, as a graduate student looking for a new apartment close to the Morningside Campus, I experience first-hand the high costs of rent close to Columbia University, a very desirable area. Based on this example, I argue, that people with high income are able to afford apartments relatively close to their working location and other interesting locations, making biking as a mode of transport very plausible. People with a lower income cannot afford to live closely in the city centre, making them more reliant on other modes, like the subway or cars.

This argument is supported by the fact, that the duration is medium negative correlated to citibike starts (-0.13) and median incomes (-0.14). This implies, that if there are more citibike starts in an NTA, the duration will decrease, likewise with the median income. In other words, NTAs with a higher median income and NTAs with a lot of citibike users will use the bikes for a shorter period of time. There is a weak positive relation of gender and citibike bike starts (0.091) and median income (0.086), implying that there is little difference in citibike starts and by median income. Additionally, all genders are using the bike for a similar amount of time, as there is no relation to be observed between gender and duration (-.0046).

This displays the same data as Figure 2 but in a differen way. Figure 3 shows each Neighbourhood Tabulation Area in the correlation of the above-mentioned variables. It also shows outliers, for example, there is one NTA with a low amount of citibike starts but with the highest trip duration, a similar observation with median income. This raises the question, how reliable the data is. Is the end point really the end point of a biking trip, or do some users end to forget the „end the bike trip“. Similarly, you should consider how much data per NTA is available. If there is one passionate citibike rider in a NTA with a low amount of users, most of the data will be skewed. In this case, you could highlight how the boroughs compare to each other, but unfortunately Manhattan dominates, while the four other boroughs are quite behind in ridership.