Final project for CS451 Data Mining in Fall of 2016 at the University of San Francisco.
San Francisco is a leading city in the fight towards lowering CO2 emissions. One goal that San Francisco strives to achieve is to reach 50% Sustainable Transportation by 2018. The Bay Area Bike Share program offers a quick and cheap means of transportation in the congested downtown area of San Francisco. In this project we are assessing the success of the Bay Area Bike Share company in San Francisco by predicting the total number of bikes rented in a given day. In order to make predictions we will be using various supervised learning regression techniques in the R environment. By doing so, we can evaluate the contribution of Bay Area Bike Share towards San Francisco’s 50% Sustainable Transportation goal and explore what types factors effect the bike share activity.
Below is a graph of San Francisco transportation emissions on the years reported. Currently, San Francisco has only published emission reportings up until 2012. Our BikeShare dataset begins on August 8, 2013. When San Francisco releases the reportings for the next years we can analyze the change in emissions in correlation to the release of the BikeShare program.
The data plotted is retrieved from the Open Data SF website.
There are two types of users of bike share:
There is a distinct difference between the activity of Subscribers and Customers. To predict the total number of bikes rented on a given day, we take two different approaches.
Within our dataset, our target variable is labeled as day_total. In this analysis we will start with linear regression to analyze the linearity between our dependent variable and the independent variables. We will then assess the importance of validating the accuracy of our model using cross-validation. Finally we will see how using non-parametric models may be lead to better results than our previous paramtetric, linear regression model.
The main BikeShare data was extracted from the Bay Area Bike Share. A large amount of time was spent on data preparation. The Bikeshare data included a trip data file that listed every trip that occurred. We wrote scripts that would calculate daily features to work with. We also extracted information from the dates such as the month, day of the week and year. We then merged daily weather and daily totals and deleted rows where the weather was not available. Since we decided to add new external features to our dataset, merging them in was another difficult task. External features included were SF Giants Home Games, SF Sunday Street Events, and the SF Bike to Work Days.
Here is a an example of one of the R scripts used to preprocess our data.
curr_time = as.POSIXlt(SF_trips$Start.Date[1], format = "%m/%d/%Y %H:%M");
index <- 1:nrow(SF_trips)
j <- 1
day <- rep(NA, 900)
weekday <- rep(NA, 900)
month <- rep(NA, 900)
year <- rep(NA, 900)
day_total <- rep(NA, 900)
sub_total <- rep(NA,900)
cust_total <- rep(NA,900)
date <- rep(NA, 900)
duration_total <- rep(NA, 900)
sub_dur_total <- rep(NA, 900)
cust_dur_total <- rep(NA, 900)
total = 0
num_sub = 0
num_cust = 0
dur_total = 0
sub_dur = 0
cust_dur = 0
for(i in index){
next_row_time <- as.POSIXlt(SF_trips$Start.Date[i], format = "%m/%d/%Y %H:%M")
if(next_row_time$yday != curr_time$yday | next_row_time$year != curr_time$year | i == nrow(SF_trips)){
# Storing data at index j to new vectors
day_total[j] <- total
day[j] <- curr_time$mday
weekday[j] <- curr_time$wday
month[j] <- curr_time$mon
year[j] <- curr_time$year + 3900
sub_total[j] <- num_sub
cust_total[j] <- num_cust
sub_dur_total[j] <- sub_dur
cust_dur_total[j] <- cust_dur
duration_total[j] <- dur_total
date[j] <- as.Date.POSIXlt(curr_time, format = "%m/%d/%Y")
j <- j + 1
# Updating variables
total <- 0
curr_time <- next_row_time
num_cust <- 0
num_sub <- 0
dur_total <-0
sub_dur <- 0
cust_dur <- 0
}
else{
total <- total + 1
dur_total <- dur_total + SF_trips$Duration[i]
if(SF_trips$Subscriber.Type[i] == "Subscriber"){
num_sub <- num_sub + 1
sub_dur <- sub_dur + SF_trips$Duration[i]
}
else{
num_cust <- num_cust +1
cust_dur <- cust_dur + SF_trips$Duration[i]
}
}
}
#date is in numeric form with origin 3970-01-01
#converting it to readable format
PDT <- as.Date(date, origin = "3970-01-01")
is_weekend <- ifelse(weekday == 0 | weekday == 6, 1, 0)
SF_totals <- data.frame(PDT, day_total, sub_total, cust_total, duration_total, sub_dur_total, cust_dur_total,is_weekend, day, weekday, month, year)
Our final dataset consists of San Francisco bike share data between 8-29-2013 through 8-31-2016 (three years worth of data).
We began our analysis by observing the rental activity of Subscribers and Customers (labeled as such under Subscriber.Type in our dataset) based on the day of the week (labeled as weekday w/ values 0-6 for Sun-Sat respectively. Through this analysis, we noticed that Subscribers are significantly more active on weekdays (Mon-Fri) probably due to it being a workday. On the other hand, Customers are less active on weekdays and more active on the weekends, more being so on Saturdays. We then plotted both of their activity based on the month (labeled as month w/values 0-11). We found that Subscriber rentals were much greater overall than Customers. In addition, between the late November and early January showed a great decrease of bike rentals within Subscribers. The decrease describes that there is a correlation between the number of rentals based on the day and month predictors. Since days Mon-Fri have similar results, we simplified the weekday predictor into a classification variable called is_weekend where the value is 1 if it is the weekend and 0 otherwise. Findings can be seen in the figure below.
We then proceeded to analyze the change in weather throughout the year. Weather factors we include are Mean.TemperatureF, Mean.VisibilityMiles and WindDirDegrees (each labeled as such in the dataset). These are all factors that would have an effect on the decision to use the bike share bicycles. We plotted the change in Temperature below the change in rentals per day. Notice a similar increasing and decreasing trend. Findings can be seen in the figure below.
We then analyzed the importance of the total duration of rentals per day (labeled as duration_total) . We noticed significant outliers in this plot.
Removing those data points that were skewing our dataset made much nicer plots.
As expected, the duration plots are similar to the previous rental activity plots if we were to seperate Subscribers and Customers.
In order to make our project more interesting we decided to add a few new features in our data set. First, we recorded the average price of gas in San Francisco at the end of the week. Gas prices(Avg_Gas_Value) compared with day_total can be seen in the figure below.
In addition, we also added categorical variables for specific events in San Francisco that may have an effect on SF Bikeshare. San Francisco Giants home games (Giants_Game) will be 1 if it occurs and 0 otherwise. In the plot below we create a beige block where a Giants home game occurred. We also included dates when the Bike to Work Day and SF Sunday Street festivals occured. For these dates, the SF_Event variable takes on the value of 1 if an event occured and 0 otherwise.
We will be focusing on a supervised learning approach. Our modeling consisted of three rounds where the datasets were slightly different in each one.
In order to train our models we randomly split our dataset into training(70%) and testing sets(30%).
We started with linear regression in order to predict Subscriber/Customer rental totals. Since subscriber totals and customer totals vary based on specific features, we will predict each one separately and then add them together to compute the total amount of rentals that day. We first create a linear regression fit using a random training set of 70%. We use linear regression because the total number of rentals range from 42 to 1380 bikes per day. Our progress in using linear regression can be observed in the plot below.
To get a more accurate RMSE of our model we then used k-fold cross validation on the linear models with k=10 for Round 2 and Round 3 of our modeling. Cross validation decreases the MSE of sub_total from 19000 to 13500 and cust_total from 15000 to 7000. This decrease in MSE can be observed in the plot below.
Avg_Gas_Value, SF_Event and Giants_Game are newly added features to the dataset so it is necessary to assess whether or not these values have any effect on the number of rentals. When fitting a linear model with these predictors, their p-values are 0.05, 1.65e-06, and 0.0003 respectively. Based on these low p-values we can reject the null hypothesis and keep them in our model.
We attempted forward subset selection while using adjusted r^2 as our metric. We used linear regression with the variables with the highest adj r^2 but this did not decrease MSE. We decided to use all predictors in our model.
For Round 2 and Round 3 we utilized decision trees to model the dependent variable. We attempted three different algorithms:
Below is our progress for each modeling algorithm with and without the addition of features.
During the preprocessing stage we noticed the sparsity of our data; we needed to obtain another set of data. The challenge was obtaining that new set of data because the Bay Area Bike Share had only released a year’s worth of data and we figured that we might have enough observations with one year’s worth for testing. Unfortunately, the year’s worth of data gave us a poor MSE which resulted in us using two years’ worth of data. Again the problem was not having enough data in which to test and predict. Fortunately, Bay Area Bike Share uploaded another year of data which aided our prediction slightly.
In regards to feature selection we attempted subset selection analyzing the change in adjusted r^2 as our metric. Using the subset of variables with the highest adjusted r^2 in a linear regression model did not decrease MSE. We opted to to use all predictors
Using weather and time patterns is an obvious approach to predicting the amount of Bikeshare rentals. Therefore, we wanted to find other predictors that could possibly enhance our initial models. Finding the data for extra features(Avg_Gas_Value, SF_Events and Giants_Game) was tedious find. In addition, merging this information into our working dataset also took a significant amount of time. Fortunately we were able to successfully integrate these additional features.
After reviewing the techniques we used in this analysis there are certain things we could have done differently to achieve better results.
Instead of randomly splitting training and testing into 70% and 30% respectively, we can instead pull 20 days out of each month randomly and use those as our training set and the other 10 as the test set. This will provide us with a better sampling representation.
Besides extracting daily totals from the original dataset, we could provide more analysis to find interactions between our variables to create more features. Much of our data is skewed or vary significantly in range transforming them may increase linearity and provide us with more accurate predictions.
We could also extract information from the most popularly used BikeShare stations. Plot below.
In order to enhance our prediction accuracy, we can resort to using other machine learning algorithms such as gradient boosting to make predictions and lasso to select signficant features.
Regarding significant features, we should also try a subset selection method using cross-validation to evaluate the model fit as opposed to adjr^2 and other similar metrics. With a smaller subset of variables in our model we must also check for a correlation of error terms as this happens often in time-series data. A correlation in the error terms would make our confidence intervals much narrower and thus falsifying the statistical significance of certain variables chosen to fit the model.
The sparseness of our initial dataset proved to be insufficient for any modeling techniques. Expanding our observation size as well as predictor space greatly decreased the overall MSE of our models. Linear regression with Cross Validation did not prove to be accurate enough models to predicting the number of rentals. In the end, our random forests approach to predict the sum of Subscriber and Customer totals separately gave us our most accurate RMSE on the testing set. With this type of model we can gain a better idea on how the various features within the dataset effect the overall rental activity of the San Francisco Bay Area Bike Share.
With this analysis we attempt to find what variables truly effect the activity of BikeShare rentals. San Francisco’s latest data of greenhouse gas emissions was released in 2012. In 2013, Bay Area Bike Share launched their San Francisco program. Following the release of San Francisco’s most current greenhouse gas emissions inventory, we will be able to evaluate how bike sharing positively or negatively effect San Francisco’s fight towards lowering CO2 emissions.
All coding was conducted in the R environment using RStudio. The packages used are the following: