The following is a analytics report created as part of the Google Data Analytics Capstone.
I am a junior data analyst working with the marketing analyst team at the finctional company Cyclistic, a bike-share company in Chicago. The financial researchers at Cyclistic have determined that yearly members are significantly more profitable than casual riders. The company’s future success, in the opinion of Lily Moreno, director of marketing, hinges on increasing the number of yearly subscribers. To turn casual riders into annual members, the marketing analytics team will create a new marketing plan. By analyzing past data for trends, the marketing team will be able to better understand how casual and annual-subscription Cyclistic riders use the service differently.
We will use Cyclistic’s historical trip data for the analysis in order to examine and pinpoint user trends. The information will be downloaded from a Cyclistic-provided AWS S3 bucket and kept locally in a secure storage drive. Only the previous year (August 2021– August 2022) will be included in the Analysis because we are looking at current patterns. Cyclistic directly gathered and keeps track of the data from user activity. Motivate International Inc. complies with this Data License Agreement in providing the data to the general public.
The trip data offered will be sufficient because we are searching for patterns in how consumers use the service. In order to undertake an accurate analysis, the data will need to be cleaned and examined for null or empty items. I am unable to use the riders’ personally identifying information due to data privacy concerns. I won’t be able to determine whether casual riders reside in the Cyclistic service region or whether they have purchased several passes by connecting pass purchases to credit card numbers.
Since I am given a choice between using the spreadsheet software Google Sheets or the programming language R, I choose to use R. With a scripted language like R, every part of the process is documented. Every step taken in the analysis is also documented. Additionally if this type of report is regularly required it can be generated automatically whenever needed.
Although the data is from a trustworthy source, it is in less than ideal shape because some station names are missing and the GPS coordinates have been generalized. All GPS locations must have the same level of accuracy and be marked with station names and an ID in order to maintain the integrity of the data. Therefore, cleaning up the data by deleting rows of information that contain redundant or inconsistent data will make it more orderly. The data will also be cleaned and categorized with columns for ride duration and day of the week the ride began.
To begin, I used the library tidyverse which is an opinionated set of R tools created for data research. Each package within tidyverse has a common data structure, language, and design philosophy which makes it ideal for our project.
library(tidyverse)
Downloaded from Amazon Web Services, the original data is stored in a secure local drive and extracted to a data folder. Prior to cleaning or analysis I inspected each CSV file using Excel ensuring column names match across all files. Using the method below, I am able to access the folder and import the data using the read_csv function and bind all matching row names. Finally, using the Janitor library we can clean the column names and ensure no white space will confalte the analysis with errors.
csv_data <- list.files(path = "./data", full.names = TRUE) %>%
lapply(read_csv) %>%
bind_rows %>%
janitor::clean_names()
Using this dataframe we can begin to clean and process the data for analysis.
Here is a list of column names (features) that are included in the dataset. These features will provide us the basic information needed to find out how each customer type uses the ride share service.
ride_id |
rideable_type |
started_at |
ended_at |
start_station_name |
start_station_id |
end_station_name |
end_station_id |
start_lat |
start_lng |
end_lat |
end_lng |
member_casual |
Looking at the dimensions of the data reveals the number of rows (rides recorded) by the number of columns (features) in the data. In this case the dimensions are 6,687,395 rides with 13 features in the dataset.
6687395 |
13 |
With over 6 and a half million observations, there is plenty of room to clean and process the data.
So that our analysis can reflect the true habbits of Cyclistic customers, it is important to remove duplicate records from within the dataset. Using the unique ride IDs I am able to create a new dataframe that only contains distinct ride IDs while also deleting rows with unavailable or null data.
dist_csv_data <- csv_data %>% distinct(ride_id, .keep_all = TRUE) %>% drop_na()
Normally, I wouldn’t build a new dataframe; instead, R enables me to keep cleaning and processing the data without the need to produce extra dataframes. Separate dataframes will be used for the cleaning procedure for the purpose of clarity.
We can process data to create a ride length (in minutes) column and a week day column using the dataframe with unique ride IDs. I also included a month and hour of departure column.
preped_data <- dist_csv_data %>%
mutate(ride_length = ((as.integer(ended_at) - as.integer(started_at)) / 60))
preped_data$hour_of_start <- lubridate::hour(preped_data$started_at)
preped_data$day_of_week <- lubridate::wday(preped_data$started_at)
preped_data$month_of_start <- lubridate::month(preped_data$started_at)
We need to make sure that any ride that is registered is from a client and not a maintenance worker. Below, I eliminated rows with inconsistent times if the start time was later than the end time once the ride duration has been determined. Additionally, I used the chance to eliminate rows whose start station and end station were the same, which could have been a sign of a customer who changed their mind or a maintenance worker.
preped_data <- subset(preped_data, ride_length > 0 &
start_station_id != end_station_id)
Looking at a summary of the data we have processed so far, we can see the max and min of the ride length reveals the presence of outliers.
## ride_id rideable_type started_at
## Length:4883222 Length:4883222 Min. :2021-08-01 00:00:04.00
## Class :character Class :character 1st Qu.:2021-10-01 17:08:52.25
## Mode :character Mode :character Median :2022-04-02 09:37:13.00
## Mean :2022-02-21 19:42:59.72
## 3rd Qu.:2022-06-27 18:05:37.50
## Max. :2022-08-31 23:58:50.00
## ended_at start_station_name start_station_id
## Min. :2021-08-01 00:03:30.00 Length:4883222 Length:4883222
## 1st Qu.:2021-10-01 17:25:41.25 Class :character Class :character
## Median :2022-04-02 09:50:28.00 Mode :character Mode :character
## Mean :2022-02-21 20:00:32.46
## 3rd Qu.:2022-06-27 18:21:46.00
## Max. :2022-09-01 19:10:01.00
## end_station_name end_station_id start_lat start_lng
## Length:4883222 Length:4883222 Min. :41.65 Min. :-87.83
## Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66
## Mode :character Mode :character Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.64
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.06 Max. :-87.53
## end_lat end_lng member_casual ride_length
## Min. :41.65 Min. :-87.83 Length:4883222 Min. : 0.02
## 1st Qu.:41.88 1st Qu.:-87.66 Class :character 1st Qu.: 6.57
## Median :41.90 Median :-87.64 Mode :character Median : 11.15
## Mean :41.90 Mean :-87.64 Mean : 17.55
## 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.: 19.40
## Max. :42.09 Max. :-87.53 Max. :41629.17
## hour_of_start day_of_week month_of_start
## Min. : 0.00 Min. :1.000 Min. : 1.000
## 1st Qu.:11.00 1st Qu.:2.000 1st Qu.: 6.000
## Median :15.00 Median :4.000 Median : 8.000
## Mean :14.21 Mean :4.069 Mean : 7.317
## 3rd Qu.:18.00 3rd Qu.:6.000 3rd Qu.: 9.000
## Max. :23.00 Max. :7.000 Max. :12.000
Even while we can see that each client type’s average travel time (in minutes) is not particularly long, the existence of outliers can have an impact on the average, which can change how we see each customer type.
casual | 24.68205 |
member | 12.59823 |
I established an upper and lower bound to handle outliers. Outliers are all observations that fall outside the range defined by the 1 and 99 percentiles. Next, I generated a list and dataframe of outliers so that we may create a summary.
# Create the upper and lower bounds
lower_bound <- quantile(preped_data$ride_length, 0.01)
upper_bound <- quantile(preped_data$ride_length, 0.99)
# Create a list of row numbers of observations that is an outlier
outlier_list <- which(preped_data$ride_length < lower_bound |
preped_data$ride_length > upper_bound)
# Use the outlier list to create a dataframe of outliers
quant_outliers <- preped_data[outlier_list, c("ride_id", "ride_length")]
Looking at a summary of the outliers, we can see that we were able to include the most important ones, such as outliers caused by bike maintenance or inordinately long rides.
## ride_id ride_length
## Length:96034 Min. : 0.02
## Class :character 1st Qu.: 1.67
## Mode :character Median : 104.02
## Mean : 124.44
## 3rd Qu.: 141.32
## Max. :41629.17
Finally, I remove all outliers identified earlier and created a new dataframe for analysis.
processed_data <- preped_data[-outlier_list, ]
Over 4.5 million observations are still available for study after all data cleaning and processing. We can observe that the ride length upper and lower boundaries fall inside a range that is appropriate for renting a bike.
## ride_id rideable_type started_at
## Length:4787188 Length:4787188 Min. :2021-08-01 00:00:04.00
## Class :character Class :character 1st Qu.:2021-10-01 16:56:51.25
## Mode :character Mode :character Median :2022-04-02 10:23:05.00
## Mean :2022-02-21 20:02:26.35
## 3rd Qu.:2022-06-27 18:57:27.00
## Max. :2022-08-31 23:58:50.00
## ended_at start_station_name start_station_id
## Min. :2021-08-01 00:05:10.00 Length:4787188 Length:4787188
## 1st Qu.:2021-10-01 17:12:37.75 Class :character Class :character
## Median :2022-04-02 10:33:21.00 Mode :character Mode :character
## Mean :2022-02-21 20:17:50.44
## 3rd Qu.:2022-06-27 19:14:26.00
## Max. :2022-09-01 00:35:41.00
## end_station_name end_station_id start_lat start_lng
## Length:4787188 Length:4787188 Min. :41.65 Min. :-87.83
## Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66
## Mode :character Mode :character Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.64
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.06 Max. :-87.53
## end_lat end_lng member_casual ride_length
## Min. :41.65 Min. :-87.83 Length:4787188 Min. : 1.967
## 1st Qu.:41.88 1st Qu.:-87.66 Class :character 1st Qu.: 6.650
## Median :41.90 Median :-87.64 Mode :character Median : 11.150
## Mean :41.90 Mean :-87.64 Mean : 15.401
## 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.: 19.133
## Max. :42.09 Max. :-87.53 Max. :103.200
## hour_of_start day_of_week month_of_start
## Min. : 0.00 Min. :1.00 Min. : 1.00
## 1st Qu.:11.00 1st Qu.:2.00 1st Qu.: 6.00
## Median :15.00 Median :4.00 Median : 8.00
## Mean :14.22 Mean :4.07 Mean : 7.32
## 3rd Qu.:18.00 3rd Qu.:6.00 3rd Qu.: 9.00
## Max. :23.00 Max. :7.00 Max. :12.00
After the data has been cleansed and processed, we can now examine it for trends. We are comparing and contrasting how annual subscribers and occasional customers used the Cyclistic service over the course of the previous year.
To begin we should first take a look at the total number of rides each customer type has made.
casual | 1950072 |
member | 2837116 |
According to the table above, annual members ride more frequently. In actuality, there is a 37% difference between the two sorts of customers.
We can all agree that customers that pay an annual fee use the service more, but we still need to know more about when these customers use the service most and least. It makes sense that clients might utilize the service all year round, but when may demand rise?
According to the graph above, demand increased steadily from March (3) through July (7), peaked in August (8), and then significantly decreased over the next three months. It makes sense that the spring and summer seasons would be the busiest, but August (8) truly jumps out because there is a notable demand for Cyclistic services.
Looking at each individual month below we can see the total number of rides for each day of the week. We can see that on Saturdays (7) and Sundays (1), casual users outperformed yearly members for the months of May (5) through October (10). Only in October (10) did annual members’ peak on a Friday (6) or a Saturday (7); otherwise, they always peaked on any day from Monday (2) through Thursday (5).
Illustrated below, it appears that occasional users outperform yearly members on the weekends throughout the months.
Consumers who would use Cyclistic during the weekdays appear to be annual members, but customers who enjoy the weekend prefer to use the service on a casual basis.
With an idea on how the customers may use the service through out the week each month. We should also take a moment to consider how the customers use the service throughout the day.
The graph above displays the overall number of rides for the year broken down into a 24-hour period. Despite a significant decline in demand between the 23rd hour (11 p.m.) and the fourth hour (4 a.m.), it is clear that cyclistic is in demand around-the-clock. We can still claim that annual Cyclistic members primarily utilize the service for work with peak use at 8 a.m., noon, and 5 p.m. It is only at 5 o’clock the casual user appears to reach their peak, with rides distributed rather evenly throughout the 24-hour period.
Having knowledge of how frequently each category of customer uses the service is important. Understanding how each kind prefers to bike is crucial for us. Understanding which of the three alternatives each customer selects may help us learn more about how they use Cyclistic.
While clients have access to a variety of modes of transportation, the graph above shows total rides each month for each customer type for each bike type. showing us that there is a strong preference towards classic bikes. All three types of bikes have been utilized consistently throughout the course of the year, however annual members never use docked bikes.
Looking at the breakdown of total rides for each type of bike throughout a 7-day week in the graph below. It is evident that occasional riders and yearly members continue to favor the classic bike. Additionally, it appears that casual users use the docked bikes and they use the electric bikes more frequently on the weekends than do annual members.
We have now examined the overall number of rides throughout the previous year. We can infer that yearly members ride more frequently during the week during business hours and prefer to use a classic bike and occasionally an electric bike based on how each customer type uses Cyclistic. Although they prefer the classic bike, casual users prefer to utilize the service on weekends in the evening and will also use the other two types of bikes.
Only a limited amount of information about each customer types’ habits can be gleaned from understanding the users through the frequency of their rides. The average amount of time an annual member or casual user spends riding their selected bike is the next topic of discussion. The table that follows displays the total mean average usage time (in minutes) for each user’s type.
casual | 19.80125 |
member | 12.37704 |
We may examine the average time (in minutes) that each client rode their bike each month throughout the previous year, just as we did with the total number of rides each customer took. As you can see below, casual users spend more time on their bikes even if annual members ride more regularly.
According to the graph above, the highest ride time for casual users occurs between March (3) and September (9). May is the month when casual riders spend the most time on their bike (5). Regular annual members limit their rides to shorter durations, with their peak season starting in May (5) and ending in September (9). It is not surprising that the average ride duration decreases during the Fall and Winter seasons.
After examining the average ride time for each user as a whole, we can further break down the year and examine the average ride time for each day of the week and each month.
As their peak season starts and spring ushers in warmer weather, we can observe above that ride lengths for the casual user grow increasingly volatile. The average trip time levels off as the peak period for casual users goes on, with weekend highs and decreased weekday usage. Although their season doesn’t start until May (5), the yearly members’ ride durations start to slightly increase in March (3), similar to the casual users. Once more, it is clear that yearly members do not ride their bikes very often.
It’s interesting to see that, like casual users, annual member ride duration peaks on the weekends. Although the ride durations of casual riders are longer, both utilization seems to rise and fall on days that are similar.
The graph above shows that despite the difference in average ride lengths, both occasional users and yearly members choose similar days.
Examining the typical ride time for each type of customer. The graph below displays the average time spent by each customer type during a 24-hour period.
With dips in the early morning hours at 5 a.m. and a gradual rise until about 2 p.m., annual members continue to maintain a modest and almost consistent ride duration. Casual users appear to reach their peak at 10 a.m., which lasts through 2 p.m., then declines to an average, which holds through 3 a.m., then lowers for the night before picking up at 8 a.m. once more.
Now that we know when each sort of consumer likes to use the service and for how long they typically enjoy riding. The average riding time for each type of customer and each bike type for each month of the year is broken down below. We can see that the average user will spend roughly the same amount of time on both the classic and the electric bike during their peak season, which runs from March (3) through September (9). Additionally, we can anticipate that casual customers will use the docked bike for significantly longer than usual, particularly from May (5) through June (7).
In addition to the aforementioned, we can also claim that annual members ride both traditional and electric bikes, with average ride times distributed throughout the year and the longest ride durations occurring during the peak season (May to September).
Below is a breakdown of the typical riding time for each type of customer for each day of the week on each type of bike. We may observe that each sort of consumer selects comparable days to boost use throughout the course of the week. Casual users appear to utilize docked bikes for a significantly longer average time than other bikes.
We have now looked at the typical ride length for each category of customers over the past year. We can assume that casual users ride longer on the weekends particularly during the spring and summer from late morning through latter afternoons. Longer trips on docked bikes are also taken by casual users who utilize both traditional and electric bikes. The yearly member prefers electric and classic bikes when using the Cyclistic service, even if they maintain a low consistent average ride duration with weekly peaks similar to casual users.
Although we have a complete picture of our clients’ riding patterns throughout the year, it’s crucial to comprehend why they choose to use the Cyclistic service. The data given included GPS information for each ride taken over the previous year. Using Tableau, I plotted the top 10 stations each sort of consumer utilized to begin their ride, and I’ve embedded interactive maps below. We will only look at the overall number of rides per station per client category in order to keep the analysis focused on our current objectives.
As seen above, yearly users like to start their journeys from locations like Kingsbury St. & Kinzie St., where there are several parks, eateries, and cafes in the vicinity. The Clark & Elm St station, where there is a wealth of retail nearby, as well as public transportation hubs that one can reach by using a bike from Cyclistic. While we are unsure of the exact destinations that our yearly customers are going to, we can say that, with average journey times of about 12 minutes, they are not traveling very far.
After examining annual members, we can move on to casual users. As seen above, casual users like stations that are close to tourist attractions. Casual users have specifically frequented the Streeter Dr. & Grand Ave. station, which is close to Ohio Street Beach, Lake Point Tower, and Pier Park. The North Western University Chicago Campus and the Lakeshore East Park, which is close by and has hotels and restaurants, are both accessible from the Streeter Dr. & Grand Ave. station.
In order to investigate and identify user trends that will enable us to more fully comprehend how annual members and casual Cyclistic users utilize the service differently, we had to look at past data that Cyclistic had supplied. We used R to prepare and analyze data that we had downloaded from an AWS S3 bucket. We also utilized Tableau to perform GIS analysis. By analyzing the overall number of rides, we can conclude that annual members, who prefer classic bikes, ride more frequently than casual members, who also prefer classic cycles but will also ride electric, and that casual members are the only customer category that will ride a docked bike.
Although casual members bike for longer, yearly members ride more frequently during the week during business hours, keeping the average ride time modest and constant. Weekend evenings are when casual members tend to use the service. According to the GIS study, if the top 10 stations are any indicator, annual members will stick to stations inside the city that are close to public transportation, while casual users would frequently utilize stations near busy areas. The best understanding of why Cyclistic consumers use the service would come from further GIS research as well as more data on user habits.