In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.
Lily Moreno: The director of marketing and is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels. Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.
The aim of this analysis is to focus on Cyclistic historical bike trip data to identify trends. More specifically, the analysis should help answer the following question: How do annual members and casual riders use Cyclistic bikes differently? The insights discovered will then help design a new marketing strategy to convert casual riders into annual members as the company’s finance analysts have concluded that annual members are much more profitable than casual riders. This report covers different phases in my analysis to help answer the business questions raised by the management. These phases include Ask questions, Prepare data, Process data, Analyze data, Share data, and Act.
Before performing the analysis, the data was collected through a public domain (http://www.divvy-tripdata.s3.amazonaws.com/index.html), then wrangled to make sure it’s cleaned, reliable and error-free by finding and filling missing values and normalizing data. After that, I explored and gained insights through variables, proceeded to data visualization to better capture trends and insights and finally made highly recommendations to the executive team.
The executive team asked to analyze Cyclistic historical bike trip data as they could help design a new marketing strategy to convert more annual members for the company. More specifically, the following business question has been raised: How do annual members and casual riders use Cyclistic bikes differently?
The data to explore and analyze was made available by Motivate International Inc. through: http://www.divvy-tripdata.s3.amazonaws.com/index.html. I proceeded to the collection and storage of data by making sure they meet the requirements in terms of integrity, reliability, credibility and security. I focused my analysis on the following twelve (12) data sets: 202004-divvy-tripdata.csv 202005-divvy-tripdata.csv 202006-divvy-tripdata.csv 202007-divvy-tripdata.csv 202008-divvy-tripdata.csv 202009-divvy-tripdata.csv 202010-divvy-tripdata.csv 202011-divvy-tripdata.csv 202012-divvy-tripdata.csv 202101-divvy-tripdata.csv 202102-divvy-tripdata.csv 202103-divvy-tripdata.csv
Let’s import the data sets
April_2020_Trip <- read.csv("C:\\Users\\BANKS\\Documents\\Mes cours\\Bike-share Data\\202004-divvy-tripdata.csv")
May_2020_Trip <- read.csv("C:\\Users\\BANKS\\Documents\\Mes cours\\Bike-share Data\\202005-divvy-tripdata.csv")
June_2020_Trip <- read.csv("C:\\Users\\BANKS\\Documents\\Mes cours\\Bike-share Data\\202006-divvy-tripdata.csv")
July_2020_Trip <- read.csv("C:\\Users\\BANKS\\Documents\\Mes cours\\Bike-share Data\\202007-divvy-tripdata.csv")
August_2020_Trip <- read.csv("C:\\Users\\BANKS\\Documents\\Mes cours\\Bike-share Data\\202008-divvy-tripdata.csv")
September_2020_Trip <- read.csv("C:\\Users\\BANKS\\Documents\\Mes cours\\Bike-share Data\\202009-divvy-tripdata.csv")
October_2020_Trip <- read.csv("C:\\Users\\BANKS\\Documents\\Mes cours\\Bike-share Data\\202010-divvy-tripdata.csv")
November_2020_Trip <- read.csv("C:\\Users\\BANKS\\Documents\\Mes cours\\Bike-share Data\\202011-divvy-tripdata.csv")
December_2020_Trip <- read.csv("C:\\Users\\BANKS\\Documents\\Mes cours\\Bike-share Data\\202012-divvy-tripdata.csv")
January_2021_Trip <- read.csv("C:\\Users\\BANKS\\Documents\\Mes cours\\Bike-share Data\\202101-divvy-tripdata.csv")
February_2021_Trip <- read.csv("C:\\Users\\BANKS\\Documents\\Mes cours\\Bike-share Data\\202102-divvy-tripdata.csv")
March_2021_Trip <- read.csv("C:\\Users\\BANKS\\Documents\\Mes cours\\Bike-share Data\\202103-divvy-tripdata.csv")
Before analyzing data, let’s load a number of packages.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.3.6 v purrr 0.3.4
## v tibble 3.1.8 v dplyr 1.0.10
## v tidyr 1.2.1 v stringr 1.4.1
## v readr 2.1.2 v forcats 0.5.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidyr)
library(dplyr)
library(ggplot2)
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(lubridate)
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(readr)
library(skimr)
library(tibble)
library(yaml)
library(gapminder)
library(ggpubr)
library(zoo)
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(data.table)
##
## Attaching package: 'data.table'
##
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
##
## The following objects are masked from 'package:dplyr':
##
## between, first, last
##
## The following object is masked from 'package:purrr':
##
## transpose
Now let’s find and remove any duplicates in the data sets.
sum(duplicated(April_2020_Trip))
## [1] 0
sum(duplicated(May_2020_Trip))
## [1] 0
sum(duplicated(June_2020_Trip))
## [1] 0
sum(duplicated(July_2020_Trip))
## [1] 0
sum(duplicated(September_2020_Trip))
## [1] 0
sum(duplicated(October_2020_Trip))
## [1] 0
sum(duplicated(November_2020_Trip))
## [1] 0
sum(duplicated(December_2020_Trip))
## [1] 0
sum(duplicated(January_2021_Trip))
## [1] 0
sum(duplicated(February_2021_Trip))
## [1] 0
sum(duplicated(March_2021_Trip))
## [1] 0
No duplicates found
Here, I searched and found missing values in my data sets.
colSums(is.na(April_2020_Trip))
## ride_id rideable_type started_at ended_at
## 0 0 0 0
## start_station_name start_station_id end_station_name end_station_id
## 0 0 0 99
## start_lat start_lng end_lat end_lng
## 0 0 99 99
## member_casual
## 0
colSums(is.na(May_2020_Trip))
## ride_id rideable_type started_at ended_at
## 0 0 0 0
## start_station_name start_station_id end_station_name end_station_id
## 0 0 0 321
## start_lat start_lng end_lat end_lng
## 0 0 321 321
## member_casual
## 0
colSums(is.na(June_2020_Trip))
## ride_id rideable_type started_at ended_at
## 0 0 0 0
## start_station_name start_station_id end_station_name end_station_id
## 0 0 0 468
## start_lat start_lng end_lat end_lng
## 0 0 468 468
## member_casual
## 0
colSums(is.na(July_2020_Trip))
## ride_id rideable_type started_at ended_at
## 0 0 0 0
## start_station_name start_station_id end_station_name end_station_id
## 0 152 0 969
## start_lat start_lng end_lat end_lng
## 0 0 770 770
## member_casual
## 0
colSums(is.na(August_2020_Trip))
## ride_id rideable_type started_at ended_at
## 0 0 0 0
## start_station_name start_station_id end_station_name end_station_id
## 0 7691 0 10110
## start_lat start_lng end_lat end_lng
## 0 0 938 938
## member_casual
## 0
colSums(is.na(September_2020_Trip))
## ride_id rideable_type started_at ended_at
## 0 0 0 0
## start_station_name start_station_id end_station_name end_station_id
## 0 19901 0 23524
## start_lat start_lng end_lat end_lng
## 0 0 789 789
## member_casual
## 0
colSums(is.na(October_2020_Trip))
## ride_id rideable_type started_at ended_at
## 0 0 0 0
## start_station_name start_station_id end_station_name end_station_id
## 0 31405 0 35787
## start_lat start_lng end_lat end_lng
## 0 0 474 474
## member_casual
## 0
colSums(is.na(November_2020_Trip))
## ride_id rideable_type started_at ended_at
## 0 0 0 0
## start_station_name start_station_id end_station_name end_station_id
## 0 24434 0 26826
## start_lat start_lng end_lat end_lng
## 0 0 284 284
## member_casual
## 0
colSums(is.na(December_2020_Trip))
## ride_id rideable_type started_at ended_at
## 0 0 0 0
## start_station_name start_station_id end_station_name end_station_id
## 0 0 0 0
## start_lat start_lng end_lat end_lng
## 0 0 111 111
## member_casual
## 0
colSums(is.na(January_2021_Trip))
## ride_id rideable_type started_at ended_at
## 0 0 0 0
## start_station_name start_station_id end_station_name end_station_id
## 0 0 0 0
## start_lat start_lng end_lat end_lng
## 0 0 103 103
## member_casual
## 0
colSums(is.na(February_2021_Trip))
## ride_id rideable_type started_at ended_at
## 0 0 0 0
## start_station_name start_station_id end_station_name end_station_id
## 0 0 0 0
## start_lat start_lng end_lat end_lng
## 0 0 214 214
## member_casual
## 0
colSums(is.na(March_2021_Trip))
## ride_id rideable_type started_at ended_at
## 0 0 0 0
## start_station_name start_station_id end_station_name end_station_id
## 0 0 0 0
## start_lat start_lng end_lat end_lng
## 0 0 167 167
## member_casual
## 0
All of the data sets contain missing values in columns end_lat and end_lng while columns start_station_id and end_station_id have some missing values but in certain data sets. To handle those missing values, I will use the method of replacing with column mean.
In this section, I will create a column called “ride_length” and calculate the length of each ride in seconds for the two user types: casual riders and annual members.
April_2020_Trip$ride_length <- difftime(April_2020_Trip$ended_at,April_2020_Trip$started_at)
May_2020_Trip$ride_length <- difftime(May_2020_Trip$ended_at,May_2020_Trip$started_at)
June_2020_Trip$ride_length <- difftime(June_2020_Trip$ended_at,June_2020_Trip$started_at)
July_2020_Trip$ride_length <- difftime(July_2020_Trip$ended_at,July_2020_Trip$started_at)
August_2020_Trip$ride_length <- difftime(August_2020_Trip$ended_at,August_2020_Trip$started_at)
September_2020_Trip$ride_length <- difftime(September_2020_Trip$ended_at,September_2020_Trip$started_at)
October_2020_Trip$ride_length <- difftime(October_2020_Trip$ended_at,October_2020_Trip$started_at)
November_2020_Trip$ride_length <- difftime(November_2020_Trip$ended_at,November_2020_Trip$started_at)
December_2020_Trip$ride_length <- difftime(December_2020_Trip$ended_at,December_2020_Trip$started_at)
January_2021_Trip$ride_length <- difftime(January_2021_Trip$ended_at,January_2021_Trip$started_at)
February_2021_Trip$ride_length <- difftime(February_2021_Trip$ended_at,February_2021_Trip$started_at)
March_2021_Trip$ride_length <- difftime(March_2021_Trip$ended_at,March_2021_Trip$started_at)
Here I will create a column called “day_of_week” that will help us understand how casual riders and annual members use the company’s bikes each day of the week.
April_2020_Trip <- April_2020_Trip
April_2020_Trip$day_of_week<- weekdays(as.Date(April_2020_Trip$started_at))
May_2020_Trip <- May_2020_Trip
May_2020_Trip$day_of_week<- weekdays(as.Date(May_2020_Trip$started_at))
June_2020_Trip <- June_2020_Trip
June_2020_Trip$day_of_week<- weekdays(as.Date(June_2020_Trip$started_at))
July_2020_Trip <- July_2020_Trip
July_2020_Trip$day_of_week<- weekdays(as.Date(July_2020_Trip$started_at))
August_2020_Trip <- August_2020_Trip
August_2020_Trip$day_of_week<- weekdays(as.Date(August_2020_Trip$started_at))
September_2020_Trip <- September_2020_Trip
September_2020_Trip$day_of_week<- weekdays(as.Date(September_2020_Trip$started_at))
October_2020_Trip <- October_2020_Trip
October_2020_Trip$day_of_week<- weekdays(as.Date(October_2020_Trip$started_at))
November_2020_Trip <- November_2020_Trip
November_2020_Trip$day_of_week<- weekdays(as.Date(November_2020_Trip$started_at))
December_2020_Trip <- December_2020_Trip
December_2020_Trip$day_of_week<- weekdays(as.Date(December_2020_Trip$started_at))
January_2021_Trip <- January_2021_Trip
January_2021_Trip$day_of_week<- weekdays(as.Date(January_2021_Trip$started_at))
February_2021_Trip <- February_2021_Trip
February_2021_Trip$day_of_week<- weekdays(as.Date(February_2021_Trip$started_at))
March_2021_Trip <- March_2021_Trip
March_2021_Trip$day_of_week<- weekdays(as.Date(March_2021_Trip$started_at))
Here I found negative ride lengths which are inconsistent for my analysis. Then, I will remove them to avoid any problems with my calculations.
April_2020_Trip_V2 <- April_2020_Trip[!(April_2020_Trip$ride_length<0),]
May_2020_Trip_V2 <- May_2020_Trip[!(May_2020_Trip$ride_length<0),]
June_2020_Trip_V2 <- June_2020_Trip[!(June_2020_Trip$ride_length<0),]
July_2020_Trip_V2 <- July_2020_Trip[!(July_2020_Trip$ride_length<0),]
August_2020_Trip_V2 <- August_2020_Trip[!(August_2020_Trip$ride_length<0),]
September_2020_Trip_V2 <- September_2020_Trip[!(September_2020_Trip$ride_length<0),]
October_2020_Trip_V2 <- October_2020_Trip[!(October_2020_Trip$ride_length<0),]
November_2020_Trip_V2 <- November_2020_Trip[!(November_2020_Trip$ride_length<0),]
December_2020_Trip_V2 <- December_2020_Trip[!(December_2020_Trip$ride_length<0),]
January_2021_Trip_V2 <- January_2021_Trip[!(January_2021_Trip$ride_length<0),]
February_2021_Trip_V2 <- February_2021_Trip[!(February_2021_Trip$ride_length<0),]
March_2021_Trip_V2 <- March_2021_Trip[!(March_2021_Trip$ride_length<0),]
Now to make sure that merging the twelve data sets into one-year file will run smoothly, I will convert column “ride_length” from factor to numeric.
is.difftime(April_2020_Trip_V2$ride_length)
## [1] TRUE
April_2020_Trip_V2$ride_length <- as.numeric(as.character(April_2020_Trip_V2$ride_length))
is.numeric(April_2020_Trip_V2$ride_length)
## [1] TRUE
is.difftime(May_2020_Trip_V2$ride_length)
## [1] TRUE
May_2020_Trip_V2$ride_length <- as.numeric(as.character(May_2020_Trip_V2$ride_length))
is.numeric(May_2020_Trip_V2$ride_length)
## [1] TRUE
is.difftime(June_2020_Trip_V2$ride_length)
## [1] TRUE
June_2020_Trip_V2$ride_length<- as.numeric(as.character(June_2020_Trip_V2$ride_length))
is.numeric(June_2020_Trip_V2$ride_length)
## [1] TRUE
is.difftime(July_2020_Trip_V2$ride_length)
## [1] TRUE
July_2020_Trip_V2$ride_length <- as.numeric(as.character(July_2020_Trip_V2$ride_length))
is.numeric(July_2020_Trip_V2$ride_length)
## [1] TRUE
is.difftime(August_2020_Trip_V2$ride_length)
## [1] TRUE
August_2020_Trip_V2$ride_length <- as.numeric(as.character(August_2020_Trip_V2$ride_length))
is.numeric(August_2020_Trip_V2$ride_length)
## [1] TRUE
is.difftime(September_2020_Trip_V2$ride_length)
## [1] TRUE
September_2020_Trip_V2$ride_length <- as.numeric(as.character(September_2020_Trip_V2$ride_length))
is.numeric(September_2020_Trip_V2$ride_length)
## [1] TRUE
is.difftime(October_2020_Trip_V2$ride_length)
## [1] TRUE
October_2020_Trip_V2$ride_length <- as.numeric(as.character(October_2020_Trip_V2$ride_length))
is.numeric(October_2020_Trip_V2$ride_length)
## [1] TRUE
is.difftime(November_2020_Trip_V2$ride_length)
## [1] TRUE
November_2020_Trip_V2$ride_length <- as.numeric(as.character(November_2020_Trip_V2$ride_length))
is.numeric(November_2020_Trip_V2$ride_length)
## [1] TRUE
is.difftime(December_2020_Trip_V2$ride_length)
## [1] TRUE
December_2020_Trip_V2$ride_length <- as.numeric(as.character(December_2020_Trip_V2$ride_length))
is.numeric(December_2020_Trip_V2$ride_length)
## [1] TRUE
is.difftime(January_2021_Trip_V2$ride_length)
## [1] TRUE
January_2021_Trip_V2$ride_length <- as.numeric(as.character(January_2021_Trip_V2$ride_length))
is.numeric(January_2021_Trip_V2$ride_length)
## [1] TRUE
is.difftime(February_2021_Trip_V2$ride_length)
## [1] TRUE
February_2021_Trip_V2$ride_length <- as.numeric(as.character(February_2021_Trip_V2$ride_length))
is.numeric(February_2021_Trip_V2$ride_length)
## [1] TRUE
is.difftime(March_2021_Trip_V2$ride_length)
## [1] TRUE
March_2021_Trip_V2$ride_length <- as.numeric(as.character(March_2021_Trip_V2$ride_length))
is.numeric(March_2021_Trip_V2$ride_length)
## [1] TRUE
Now let’s convert the columns “start_station_name”, “end_station_name”, “start_station_id”, “end_station_id” to character so that they can merge correctly
April_2020_Trip_V2 <- mutate(April_2020_Trip_V2, start_station_name = as.character(start_station_name)
,end_station_name = as.character(end_station_name), start_station_id = as.character(start_station_id), end_station_id = as.character(end_station_id))
May_2020_Trip_V2 <- mutate(May_2020_Trip_V2, start_station_name = as.character(start_station_name)
,end_station_name = as.character(end_station_name), start_station_id = as.character(start_station_id), end_station_id = as.character(end_station_id))
June_2020_Trip_V2 <- mutate(June_2020_Trip_V2, start_station_name = as.character(start_station_name)
,end_station_name = as.character(end_station_name), start_station_id = as.character(start_station_id), end_station_id = as.character(end_station_id))
July_2020_Trip_V2 <- mutate(July_2020_Trip_V2, start_station_name = as.character(start_station_name)
,end_station_name = as.character(end_station_name), start_station_id = as.character(start_station_id), end_station_id = as.character(end_station_id))
August_2020_Trip_V2 <- mutate(August_2020_Trip_V2, start_station_name = as.character(start_station_name)
,end_station_name = as.character(end_station_name), start_station_id = as.character(start_station_id), end_station_id = as.character(end_station_id))
September_2020_Trip_V2 <- mutate(September_2020_Trip_V2, start_station_name = as.character(start_station_name)
,end_station_name = as.character(end_station_name), start_station_id = as.character(start_station_id), end_station_id = as.character(end_station_id))
October_2020_Trip_V2 <- mutate(October_2020_Trip_V2, start_station_name = as.character(start_station_name)
,end_station_name = as.character(end_station_name), start_station_id = as.character(start_station_id), end_station_id = as.character(end_station_id))
November_2020_Trip_V2 <- mutate(November_2020_Trip_V2, start_station_name = as.character(start_station_name)
,end_station_name = as.character(end_station_name), start_station_id = as.character(start_station_id), end_station_id = as.character(end_station_id))
December_2020_Trip_V2 <- mutate(December_2020_Trip_V2, start_station_name = as.character(start_station_name)
,end_station_name = as.character(end_station_name), start_station_id = as.character(start_station_id), end_station_id = as.character(end_station_id))
January_2021_Trip_V2 <- mutate(January_2021_Trip_V2, start_station_name = as.character(start_station_name)
,end_station_name = as.character(end_station_name), start_station_id = as.character(start_station_id), end_station_id = as.character(end_station_id))
February_2021_Trip_V2 <- mutate(February_2021_Trip_V2, start_station_name = as.character(start_station_name)
,end_station_name = as.character(end_station_name), start_station_id = as.character(start_station_id), end_station_id = as.character(end_station_id))
March_2021_Trip_V2 <- mutate(March_2021_Trip_V2, start_station_name = as.character(start_station_name)
,end_station_name = as.character(end_station_name), start_station_id = as.character(start_station_id), end_station_id = as.character(end_station_id))
Data has been wrangled and manipulated to ensure they are cleaned and ready for analysis. Now it is time to merge them into one-year data frame.
one_year_trips <-bind_rows(April_2020_Trip_V2,May_2020_Trip_V2,June_2020_Trip_V2,July_2020_Trip_V2, August_2020_Trip_V2, September_2020_Trip_V2, October_2020_Trip_V2, November_2020_Trip_V2,December_2020_Trip_V2, January_2021_Trip_V2, February_2021_Trip_V2, March_2021_Trip_V2)
str(one_year_trips)
## 'data.frame': 3479196 obs. of 15 variables:
## $ ride_id : chr "A847FADBBC638E45" "5405B80E996FF60D" "5DD24A79A4E006F4" "2A59BBDF5CDBA725" ...
## $ rideable_type : chr "docked_bike" "docked_bike" "docked_bike" "docked_bike" ...
## $ started_at : chr "2020-04-26 17:45:14" "2020-04-17 17:08:54" "2020-04-01 17:54:13" "2020-04-07 12:50:19" ...
## $ ended_at : chr "2020-04-26 18:12:03" "2020-04-17 17:17:03" "2020-04-01 18:08:36" "2020-04-07 13:02:31" ...
## $ start_station_name: chr "Eckhart Park" "Drake Ave & Fullerton Ave" "McClurg Ct & Erie St" "California Ave & Division St" ...
## $ start_station_id : chr "86" "503" "142" "216" ...
## $ end_station_name : chr "Lincoln Ave & Diversey Pkwy" "Kosciuszko Park" "Indiana Ave & Roosevelt Rd" "Wood St & Augusta Blvd" ...
## $ end_station_id : chr "152" "499" "255" "657" ...
## $ start_lat : num 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num -87.7 -87.7 -87.6 -87.7 -87.6 ...
## $ end_lat : num 41.9 41.9 41.9 41.9 42 ...
## $ end_lng : num -87.7 -87.7 -87.6 -87.7 -87.7 ...
## $ member_casual : chr "member" "member" "member" "member" ...
## $ ride_length : num 1609 489 863 732 3175 ...
## $ day_of_week : chr "Sunday" "Friday" "Wednesday" "Tuesday" ...
summary(one_year_trips)
## ride_id rideable_type started_at ended_at
## Length:3479196 Length:3479196 Length:3479196 Length:3479196
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## start_station_name start_station_id end_station_name end_station_id
## Length:3479196 Length:3479196 Length:3479196 Length:3479196
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## start_lat start_lng end_lat end_lng
## Min. :41.64 Min. :-87.87 Min. :41.54 Min. :-88.07
## 1st Qu.:41.88 1st Qu.:-87.66 1st Qu.:41.88 1st Qu.:-87.66
## Median :41.90 Median :-87.64 Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.64 Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.08 Max. :-87.52 Max. :42.16 Max. :-87.44
## member_casual ride_length day_of_week
## Length:3479196 Min. : 0 Length:3479196
## Class :character 1st Qu.: 476 Class :character
## Mode :character Median : 874 Mode :character
## Mean : 1677
## 3rd Qu.: 1601
## Max. :3523202
Recalling that the aim of our analysis is to understand how annual members and casual riders use Cyclistic bikes differently. That is why my analysis will specifically focus on the following variables: ride_length, day_of_week, member_casual, and rideable_type.
summary(one_year_trips$ride_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 476 874 1677 1601 3523202
table(one_year_trips$day_of_week)
##
## Friday Monday Saturday Sunday Thursday Tuesday Wednesday
## 514949 418505 658179 527570 466855 429639 463499
table(one_year_trips$rideable_type)
##
## classic_bike docked_bike electric_bike
## 319871 2548128 611197
table(one_year_trips$member_casual)
##
## casual member
## 1427121 2052075
From the previous output, we notice that:
Riders (casuals and members) spend on average 1677 seconds riding between two (2) stations.
Docked bike is the most popular bike type for a total of 2,548,128 trips, followed by electric bike (611,197 trips) and classic bike (319,871 trips)
Weekends have the most important traffic with Saturday (658,179 trips) and Sunday (527,570 trips)
Annual members have the most important trips (2,052,075) compared to casual riders (1,427,121 trips)
In this section, I conduct thorough analysis to compare annual members and casual riders.
number_of_rides_and_average_duration_per_usertype <- one_year_trips %>%
group_by(member_casual, day_of_week) %>%
summarise(number_of_rides = n() ,average_duration = mean(ride_length)) %>%
arrange(member_casual, day_of_week)
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
This allows us to gain insights on how Cyclistic’s clients (members and casuals) use the company’s bikes differently in terms of weekdays, number of rides and average ride length. For example, casual riders are mostly active on Saturday, and Sunday for respectively 335,059 and 262,271 rides and spend on average 2817.35 and 3044.58 seconds between two (2) stations.
When it comes to annual members, they are mostly active on Friday (306,398 trips), and Saturday (323,120 trips), and spend respectively 920.22 and 1067.61 seconds between two (2) stations.
number_of_rides_and_biketype_per_usertype <- one_year_trips %>%
group_by(member_casual, rideable_type) %>%
summarise(number_of_rides = n() ,average_duration = mean(ride_length)) %>%
arrange(member_casual, rideable_type)
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
Here I found that casual riders and annual members have the same preference (counting the number of rides for each bike type) in terms of bike type used for their trips. Indeed, docked bike is most popular, followed by electric bike, and classic bike.
Here I will go through a series of visualizations which come from previous data sets created in the analysis. For each visualization, I will go deeper in my analysis and explain the trends found.
For this first graph, I plot average ride length against weekdays for each rider type.
ggplot(data = number_of_rides_and_average_duration_per_usertype,
aes(x = day_of_week, y = average_duration, fill = member_casual)) +
geom_col(position = "dodge") +
scale_color_manual(values = colors)+
labs(title = "Average Duration by weekdays per User Type", x = "Weekday", y = "Average Duration")
We notice that in terms of ride duration, casual riders and annual members are mostly active at weekends. However, casual riders have on average the highest ride lengths compared to annual members. Specifically, casual riders spend on average 3044 seconds on Sunday (against 1093 seconds for annual members) and 2817 seconds on Saturday (against 1067 seconds for annual members).
Here I plot number of rides against weekdays for each rider type
ggplot(data = number_of_rides_and_average_duration_per_usertype,
aes(x = day_of_week, y = number_of_rides, fill = member_casual)) +
geom_col(position = "dodge") +
scale_fill_manual(values = c("red", "blue"))+
labs(title = "Number of rides by weekdays per User Type", x = "Weekday", y = "Number of rides")
Compared to the previous section, here, annual members have the highest number of rides compared to casual riders. Specifically, they are mostly active on Saturday (323,120 trips) and Friday (306,398 trips), When it comes to casual riders they are mostly active on Saturday (335,059 trips), and Sunday (262,271 trips).
Here, I plot Average Duration against bike type for each rider type.
ggplot(data = number_of_rides_and_biketype_per_usertype,
aes(x = rideable_type, y = average_duration, fill = member_casual)) +
geom_col(position = "dodge") +
scale_fill_manual(values = c("purple", "plum1"))+
labs(title = "Average Duration by Bike Type per User Type", x = "Bike Type",
y = "Average Duration")
We notice that although having the same preference for docked bikes, casual riders and annual members differ in terms of average ride length using that bike type. Indeed, casual riders have the highest duration (3061 seconds) compared to annual members (1027 seconds).
Here, I plot number of rides against bike type for each rider type.
ggplot(data = number_of_rides_and_biketype_per_usertype,
aes(x = rideable_type, y = number_of_rides, fill = member_casual)) +
geom_col(position = "dodge") +
scale_fill_manual(values = c("dodgerblue3", "mediumaquamarine"))+
labs(title = "Number of rides by Bike Type per User Type", x = "Bike Type", y = "Number of rides")
By contrast, here we can see that annual members have the highest number of rides (1,434,720) using docked bikes compared to casual riders (1,113,408).
In general, annual members have the highest number of rides (2,052,075) compared to casual riders (1,427,121) while casual riders have the highest average ride length (18552.74 seconds) compared to annual members (6768.31 seconds).
Annual members and casual riders are mostly active at weekends and differ in terms of number of trips on Saturday: casual riders (335,059 trips) and annual members (323,120 trips); average ride length on Sunday: casual riders (3044 seconds) and annual members (1093 seconds).
Docked bike is most popular and used by annual members and casual riders. However, they differ in terms of highest ride length (3061 seconds for casual riders against 1027 seconds for annual members) and number of rides (1,113,408 rides for casual riders and 1,434,720 rides for annual members and ).
Based on previous analysis, the executive team should apply the following recommendations to design a new marketing strategy to convert casual riders into annual members:
As casual riders have the highest ride lengths but least number of rides, this shows that they are interested in using the bikes for long time. Then, the marketing team should propose them a marketing offer which will include an attractive annual membership at affordable price compared to their single-ride passes and full-day passes.
Design a special annual marketing offer which is affordable in terms of transport costs for casual riders that use the bikes to commute to work each day.
As casual riders have a particular interest for docked bikes, then design a special package exclusively for docked bikes to convert them into annual members with a bonus of bringing a family member who will benefit of a three-month subscription for free.