Bellabeat is a high-tech company that manufactures health-focused smart products. These smart devices collect data on activity, sleep, stress, and reproductive health to empower women with knowledge about their own health and habits. Bellabeat’s products are accompanied by the Bellabeat app which monitors biometric and lifestyle data to help women better understand how their bodies work and make healthier choices.
The main focus of this case is to analyze fitness data that is monitored by smart devices in order to determine how to become a larger player in the global smart device market. My findings can help unlock new growth opportunities for the company. After reviewing Bellabeat’s product catalog, I have decided to apply the findings from my analysis on the Bellabeat app with the goal of increasing app usage, subscriptions and retention.
Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep and stress.
Ivy: can be worn as a bracelet and correlates menstrual cycle data, lifestyle habits, and biometric readings. It reveals a comprehensive and accurate state of your body and mind. The Ivy tracker connects to the Bellabeat app to track activity, sleep and stress.
Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.
Bellabeat membership: Bellabeat also offers a subscription-based membership program for users. The membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health, beauty and mindfulness based on their lifestyle and goals.
Identify trends and gain insight into how consumers use non-Bellabeat smart devices to help guide marketing strategy for the company.
The FitBit Fitness Tracker Data was generated by thirty-three eligible FitBit users that consented to the submission of their personal tracker data, including daily, hourly, minute and second-level output for physical activity, heart rate and sleep monitoring. Variation between output represents use of different types of Fitbit trackers and individual tracking behaviors and preferences. The respondents participated in a survey via Amazon Mechanical Turk that had data collected between 03/12/2016-05/12/2016. Robert Furberg, Julia Brinton, Michael Keating, and Alexa Ortiz are credited with their involvement in crowdsourcing and collecting this data. This data is organized and stored as FitBit Fitness Tracker Data on Kaggle and is currently owned by Mobius, a data scientist in the Healthcare industry.
The FitBit Fitness Tracker Data has been confirmed as open data which is publicly accessible, exploitable, editable and is shared by Mobius for any purpose. Anyone can copy, modify, distribute and conduct a professional analysis on the work, even for commercial usage, all without asking permission.
I have been provided eighteen CSV documents to analyze. Each dataset represents quantitative data tracked by FitBit devices. Most of the data is considered long since each row is one time point per user, meaning each user can have data in multiple rows. The documents also contain wide data which contains a column for every minute recorded. The wide datasets have an accompanied long dataset with the same information. Every user has an unique ID which is distributed to multiple rows given that the data is tracked by day and time.
I sorted the data in each spreadsheet and filtered the tables by creating pivot tables in Google Sheets. I was able to preliminarily discover some trends as well as verify attributes and observations of each table including relations between tables. I recorded the amount of users in each dataset, verified the time length of the tracker data, and recorded observations measured in the table below.
Table Name | Editor | Description |
---|---|---|
dailyActivity_merged | Google Sheets | Daily Activities Performed Time Frame: Daily over 31 days Unique Users: 33 Ids Observations: Total Steps, Tracker Distance, Distance (Very Active, Moderately Active, Lightly Active, Sedentary), Minutes (Very Active, Moderately Active, Lightly Active, Sedentary),Intensities, Calories, Logged Activities |
dailyCalories_merged | Google Sheets | Daily Calories Burned Time Frame: Daily over 31 days Unique Users: 33 Ids Observations: Calories |
dailyIntensities_merged | Google Sheets | Daily Intensities Performed Time Frame: Daily over 31 days Unique Users: 33 Ids Observations: Minutes (Sedentary, Lightly Active, Fairly Active, Very Active), Distance (Sedentary, Lightly Active, Fairly Active, Very Active) |
dailySteps_merged | Google Sheets | Daily Steps Taken Time Frame: Daily over 31 days Unique Users: 33 Ids Observations: Total Steps |
sleepDay_merged | Google Sheets | Daily Sleep Log Time Frame: Daily over 31 days Unique Users: 24 Ids Observations: Total Sleep Records, Total Minutes Asleep, Total Time in Bed |
weightLogInfo_merged | Google Sheets | Daily Weight Log Time Frame: Daily over 31 days Unique Users: 8 Ids Observations: Weight in Kilograms, Weight in Pounds, BMI, Report Type (manual or automatic), Body Fat Percentage, Log ID |
hourlyCalories_merged | Google Sheets | Hourly Calories Burned Time Frame: Hourly over 31 days Unique Users: 33 Ids Observations: Calories |
hourlyIntensities_merged | Google Sheets | Hourly Intensities Performed Time Frame: Hourly over 31 days Unique Users: 33 Ids Observations: Total Intensity and Average Intensity |
hourlySteps_merged | Google Sheets | Hourly Steps Taken Time Frame: Hourly over 31 days Unique Users: 33 Ids Observations: Total Steps |
minuteCaloriesNarrow_merged | Google Sheets | Calories Burned per Minute Time Frame: Every minute over 31 days Unique Users: 33 Ids Observations: Calories Note: Every minute per row |
minuteCaloriesWide_merged | Google Sheets | Calories Burned per Minute Time Frame: Every minute over 31 days Unique Users: 33 Ids Observations: Calories Note: Every minute per column |
minuteIntensitiesNarrow_merged | Google Sheets | Intensity Performed per Minute Time Frame: Every minute over 31 days Unique Users: 33 Ids Observations: Intensity - range (0-3) Note: Every minute per row |
minuteIntensitiesWide_merged | Google Sheets | Intensity Performed per Minute Time Frame: Every minute over 31 days Unique Users: 33 Ids Observations: Intensity - range (0-3) Note: Every minute per column |
minuteMETsNarrow_merged | Google Sheets | METs (Metabolic Equivalents) One MET is defined as the energy used when inactive. Time frame: Every minute over 31 days Unique Users: 33 Ids Observations: METs - range (0-157) Note: Every minute per row |
minuteSleep_merged | Google Sheets | Sleep Value by Minute Time frame: Every minute over 31 days Unique Users: 24 Ids Observations: Value - range (1-3) |
minuteStepsNarrow_merged | Google Sheets | Steps Taken per Minute Time frame: Every minute over 31 days Unique Users: 33 Ids Observations: Steps Note: Every minute per row |
minuteStepsWide_merged | Google Sheets | Steps Taken per Minute Time frame: Every minute over 31 days Unique Users: 33 Ids Observations: Steps Note: Every minute per column |
heartrate_seconds_merged | Google Sheets | Heart Rate Log Time Frame: Every second over 31 days Unique Users: 14 Ids Observations: Value - range (36-203) |
The sample size provided is 33 users and does not contain demographic information. This could lead to sampling bias since I am unsure if the sample is representative of the population as a whole and if it includes both male and female participants. Furthermore, the dataset is almost 7 years old and the survey is split into two segments over a two month period in the springtime (03/12/2016-04/11/2016 & 04/12/16-05/12/2016.)
From this point on, I have chosen to use RStudio due to the accessibility, the large amount of data provided and to create data visualizations to share my results with stakeholders.
I will choose the following packages that will help me analyze the datasets after I upload and open them in RStudio.
install.packages("tidyverse") # assists with data import, tidying, manipulation and data visualization
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("lubridate") # contains functions to work with date-times and time-spans: fast and user friendly parsing of date-time data, extraction and updating of components of a date-time (years, months, days, hours, minutes, and seconds), algebraic manipulation on date-time and time-span objects.
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("dplyr") # provides a set of tools for efficiently manipulating datasets focusing only on dataframes
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("ggplot2") # dedicated to data visualization. It can greatly improve the quality and aesthetics of the graphics and will assist in efficiency when creating them. Ggplot2 also allows the user to build almost any type of chart.
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("tidyr") # cleans dataframes so that each row is a unit of observation and each column is a single piece of information
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
install.packages("readr") # makes it easy to get rectangular data out of comma separated (csv), tab separated (tsv) or fixed width files (fwf) and into R
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.1.0
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(lubridate)
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(dplyr)
library(ggplot2)
library(tidyr)
library(readr)
After observing all of the datasets on Google Sheets, I have concluded that the Daily activity dataset already contains all of the observations from the Daily calories, Daily steps and Daily intensities datasets. Another indicator is that the number of observations in each dataset is the same, 940. I will use the Daily activity, Daily sleep, Weight log and Hourly steps datasets for my analysis.
activity <- read.csv("/cloud/project/fitbit_datasets/dailyActivity_merged.csv")
sleep <- read.csv("/cloud/project/fitbit_datasets/sleepDay_merged.csv")
weight <- read.csv("/cloud/project/fitbit_datasets/weightLogInfo_merged.csv")
h_steps <- read.csv("/cloud/project/fitbit_datasets/hourlySteps_merged.csv")
I already checked the data in Google Sheets. I just need to compare that information to verify that everything was imported correctly by using the view() and str() functions.
view(activity)
str(activity)
## 'data.frame': 940 obs. of 15 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityDate : chr "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
## $ TotalSteps : int 13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
## $ TotalDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ TrackerDistance : num 8.5 6.97 6.74 6.28 8.16 ...
## $ LoggedActivitiesDistance: num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveDistance : num 1.88 1.57 2.44 2.14 2.71 ...
## $ ModeratelyActiveDistance: num 0.55 0.69 0.4 1.26 0.41 ...
## $ LightActiveDistance : num 6.06 4.71 3.91 2.83 5.04 ...
## $ SedentaryActiveDistance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ VeryActiveMinutes : int 25 21 30 29 36 38 42 50 28 19 ...
## $ FairlyActiveMinutes : int 13 19 11 34 10 20 16 31 12 8 ...
## $ LightlyActiveMinutes : int 328 217 181 209 221 164 233 264 205 211 ...
## $ SedentaryMinutes : int 728 776 1218 726 773 539 1149 775 818 838 ...
## $ Calories : int 1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...
view(sleep)
str(sleep)
## 'data.frame': 413 obs. of 5 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ SleepDay : chr "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
## $ TotalSleepRecords : int 1 2 1 2 1 1 1 1 1 1 ...
## $ TotalMinutesAsleep: int 327 384 412 340 700 304 360 325 361 430 ...
## $ TotalTimeInBed : int 346 407 442 367 712 320 377 364 384 449 ...
view(weight)
str(weight)
## 'data.frame': 67 obs. of 8 variables:
## $ Id : num 1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
## $ Date : chr "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
## $ WeightKg : num 52.6 52.6 133.5 56.7 57.3 ...
## $ WeightPounds : num 116 116 294 125 126 ...
## $ Fat : int 22 NA NA NA NA 25 NA NA NA NA ...
## $ BMI : num 22.6 22.6 47.5 21.5 21.7 ...
## $ IsManualReport: chr "True" "True" "False" "True" ...
## $ LogId : num 1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...
view(h_steps)
str(h_steps)
## 'data.frame': 22099 obs. of 3 variables:
## $ Id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ ActivityHour: chr "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" "4/12/2016 2:00:00 AM" "4/12/2016 3:00:00 AM" ...
## $ StepTotal : int 373 160 151 0 0 0 0 0 250 1864 ...
I identified some problems with the timestamp data. Before I begin my analysis, I need to rename the date and time columns to create consistency across the datasets. I also need to split those columns and convert them to date and time format. The final steps of the process phase will be when I remove unnecessary columns and verify that each dataset does not contain any duplicates.
activity <- rename_with(activity, tolower)
sleep <- rename_with(sleep, tolower)
weight <- rename_with(weight, tolower)
h_steps <- rename_with(h_steps, tolower)
head(activity)
## id activitydate totalsteps totaldistance trackerdistance
## 1 1503960366 4/12/2016 13162 8.50 8.50
## 2 1503960366 4/13/2016 10735 6.97 6.97
## 3 1503960366 4/14/2016 10460 6.74 6.74
## 4 1503960366 4/15/2016 9762 6.28 6.28
## 5 1503960366 4/16/2016 12669 8.16 8.16
## 6 1503960366 4/17/2016 9705 6.48 6.48
## loggedactivitiesdistance veryactivedistance moderatelyactivedistance
## 1 0 1.88 0.55
## 2 0 1.57 0.69
## 3 0 2.44 0.40
## 4 0 2.14 1.26
## 5 0 2.71 0.41
## 6 0 3.19 0.78
## lightactivedistance sedentaryactivedistance veryactiveminutes
## 1 6.06 0 25
## 2 4.71 0 21
## 3 3.91 0 30
## 4 2.83 0 29
## 5 5.04 0 36
## 6 2.51 0 38
## fairlyactiveminutes lightlyactiveminutes sedentaryminutes calories
## 1 13 328 728 1985
## 2 19 217 776 1797
## 3 11 181 1218 1776
## 4 34 209 726 1745
## 5 10 221 773 1863
## 6 20 164 539 1728
head(sleep)
## id sleepday totalsleeprecords totalminutesasleep
## 1 1503960366 4/12/2016 12:00:00 AM 1 327
## 2 1503960366 4/13/2016 12:00:00 AM 2 384
## 3 1503960366 4/15/2016 12:00:00 AM 1 412
## 4 1503960366 4/16/2016 12:00:00 AM 2 340
## 5 1503960366 4/17/2016 12:00:00 AM 1 700
## 6 1503960366 4/19/2016 12:00:00 AM 1 304
## totaltimeinbed
## 1 346
## 2 407
## 3 442
## 4 367
## 5 712
## 6 320
head(weight)
## id date weightkg weightpounds fat bmi
## 1 1503960366 5/2/2016 11:59:59 PM 52.6 115.9631 22 22.65
## 2 1503960366 5/3/2016 11:59:59 PM 52.6 115.9631 NA 22.65
## 3 1927972279 4/13/2016 1:08:52 AM 133.5 294.3171 NA 47.54
## 4 2873212765 4/21/2016 11:59:59 PM 56.7 125.0021 NA 21.45
## 5 2873212765 5/12/2016 11:59:59 PM 57.3 126.3249 NA 21.69
## 6 4319703577 4/17/2016 11:59:59 PM 72.4 159.6147 25 27.45
## ismanualreport logid
## 1 True 1.462234e+12
## 2 True 1.462320e+12
## 3 False 1.460510e+12
## 4 True 1.461283e+12
## 5 True 1.463098e+12
## 6 True 1.460938e+12
head(h_steps)
## id activityhour steptotal
## 1 1503960366 4/12/2016 12:00:00 AM 373
## 2 1503960366 4/12/2016 1:00:00 AM 160
## 3 1503960366 4/12/2016 2:00:00 AM 151
## 4 1503960366 4/12/2016 3:00:00 AM 0
## 5 1503960366 4/12/2016 4:00:00 AM 0
## 6 1503960366 4/12/2016 5:00:00 AM 0
activity <- activity %>%
rename(date= activitydate) %>%
mutate(date= as_date(date, format= "%m/%d/%Y"))
glimpse(activity)
## Rows: 940
## Columns: 15
## $ id <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ date <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-04-…
## $ totalsteps <int> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ totaldistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ trackerdistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ loggedactivitiesdistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ veryactivedistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ moderatelyactivedistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ lightactivedistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ sedentaryactivedistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ veryactiveminutes <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ fairlyactiveminutes <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ lightlyactiveminutes <int> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ sedentaryminutes <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ calories <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…
sleep <- sleep %>%
rename(date= sleepday) %>%
mutate(date= as_date(date, format= "%m/%d/%Y %I:%M:%S %p", tz= Sys.timezone()))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `date = as_date(date, format = "%m/%d/%Y %I:%M:%S %p", tz =
## Sys.timezone())`.
## Caused by warning:
## ! `tz` argument is ignored by `as_date()`
glimpse(sleep)
## Rows: 413
## Columns: 5
## $ id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150…
## $ date <date> 2016-04-12, 2016-04-13, 2016-04-15, 2016-04-16, 20…
## $ totalsleeprecords <int> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ totalminutesasleep <int> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2…
## $ totaltimeinbed <int> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3…
weight <- weight %>%
mutate(date= as_date(date, format= "%m/%d/%Y %I:%M:%S %p", tz= Sys.timezone()))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `date = as_date(date, format = "%m/%d/%Y %I:%M:%S %p", tz =
## Sys.timezone())`.
## Caused by warning:
## ! `tz` argument is ignored by `as_date()`
glimpse(weight)
## Rows: 67
## Columns: 8
## $ id <dbl> 1503960366, 1503960366, 1927972279, 2873212765, 2873212…
## $ date <date> 2016-05-02, 2016-05-03, 2016-04-13, 2016-04-21, 2016-0…
## $ weightkg <dbl> 52.6, 52.6, 133.5, 56.7, 57.3, 72.4, 72.3, 69.7, 70.3, …
## $ weightpounds <dbl> 115.9631, 115.9631, 294.3171, 125.0021, 126.3249, 159.6…
## $ fat <int> 22, NA, NA, NA, NA, 25, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ bmi <dbl> 22.65, 22.65, 47.54, 21.45, 21.69, 27.45, 27.38, 27.25,…
## $ ismanualreport <chr> "True", "True", "False", "True", "True", "True", "True"…
## $ logid <dbl> 1.462234e+12, 1.462320e+12, 1.460510e+12, 1.461283e+12,…
h_steps <- h_steps %>%
rename(date_time= activityhour) %>%
mutate(date_time= as.POSIXct(date_time, format="%m/%d/%Y %I:%M:%S %p", tz= Sys.timezone()))
glimpse(h_steps)
## Rows: 22,099
## Columns: 3
## $ id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1503960366, …
## $ date_time <dttm> 2016-04-12 00:00:00, 2016-04-12 01:00:00, 2016-04-12 02:00:…
## $ steptotal <int> 373, 160, 151, 0, 0, 0, 0, 0, 250, 1864, 676, 360, 253, 221,…
I will separate the date_time column into two separate columns, one for date and one for time.
h_steps <- h_steps %>%
separate(date_time, into = c("date", "time"), sep= " ") %>%
mutate(date = ymd(date))
After formatting the datasets, I will remove unnecessary columns.
weight <- weight %>%
select(-c( weightkg, fat, bmi, ismanualreport, logid))
view(weight)
activity <- activity %>%
select(-c(trackerdistance, loggedactivitiesdistance))
view(activity)
sleep<- sleep %>%
select(-c(totalsleeprecords))
view(sleep)
Lastly, I will find and remove duplicates and N/A values.
sum(duplicated(activity))
## [1] 0
sum(duplicated(sleep))
## [1] 3
sum(duplicated(weight))
## [1] 0
sum(duplicated(h_steps))
## [1] 0
All datasets have 0 duplicates besides the sleep dataset which contains 3 duplicates.
activity <- activity %>%
distinct() %>%
drop_na()
sleep <- sleep %>%
distinct() %>%
drop_na()
sleep <- unique(sleep)
sum(duplicated(sleep))
## [1] 0
weight <- weight %>%
distinct() %>%
drop_na()
h_steps <- h_steps %>%
distinct() %>%
drop_na()
The duplicates in sleep have now been removed.
In order to compare variables from the activity, sleep and weight dataset in the next phase, I will merge these three datasets to create one dataset with all of that information.
activity_sleep <- merge(activity,
sleep, by= c("id","date"),
all.x = TRUE)
head(activity_sleep)
## id date totalsteps totaldistance veryactivedistance
## 1 1503960366 2016-04-12 13162 8.50 1.88
## 2 1503960366 2016-04-13 10735 6.97 1.57
## 3 1503960366 2016-04-14 10460 6.74 2.44
## 4 1503960366 2016-04-15 9762 6.28 2.14
## 5 1503960366 2016-04-16 12669 8.16 2.71
## 6 1503960366 2016-04-17 9705 6.48 3.19
## moderatelyactivedistance lightactivedistance sedentaryactivedistance
## 1 0.55 6.06 0
## 2 0.69 4.71 0
## 3 0.40 3.91 0
## 4 1.26 2.83 0
## 5 0.41 5.04 0
## 6 0.78 2.51 0
## veryactiveminutes fairlyactiveminutes lightlyactiveminutes sedentaryminutes
## 1 25 13 328 728
## 2 21 19 217 776
## 3 30 11 181 1218
## 4 29 34 209 726
## 5 36 10 221 773
## 6 38 20 164 539
## calories totalminutesasleep totaltimeinbed
## 1 1985 327 346
## 2 1797 384 407
## 3 1776 NA NA
## 4 1745 412 442
## 5 1863 340 367
## 6 1728 700 712
final_dataset <-merge(activity_sleep,
weight, by= c("id","date"),
all.x = TRUE)
head(final_dataset)
## id date totalsteps totaldistance veryactivedistance
## 1 1503960366 2016-04-12 13162 8.50 1.88
## 2 1503960366 2016-04-13 10735 6.97 1.57
## 3 1503960366 2016-04-14 10460 6.74 2.44
## 4 1503960366 2016-04-15 9762 6.28 2.14
## 5 1503960366 2016-04-16 12669 8.16 2.71
## 6 1503960366 2016-04-17 9705 6.48 3.19
## moderatelyactivedistance lightactivedistance sedentaryactivedistance
## 1 0.55 6.06 0
## 2 0.69 4.71 0
## 3 0.40 3.91 0
## 4 1.26 2.83 0
## 5 0.41 5.04 0
## 6 0.78 2.51 0
## veryactiveminutes fairlyactiveminutes lightlyactiveminutes sedentaryminutes
## 1 25 13 328 728
## 2 21 19 217 776
## 3 30 11 181 1218
## 4 29 34 209 726
## 5 36 10 221 773
## 6 38 20 164 539
## calories totalminutesasleep totaltimeinbed weightpounds
## 1 1985 327 346 NA
## 2 1797 384 407 NA
## 3 1776 NA NA NA
## 4 1745 412 442 NA
## 5 1863 340 367 NA
## 6 1728 700 712 NA
Now, I will verify the amount of distinct Ids in the dataset that have not been merged and examine the data by summarizing the information contained in the merged dataset.
n_distinct(activity$id)
## [1] 33
n_distinct(sleep$id)
## [1] 24
n_distinct(weight$id)
## [1] 8
n_distinct(h_steps$id)
## [1] 33
This information verifies the number of participants in each data set. There are 33 participants in the activity and h_steps datasets, 24 in the sleep dataset and 8 in the weight dataset
Next, I will explore the summary statistics of the main final dataset, which has activity, sleep and weight combined:
final_dataset %>%
select(totalsteps,
totaldistance,
sedentaryminutes,
veryactiveminutes,
fairlyactiveminutes,
lightlyactiveminutes,
calories,
totalminutesasleep,
totaltimeinbed,
weightpounds)%>%
summary()
## totalsteps totaldistance sedentaryminutes veryactiveminutes
## Min. : 0 Min. : 0.000 Min. : 0.0 Min. : 0.00
## 1st Qu.: 3790 1st Qu.: 2.620 1st Qu.: 729.8 1st Qu.: 0.00
## Median : 7406 Median : 5.245 Median :1057.5 Median : 4.00
## Mean : 7638 Mean : 5.490 Mean : 991.2 Mean : 21.16
## 3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.:1229.5 3rd Qu.: 32.00
## Max. :36019 Max. :28.030 Max. :1440.0 Max. :210.00
##
## fairlyactiveminutes lightlyactiveminutes calories totalminutesasleep
## Min. : 0.00 Min. : 0.0 Min. : 0 Min. : 58.0
## 1st Qu.: 0.00 1st Qu.:127.0 1st Qu.:1828 1st Qu.:361.0
## Median : 6.00 Median :199.0 Median :2134 Median :432.5
## Mean : 13.56 Mean :192.8 Mean :2304 Mean :419.2
## 3rd Qu.: 19.00 3rd Qu.:264.0 3rd Qu.:2793 3rd Qu.:490.0
## Max. :143.00 Max. :518.0 Max. :4900 Max. :796.0
## NA's :530
## totaltimeinbed weightpounds
## Min. : 61.0 Min. :116.0
## 1st Qu.:403.8 1st Qu.:135.4
## Median :463.0 Median :137.8
## Mean :458.5 Mean :158.8
## 3rd Qu.:526.0 3rd Qu.:187.5
## Max. :961.0 Max. :294.3
## NA's :530 NA's :873
Discoveries identified from this summary:
The majority of the participants are lightly active when they are not sedentary.
Average sedentary time is 991 minutes or 16.5 hours. This downtime needs to be reduced for a healthier lifestyle.
The participants sleep, on average for about 7 hours and their average time in bed is about 30 minutes.
Average total steps per day are 7638 which is slightly less than the recommended 8,000 steps per day (3.3 miles) which can lower risks of death from heart disease and cancer by 50% than those who took only 4000 steps per day, according to the NIH. Those who took 12,000 steps per day (5 miles) lowered their risk by 65% as opposed to those who took only 4,000 steps per day on average.
Source: https://www.cdc.gov/healthyweight/assessing/bmi/adult_bmi/index.html
The majority of the participants are lightly active when they are not sedentary.
Average sedentary time is 991 minutes or 16.5 hours. This downtime needs to be reduced for a healthier lifestyle Average total steps per day are 7638 which is slightly less than the recommended 8,000 steps per day (3.3 miles) which can lower risks of death from heart disease and cancer by 50% than those who took only 4000 steps per day, according to the NIH. Those who took 12,000 steps per day (5 miles) lowered their risk by 65% as opposed to those who took only 4,000 steps per day on average.
Most of the users (87.9% used their device for more than 21 days during the survey while the remaining 12.1% used their device less than 20 days during the survey.
Bellabeat can implement a reward system on the app that activates when the user walks a minimum of 8,000 steps a day. For increased motivation, Bellabeat can offer extra points after goals have been met regarding steps and activities by incentivizing users to become more active by performing more “fairly” and “very active” exercises. The reward points can be redeemable towards discounts on other Bellabeat products.
Encourage a nighttime routine by creating a sleep reminder at least an hour before bedtime. App features could recommend activities such as limiting screen time an hour before bedtime and eating their last meal, caffeine or alcohol intake at least two hours before bedtime. Rewards can be applied when the recommended amount of sleep is achieved.
The sedentary monitor can send a signal to the smart device to remind the user to reduce sedentary time when the user would benefit from more physical activity.
The app could begin a bellabeat community that allows them to help motivate fellow members to become more active throughout the day. Users can upload pictures of their meals, activities and progress and share their own motivational texts to inspire other users just like them.