Case Study 2: Bellabeat

1. Company Overview
  - 1.1 About the Company
  - 1.2 Products
1. Ask Phase
  - 2.1 Business task
  - 2.2 Stakeholders
1. Prepare Phase
  - 3.1 Dataset information
  - 3.2 Accessibility and privacy of data
  - 3.3 Data organization and verification
  - 3.4 Data credibility and integrity
1. Process Phase
  - 4.1 Installing packages and opening libraries
  - 4.2 Importing datasets and verifying import
  - 4.3 Formatting and cleaning datasets
  - 4.4 Merging datasets
  - 4.5 Exploring and summarizing the data
1. Analyze and Share Phases
  - 5.1 Calories burned versus steps taken
  - 5.2 Classifying participants by daily activity level
  - 5.3 Activity habits
    * 5.3.1 Activities by the day of the week
    * 5.3.2 Activities by the hour of the day
  - 5.4 Sleep versus activities
    * 5.4.1 Distribution of sleep time
    * 5.4.2 Sleep by user type
    * 5.4.3 Total time in bed awake versus time asleep
  - 5.5 Use of smart device
1. Act Phase (Recommendations)

1. Company Overview

1.1 About the Company

Bellabeat is a high-tech company that manufactures health-focused smart products. These smart devices collect data on activity, sleep, stress, and reproductive health to empower women with knowledge about their own health and habits. Bellabeat’s products are accompanied by the Bellabeat app which monitors biometric and lifestyle data to help women better understand how their bodies work and make healthier choices.

The main focus of this case is to analyze fitness data that is monitored by smart devices in order to determine how to become a larger player in the global smart device market. My findings can help unlock new growth opportunities for the company. After reviewing Bellabeat’s product catalog, I have decided to apply the findings from my analysis on the Bellabeat app with the goal of increasing app usage, subscriptions and retention.

Empowering Women to Unlock Their Full Potential

1.2 Products

Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products.
Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep and stress.
Ivy: can be worn as a bracelet and correlates menstrual cycle data, lifestyle habits, and biometric readings. It reveals a comprehensive and accurate state of your body and mind. The Ivy tracker connects to the Bellabeat app to track activity, sleep and stress.
Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.
Bellabeat membership: Bellabeat also offers a subscription-based membership program for users. The membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health, beauty and mindfulness based on their lifestyle and goals.

2. Ask Phase

2.1 Business task

Identify trends and gain insight into how consumers use non-Bellabeat smart devices to help guide marketing strategy for the company.

2.2 Stakeholders

Urška Sršen - Bellabeat’s cofounder and Chief Creative Officer
Sando Mur - Bellabeat’s cofounder and Mathematician; key member of Bellabeat executive team
Bellabeat marketing analytics team

3. Prepare Phase

3.1 Dataset information

The FitBit Fitness Tracker Data was generated by thirty-three eligible FitBit users that consented to the submission of their personal tracker data, including daily, hourly, minute and second-level output for physical activity, heart rate and sleep monitoring. Variation between output represents use of different types of Fitbit trackers and individual tracking behaviors and preferences. The respondents participated in a survey via Amazon Mechanical Turk that had data collected between 03/12/2016-05/12/2016. Robert Furberg, Julia Brinton, Michael Keating, and Alexa Ortiz are credited with their involvement in crowdsourcing and collecting this data. This data is organized and stored as FitBit Fitness Tracker Data on Kaggle and is currently owned by Mobius, a data scientist in the Healthcare industry.

3.2 Accessibility and privacy of data

The FitBit Fitness Tracker Data has been confirmed as open data which is publicly accessible, exploitable, editable and is shared by Mobius for any purpose. Anyone can copy, modify, distribute and conduct a professional analysis on the work, even for commercial usage, all without asking permission.

3.3 Data organization and verification

I have been provided eighteen CSV documents to analyze. Each dataset represents quantitative data tracked by FitBit devices. Most of the data is considered long since each row is one time point per user, meaning each user can have data in multiple rows. The documents also contain wide data which contains a column for every minute recorded. The wide datasets have an accompanied long dataset with the same information. Every user has an unique ID which is distributed to multiple rows given that the data is tracked by day and time.

I sorted the data in each spreadsheet and filtered the tables by creating pivot tables in Google Sheets. I was able to preliminarily discover some trends as well as verify attributes and observations of each table including relations between tables. I recorded the amount of users in each dataset, verified the time length of the tracker data, and recorded observations measured in the table below.

Table Name	Editor	Description
dailyActivity_merged	Google Sheets	Daily Activities Performed Time Frame: Daily over 31 days Unique Users: 33 Ids Observations: Total Steps, Tracker Distance, Distance (Very Active, Moderately Active, Lightly Active, Sedentary), Minutes (Very Active, Moderately Active, Lightly Active, Sedentary),Intensities, Calories, Logged Activities
dailyCalories_merged	Google Sheets	Daily Calories Burned Time Frame: Daily over 31 days Unique Users: 33 Ids Observations: Calories
dailyIntensities_merged	Google Sheets	Daily Intensities Performed Time Frame: Daily over 31 days Unique Users: 33 Ids Observations: Minutes (Sedentary, Lightly Active, Fairly Active, Very Active), Distance (Sedentary, Lightly Active, Fairly Active, Very Active)
dailySteps_merged	Google Sheets	Daily Steps Taken Time Frame: Daily over 31 days Unique Users: 33 Ids Observations: Total Steps
sleepDay_merged	Google Sheets	Daily Sleep Log Time Frame: Daily over 31 days Unique Users: 24 Ids Observations: Total Sleep Records, Total Minutes Asleep, Total Time in Bed
weightLogInfo_merged	Google Sheets	Daily Weight Log Time Frame: Daily over 31 days Unique Users: 8 Ids Observations: Weight in Kilograms, Weight in Pounds, BMI, Report Type (manual or automatic), Body Fat Percentage, Log ID
hourlyCalories_merged	Google Sheets	Hourly Calories Burned Time Frame: Hourly over 31 days Unique Users: 33 Ids Observations: Calories
hourlyIntensities_merged	Google Sheets	Hourly Intensities Performed Time Frame: Hourly over 31 days Unique Users: 33 Ids Observations: Total Intensity and Average Intensity
hourlySteps_merged	Google Sheets	Hourly Steps Taken Time Frame: Hourly over 31 days Unique Users: 33 Ids Observations: Total Steps
minuteCaloriesNarrow_merged	Google Sheets	Calories Burned per Minute Time Frame: Every minute over 31 days Unique Users: 33 Ids Observations: Calories Note: Every minute per row
minuteCaloriesWide_merged	Google Sheets	Calories Burned per Minute Time Frame: Every minute over 31 days Unique Users: 33 Ids Observations: Calories Note: Every minute per column
minuteIntensitiesNarrow_merged	Google Sheets	Intensity Performed per Minute Time Frame: Every minute over 31 days Unique Users: 33 Ids Observations: Intensity - range (0-3) Note: Every minute per row
minuteIntensitiesWide_merged	Google Sheets	Intensity Performed per Minute Time Frame: Every minute over 31 days Unique Users: 33 Ids Observations: Intensity - range (0-3) Note: Every minute per column
minuteMETsNarrow_merged	Google Sheets	METs (Metabolic Equivalents) One MET is defined as the energy used when inactive. Time frame: Every minute over 31 days Unique Users: 33 Ids Observations: METs - range (0-157) Note: Every minute per row
minuteSleep_merged	Google Sheets	Sleep Value by Minute Time frame: Every minute over 31 days Unique Users: 24 Ids Observations: Value - range (1-3)
minuteStepsNarrow_merged	Google Sheets	Steps Taken per Minute Time frame: Every minute over 31 days Unique Users: 33 Ids Observations: Steps Note: Every minute per row
minuteStepsWide_merged	Google Sheets	Steps Taken per Minute Time frame: Every minute over 31 days Unique Users: 33 Ids Observations: Steps Note: Every minute per column
heartrate_seconds_merged	Google Sheets	Heart Rate Log Time Frame: Every second over 31 days Unique Users: 14 Ids Observations: Value - range (36-203)

3.4 Data credibility and integrity

The sample size provided is 33 users and does not contain demographic information. This could lead to sampling bias since I am unsure if the sample is representative of the population as a whole and if it includes both male and female participants. Furthermore, the dataset is almost 7 years old and the survey is split into two segments over a two month period in the springtime (03/12/2016-04/11/2016 & 04/12/16-05/12/2016.)

4. Process Phase

From this point on, I have chosen to use RStudio due to the accessibility, the large amount of data provided and to create data visualizations to share my results with stakeholders.

4.1 Installing packages and opening libraries

I will choose the following packages that will help me analyze the datasets after I upload and open them in RStudio.

Installing packages

install.packages("tidyverse") # assists with data import, tidying, manipulation and data visualization

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

install.packages("lubridate") # contains functions to work with date-times and time-spans: fast and user friendly parsing of date-time data, extraction and updating of components of a date-time (years, months, days, hours, minutes, and seconds), algebraic manipulation on date-time and time-span objects.

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

install.packages("dplyr") # provides a set of tools for efficiently manipulating datasets focusing only on dataframes

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

install.packages("ggplot2") # dedicated to data visualization. It can greatly improve the quality and aesthetics of the graphics and will assist in efficiency when creating them. Ggplot2 also allows the user to build almost any type of chart.

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

install.packages("tidyr") # cleans dataframes so that each row is a unit of observation and each column is a single piece of information

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

install.packages("readr") # makes it easy to get rectangular data out of comma separated (csv), tab separated (tsv) or fixed width files (fwf) and into R

## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.2'
## (as 'lib' is unspecified)

Opening libraries

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   1.0.1
## ✔ tibble  3.1.8     ✔ dplyr   1.1.0
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.3     ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(lubridate)

## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(dplyr) 
library(ggplot2) 
library(tidyr) 
library(readr)

4.2 Importing datasets and verifying import

After observing all of the datasets on Google Sheets, I have concluded that the Daily activity dataset already contains all of the observations from the Daily calories, Daily steps and Daily intensities datasets. Another indicator is that the number of observations in each dataset is the same, 940. I will use the Daily activity, Daily sleep, Weight log and Hourly steps datasets for my analysis.

Importing datasets

activity <- read.csv("/cloud/project/fitbit_datasets/dailyActivity_merged.csv")
sleep <- read.csv("/cloud/project/fitbit_datasets/sleepDay_merged.csv")
weight <- read.csv("/cloud/project/fitbit_datasets/weightLogInfo_merged.csv")
h_steps <- read.csv("/cloud/project/fitbit_datasets/hourlySteps_merged.csv")

Verifying imports

I already checked the data in Google Sheets. I just need to compare that information to verify that everything was imported correctly by using the view() and str() functions.

view(activity)
str(activity)

## 'data.frame':    940 obs. of  15 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
##  $ TotalDistance           : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveDistance      : num  1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  $ SedentaryActiveDistance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : int  25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : int  13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : int  328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...
##  $ Calories                : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...

view(sleep)
str(sleep)

## 'data.frame':    413 obs. of  5 variables:
##  $ Id                : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ SleepDay          : chr  "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
##  $ TotalSleepRecords : int  1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep: int  327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed    : int  346 407 442 367 712 320 377 364 384 449 ...

view(weight)
str(weight)

## 'data.frame':    67 obs. of  8 variables:
##  $ Id            : num  1.50e+09 1.50e+09 1.93e+09 2.87e+09 2.87e+09 ...
##  $ Date          : chr  "5/2/2016 11:59:59 PM" "5/3/2016 11:59:59 PM" "4/13/2016 1:08:52 AM" "4/21/2016 11:59:59 PM" ...
##  $ WeightKg      : num  52.6 52.6 133.5 56.7 57.3 ...
##  $ WeightPounds  : num  116 116 294 125 126 ...
##  $ Fat           : int  22 NA NA NA NA 25 NA NA NA NA ...
##  $ BMI           : num  22.6 22.6 47.5 21.5 21.7 ...
##  $ IsManualReport: chr  "True" "True" "False" "True" ...
##  $ LogId         : num  1.46e+12 1.46e+12 1.46e+12 1.46e+12 1.46e+12 ...

view(h_steps)
str(h_steps)

## 'data.frame':    22099 obs. of  3 variables:
##  $ Id          : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityHour: chr  "4/12/2016 12:00:00 AM" "4/12/2016 1:00:00 AM" "4/12/2016 2:00:00 AM" "4/12/2016 3:00:00 AM" ...
##  $ StepTotal   : int  373 160 151 0 0 0 0 0 250 1864 ...

4.3 Formatting and cleaning data

I identified some problems with the timestamp data. Before I begin my analysis, I need to rename the date and time columns to create consistency across the datasets. I also need to split those columns and convert them to date and time format. The final steps of the process phase will be when I remove unnecessary columns and verify that each dataset does not contain any duplicates.

Changing column formats and formatting the date column

activity <- rename_with(activity, tolower)
sleep <- rename_with(sleep, tolower)
weight <- rename_with(weight, tolower)
h_steps <- rename_with(h_steps, tolower)

Verifying column name changes

head(activity)

##           id activitydate totalsteps totaldistance trackerdistance
## 1 1503960366    4/12/2016      13162          8.50            8.50
## 2 1503960366    4/13/2016      10735          6.97            6.97
## 3 1503960366    4/14/2016      10460          6.74            6.74
## 4 1503960366    4/15/2016       9762          6.28            6.28
## 5 1503960366    4/16/2016      12669          8.16            8.16
## 6 1503960366    4/17/2016       9705          6.48            6.48
##   loggedactivitiesdistance veryactivedistance moderatelyactivedistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   lightactivedistance sedentaryactivedistance veryactiveminutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   fairlyactiveminutes lightlyactiveminutes sedentaryminutes calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728

head(sleep)

##           id              sleepday totalsleeprecords totalminutesasleep
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304
##   totaltimeinbed
## 1            346
## 2            407
## 3            442
## 4            367
## 5            712
## 6            320

head(weight)

##           id                  date weightkg weightpounds fat   bmi
## 1 1503960366  5/2/2016 11:59:59 PM     52.6     115.9631  22 22.65
## 2 1503960366  5/3/2016 11:59:59 PM     52.6     115.9631  NA 22.65
## 3 1927972279  4/13/2016 1:08:52 AM    133.5     294.3171  NA 47.54
## 4 2873212765 4/21/2016 11:59:59 PM     56.7     125.0021  NA 21.45
## 5 2873212765 5/12/2016 11:59:59 PM     57.3     126.3249  NA 21.69
## 6 4319703577 4/17/2016 11:59:59 PM     72.4     159.6147  25 27.45
##   ismanualreport        logid
## 1           True 1.462234e+12
## 2           True 1.462320e+12
## 3          False 1.460510e+12
## 4           True 1.461283e+12
## 5           True 1.463098e+12
## 6           True 1.460938e+12

head(h_steps)

##           id          activityhour steptotal
## 1 1503960366 4/12/2016 12:00:00 AM       373
## 2 1503960366  4/12/2016 1:00:00 AM       160
## 3 1503960366  4/12/2016 2:00:00 AM       151
## 4 1503960366  4/12/2016 3:00:00 AM         0
## 5 1503960366  4/12/2016 4:00:00 AM         0
## 6 1503960366  4/12/2016 5:00:00 AM         0

Renaming column activitydate to date and changing format from chr to POSIXctformat YYYY-MM-DD

activity <- activity %>% 
  rename(date= activitydate) %>%
  mutate(date= as_date(date, format= "%m/%d/%Y"))
glimpse(activity)

## Rows: 940
## Columns: 15
## $ id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ date                     <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-04-…
## $ totalsteps               <int> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
## $ totaldistance            <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ trackerdistance          <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
## $ loggedactivitiesdistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ veryactivedistance       <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
## $ moderatelyactivedistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
## $ lightactivedistance      <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
## $ sedentaryactivedistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ veryactiveminutes        <int> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
## $ fairlyactiveminutes      <int> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
## $ lightlyactiveminutes     <int> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
## $ sedentaryminutes         <int> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
## $ calories                 <int> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…

Renaming column sleepday to date and changing format from chr to POSIXctformat YYYY-MM-DD

sleep <- sleep %>%
  rename(date= sleepday) %>%
  mutate(date= as_date(date, format= "%m/%d/%Y  %I:%M:%S %p", tz= Sys.timezone()))

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `date = as_date(date, format = "%m/%d/%Y %I:%M:%S %p", tz =
##   Sys.timezone())`.
## Caused by warning:
## ! `tz` argument is ignored by `as_date()`

glimpse(sleep)

## Rows: 413
## Columns: 5
## $ id                 <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150…
## $ date               <date> 2016-04-12, 2016-04-13, 2016-04-15, 2016-04-16, 20…
## $ totalsleeprecords  <int> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ totalminutesasleep <int> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2…
## $ totaltimeinbed     <int> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3…

Changing column date format from chr to POSIXctformat YYYY-MM-DD

weight <- weight %>%
  mutate(date= as_date(date, format= "%m/%d/%Y  %I:%M:%S %p", tz= Sys.timezone()))

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `date = as_date(date, format = "%m/%d/%Y %I:%M:%S %p", tz =
##   Sys.timezone())`.
## Caused by warning:
## ! `tz` argument is ignored by `as_date()`

glimpse(weight)

## Rows: 67
## Columns: 8
## $ id             <dbl> 1503960366, 1503960366, 1927972279, 2873212765, 2873212…
## $ date           <date> 2016-05-02, 2016-05-03, 2016-04-13, 2016-04-21, 2016-0…
## $ weightkg       <dbl> 52.6, 52.6, 133.5, 56.7, 57.3, 72.4, 72.3, 69.7, 70.3, …
## $ weightpounds   <dbl> 115.9631, 115.9631, 294.3171, 125.0021, 126.3249, 159.6…
## $ fat            <int> 22, NA, NA, NA, NA, 25, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ bmi            <dbl> 22.65, 22.65, 47.54, 21.45, 21.69, 27.45, 27.38, 27.25,…
## $ ismanualreport <chr> "True", "True", "False", "True", "True", "True", "True"…
## $ logid          <dbl> 1.462234e+12, 1.462320e+12, 1.460510e+12, 1.461283e+12,…

Renaming column activityhour to date_time and changing format from chr to POSIXctformat YYYY-MM-DD HH-MM-SS

h_steps <- h_steps %>% 
  rename(date_time= activityhour) %>% 
  mutate(date_time= as.POSIXct(date_time, format="%m/%d/%Y %I:%M:%S %p", tz= Sys.timezone()))
glimpse(h_steps)

## Rows: 22,099
## Columns: 3
## $ id        <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 1503960366, …
## $ date_time <dttm> 2016-04-12 00:00:00, 2016-04-12 01:00:00, 2016-04-12 02:00:…
## $ steptotal <int> 373, 160, 151, 0, 0, 0, 0, 0, 250, 1864, 676, 360, 253, 221,…

I will separate the date_time column into two separate columns, one for date and one for time.

Separating the date and time column

h_steps <- h_steps %>%
  separate(date_time, into = c("date", "time"), sep= " ") %>%
  mutate(date = ymd(date))

After formatting the datasets, I will remove unnecessary columns.

Removing weightkg, fat, bmi, ismanualreport and logid

weight <- weight %>% 
  select(-c( weightkg, fat, bmi, ismanualreport, logid))
view(weight)

Removing trackerdistance and loggedactivitiesdistance

activity <- activity %>% 
  select(-c(trackerdistance, loggedactivitiesdistance))
view(activity)

Removing totalsleeprecords

sleep<- sleep %>% 
  select(-c(totalsleeprecords))
view(sleep)

Lastly, I will find and remove duplicates and N/A values.

Find duplicates

sum(duplicated(activity))

## [1] 0

sum(duplicated(sleep))

## [1] 3

sum(duplicated(weight))

## [1] 0

sum(duplicated(h_steps))

## [1] 0

All datasets have 0 duplicates besides the sleep dataset which contains 3 duplicates.

Remove duplicates and N/A

activity <- activity %>%
  distinct() %>%
  drop_na()

sleep <- sleep %>%
  distinct() %>%
  drop_na()

sleep <- unique(sleep)
sum(duplicated(sleep))

## [1] 0

weight <- weight %>%
  distinct() %>%
  drop_na()

h_steps <- h_steps %>%
  distinct() %>%
  drop_na()

The duplicates in sleep have now been removed.

4.4 Merging the datasets

In order to compare variables from the activity, sleep and weight dataset in the next phase, I will merge these three datasets to create one dataset with all of that information.

Merging activity and sleep datasets

activity_sleep <- merge(activity, 
                        sleep, by= c("id","date"), 
                        all.x = TRUE) 
head(activity_sleep)

##           id       date totalsteps totaldistance veryactivedistance
## 1 1503960366 2016-04-12      13162          8.50               1.88
## 2 1503960366 2016-04-13      10735          6.97               1.57
## 3 1503960366 2016-04-14      10460          6.74               2.44
## 4 1503960366 2016-04-15       9762          6.28               2.14
## 5 1503960366 2016-04-16      12669          8.16               2.71
## 6 1503960366 2016-04-17       9705          6.48               3.19
##   moderatelyactivedistance lightactivedistance sedentaryactivedistance
## 1                     0.55                6.06                       0
## 2                     0.69                4.71                       0
## 3                     0.40                3.91                       0
## 4                     1.26                2.83                       0
## 5                     0.41                5.04                       0
## 6                     0.78                2.51                       0
##   veryactiveminutes fairlyactiveminutes lightlyactiveminutes sedentaryminutes
## 1                25                  13                  328              728
## 2                21                  19                  217              776
## 3                30                  11                  181             1218
## 4                29                  34                  209              726
## 5                36                  10                  221              773
## 6                38                  20                  164              539
##   calories totalminutesasleep totaltimeinbed
## 1     1985                327            346
## 2     1797                384            407
## 3     1776                 NA             NA
## 4     1745                412            442
## 5     1863                340            367
## 6     1728                700            712

Merging activity_sleep dataset with weight dataset

final_dataset <-merge(activity_sleep, 
                      weight, by= c("id","date"), 
                      all.x = TRUE)
head(final_dataset)

##           id       date totalsteps totaldistance veryactivedistance
## 1 1503960366 2016-04-12      13162          8.50               1.88
## 2 1503960366 2016-04-13      10735          6.97               1.57
## 3 1503960366 2016-04-14      10460          6.74               2.44
## 4 1503960366 2016-04-15       9762          6.28               2.14
## 5 1503960366 2016-04-16      12669          8.16               2.71
## 6 1503960366 2016-04-17       9705          6.48               3.19
##   moderatelyactivedistance lightactivedistance sedentaryactivedistance
## 1                     0.55                6.06                       0
## 2                     0.69                4.71                       0
## 3                     0.40                3.91                       0
## 4                     1.26                2.83                       0
## 5                     0.41                5.04                       0
## 6                     0.78                2.51                       0
##   veryactiveminutes fairlyactiveminutes lightlyactiveminutes sedentaryminutes
## 1                25                  13                  328              728
## 2                21                  19                  217              776
## 3                30                  11                  181             1218
## 4                29                  34                  209              726
## 5                36                  10                  221              773
## 6                38                  20                  164              539
##   calories totalminutesasleep totaltimeinbed weightpounds
## 1     1985                327            346           NA
## 2     1797                384            407           NA
## 3     1776                 NA             NA           NA
## 4     1745                412            442           NA
## 5     1863                340            367           NA
## 6     1728                700            712           NA

4.5 Exploring and summarizing the data

Now, I will verify the amount of distinct Ids in the dataset that have not been merged and examine the data by summarizing the information contained in the merged dataset.

Verifying distinct IDs

n_distinct(activity$id)

## [1] 33

n_distinct(sleep$id)

## [1] 24

n_distinct(weight$id)

## [1] 8

n_distinct(h_steps$id)

## [1] 33

This information verifies the number of participants in each data set. There are 33 participants in the activity and h_steps datasets, 24 in the sleep dataset and 8 in the weight dataset

Next, I will explore the summary statistics of the main final dataset, which has activity, sleep and weight combined:

Summarizing the main final dataset

final_dataset %>% 
  select(totalsteps,
         totaldistance,
         sedentaryminutes, 
         veryactiveminutes, 
         fairlyactiveminutes, 
         lightlyactiveminutes,
         calories,
         totalminutesasleep, 
         totaltimeinbed, 
         weightpounds)%>%
   summary()

##    totalsteps    totaldistance    sedentaryminutes veryactiveminutes
##  Min.   :    0   Min.   : 0.000   Min.   :   0.0   Min.   :  0.00   
##  1st Qu.: 3790   1st Qu.: 2.620   1st Qu.: 729.8   1st Qu.:  0.00   
##  Median : 7406   Median : 5.245   Median :1057.5   Median :  4.00   
##  Mean   : 7638   Mean   : 5.490   Mean   : 991.2   Mean   : 21.16   
##  3rd Qu.:10727   3rd Qu.: 7.713   3rd Qu.:1229.5   3rd Qu.: 32.00   
##  Max.   :36019   Max.   :28.030   Max.   :1440.0   Max.   :210.00   
##                                                                     
##  fairlyactiveminutes lightlyactiveminutes    calories    totalminutesasleep
##  Min.   :  0.00      Min.   :  0.0        Min.   :   0   Min.   : 58.0     
##  1st Qu.:  0.00      1st Qu.:127.0        1st Qu.:1828   1st Qu.:361.0     
##  Median :  6.00      Median :199.0        Median :2134   Median :432.5     
##  Mean   : 13.56      Mean   :192.8        Mean   :2304   Mean   :419.2     
##  3rd Qu.: 19.00      3rd Qu.:264.0        3rd Qu.:2793   3rd Qu.:490.0     
##  Max.   :143.00      Max.   :518.0        Max.   :4900   Max.   :796.0     
##                                                          NA's   :530       
##  totaltimeinbed   weightpounds  
##  Min.   : 61.0   Min.   :116.0  
##  1st Qu.:403.8   1st Qu.:135.4  
##  Median :463.0   Median :137.8  
##  Mean   :458.5   Mean   :158.8  
##  3rd Qu.:526.0   3rd Qu.:187.5  
##  Max.   :961.0   Max.   :294.3  
##  NA's   :530     NA's   :873

Discoveries identified from this summary:

The majority of the participants are lightly active when they are not sedentary.
Average sedentary time is 991 minutes or 16.5 hours. This downtime needs to be reduced for a healthier lifestyle.
The participants sleep, on average for about 7 hours and their average time in bed is about 30 minutes.
Average total steps per day are 7638 which is slightly less than the recommended 8,000 steps per day (3.3 miles) which can lower risks of death from heart disease and cancer by 50% than those who took only 4000 steps per day, according to the NIH. Those who took 12,000 steps per day (5 miles) lowered their risk by 65% as opposed to those who took only 4,000 steps per day on average.

Source: https://www.nih.gov/news-events/nih-research-matters/number-steps-day-more-important-step-intensity#:~:text=A%20goal%20of%2010%2C000%20steps,10%2C000%20steps%20are%20taken%20daily

The average weight of the participants is 158 pounds which is slightly overweight for the average woman. I am unsure if these participants are both male and female. Male participants weighing 158 pounds on average are considered to be an ideal weight according to the CDC.

Source: https://www.cdc.gov/healthyweight/assessing/bmi/adult_bmi/index.html

5. Analyze and Share Phases

We will analyze trends of the FitBit users to determine if that can help us with BellaBeat’s marketing strategy.

5.1 Calories burned versus steps taken

I would like to confirm that there is a positive correlation between the amount of steps taken in a day and the amount of calories that are burned.

Steps versus calories

ggplot(data=final_dataset, 
       aes(x=totalsteps, y=calories)) +   
  geom_point() + 
  geom_smooth() + 
  labs(title="Total Steps Versus Calories", 
       caption= 'Data Source: FitBit Fitness Tracker Data',
       y= "Calories", x = "Total Steps") +
geom_smooth(formula = y ~ x)

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## `geom_smooth()` using method = 'loess'

Using a correlation test to confirm correlation between steps and calories

cor.test(final_dataset$totalsteps, 
         final_dataset$calories, 
         method = 'pearson', 
         conf.level = 0.95)

## 
##  Pearson's product-moment correlation
## 
## data:  final_dataset$totalsteps and final_dataset$calories
## t = 22.472, df = 938, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5483688 0.6316184
## sample estimates:
##       cor 
## 0.5915681

With a confidence level of 95%, the correlation between the variables is almost 0.6 which translates to a strong relationship between the total steps and calories. The more steps a user takes, the more calories the user burns.

5.2 Classifying participants by daily activity level

Since I don’t have any demographic information from the data, I want to determine the type of participants contained within the sample. I can classify each ID by activity considering the daily amount of steps taken. They will be categorized as follows:

Sedentary - Less than 5000 steps a day.
Lightly active - Between 5000 and 7499 steps a day.
Fairly active - Between 7500 and 9999 steps a day.
Very active - More than 10000 steps a day.

Classifications have been sourced through following article:

https://pubmed.ncbi.nlm.nih.gov/14715035/

First, I will calculate the average daily steps and calories by user Id.

Calculating the average daily steps and calories by user Id

activity_average <- final_dataset %>% 
  group_by (id) %>% 
  summarise(avg_daily_steps= mean(totalsteps), 
            avg_daily_cal= mean(calories))%>% 
  mutate(user_type= case_when(
    avg_daily_steps < 5000 ~ "sedentary",
    avg_daily_steps >= 5000 & avg_daily_steps <7499 ~"lightly active",
    avg_daily_steps >= 7499 & avg_daily_steps <9999 ~"fairly active",
    avg_daily_steps >= 10000 ~"very active"
  ))


head(activity_average)

## # A tibble: 6 × 4
##           id avg_daily_steps avg_daily_cal user_type     
##        <dbl>           <dbl>         <dbl> <chr>         
## 1 1503960366          12117.         1816. very active   
## 2 1624580081           5744.         1483. lightly active
## 3 1644430081           7283.         2811. lightly active
## 4 1844505072           2580.         1573. sedentary     
## 5 1927972279            916.         2173. sedentary     
## 6 2022484408          11371.         2510. very active

User type percentage

user_type_sum <- activity_average %>%
  group_by(user_type) %>%
  summarise(total= n()) %>%
  mutate(total_percent= scales::percent (total/sum(total)))

Ordering the user type column from sedentary to fairly active

user_type_sum$user_type <-factor(user_type_sum$user_type, levels= c("sedentary","lightly active","fairly active","very active"))

Visualization of user types

user_type_sum %>% 
  ggplot(aes(x="",y=total_percent, fill=user_type)) + 
  geom_bar(stat = "identity",)+
  coord_polar("y", start=0)+ 
  theme(axis.title.x= element_blank(),  
        axis.title.y = element_blank(), 
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        panel.background = element_blank(),  
        axis.ticks = element_blank(), 
        axis.text.x = element_blank(),  
        plot.title = element_text(hjust = -0.3, size=14, face = "bold")) + 
  scale_fill_manual(values = c("#72b4eb","#F7B7A3", "#EA5F89", "#9B3192")) + 
  geom_text(aes(x=1.7, label = total_percent),
            position = position_stack(vjust = 0.5))+
  labs(title="Percentage of Each User Type")

Given the graphical information, we can see that the participants are fairly distributed by their activity in relation to their daily amount of steps. Based on this information, all types of users wear smart-devices.

5.3 Activity habits throughout the day and the week

Now, I will see if I can identify trends that occur each day and throughout the week.

5.3.1 Activities by the day of the week

final_dataset %>% 
  mutate(weekdays = weekdays(date)) %>% 
  select(weekdays, totalsteps) %>% 
  mutate(weekdays = factor(weekdays, levels = c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'))) %>% 
  drop_na() %>% 
  ggplot(aes(weekdays, totalsteps, fill = weekdays)) +
  geom_boxplot() +
  scale_fill_brewer(palette="RdPu") +
  theme(legend.position="none") +
  labs(
    title = "Daily Activity Habits",
    x = "Day of the Week",
    y = "Steps",
    caption = 'Data Source: FitBit Fitness Tracker Data')

This graph demonstrates the activity levels throughout the week remain consistent, overall.

5.3.2 Activities by the hour of the day

h_steps %>%
  group_by(time) %>%
  summarize(average_steps = mean(steptotal)) %>%
  ggplot() +
  geom_col(mapping = aes(x=time, y = average_steps, fill = average_steps)) + 
  labs(title = "Hourly Steps Throughout the Day", x="Hour of the Day", y="", 
       fill= 'Average Steps', caption = 'Data Source: FitBit Fitness Tracker Data') + 
  scale_fill_gradient(low = "green", high = "red")+
  theme(axis.text.x = element_text(angle = 90))

The participants have demonstrated that most of their active hours are from 08:00 AM to 7:00 PM. The highest concentration of activity occurred during lunch hours (12:00PM-2:00PM) and after the typical work day (5:00 PM- 7:00 PM.)

5.4 Sleep versus activities

Activity levels and the outcome of fitness results can be affected by the amount and quality of sleep that the users have each night. I will now explore this area of the data in order to find some possible correlations between sleep and activity.

5.4.1 Distribution of sleep time

final_dataset%>% 
  select(totalminutesasleep) %>% 
  drop_na() %>% 
  mutate(sleep_quality = ifelse(totalminutesasleep <= 420, 'Less Than 7h',
                                ifelse(totalminutesasleep <= 540, '7h to 9h', 
                                       'More Than 9h'))) %>%
  mutate(sleep_quality = factor(sleep_quality, 
                                levels = c('Less Than 7h','7h to 9h',
                                           'More Than 9h'))) %>% 
  ggplot(aes(x = totalminutesasleep, fill = sleep_quality)) +
  geom_histogram(position = 'dodge', bins = 30) +
  scale_fill_manual(values=c("#72b4eb", "#F7B7A3", "#EA5F89")) +
  theme(legend.position = c(.80, .80),
        legend.title = element_blank(),
        legend.spacing.y = unit(0, "mm"), 
        panel.border = element_rect(colour = "black", fill=NA),
        panel.background = element_blank(),
        legend.background = element_blank(),
        legend.box.background = element_rect(colour = "black")) +
  labs(
    title = "Distribution of Sleep Time",
    x = "Time Slept (Minutes)",
    y = "Count",
    caption = 'Data Source: FitBit Fitness Tracker Data')

5.4.2 Sleep by user type

Adding the usertype column into the activity_sleep_average dataset

activity_sleep_average <- merge(activity_sleep, activity_average[c("id","user_type")], by="id")

Reordering the user type column in the dataset

activity_sleep_average$user_type <-ordered(activity_sleep_average$user_type, 
                                         levels= c("sedentary","lightly active","fairly active","very active"))

Hours Asleep by User Type

ggplot(subset(activity_sleep_average,!is.na(totalminutesasleep)),
       aes(user_type,totalminutesasleep/60, fill=user_type))+
  geom_boxplot()+
  stat_summary(fun="mean", geom="point", 
               shape=23,size=2, fill="white")+
  labs(title= "Hours Asleep by User Type", 
       x= " ", y=" Hours Asleep", 
       caption= 'Data Source: Fitbit Fitness Tracker Data')+
  scale_fill_brewer(palette="RdPu")+
  theme(plot.title= element_text(hjust= 0.5,vjust= 0.8, size=16),
        legend.position= "none")

Given that there are a lot of outliers in this graph, there is no correlation among the participant’s user type and the amount of sleep that they receive. Furthermore, the lightly active users slept the most (over 7.5 hours) and got the recommended amount of sleep while the very active users slept the least (less than 7 hours) and slept less than the recommended amount of sleep.

Source: https://www.cdc.gov/sleep/about_sleep/how_much_sleep.html

5.4.3 Total time in bed awake versus time asleep

Sleep versus time in bed

ggplot(data=final_dataset, 
       aes(x = totalminutesasleep, y= totaltimeinbed)) + 
  geom_point()+ 
  labs(title="Total Minutes Asleep vs. Total Time in Bed",
       x= 'Total Minutes Asleep', y= 'Total Time in Bed',
       caption= 'Data Source: Fitbit Fitness Tracker Data')

## Warning: Removed 530 rows containing missing values (`geom_point()`).

Correlation test between total minutes asleep and total time in bed

cor.test(final_dataset$totalminutesasleep, 
         final_dataset$totaltimeinbed, 
         method = 'pearson', 
         conf.level = 0.95)

## 
##  Pearson's product-moment correlation
## 
## data:  final_dataset$totalminutesasleep and final_dataset$totaltimeinbed
## t = 51.28, df = 408, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9161262 0.9423551
## sample estimates:
##       cor 
## 0.9304224

The graph demonstrates that although most users have a strong correlation (.93) between time in bed and time asleep, some users take more than an hour to fall asleep. The outliers who had more than 10 hours of sleep also remained in bed for a total of 16 hours.

5.5 Use of smart device

Now that I have identified trends in the participants’ activity, sleep and calories burned, I want to see how often the users in our sample are using their smart device. This can help me create a plan for Bellabeat’s marketing strategy regarding the features that would assist in increasing smart device usage, app interactions and future product sales.

First, I will classify the participants by the amount of time that their smart device was activated over the thirty-one day time frame.

low use - participants who used their device between 1 and 10 days.
medium use - participants who used their device between 10 and 20 days.
high use - participants who used their device between 21 and 31 days

I will create a new dataset grouping by Id to calculate the number of days the device was used by creating a new column with the classifications: low use, medium use and high use

Categorizing smart device usage by low, medium, and high

sd_use <- final_dataset %>%
  group_by(id) %>%
  summarize(days_used=sum(n())) %>%
  mutate(usage = case_when(
    days_used >= 1 & days_used <= 10 ~ "low use",
    days_used >= 11 & days_used <= 20 ~ "medium use", 
    days_used >= 21 & days_used <= 31 ~ "high use"))

head(sd_use)

## # A tibble: 6 × 3
##           id days_used usage   
##        <dbl>     <int> <chr>   
## 1 1503960366        31 high use
## 2 1624580081        31 high use
## 3 1644430081        30 high use
## 4 1844505072        31 high use
## 5 1927972279        31 high use
## 6 2022484408        31 high use

Now, I will create a data frame that converts the participant count of each usage type into percentages and order by usage levels.

Creating a table that categorizes usage types in to percents

sd_use_percentage <- sd_use %>%
  group_by(usage) %>%
  summarise(total = n()) %>%
  mutate(totals = sum(total)) %>%
  group_by(usage) %>%
  summarise(total_percent = total / totals) %>%
  mutate(percentage = scales::percent(total_percent))


sd_use_percentage$usage <-ordered(sd_use_percentage$usage, 
                              levels= c("low use", "medium use", "high use")) 
head(sd_use_percentage)

## # A tibble: 3 × 3
##   usage      total_percent percentage
##   <ord>              <dbl> <chr>     
## 1 high use          0.879  87.9%     
## 2 low use           0.0303 3.0%      
## 3 medium use        0.0909 9.1%

Creating a chart that visualizes usage percentages

sd_use_percentage %>%
  ggplot(aes(x="",y=total_percent, fill=usage)) +
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start=0)+
  theme_minimal()+
  theme(axis.title.x= element_blank(),
        axis.title.y = element_blank(),
        panel.border = element_blank(), 
        panel.grid = element_blank(), 
        axis.ticks = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(hjust = 0.5, size=14, face = "bold")) +
  geom_text(aes(x=1.6, label = percentage),
            position = position_stack(vjust = 0.5))+
  scale_fill_manual(values = c("#72b4eb","#F7B7A3","#EA5F89"),
                    labels = c("Low Use - 1 to 10 days",
                               "Medium Use - 11 to 20 days",
                               "High Use - 21 to 30 days"))+
  labs(title="Daily Use of Smart Device",
       fill = 'Usage')

Most of the users (87.9% used their device for more than 21 days during the survey while the remaining 12.1% used their device less than 20 days during the survey.

6. Act Phase (Recommendations)

Key findings:

Activity

The majority of the participants are lightly active when they are not sedentary.
Average sedentary time is 991 minutes or 16.5 hours. This downtime needs to be reduced for a healthier lifestyle Average total steps per day are 7638 which is slightly less than the recommended 8,000 steps per day (3.3 miles) which can lower risks of death from heart disease and cancer by 50% than those who took only 4000 steps per day, according to the NIH. Those who took 12,000 steps per day (5 miles) lowered their risk by 65% as opposed to those who took only 4,000 steps per day on average.
Most of the users (87.9% used their device for more than 21 days during the survey while the remaining 12.1% used their device less than 20 days during the survey.

Sleep

The participants sleep, on average for about 7 hours and their average time in bed is about 30 minutes Some users take more than an hour to fall asleep. The outliers who had more than 10 hours of sleep also remained in bed for a total of 16 hours.

Calories

The more steps a user takes, the more calories the user burns. Furthermore, the more activities performed, the more calories are burned.

Recommendations:

Bellabeat can implement a reward system on the app that activates when the user walks a minimum of 8,000 steps a day. For increased motivation, Bellabeat can offer extra points after goals have been met regarding steps and activities by incentivizing users to become more active by performing more “fairly” and “very active” exercises. The reward points can be redeemable towards discounts on other Bellabeat products.
Encourage a nighttime routine by creating a sleep reminder at least an hour before bedtime. App features could recommend activities such as limiting screen time an hour before bedtime and eating their last meal, caffeine or alcohol intake at least two hours before bedtime. Rewards can be applied when the recommended amount of sleep is achieved.
The sedentary monitor can send a signal to the smart device to remind the user to reduce sedentary time when the user would benefit from more physical activity.
The app could begin a bellabeat community that allows them to help motivate fellow members to become more active throughout the day. Users can upload pictures of their meals, activities and progress and share their own motivational texts to inspire other users just like them.

Case Study 2: Bellabeat | R

Shantel Shephard

Table of Contents