Before showing how to aggregate and analyze the data on different time scales we provide an example of what our recorded raw data looks like. For privacy reasons, random noise is added to each example dataset. We chose two sensors for illustrative purposes: WIFI and app usage. Those two sensors differ in how they are recorded. Whereas wifi connections are recorded at specific time points, app usage has a start time and duration.
timestamp | sha256_salted_mac_hash |
---|---|
2021-10-25 18:26:07.797312295 | d9869d0b40ef75cd284a2b896a3d0dabc9e6ae5b29551ea300b4f53dd0130f6c |
2021-10-25 18:26:09.172443738 | 523ad8310fe14f8fd40f53fc014476e21bbc5dd6a0bd402d52f40ab23360439b |
2021-10-25 18:26:08.210457816 | d9869d0b40ef75cd284a2b896a3d0dabc9e6ae5b29551ea300b4f53dd0130f6c |
2021-10-25 18:26:14.657465888 | 36cb027c97d12c952a9af551d718e854f944d56fb4e7bb96ee69ea7a61ade779 |
2021-10-25 18:26:29.044966888 | e6b727dd9964a9637a97ec343a457737e41d25bebcd44e1ab5c56ad64b76f800 |
2021-10-25 18:26:13.265038973 | 6c70ba8cf4fd8e8133c3d0d593d12e3ae25465a8e4cd3b0de5e40c166433ea40 |
2021-10-25 18:27:01.722724855 | 2277cb466048be250a04a3ca1464aee1f45fb23fce29f4f3e5b737c1c4ba2d7f |
2021-10-25 18:26:07.529904046 | 252069f26e0ff8bdc133401ce50210b88a341304e0edf58c644787761b13c825 |
2021-10-25 18:26:07.529359900 | 72826bc603bbffa7db9ec14583d500a9c13ca2b141674d9d1c3f1d90ae40091d |
2021-10-26 11:18:48.053753455 | 4c6dc8eb70fa64c4a2e157d5462efd188d2a723b2c937937c74d58278b4151e6 |
timestamp | duration | genre |
---|---|---|
2021-10-25 17:17:01.986000000 | 1.869158 | NA |
2021-10-25 17:17:02.870157536 | 44.540493 | FOOD_DRINK |
2021-10-25 18:25:07.788492869 | 1.662723 | COMMUNICATION |
2021-10-25 18:25:08.904723355 | 6.508187 | SHOPPING |
2021-10-25 18:26:05.376186557 | 4.973042 | NA |
2021-10-25 18:27:30.746041812 | 1.812178 | NEWS_MAGAZINES |
2021-10-25 18:27:30.594178285 | 4.918438 | FOOD_DRINK |
2021-10-25 18:27:34.714437904 | 17.996723 | NA |
2021-10-25 18:27:50.917722524 | 82.216111 | SOCIAL |
2021-10-25 18:29:26.152111508 | 7.132128 | HEALTH_FITNESS |
Date | pa_happy_sliderNegPos | pa_energy_sliderNegPos | pa_relax_sliderNegPos |
---|---|---|---|
2021-10-25 18:59:38.837536516 | NA | NA | NA |
2021-10-26 07:46:39.579333615 | 5.297842 | 2.061052 | 3.567544 |
2021-10-26 11:38:59.334110889 | 6.311747 | 13.104694 | 6.047756 |
2021-10-26 14:22:00.627950088 | 10.939742 | 12.703427 | 10.068575 |
2021-10-26 16:05:10.339640619 | 5.962909 | 4.608161 | 6.854507 |
2021-10-26 19:06:20.441581469 | 8.982813 | 8.179666 | 7.555428 |
2021-10-26 21:21:32.237598192 | 8.304694 | 10.684946 | 8.264331 |
2021-10-27 07:43:06.095237021 | 2.365334 | 9.936023 | 9.155478 |
2021-10-27 11:59:47.196670340 | 8.448896 | 6.472782 | 12.086550 |
2021-10-27 13:00:48.853281969 | 8.250955 | 8.177381 | 8.016611 |
To create meaningful variables and to combine different sensors, we must choose a time window in which we aggregate the measures.
In the following example, we will show how to create a dataset that shows passive smartphone measures aggregated X hours before each ESM measurement was filled out
We begin with converting all dates to class “POSIXct” in order to represent them as calendar dates and times.
Wifi$timestamp <- as.POSIXct(Wifi$timestamp)
App$timestamp <- as.POSIXct(App$timestamp)
ESM$Date <- as.POSIXct(ESM$Date)
We continue with adding the end time to the app usage dataset (note that the duration is rounded to seconds).
App$end_time <- as.POSIXct(c(rep(NA,nrow(App)))) # Add "empty" end time column (with the class POSIXct)
App$end_time <- App$timestamp + App$duration # Add start time (timestamp) and duration together to calculate the end time
timestamp | duration | genre | end_time |
---|---|---|---|
2021-10-25 17:17:01 | 1.869158 | NA | 2021-10-25 17:17:03 |
2021-10-25 17:17:02 | 44.540493 | FOOD_DRINK | 2021-10-25 17:17:47 |
2021-10-25 18:25:07 | 1.662723 | COMMUNICATION | 2021-10-25 18:25:09 |
2021-10-25 18:25:08 | 6.508187 | SHOPPING | 2021-10-25 18:25:15 |
2021-10-25 18:26:05 | 4.973042 | NA | 2021-10-25 18:26:10 |
2021-10-25 18:27:30 | 1.812178 | NEWS_MAGAZINES | 2021-10-25 18:27:32 |
2021-10-25 18:27:30 | 4.918438 | FOOD_DRINK | 2021-10-25 18:27:35 |
2021-10-25 18:27:34 | 17.996723 | NA | 2021-10-25 18:27:52 |
2021-10-25 18:27:50 | 82.216111 | SOCIAL | 2021-10-25 18:29:13 |
2021-10-25 18:29:26 | 7.132128 | HEALTH_FITNESS | 2021-10-25 18:29:33 |
Next, we create a dataset that shows app usage per second. This will help us to summarize app usage before the ESM questionnaire was filled out.
# The following code is adapted from:
# https://stackoverflow.com/questions/57245771/how-do-i-fill-time-sequence-based-on-start-and-end-time-in-r
# seq.POSIXt() creates a sequence of times (e.g., 2021-10-25 17:17:01 CEST, 2021-10-25 17:17:02 CEST, 2021-10-25 17:17:03 CEST). This function is vectorized, which means that it is directly performed on the whole vector (= column).
vec_seq <- Vectorize(seq.POSIXt, vectorize.args = c("from", "to"))
App_persecond <- App %>%
transmute(genre, Date = vec_seq(timestamp, end_time, 1)) %>% #transmute() will create a new variable (Date) based on the vec_seq() function (which takes the following arguments: from = timestamp, to = end_time, by = 1 (by one second))
unnest(cols = c(Date)) # displays each value of the time sequence as separate row
genre | Date |
---|---|
NA | 2021-10-25 17:17:01 |
NA | 2021-10-25 17:17:02 |
FOOD_DRINK | 2021-10-25 17:17:02 |
FOOD_DRINK | 2021-10-25 17:17:03 |
FOOD_DRINK | 2021-10-25 17:17:04 |
FOOD_DRINK | 2021-10-25 17:17:05 |
FOOD_DRINK | 2021-10-25 17:17:06 |
FOOD_DRINK | 2021-10-25 17:17:07 |
FOOD_DRINK | 2021-10-25 17:17:08 |
FOOD_DRINK | 2021-10-25 17:17:09 |
As a next step, we match our datasets (from different measures) with the ESM data. For this, we loop through the filled-out ESM questionnaires and summarize the passive measures X hours before the ESM questionnaire was filled out. For X we can choose any value. In this example, we aggregate the measures three hours before the ESM questionnaire was filled out. In our paper, we explore other time windows as well.
### DECIDE HERE ON WHICH LEVEL YOU WANT TO AGGREGATE THE PASSIVE MEASURES ####
rule_int = 3 # How much hours before the ESM questionnaire do you want to aggregate the variables on?
#### Create dataframes that we will grow in our loop
FeatureOverviewApps <- data.frame(Date = ESM$Date[1])
FeatureOverviewWifi <- data.frame(Date = ESM$Date[1])
for (i in 1:nrow(ESM)) {
time = as.POSIXct(ESM$Date[i]) # loop through each time point at which an ESM questionnaire was filled out
####### APP USAGE #########
# Here we calculate how long each app category was used before the ESM questionnaire was filled out.
# We select the rows that are between x hours before the ESM questionnaire (time - rule_int) but not after the questionnaire was filled out. If no time points are included in the dataset, an empty tibble will be created and, thus, no row will be added.
App_beforeESM <-
App_persecond[(App_persecond$Date > (time - rule_int * 60 * 60)) &
(App_persecond$Date < time), ] %>% # We have to multiply the hours two times by 60 to calculate it in seconds
group_by(genre) %>%
summarise(Date = time, TotalMinutes = n() / 60) # We divide by 60, because each row is one second. Thus we have to divide by 60 to get the time in minutes.
# Rename variables for clarity
App_beforeESM$Genre_min = paste(App_beforeESM$genre, "_min", sep = "")
App_beforeESM <-
App_beforeESM[, -1] # We don't need the genre column anymore (because we renamed it for clarity)
# Bring data in wide format
App_beforeESM <- spread(App_beforeESM, Genre_min, TotalMinutes)
### Here we add the total minutes of app usage
# For this we simply calculate the sum
App_beforeESM$APP_USAGE_min <- rowSums(App_beforeESM[, -1])
FeatureOverviewApps = full_join(FeatureOverviewApps, App_beforeESM)
####### Wifi #########
TOTAL_MACHASHES_number <-
Wifi[(Wifi$timestamp > (time - rule_int * 60 * 60)) &
(Wifi$timestamp < time), ] %>% nrow() # Here we calculate the total number of recorded wifi connections (X hours before but not after the ESM was filled out)
UNIQUE_MACHASHES_number <-
Wifi[(Wifi$timestamp > (time - rule_int * 60 * 60)) &
(Wifi$timestamp < time), ] %>% .[!duplicated(.$mac_hash),] %>% nrow() # Here we calculate the number of unique recorded wifi connections (X hours before but not after the ESM was filled out)
Wifi_beforeESM <-
data.frame(
Date = time,
TOTAL_MACHASHES_number = TOTAL_MACHASHES_number,
UNIQUE_MACHASHES_number = UNIQUE_MACHASHES_number
)
FeatureOverviewWifi = full_join(FeatureOverviewWifi, Wifi_beforeESM)
###### More Features & Sensors can be added here #####
### ......
}
Date | TOTAL_MACHASHES_number | UNIQUE_MACHASHES_number |
---|---|---|
2021-10-25 18:59:38 | 9 | 0 |
2021-10-26 07:46:39 | 0 | 0 |
2021-10-26 11:38:59 | 20 | 0 |
2021-10-26 14:22:00 | 52 | 0 |
2021-10-26 16:05:10 | 0 | 0 |
2021-10-26 19:06:20 | 0 | 0 |
2021-10-26 21:21:32 | 0 | 0 |
2021-10-27 07:43:06 | 0 | 0 |
2021-10-27 11:59:47 | 15 | 0 |
2021-10-27 13:00:48 | 24 | 0 |
Date | COMMUNICATION_min | FOOD_DRINK_min | HEALTH_FITNESS_min | MUSIC_AUDIO_min | NA_min | NEWS_MAGAZINES_min | SHOPPING_min | SOCIAL_min | TOOLS_min | APP_USAGE_min | TRAVEL_LOCAL_min | VIDEOPLAYERS_EDITORS_min | PRODUCTIVITY_min | BOOKS_REFERENCE_min | COMICS_min | PERSONALIZATION_min |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2021-10-25 18:59:38 | 0.1166667 | 0.8333333 | 0.1333333 | 1.0333333 | 0.7333333 | 0.0333333 | 0.1166667 | 1.7833333 | 0.0500000 | 4.8333333 | NA | NA | NA | NA | NA | NA |
2021-10-26 07:46:39 | 0.0500000 | NA | NA | 0.0166667 | NA | 0.1833333 | NA | 0.0166667 | NA | 0.4500000 | 0.1833333 | NA | NA | NA | NA | NA |
2021-10-26 11:38:59 | NA | NA | NA | NA | 0.3000000 | 0.0666667 | NA | 1.3666667 | NA | 1.7833333 | NA | 0.0500000 | NA | NA | NA | NA |
2021-10-26 14:22:00 | 0.4166667 | 0.0333333 | NA | NA | 1.7000000 | 0.0833333 | NA | 1.4500000 | 0.8666667 | 4.5500000 | NA | NA | NA | NA | NA | NA |
2021-10-26 16:05:10 | 0.5000000 | 0.0333333 | NA | NA | 0.5333333 | 0.2000000 | NA | 0.3166667 | NA | 1.5833333 | NA | NA | NA | NA | NA | NA |
2021-10-26 19:06:20 | 0.4000000 | NA | NA | NA | 0.1166667 | 0.2166667 | 0.0666667 | 0.1500000 | NA | 0.9500000 | NA | NA | NA | NA | NA | NA |
2021-10-26 21:21:32 | 0.4000000 | NA | NA | NA | 0.2333333 | 0.2666667 | NA | 0.1500000 | NA | 1.0500000 | NA | NA | NA | NA | NA | NA |
2021-10-27 07:43:06 | 0.5333333 | NA | NA | NA | 0.0833333 | NA | NA | NA | 0.0500000 | 0.9166667 | 0.2500000 | NA | NA | NA | NA | NA |
2021-10-27 11:59:47 | 0.1500000 | NA | NA | NA | 0.1333333 | NA | NA | 0.0666667 | 0.0666667 | 0.5333333 | 0.1166667 | NA | NA | NA | NA | NA |
2021-10-27 13:00:48 | 0.3833333 | NA | NA | NA | 0.1666667 | NA | NA | 0.2833333 | 0.0666667 | 1.2000000 | 0.1166667 | 0.0166667 | 0.1666667 | NA | NA | NA |
Missing data must be identified. Compared to most actively assessed data it is not always directly observable whether passive data is missing or not. For measures such as app usage, we do not know whether we do not have data because no apps were used or because of technical problems. This means that we must decide based on which time frame we want to exclude app usage if no app usage was recorded.
In this example, we check for app usage for each day whether more than 18 subsequent hours are without recorded data. If so, we exclude all data from that day.
FeatureOverviewApps <- right_join(FeatureOverviewApps,data.frame(Date = ESM$Date), by = c("Date")) #Here we also include days in our feature overview that are without any app data (but have ESM data).
#### SET HERE THE MISSING Threshold ####
MissingThreshold <-
18 # Here we set it to 18 hours (other windows could have been chosen as well)
#Show app usage per hour, to calculate how many hours per day are without app usage
LabelMissing_App <- App_persecond %>%
group_by(Date = floor_date(Date, "1 hour")) %>%
summarize(Usage = n() / 60)
LabelMissing_App$Time <-
format(LabelMissing_App$Date, format = "%H")
LabelMissing_App$Time <- as.numeric(LabelMissing_App$Time)
# Calculate differences: How many hours in between are without app usage?
LabelMissing_App <- LabelMissing_App %>%
group_by(floor_date(Date, "1 day")) %>% #Calculates missing values per day
summarise(Date = Date,
Time = Time,
difference = abs(Time - lag(Time))) # lag(Time) shows the next time point, which makes it possible to calculate the difference between two time points.
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Calculate how many hours past from 00:00 to the first recorded app usage (the first difference will be "NA")
LabelMissing_App$difference[is.na(LabelMissing_App$difference)] <-
LabelMissing_App$Time[is.na(LabelMissing_App$difference)]
# Set the very first measurement to 0 (because the data collection might have started at any point of the day). Thus, we don't want to exclude the first day.
LabelMissing_App$difference[1] <- 0
# Create an index
LabelMissing_App$difference_index <- 0
# Compare whether the difference is greater than our missing treshhold
LabelMissing_App$difference_index[LabelMissing_App$difference > MissingThreshold] <-
1
######## Check for other direction (the same again just different lag)
LabelMissing_App <- LabelMissing_App %>%
group_by(floor_date(Date, "1 day")) %>% #Calculates missing values per day
summarise(
Date = Date,
Time = Time,
difference = abs(Time - lead(Time, 1)),
difference_index = difference_index
) # Change to other direction, lead(Time) shows the previous time point
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Calculate how many hours past from the last recorded measurement to 00:00 (the difference for the last recorded measurement will be NA)
LabelMissing_App$difference[is.na(LabelMissing_App$difference)] <-
24 - LabelMissing_App$Time[is.na(LabelMissing_App$difference)]
# Set last measurement to 0 (because the last day can end at any point)
LabelMissing_App$difference[nrow(LabelMissing_App)] <- 0
# Compare whether the difference is greater than our missing treshhold
LabelMissing_App$difference_index[LabelMissing_App$difference > MissingThreshold] <-
1
We created a dataset that shows for each day how many hours are without data between measurements. This variable is called “difference”. If the difference is more than our threshold, the difference_index will be one.
floor_date(Date, “1 day”) | Date | Time | difference | difference_index |
---|---|---|---|---|
2021-10-25 | 2021-10-25 17:00:00 | 17 | 1 | 0 |
2021-10-25 | 2021-10-25 18:00:00 | 18 | 1 | 0 |
2021-10-25 | 2021-10-25 19:00:00 | 19 | 1 | 0 |
2021-10-25 | 2021-10-25 20:00:00 | 20 | 2 | 0 |
2021-10-25 | 2021-10-25 22:00:00 | 22 | 1 | 0 |
2021-10-25 | 2021-10-25 23:00:00 | 23 | 1 | 0 |
2021-10-26 | 2021-10-26 07:00:00 | 7 | 1 | 0 |
2021-10-26 | 2021-10-26 08:00:00 | 8 | 1 | 0 |
2021-10-26 | 2021-10-26 09:00:00 | 9 | 1 | 0 |
2021-10-26 | 2021-10-26 10:00:00 | 10 | 1 | 0 |
As a next step, we label all data points from a day as missing if no App usage was recorded for X hours during that day.
#First, we label all NA as 0
FeatureOverviewApps[is.na(FeatureOverviewApps)] <- 0
# Second, we label all data points from a day as missing if X hours during that day are missing
for (i in 1:nrow(FeatureOverviewApps)) {
# We loop through each row of the dataset, if the date has a difference_index of one, this specific day will be labeled as missing. A difference_index of one means that more hours than the missing threshold are missing for that day.
if (!floor_date(FeatureOverviewApps$Date[i], "1 day") %in% LabelMissing_App$`floor_date(Date, "1 day")`[LabelMissing_App$difference_index == 0]) {
FeatureOverviewApps[i, -1] = NA
}
}
# You can add other measures here
#.....
As the last step we merge our datasets from our different sensors and ESM.
# Merge different datasets
FeatureOverview <- merge(FeatureOverviewApps,FeatureOverviewWifi)
FeatureOverview$timescale_beforeESM <- paste(rule_int, "h", sep = "")
FeatureOverview$ParticipantNumber = "Example Participant"
# Merge with pa
FeatureOverview <- right_join(FeatureOverview,ESM, by = c("Date"))
FeatureOverview['pa_mean'] <- FeatureOverview %>% select(c("pa_happy_sliderNegPos","pa_energy_sliderNegPos","pa_relax_sliderNegPos")) %>% rowMeans()
Date | COMMUNICATION_min | FOOD_DRINK_min | HEALTH_FITNESS_min | MUSIC_AUDIO_min | NA_min | NEWS_MAGAZINES_min | SHOPPING_min | SOCIAL_min | TOOLS_min | APP_USAGE_min | TRAVEL_LOCAL_min | VIDEOPLAYERS_EDITORS_min | PRODUCTIVITY_min | BOOKS_REFERENCE_min | COMICS_min | PERSONALIZATION_min | TOTAL_MACHASHES_number | UNIQUE_MACHASHES_number | timescale_beforeESM | ParticipantNumber | pa_happy_sliderNegPos | pa_energy_sliderNegPos | pa_relax_sliderNegPos | pa_mean |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2021-10-25 18:59:38 | 0.1166667 | 0.8333333 | 0.1333333 | 1.0333333 | 0.7333333 | 0.0333333 | 0.1166667 | 1.7833333 | 0.0500000 | 4.833333 | 0.0000000 | 0.00 | 0 | 0 | 0 | 0 | 9 | 0 | 3h | Example Participant | NA | NA | NA | NA |
2021-10-26 07:46:39 | 0.0500000 | 0.0000000 | 0.0000000 | 0.0166667 | 0.0000000 | 0.1833333 | 0.0000000 | 0.0166667 | 0.0000000 | 0.450000 | 0.1833333 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 3h | Example Participant | 5.297842 | 2.061052 | 3.567544 | 3.642146 |
2021-10-26 11:38:59 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.3000000 | 0.0666667 | 0.0000000 | 1.3666667 | 0.0000000 | 1.783333 | 0.0000000 | 0.05 | 0 | 0 | 0 | 0 | 20 | 0 | 3h | Example Participant | 6.311747 | 13.104694 | 6.047756 | 8.488066 |
2021-10-26 14:22:00 | 0.4166667 | 0.0333333 | 0.0000000 | 0.0000000 | 1.7000000 | 0.0833333 | 0.0000000 | 1.4500000 | 0.8666667 | 4.550000 | 0.0000000 | 0.00 | 0 | 0 | 0 | 0 | 52 | 0 | 3h | Example Participant | 10.939742 | 12.703427 | 10.068575 | 11.237248 |
2021-10-26 16:05:10 | 0.5000000 | 0.0333333 | 0.0000000 | 0.0000000 | 0.5333333 | 0.2000000 | 0.0000000 | 0.3166667 | 0.0000000 | 1.583333 | 0.0000000 | 0.00 | 0 | 0 | 0 | 0 | 0 | 0 | 3h | Example Participant | 5.962909 | 4.608161 | 6.854507 | 5.808525 |
To better understand how the data varies over different time scales, it is a good idea to start visualizing the data. For this, we are using violin plots. Violin plots show the distribution of the data, which is here aggregated three hours before ESM was filled out. Additionally, we display the jittered data points next to it. Jittering data points adds random noise to the data, which slightly changes the location of the data points in order to reduce overlap. In our paper, we explore other time scales as well. The code provided is also able to display more than one time scale if those are present in the dataset.
# Here transform the variable as an ordered factor. Thus, the timescales will be displayed in the right order.
FeatureOverview$timescale_beforeESM <- ordered(FeatureOverview$timescale_beforeESM,levels = c("1h","3h","6h","9h","12h","24h"))
FeatureOverview$timescale_beforeESM_num <- recode(FeatureOverview$timescale_beforeESM, "1h" = 1, "3h" = 3,"6h" = 6,"9h" = 9,"12h" = 12, "24h" = 24 ) #This is used to calculate the percentages
# We use ggplot to create the violin plots, more information about ggplot2 can be found here: https://ggplot2.tidyverse.org/
source("https://raw.githubusercontent.com/datavizpyr/data/master/half_flat_violinplot.R")
FeatureOverview[FeatureOverview$ParticipantNumber == "Example Participant", ] %>% ggplot(
.,
aes(
x = 1,
y = APP_USAGE_min / timescale_beforeESM_num / 60,
fill = timescale_beforeESM,
color = timescale_beforeESM
)
) + #We divide the minutes of app usage by the hours of aggregation and by 60 to calculate the percentage
geom_flat_violin(alpha = 0.5,
position = position_nudge(x = .3, y = 0),
width = 0.6) + # Here we add the violon plot
geom_point(position = position_jitter(seed = 1, width = 0.15, height = 0)) + # Here we add the jittered datapoints (jittered on the horizontal axes)
theme_minimal() +
xlab("") +
ylab("") +
labs(fill = '', color = '') +
theme(
legend.position = "none",
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
text = element_text(size = 14),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()
) +
facet_grid(cols = vars(timescale_beforeESM)) + # shows different time scales next to each other, if present.
scale_color_manual(
breaks = c("1h", "3h", "6h", "9h", "12h", "24h"),
values = c(
"#440154",
"#443983",
"#31688e",
"#21918c",
"#35b779",
"#90d743"
)
) +
scale_fill_manual(
breaks = c("1h", "3h", "6h", "9h", "12h", "24h"),
values = c(
"#440154",
"#443983",
"#31688e",
"#21918c",
"#35b779",
"#90d743"
)
)
The violin plot shows the variation of how much time someone spent on apps three hours before each ESM questionnaire was filled out (in percentage). This means that the graph does not represent raw app usage but the aggregated version before each filled-out ESM questionnaire. For a three hour level of aggregation the time the participant spent on apps seems to be generally quite short (fluctuating between 0 and 0.05%).
To investigate the impact of choosing different time scales when relating multiple variables to one another, we calculate the bivariate correlation between minutes spent on apps and positive affect as measured through ESM. Here we provide an example code for generating the scatter plots including the correlation. Here we use again app usage that is aggregated three hours before ESM was filled out. In our paper, we explore other time scales as well. The code provided is also able to display more than onetime scale if those are present in the dataset.
ggplot(FeatureOverview[FeatureOverview$ParticipantNumber == "Example Participant", ], aes(x =
APP_USAGE_min, y = pa_mean)) +
geom_point(colour = "#440154FF") +
geom_smooth(
method = "lm", #adds linear regression to the plot
se = TRUE,
fullrange = FALSE,
level = 0.95,
colour = "#21908CFF",
fill = "#21908CFF"
) +
stat_cor( # adds the correlation coefficient to the plot
method = "pearson",
p.accuracy = 0.01,
r.accuracy = 0.01,
label.y = 0.3,
label.sep = ",",
label.x = 0,
na.rm = TRUE,
cor.coef.name = c("r")
) +
theme_minimal() +
facet_grid(cols = vars(timescale_beforeESM), scales = "free_x") + # shows different time scales next to each other, if present.
ylim(c(0, 10)) +
ylab("Positive Affect") +
xlab("Minutes beeing at Home") +
theme(text = element_text(size = 16), axis.text.x = element_text(hjust = 0.8))
Researchers often choose a particular moving window when using passive smartphone measures to build (individualized) prediction models for mood and psychological symptoms. For using these kinds of prediction models it is common to split the data into a train set (which is used to build the prediction model) and a test set (which is used to evaluate the prediction performance). Since we use time-series data, it is important to take the order of the data into account when splitting the data into a train and test set. In other words, it is important to take the temporal direction into account, which means that we only use past data to predict momentary positive affect. For doing so, a moving window cross-validation strategy can be used. This means that we train the model on past data points to predict the next data point. This procedure will be repeated until the end of the dataset is reached. The test set will contain all predictions that were made across the different models.
We use the caret package for our analysis. More information about the caret package and time series cross-validation can be found here: https://topepo.github.io/caret/data-splitting.html#data-splitting-for-time-series
# First we omit rows with missing data. How to handle missing data (once it is identified), for example, whether to impute missing data or whether to exclude missing data, is another important decision. For simplicity, we decided to exclude missing data here. The participant that we used in our paper for the prediction model did not have any missing data, thus we did not have to decide how to handle missing data.
FeatureOverview <- na.omit(FeatureOverview)
# Here we specify what we want to predict and which passive smartphone measures we are using for doing so
formula <- as.formula(paste("pa_mean ~ SOCIAL_min +
COMMUNICATION_min +
APP_USAGE_min +
TOTAL_MACHASHES_number +
UNIQUE_MACHASHES_number"))
# Here we specify what the minimum and maximum size of the moving window is
min_window = 6 # here we use six data points to train the model
max_window = 8 # here we use eight data points to train the model
# For our analysis we made use of parallel computing
cl <- makePSOCKcluster(detectCores()-1)
registerDoParallel(cl)
OverallResults <- list() #create a list to store the results
for(k in c(min_window:max_window)){ # here we loop through the different moving window sizes
start_time <- Sys.time()
set.seed(12361488)
timeSlices <- createTimeSlices(1:nrow(FeatureOverview),
initialWindow = k, horizon = 1, fixedWindow = TRUE) # We use the createTimeSlices to split our dataframe into a train and test set
trainSlices <- timeSlices[[1]]
testSlices <- timeSlices[[2]]
fitControl <- trainControl(method = "LOOCV") # to use leave-one-out-cross-validation to choose the best hyperparameters (= number of predictors) in the training set.
pred <- rep(NA,length(trainSlices)) #we create empty vectors to store the results
true <- rep(NA,length(trainSlices))
w <- rep(NA,length(trainSlices))
index <- rep(NA,length(testSlices))
for(i in 1:length(trainSlices)){
model<- train(formula, data=FeatureOverview[trainSlices[[i]],],trControl = fitControl, method="rf", na.action = na.omit) #Here the prediction model is build
# Next we store the results from the prediction model by making predictions for our test set. The predict() function automatically uses the "best" model (based on loocv from the trains set)
pred[i] <- predict(model,FeatureOverview[testSlices[[i]],])
true[i] <- FeatureOverview$pa_mean[testSlices[[i]]]
w[i] <- k
index[i] <- as.numeric(testSlices[[i]])
}
end_time <- Sys.time()
results <- cbind(pred,true,w,index)
OverallResults[[k-min_window+1]] <- results # store results in our empty list
print(paste("Completed Time Window:",w[1],", Running time:", paste(end_time - start_time)))
}
## [1] "Completed Time Window: 6 , Running time: 41.7029678821564"
## [1] "Completed Time Window: 7 , Running time: 28.0341827869415"
## [1] "Completed Time Window: 8 , Running time: 28.2695298194885"
stopCluster(cl)
We create a data frame that shows the predicted (pred) and true values for different moving window sizes (w). The “index” columns indicate for which data point the prediction was made. If we use a moving window of six, the first predicted point will be the seventh data point. In contrast, if we use a moving window of seven the first predicted point will be the eighth data point.
OverallResults <- as.data.frame(do.call("rbind",OverallResults)) #Combine lists into one dataframe
pred | true | w | index |
---|---|---|---|
6.373170 | 7.152279 | 6 | 7 |
8.380912 | 9.002743 | 6 | 8 |
8.721968 | 8.148316 | 6 | 9 |
7.098829 | 7.792369 | 6 | 10 |
7.838884 | 7.692385 | 6 | 11 |
8.429940 | 5.947623 | 6 | 12 |
7.397675 | 3.490151 | 6 | 13 |
4.921025 | 6.432927 | 6 | 14 |
6.977998 | 6.894285 | 6 | 15 |
6.645987 | 9.152462 | 6 | 16 |
As a next step, we calculate the prediction accuracy and plot the results. The code provided can plot the results for different moving window sizes.
OverallResults <-
OverallResults[OverallResults$index > max_window, ] # make test set the same length for all moving window sizes
# Create empty dataframe to store results
size = length(max_window:min_window)
OverallResultsPerformance <-
data.frame(
i = unique(OverallResults$w),
cor = rep(NA, size),
cor.int_1 = rep(NA, size),
cor.int_2 = rep(NA, size),
rsq = rep(NA, size),
RMSE = rep(NA, size)
)
# calculate prediction accuracy for each moving window (w)
for (i in unique(OverallResults$w)) {
results <-
OverallResults[OverallResults$w == i, ] # loop through different moving windows
cor = cor.test(results$pred, results$true)$estimate # calculate/save the correlation
cor.int_1 = cor.test(results$pred, results$true)$conf.int[1] # save lower confidence interval
cor.int_2 = cor.test(results$pred, results$true)$conf.int[2] # save upper confidence interval
rss <-
sum((results$pred - results$true) ^ 2) # residual sum of squares
tss <-
sum((results$true - mean(results$true)) ^ 2) # total sum of squares
rsq <- 1 - rss / tss
RMSE <- RMSE(results$pred, results$true)
OverallResultsPerformance[OverallResultsPerformance$i == i, ] <-
cbind(i, cor, cor.int_1, cor.int_2, rsq, RMSE)
}
ggplot() +
geom_line(aes(y = OverallResultsPerformance$rsq, x = OverallResultsPerformance$i, color = "3h")) +
# ... different colors could be added
theme_minimal() +
xlab("Moving Window Size") +
ylab("R-squared") +
scale_color_viridis(
discrete = TRUE,
limits = c("24h", "12h", "9h", "6h", "3h", "1h", "rolling mean")
) +
labs(color = "Level of Aggregation")
In this example, the prediction accuracy is very low (even negative, which means that the prediction accuracy is worse than chance). This is the case because we use a dataset in which random noise was added and we only use a few predictors. Using the real dataset with more predictors led to a reasonable R-squared.