Example Datasets

Before showing how to aggregate and analyze the data on different time scales we provide an example of what our recorded raw data looks like. For privacy reasons, random noise is added to each example dataset. We chose two sensors for illustrative purposes: WIFI and app usage. Those two sensors differ in how they are recorded. Whereas wifi connections are recorded at specific time points, app usage has a start time and duration.

Wifi

timestamp sha256_salted_mac_hash
2021-10-25 18:26:07.797312295 d9869d0b40ef75cd284a2b896a3d0dabc9e6ae5b29551ea300b4f53dd0130f6c
2021-10-25 18:26:09.172443738 523ad8310fe14f8fd40f53fc014476e21bbc5dd6a0bd402d52f40ab23360439b
2021-10-25 18:26:08.210457816 d9869d0b40ef75cd284a2b896a3d0dabc9e6ae5b29551ea300b4f53dd0130f6c
2021-10-25 18:26:14.657465888 36cb027c97d12c952a9af551d718e854f944d56fb4e7bb96ee69ea7a61ade779
2021-10-25 18:26:29.044966888 e6b727dd9964a9637a97ec343a457737e41d25bebcd44e1ab5c56ad64b76f800
2021-10-25 18:26:13.265038973 6c70ba8cf4fd8e8133c3d0d593d12e3ae25465a8e4cd3b0de5e40c166433ea40
2021-10-25 18:27:01.722724855 2277cb466048be250a04a3ca1464aee1f45fb23fce29f4f3e5b737c1c4ba2d7f
2021-10-25 18:26:07.529904046 252069f26e0ff8bdc133401ce50210b88a341304e0edf58c644787761b13c825
2021-10-25 18:26:07.529359900 72826bc603bbffa7db9ec14583d500a9c13ca2b141674d9d1c3f1d90ae40091d
2021-10-26 11:18:48.053753455 4c6dc8eb70fa64c4a2e157d5462efd188d2a723b2c937937c74d58278b4151e6

App

timestamp duration genre
2021-10-25 17:17:01.986000000 1.869158 NA
2021-10-25 17:17:02.870157536 44.540493 FOOD_DRINK
2021-10-25 18:25:07.788492869 1.662723 COMMUNICATION
2021-10-25 18:25:08.904723355 6.508187 SHOPPING
2021-10-25 18:26:05.376186557 4.973042 NA
2021-10-25 18:27:30.746041812 1.812178 NEWS_MAGAZINES
2021-10-25 18:27:30.594178285 4.918438 FOOD_DRINK
2021-10-25 18:27:34.714437904 17.996723 NA
2021-10-25 18:27:50.917722524 82.216111 SOCIAL
2021-10-25 18:29:26.152111508 7.132128 HEALTH_FITNESS

ESM

Date pa_happy_sliderNegPos pa_energy_sliderNegPos pa_relax_sliderNegPos
2021-10-25 18:59:38.837536516 NA NA NA
2021-10-26 07:46:39.579333615 5.297842 2.061052 3.567544
2021-10-26 11:38:59.334110889 6.311747 13.104694 6.047756
2021-10-26 14:22:00.627950088 10.939742 12.703427 10.068575
2021-10-26 16:05:10.339640619 5.962909 4.608161 6.854507
2021-10-26 19:06:20.441581469 8.982813 8.179666 7.555428
2021-10-26 21:21:32.237598192 8.304694 10.684946 8.264331
2021-10-27 07:43:06.095237021 2.365334 9.936023 9.155478
2021-10-27 11:59:47.196670340 8.448896 6.472782 12.086550
2021-10-27 13:00:48.853281969 8.250955 8.177381 8.016611

Example Code for Aggregating Passive Smartphone Measures

To create meaningful variables and to combine different sensors, we must choose a time window in which we aggregate the measures.

In the following example, we will show how to create a dataset that shows passive smartphone measures aggregated X hours before each ESM measurement was filled out

1. Bring data into the right format

We begin with converting all dates to class “POSIXct” in order to represent them as calendar dates and times.

Wifi$timestamp <- as.POSIXct(Wifi$timestamp) 
App$timestamp <- as.POSIXct(App$timestamp)
ESM$Date <- as.POSIXct(ESM$Date)

We continue with adding the end time to the app usage dataset (note that the duration is rounded to seconds).

App$end_time <-  as.POSIXct(c(rep(NA,nrow(App)))) # Add "empty" end time column (with the class POSIXct)

App$end_time  <- App$timestamp + App$duration # Add start time (timestamp) and duration together to calculate the end time
timestamp duration genre end_time
2021-10-25 17:17:01 1.869158 NA 2021-10-25 17:17:03
2021-10-25 17:17:02 44.540493 FOOD_DRINK 2021-10-25 17:17:47
2021-10-25 18:25:07 1.662723 COMMUNICATION 2021-10-25 18:25:09
2021-10-25 18:25:08 6.508187 SHOPPING 2021-10-25 18:25:15
2021-10-25 18:26:05 4.973042 NA 2021-10-25 18:26:10
2021-10-25 18:27:30 1.812178 NEWS_MAGAZINES 2021-10-25 18:27:32
2021-10-25 18:27:30 4.918438 FOOD_DRINK 2021-10-25 18:27:35
2021-10-25 18:27:34 17.996723 NA 2021-10-25 18:27:52
2021-10-25 18:27:50 82.216111 SOCIAL 2021-10-25 18:29:13
2021-10-25 18:29:26 7.132128 HEALTH_FITNESS 2021-10-25 18:29:33

Next, we create a dataset that shows app usage per second. This will help us to summarize app usage before the ESM questionnaire was filled out.

# The following code is adapted from:
# https://stackoverflow.com/questions/57245771/how-do-i-fill-time-sequence-based-on-start-and-end-time-in-r

# seq.POSIXt() creates a sequence of times (e.g., 2021-10-25 17:17:01 CEST, 2021-10-25 17:17:02 CEST, 2021-10-25 17:17:03 CEST). This function is vectorized, which means that it is directly performed on the whole vector (= column).
vec_seq <- Vectorize(seq.POSIXt, vectorize.args = c("from", "to"))

App_persecond <- App %>%
  transmute(genre, Date = vec_seq(timestamp, end_time, 1)) %>% #transmute() will create a new variable (Date) based on the vec_seq() function (which takes the following arguments: from = timestamp, to = end_time, by = 1 (by one second))
  unnest(cols = c(Date)) # displays each value of the time sequence as separate row
genre Date
NA 2021-10-25 17:17:01
NA 2021-10-25 17:17:02
FOOD_DRINK 2021-10-25 17:17:02
FOOD_DRINK 2021-10-25 17:17:03
FOOD_DRINK 2021-10-25 17:17:04
FOOD_DRINK 2021-10-25 17:17:05
FOOD_DRINK 2021-10-25 17:17:06
FOOD_DRINK 2021-10-25 17:17:07
FOOD_DRINK 2021-10-25 17:17:08
FOOD_DRINK 2021-10-25 17:17:09

2. Match with ESM data

As a next step, we match our datasets (from different measures) with the ESM data. For this, we loop through the filled-out ESM questionnaires and summarize the passive measures X hours before the ESM questionnaire was filled out. For X we can choose any value. In this example, we aggregate the measures three hours before the ESM questionnaire was filled out. In our paper, we explore other time windows as well.

### DECIDE HERE ON WHICH LEVEL YOU WANT TO AGGREGATE THE PASSIVE MEASURES ####
rule_int = 3 # How much hours before the ESM questionnaire do you want to aggregate the variables on?


#### Create dataframes that we will grow in our loop
FeatureOverviewApps <- data.frame(Date = ESM$Date[1])
FeatureOverviewWifi <- data.frame(Date = ESM$Date[1])

for (i in 1:nrow(ESM)) {
  time = as.POSIXct(ESM$Date[i]) # loop through each time point at which an ESM questionnaire was filled out
  
  ####### APP USAGE #########
  
  # Here we calculate how long each app category was used before the ESM questionnaire was filled out.
  # We select the rows that are between x hours before the ESM questionnaire (time - rule_int) but not after the questionnaire was filled out. If no time points are included in the dataset, an empty tibble will be created and, thus, no row will be added.
  
  App_beforeESM <-
    App_persecond[(App_persecond$Date > (time - rule_int * 60 * 60)) &
                    (App_persecond$Date < time), ] %>% # We have to multiply the hours two times by 60 to calculate it in seconds
    group_by(genre) %>%
    summarise(Date = time, TotalMinutes = n() / 60) # We divide by 60, because each row is one second. Thus we have to divide by 60 to get the time in minutes.
  
  # Rename variables for clarity
  App_beforeESM$Genre_min = paste(App_beforeESM$genre, "_min", sep = "")
  App_beforeESM <-
    App_beforeESM[, -1] # We don't need the genre column anymore (because we renamed it for clarity)
  
  # Bring data in wide format
  App_beforeESM <- spread(App_beforeESM, Genre_min, TotalMinutes)
  
  ### Here we add the total minutes of app usage
  # For this we simply calculate the sum
  App_beforeESM$APP_USAGE_min <- rowSums(App_beforeESM[, -1])
  
  FeatureOverviewApps = full_join(FeatureOverviewApps, App_beforeESM)
  
  ####### Wifi #########
  TOTAL_MACHASHES_number <-
    Wifi[(Wifi$timestamp > (time - rule_int * 60 * 60)) &
           (Wifi$timestamp < time), ] %>% nrow() # Here we calculate the total number of recorded wifi connections (X hours before but not after the ESM was filled out)
  
  UNIQUE_MACHASHES_number <-
    Wifi[(Wifi$timestamp > (time - rule_int * 60 * 60)) &
           (Wifi$timestamp < time), ] %>% .[!duplicated(.$mac_hash),]  %>% nrow() # Here we calculate the number of unique recorded wifi connections (X hours before but not after the ESM was filled out)
  
  Wifi_beforeESM <-
    data.frame(
      Date = time,
      TOTAL_MACHASHES_number = TOTAL_MACHASHES_number,
      UNIQUE_MACHASHES_number = UNIQUE_MACHASHES_number
    )
  
  FeatureOverviewWifi = full_join(FeatureOverviewWifi, Wifi_beforeESM)
  
  ###### More Features & Sensors can be added here #####
  ### ......
}

Example Aggregated Datasets

Wifi

Date TOTAL_MACHASHES_number UNIQUE_MACHASHES_number
2021-10-25 18:59:38 9 0
2021-10-26 07:46:39 0 0
2021-10-26 11:38:59 20 0
2021-10-26 14:22:00 52 0
2021-10-26 16:05:10 0 0
2021-10-26 19:06:20 0 0
2021-10-26 21:21:32 0 0
2021-10-27 07:43:06 0 0
2021-10-27 11:59:47 15 0
2021-10-27 13:00:48 24 0

App

Date COMMUNICATION_min FOOD_DRINK_min HEALTH_FITNESS_min MUSIC_AUDIO_min NA_min NEWS_MAGAZINES_min SHOPPING_min SOCIAL_min TOOLS_min APP_USAGE_min TRAVEL_LOCAL_min VIDEOPLAYERS_EDITORS_min PRODUCTIVITY_min BOOKS_REFERENCE_min COMICS_min PERSONALIZATION_min
2021-10-25 18:59:38 0.1166667 0.8333333 0.1333333 1.0333333 0.7333333 0.0333333 0.1166667 1.7833333 0.0500000 4.8333333 NA NA NA NA NA NA
2021-10-26 07:46:39 0.0500000 NA NA 0.0166667 NA 0.1833333 NA 0.0166667 NA 0.4500000 0.1833333 NA NA NA NA NA
2021-10-26 11:38:59 NA NA NA NA 0.3000000 0.0666667 NA 1.3666667 NA 1.7833333 NA 0.0500000 NA NA NA NA
2021-10-26 14:22:00 0.4166667 0.0333333 NA NA 1.7000000 0.0833333 NA 1.4500000 0.8666667 4.5500000 NA NA NA NA NA NA
2021-10-26 16:05:10 0.5000000 0.0333333 NA NA 0.5333333 0.2000000 NA 0.3166667 NA 1.5833333 NA NA NA NA NA NA
2021-10-26 19:06:20 0.4000000 NA NA NA 0.1166667 0.2166667 0.0666667 0.1500000 NA 0.9500000 NA NA NA NA NA NA
2021-10-26 21:21:32 0.4000000 NA NA NA 0.2333333 0.2666667 NA 0.1500000 NA 1.0500000 NA NA NA NA NA NA
2021-10-27 07:43:06 0.5333333 NA NA NA 0.0833333 NA NA NA 0.0500000 0.9166667 0.2500000 NA NA NA NA NA
2021-10-27 11:59:47 0.1500000 NA NA NA 0.1333333 NA NA 0.0666667 0.0666667 0.5333333 0.1166667 NA NA NA NA NA
2021-10-27 13:00:48 0.3833333 NA NA NA 0.1666667 NA NA 0.2833333 0.0666667 1.2000000 0.1166667 0.0166667 0.1666667 NA NA NA

Example Code for Labeling Missing Values

Missing data must be identified. Compared to most actively assessed data it is not always directly observable whether passive data is missing or not. For measures such as app usage, we do not know whether we do not have data because no apps were used or because of technical problems. This means that we must decide based on which time frame we want to exclude app usage if no app usage was recorded.

In this example, we check for app usage for each day whether more than 18 subsequent hours are without recorded data. If so, we exclude all data from that day.

FeatureOverviewApps <- right_join(FeatureOverviewApps,data.frame(Date = ESM$Date), by = c("Date")) #Here we also include days in our feature overview that are without any app data (but have ESM data).
#### SET HERE THE MISSING Threshold ####
MissingThreshold <-
  18 # Here we set it to 18 hours (other windows could have been chosen as well)

#Show app usage per hour, to calculate how many hours per day are without app usage
LabelMissing_App <- App_persecond %>%
  group_by(Date = floor_date(Date, "1 hour")) %>%
  summarize(Usage = n() / 60)

LabelMissing_App$Time <-
  format(LabelMissing_App$Date, format = "%H")
LabelMissing_App$Time <- as.numeric(LabelMissing_App$Time)

# Calculate differences: How many hours in between are without app usage?
LabelMissing_App <- LabelMissing_App %>%
  group_by(floor_date(Date, "1 day")) %>% #Calculates missing values per day
  summarise(Date = Date,
            Time = Time,
            difference = abs(Time - lag(Time))) # lag(Time) shows the next time point, which makes it possible to calculate the difference between two time points.
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Calculate how many hours past from 00:00 to the first recorded app usage (the first difference will be "NA")
LabelMissing_App$difference[is.na(LabelMissing_App$difference)] <-
  LabelMissing_App$Time[is.na(LabelMissing_App$difference)]

# Set the very first measurement to 0 (because the data collection might have started at any point of the day). Thus, we don't want to exclude the first day.
LabelMissing_App$difference[1] <- 0

# Create an index
LabelMissing_App$difference_index <- 0

# Compare whether the difference is greater than our missing treshhold
LabelMissing_App$difference_index[LabelMissing_App$difference > MissingThreshold] <-
  1

######## Check for other direction (the same again just different lag)
LabelMissing_App <- LabelMissing_App %>%
  group_by(floor_date(Date, "1 day")) %>% #Calculates missing values per day
  summarise(
    Date = Date,
    Time = Time,
    difference = abs(Time - lead(Time, 1)),
    difference_index = difference_index
  ) # Change to other direction, lead(Time) shows the previous time point
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Calculate how many hours past from the last recorded measurement to 00:00 (the difference for the last recorded measurement will be NA)
LabelMissing_App$difference[is.na(LabelMissing_App$difference)] <-
  24 - LabelMissing_App$Time[is.na(LabelMissing_App$difference)]

# Set last measurement to 0 (because the last day can end at any point)
LabelMissing_App$difference[nrow(LabelMissing_App)] <- 0

# Compare whether the difference is greater than our missing treshhold
LabelMissing_App$difference_index[LabelMissing_App$difference > MissingThreshold] <-
  1 

We created a dataset that shows for each day how many hours are without data between measurements. This variable is called “difference”. If the difference is more than our threshold, the difference_index will be one.

floor_date(Date, “1 day”) Date Time difference difference_index
2021-10-25 2021-10-25 17:00:00 17 1 0
2021-10-25 2021-10-25 18:00:00 18 1 0
2021-10-25 2021-10-25 19:00:00 19 1 0
2021-10-25 2021-10-25 20:00:00 20 2 0
2021-10-25 2021-10-25 22:00:00 22 1 0
2021-10-25 2021-10-25 23:00:00 23 1 0
2021-10-26 2021-10-26 07:00:00 7 1 0
2021-10-26 2021-10-26 08:00:00 8 1 0
2021-10-26 2021-10-26 09:00:00 9 1 0
2021-10-26 2021-10-26 10:00:00 10 1 0

As a next step, we label all data points from a day as missing if no App usage was recorded for X hours during that day.

#First, we label all NA as 0
FeatureOverviewApps[is.na(FeatureOverviewApps)] <- 0

# Second, we label all data points from a day as missing if X hours during that day are missing
for (i in 1:nrow(FeatureOverviewApps)) {
  # We loop through each row of the dataset, if the date has a difference_index of one, this specific day will be labeled as missing. A difference_index of one means that more hours than the missing threshold are missing for that day.
  if (!floor_date(FeatureOverviewApps$Date[i], "1 day") %in% LabelMissing_App$`floor_date(Date, "1 day")`[LabelMissing_App$difference_index == 0]) {
    FeatureOverviewApps[i, -1] = NA
  }
}

# You can add other measures here
#..... 

As the last step we merge our datasets from our different sensors and ESM.

# Merge different datasets
FeatureOverview <- merge(FeatureOverviewApps,FeatureOverviewWifi)
FeatureOverview$timescale_beforeESM <- paste(rule_int, "h", sep = "")
FeatureOverview$ParticipantNumber = "Example Participant"

# Merge with pa
FeatureOverview <- right_join(FeatureOverview,ESM, by = c("Date"))
FeatureOverview['pa_mean'] <- FeatureOverview %>% select(c("pa_happy_sliderNegPos","pa_energy_sliderNegPos","pa_relax_sliderNegPos")) %>% rowMeans()
Date COMMUNICATION_min FOOD_DRINK_min HEALTH_FITNESS_min MUSIC_AUDIO_min NA_min NEWS_MAGAZINES_min SHOPPING_min SOCIAL_min TOOLS_min APP_USAGE_min TRAVEL_LOCAL_min VIDEOPLAYERS_EDITORS_min PRODUCTIVITY_min BOOKS_REFERENCE_min COMICS_min PERSONALIZATION_min TOTAL_MACHASHES_number UNIQUE_MACHASHES_number timescale_beforeESM ParticipantNumber pa_happy_sliderNegPos pa_energy_sliderNegPos pa_relax_sliderNegPos pa_mean
2021-10-25 18:59:38 0.1166667 0.8333333 0.1333333 1.0333333 0.7333333 0.0333333 0.1166667 1.7833333 0.0500000 4.833333 0.0000000 0.00 0 0 0 0 9 0 3h Example Participant NA NA NA NA
2021-10-26 07:46:39 0.0500000 0.0000000 0.0000000 0.0166667 0.0000000 0.1833333 0.0000000 0.0166667 0.0000000 0.450000 0.1833333 0.00 0 0 0 0 0 0 3h Example Participant 5.297842 2.061052 3.567544 3.642146
2021-10-26 11:38:59 0.0000000 0.0000000 0.0000000 0.0000000 0.3000000 0.0666667 0.0000000 1.3666667 0.0000000 1.783333 0.0000000 0.05 0 0 0 0 20 0 3h Example Participant 6.311747 13.104694 6.047756 8.488066
2021-10-26 14:22:00 0.4166667 0.0333333 0.0000000 0.0000000 1.7000000 0.0833333 0.0000000 1.4500000 0.8666667 4.550000 0.0000000 0.00 0 0 0 0 52 0 3h Example Participant 10.939742 12.703427 10.068575 11.237248
2021-10-26 16:05:10 0.5000000 0.0333333 0.0000000 0.0000000 0.5333333 0.2000000 0.0000000 0.3166667 0.0000000 1.583333 0.0000000 0.00 0 0 0 0 0 0 3h Example Participant 5.962909 4.608161 6.854507 5.808525

Example Code for Analyzing Data Aggregated on Different Time Scales

Violin Plots

To better understand how the data varies over different time scales, it is a good idea to start visualizing the data. For this, we are using violin plots. Violin plots show the distribution of the data, which is here aggregated three hours before ESM was filled out. Additionally, we display the jittered data points next to it. Jittering data points adds random noise to the data, which slightly changes the location of the data points in order to reduce overlap. In our paper, we explore other time scales as well. The code provided is also able to display more than one time scale if those are present in the dataset.

# Here transform the variable as an ordered factor. Thus, the timescales will be displayed in the right order.
FeatureOverview$timescale_beforeESM <- ordered(FeatureOverview$timescale_beforeESM,levels = c("1h","3h","6h","9h","12h","24h"))
FeatureOverview$timescale_beforeESM_num <- recode(FeatureOverview$timescale_beforeESM, "1h" = 1, "3h" = 3,"6h" = 6,"9h" = 9,"12h" = 12, "24h" = 24 ) #This is used to calculate the percentages

# We use ggplot to create the violin plots, more information about ggplot2 can be found here: https://ggplot2.tidyverse.org/
source("https://raw.githubusercontent.com/datavizpyr/data/master/half_flat_violinplot.R")

FeatureOverview[FeatureOverview$ParticipantNumber == "Example Participant", ] %>%  ggplot(
  .,
  aes(
    x = 1,
    y = APP_USAGE_min / timescale_beforeESM_num / 60,
    fill = timescale_beforeESM,
    color = timescale_beforeESM
  )
) + #We divide the minutes of app usage by the hours of aggregation and by 60 to calculate the percentage
  geom_flat_violin(alpha = 0.5,
                   position = position_nudge(x = .3, y = 0),
                   width = 0.6) + # Here we add the violon plot
  geom_point(position = position_jitter(seed = 1, width = 0.15, height = 0)) + # Here we add the jittered datapoints (jittered on the horizontal axes)
  theme_minimal() +
  xlab("") +
  ylab("") +
  labs(fill = '', color = '') +
  theme(
    legend.position = "none",
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    text = element_text(size = 14),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank()
  ) +
  facet_grid(cols = vars(timescale_beforeESM)) + # shows different time scales next to each other, if present.
  scale_color_manual(
    breaks = c("1h", "3h", "6h", "9h", "12h", "24h"),
    values = c(
      "#440154",
      "#443983",
      "#31688e",
      "#21918c",
      "#35b779",
      "#90d743"
    )
  ) +
  scale_fill_manual(
    breaks = c("1h", "3h", "6h", "9h", "12h", "24h"),
    values = c(
      "#440154",
      "#443983",
      "#31688e",
      "#21918c",
      "#35b779",
      "#90d743"
    )
  )

The violin plot shows the variation of how much time someone spent on apps three hours before each ESM questionnaire was filled out (in percentage). This means that the graph does not represent raw app usage but the aggregated version before each filled-out ESM questionnaire. For a three hour level of aggregation the time the participant spent on apps seems to be generally quite short (fluctuating between 0 and 0.05%).

Correlation Analysis

To investigate the impact of choosing different time scales when relating multiple variables to one another, we calculate the bivariate correlation between minutes spent on apps and positive affect as measured through ESM. Here we provide an example code for generating the scatter plots including the correlation. Here we use again app usage that is aggregated three hours before ESM was filled out. In our paper, we explore other time scales as well. The code provided is also able to display more than onetime scale if those are present in the dataset.

ggplot(FeatureOverview[FeatureOverview$ParticipantNumber == "Example Participant", ], aes(x =
                                                                                            APP_USAGE_min, y = pa_mean)) +
  geom_point(colour = "#440154FF") +
  geom_smooth(
    method = "lm", #adds linear regression to the plot
    se = TRUE,
    fullrange = FALSE,
    level = 0.95,
    colour = "#21908CFF",
    fill = "#21908CFF"
  ) +
  stat_cor( # adds the correlation coefficient to the plot
    method = "pearson",
    p.accuracy = 0.01,
    r.accuracy = 0.01,
    label.y = 0.3,
    label.sep = ",",
    label.x = 0,
    na.rm = TRUE,
    cor.coef.name = c("r")
  ) +
  theme_minimal() +
  facet_grid(cols = vars(timescale_beforeESM), scales = "free_x") + # shows different time scales next to each other, if present.
  ylim(c(0, 10)) +
  ylab("Positive Affect") +
  xlab("Minutes beeing at Home")  +
  theme(text = element_text(size = 16), axis.text.x = element_text(hjust = 0.8))

Choosing different Moving Windows while analyzing the data

Researchers often choose a particular moving window when using passive smartphone measures to build (individualized) prediction models for mood and psychological symptoms. For using these kinds of prediction models it is common to split the data into a train set (which is used to build the prediction model) and a test set (which is used to evaluate the prediction performance). Since we use time-series data, it is important to take the order of the data into account when splitting the data into a train and test set. In other words, it is important to take the temporal direction into account, which means that we only use past data to predict momentary positive affect. For doing so, a moving window cross-validation strategy can be used. This means that we train the model on past data points to predict the next data point. This procedure will be repeated until the end of the dataset is reached. The test set will contain all predictions that were made across the different models.

We use the caret package for our analysis. More information about the caret package and time series cross-validation can be found here: https://topepo.github.io/caret/data-splitting.html#data-splitting-for-time-series

#  First we omit rows with missing data. How to handle missing data (once it is identified), for example, whether to impute missing data or whether to exclude missing data, is another important decision. For simplicity, we decided to exclude missing data here. The participant that we used in our paper for the prediction model did not have any missing data, thus we did not have to decide how to handle missing data.

FeatureOverview <- na.omit(FeatureOverview)

# Here we specify what we want to predict and which passive smartphone measures we are using for doing so 

formula <- as.formula(paste("pa_mean ~  SOCIAL_min +
               COMMUNICATION_min +
               APP_USAGE_min +
               TOTAL_MACHASHES_number +
               UNIQUE_MACHASHES_number"))

# Here we specify what the minimum and maximum size of the moving window is
min_window = 6 # here we use six data points to train the model
max_window = 8 # here we use eight data points to train the model

# For our analysis we made use of parallel computing 
cl <- makePSOCKcluster(detectCores()-1)
registerDoParallel(cl)

OverallResults <- list() #create a list to store the results

for(k in c(min_window:max_window)){ # here we loop through the different moving window sizes
  start_time <- Sys.time()
  
  set.seed(12361488)
  timeSlices <- createTimeSlices(1:nrow(FeatureOverview), 
                                 initialWindow = k, horizon = 1, fixedWindow = TRUE) # We use the createTimeSlices to split our dataframe into a train and test set
  
  trainSlices <- timeSlices[[1]]
  testSlices <- timeSlices[[2]]
  
  fitControl <- trainControl(method = "LOOCV") # to use leave-one-out-cross-validation to choose the best hyperparameters (= number of predictors)  in the training set. 
  
  pred <- rep(NA,length(trainSlices)) #we create empty vectors to store the results
  true <- rep(NA,length(trainSlices))
  w <- rep(NA,length(trainSlices))
  index <- rep(NA,length(testSlices))
  
  for(i in 1:length(trainSlices)){
    
    model<- train(formula, data=FeatureOverview[trainSlices[[i]],],trControl = fitControl, method="rf", na.action = na.omit) #Here the prediction model is build
    
  # Next we store the results from the prediction model by making predictions for our test set. The predict() function automatically uses the "best" model (based on loocv from the trains set)
    
    pred[i] <- predict(model,FeatureOverview[testSlices[[i]],])
    true[i] <- FeatureOverview$pa_mean[testSlices[[i]]]
    w[i] <- k
    index[i] <- as.numeric(testSlices[[i]])
  
  }
 
    end_time <- Sys.time()
    results <- cbind(pred,true,w,index)
    OverallResults[[k-min_window+1]] <- results # store results in our empty list
    print(paste("Completed Time Window:",w[1],", Running time:", paste(end_time - start_time)))
}
## [1] "Completed Time Window: 6 , Running time: 41.7029678821564"
## [1] "Completed Time Window: 7 , Running time: 28.0341827869415"
## [1] "Completed Time Window: 8 , Running time: 28.2695298194885"
stopCluster(cl)

We create a data frame that shows the predicted (pred) and true values for different moving window sizes (w). The “index” columns indicate for which data point the prediction was made. If we use a moving window of six, the first predicted point will be the seventh data point. In contrast, if we use a moving window of seven the first predicted point will be the eighth data point.

OverallResults <- as.data.frame(do.call("rbind",OverallResults)) #Combine lists into one dataframe
pred true w index
6.373170 7.152279 6 7
8.380912 9.002743 6 8
8.721968 8.148316 6 9
7.098829 7.792369 6 10
7.838884 7.692385 6 11
8.429940 5.947623 6 12
7.397675 3.490151 6 13
4.921025 6.432927 6 14
6.977998 6.894285 6 15
6.645987 9.152462 6 16

As a next step, we calculate the prediction accuracy and plot the results. The code provided can plot the results for different moving window sizes.

OverallResults <-
  OverallResults[OverallResults$index > max_window, ] # make test set the same length for all moving window sizes

# Create empty dataframe to store results
size = length(max_window:min_window)
OverallResultsPerformance <-
  data.frame(
    i = unique(OverallResults$w),
    cor = rep(NA, size),
    cor.int_1 = rep(NA, size),
    cor.int_2 = rep(NA, size),
    rsq = rep(NA, size),
    RMSE = rep(NA, size)
  )

# calculate prediction accuracy for each moving window (w)
for (i in unique(OverallResults$w)) {
  results <-
    OverallResults[OverallResults$w == i, ] # loop through different moving windows
  
  cor = cor.test(results$pred, results$true)$estimate # calculate/save the correlation
  cor.int_1 = cor.test(results$pred, results$true)$conf.int[1] # save lower confidence interval
  cor.int_2 = cor.test(results$pred, results$true)$conf.int[2] # save upper confidence interval
  
  rss <-
    sum((results$pred - results$true) ^ 2)  # residual sum of squares
  tss <-
    sum((results$true - mean(results$true)) ^ 2)  # total sum of squares
  rsq <- 1 - rss / tss
  
  RMSE <- RMSE(results$pred, results$true)
  
  OverallResultsPerformance[OverallResultsPerformance$i == i, ] <-
    cbind(i, cor, cor.int_1, cor.int_2, rsq, RMSE)
}


ggplot() +
  geom_line(aes(y = OverallResultsPerformance$rsq, x = OverallResultsPerformance$i, color = "3h")) +
  # ... different colors could be added
  theme_minimal() +
  xlab("Moving Window Size") +
  ylab("R-squared") +
  scale_color_viridis(
    discrete = TRUE,
    limits = c("24h", "12h", "9h", "6h", "3h", "1h", "rolling mean")
  ) +
  labs(color = "Level of Aggregation")

In this example, the prediction accuracy is very low (even negative, which means that the prediction accuracy is worse than chance). This is the case because we use a dataset in which random noise was added and we only use a few predictors. Using the real dataset with more predictors led to a reasonable R-squared.