Parallel processing to extract data frame from Motus data base

Hi everyone,

We are dealing with a very large Motus database (.motus file size is more than 3Go), which takes several hours to convert to a dataframe with R. We are interested to keep a high level of definition of detection in relation to time.

As anybody developed any ideas to accelerate the extraction with that kind of dB, excepted applying severe filter on the ts variable?

We wonder if some of you have had to work with parallel processing yet, and if so, could give us advices or R codes to speed up the extraction of data.

More precisely, we would like to know if it is possible to parallelized this code from the Motus R Book and how to do it?

sql.motus <- tagme(proj.num, update = TRUE, dir = "./data/")
tbl.alltags <- tbl(sql.motus, "alltags")

df.alltags <- tbl.alltags %>% 
  mutate(recvLat = if_else((is.na(gpsLat)|gpsLat == 0), 
                           recvDeployLat, gpsLat), 
         recvLon = if_else((is.na(gpsLon)|gpsLon == 0), 
                           recvDeployLon, gpsLon), 
         recvAlt = if_else(is.na(gpsAlt), recvDeployAlt, gpsAlt)) %>%
  select(-noise, -slop, -burstSlop, -done, -bootnum, -mfgID, 
         -codeSet, -mfg, -nomFreq, -markerNumber, -markerType, 
         -tagDeployComments, -fullID, -deviceID, -recvDeployLat, 
         -recvDeployLon, -recvDeployAlt, -speciesGroup, -gpsLat, 
         -gpsLon, - recvAlt, - recvSiteName) %>%
  collect() %>%
  as.data.frame() %>%
  mutate(ts = as_datetime(ts), # work with dates AFTER transforming to flat file
         tagDeployStart = as_datetime(tagDeployStart),
         tagDeployEnd = as_datetime(tagDeployEnd), 
         recvLat = plyr::round_any(recvLat, 0.05),
         recvLon = plyr::round_any(recvLon, 0.05),
         recvDeployName = if_else(is.na(recvDeployName), 
                                  paste(recvLat, recvLon, sep=":"), 
                                  recvDeployName))

Thank you

François Gagnon, biologist M.Sc

Environment and Climate change Canada

1550, Av. D’Estimauville, Québec, QC

email : Francois.Gagnon14@canada.ca

Salut François

If you use dplyr's show_query() function to display the SQL it
generates for a sequence of operations, you may be able to apply
that query directly to the database on disk rather than having to
read the whole thing into memory. And there might be ways to
optimize that query before applying it.

A problem with using dplyr is that (last I checked) it can't write
results back to the underlying SQL database, so all mutations happen
in memory, meaning the entire data set has to be read in.

Or maybe you can run copies of your code in different R sessions,
selecting different time periods for each R session. If the sessions
write their results to separate files, they should be able to operate
in parallel.

Perhaps, though, motus should be supplying data in a form where this
"standard" kind of processing has already been done?

John