Motus table joins

DenisLepage · April 13, 2020, 7:54pm

Hi everyone,

We’ve known for a while that the alltags view in motus sqlite databases can be exceedingly slow. One of the big reason behind this is the join between the hits and the gps tables. Since hits and gps fixes are not happening at the same time, the join as currently designed needs to summarize lots of data in order to figure out which gps row is the close to each hit. In large databases with millions of rows, this could potentially mean billions of operations, which will grow exponentially with the database.

We’re currently working on integrating in the package a version of the view that leaves out the GPS fixes from the table, and this has shown dramatic improvements, as was posted a little while ago on this forum. Sorry for being a bit slow on releasing that, but this should be coming soon.

In the meantime, it may be worth pointing out that much more tailored joins between underlying tables using dplyr statements, instead of using the alltags view, can potentially improve dramatically your results.

As an example, the code below builds a summary of the GPS data to figure out the first GPS fix for every hour bin, and uses that table to join with the hits. This allows the join to operate on a simple merge between columns shared among the 2 tables (hourBin, which we calculate in both hits and gps). It is worth noting that sqlite offers only limited mathematical aggregate functions, and if you needed to use any other R functions (e.g. using median() instead of mean to find the mid-point GPS hits), you would need to convert your tibble to a dataframe.

You can likely further improve efficiency by filtering the hits table (e.g. for specific tags). This is particularly relevant if you need to convert your table to a dataframe with the collect() statement. Other similar approaches could be used based on your needs.

We will try to add some examples of how some of this can be achieved in the Motus R book, and including between documentation on how the underlying tables are expected to be joined to each other, so you can create customized versions of the same types output that alltags provides.

One downside of this approach is that some hourBin may not have gps data, and records for hits and gps may fall in different hourBins by chance, even though they may be moments apart. To reduce this chance, one could potentially use a larger time window (e.g. 2 hours), which should have little impact on stationary receivers. With mobile receivers, the time window could be reduced, but it should not be smaller than the interval between expected GPS fixes (e.g. 5 minutes), or some hits won’t be associated with GPS detections. If you are dealing with stationary receivers, I would suggest relying on the receiver deployment table first anyway, but it is still recommended to validate the data with the GPS readings, as receivers may have been moved without being reflected in the metadata.

I hope this helps!

Denis Lepage

Birds Canada

library(motus)

library(dplyr)

src <-tagme(76, update=F) #load data from sample project

wblake · April 13, 2020, 8:48pm

Dear Denis,
In the code sample you shared with us, does “tbl.hits.gps” replace the “tbl.alltags” from the Motus R Book?
Specifically, to create a data frame, would we then simply go with:

df.alltags <- tbl.hits.gps %>% 
  collect() %>% 
  as.data.frame()

I’m able to create a data frame with the above code, but can no longer specify the arguments to the data frame as recommended in the R Motus Book “mutate(ts = as_datetime(ts, tz = “UTC”, origin = “1970-01-01”))”
Thanks for any additional pointers.

William

DenisLepage · April 14, 2020, 5:02pm

Hi William and everyone,

The example I provided was not meant to be complete, but rather to provide a solution to those interested in digging a bit deeper. There are several aspects handled by the alltags view that this does not cover, including renaming fields and combining various tables.

We’ll try to provide more useful details in the R book, but that may take a little while longer.

Cheers Denis

wblake · April 14, 2020, 5:47pm

Hi Denis,

Thanks for your quick response. I tried to delete my question in the google group but I couldn’t delete it until it showed up as “posted” and you beat me in your response! haha. Yes, I dug into the first R book chapters again and realized that as you just mentioned, there were “several aspects handled by the alltags view” such as “renaming fields and combining various tables” not covered in the R book or any shared code.

Our Intermountain West Collaborative Motus project unfortunately suffers from a large dataset (7 Million lines of detections) in only one full season of tagging in western Montana. Josh Sayers at Motus shared with me again the link for the fast alltags view: Redirecting to Google Groups, which helps a bit but still takes forever to process one of our main Motus stations.

Any further help will be amazing. I’m looking forward to more useful details in the R book!
Thanks Denis,

William