Hi everyone,
We’ve known for a while that the alltags view in motus sqlite databases can be exceedingly slow. One of the big reason behind this is the join between the hits and the gps tables. Since hits and gps fixes are not happening at the same time, the join as currently designed needs to summarize lots of data in order to figure out which gps row is the close to each hit. In large databases with millions of rows, this could potentially mean billions of operations, which will grow exponentially with the database.
We’re currently working on integrating in the package a version of the view that leaves out the GPS fixes from the table, and this has shown dramatic improvements, as was posted a little while ago on this forum. Sorry for being a bit slow on releasing that, but this should be coming soon.
In the meantime, it may be worth pointing out that much more tailored joins between underlying tables using dplyr statements, instead of using the alltags view, can potentially improve dramatically your results.
As an example, the code below builds a summary of the GPS data to figure out the first GPS fix for every hour bin, and uses that table to join with the hits. This allows the join to operate on a simple merge between columns shared among the 2 tables (hourBin, which we calculate in both hits and gps). It is worth noting that sqlite offers only limited mathematical aggregate functions, and if you needed to use any other R functions (e.g. using median() instead of mean to find the mid-point GPS hits), you would need to convert your tibble to a dataframe.
You can likely further improve efficiency by filtering the hits table (e.g. for specific tags). This is particularly relevant if you need to convert your table to a dataframe with the collect() statement. Other similar approaches could be used based on your needs.
We will try to add some examples of how some of this can be achieved in the Motus R book, and including between documentation on how the underlying tables are expected to be joined to each other, so you can create customized versions of the same types output that alltags provides.
One downside of this approach is that some hourBin may not have gps data, and records for hits and gps may fall in different hourBins by chance, even though they may be moments apart. To reduce this chance, one could potentially use a larger time window (e.g. 2 hours), which should have little impact on stationary receivers. With mobile receivers, the time window could be reduced, but it should not be smaller than the interval between expected GPS fixes (e.g. 5 minutes), or some hits won’t be associated with GPS detections. If you are dealing with stationary receivers, I would suggest relying on the receiver deployment table first anyway, but it is still recommended to validate the data with the GPS readings, as receivers may have been moved without being reflected in the metadata.
I hope this helps!
Denis Lepage
Birds Canada
library(motus)
library(dplyr)
src <-tagme(76, update=F) #load data from sample project