Extracting from SQL using data.table in R

Mary_Scofield · December 31, 2024, 4:22am

Hello,

I’m looking for some help with developing a script in R using data.table package to access our motus data. We’ve been using a script based off the R book but our project is now a size too large to continue using the script as is. Any insights or coding help is appreciated.

Mary Scofield

Avian Biologist, MPG Ranch

user253 · January 5, 2025, 7:38pm

Mary - I have a lot of experience in both R and with SQL databases. I have not done much specifically using R and SQL together on any significant project beyond playing with some of the motus.org examples. I am happy to try and see if I can help you.

In case you haven’t seen it… I also highly recommend the youtube video made by Amie MacDonald, from Birds Canada for a workshop in 2022. " How to clean and prepare Motus data for analysis". https://www.youtube.com/watch?v=2DsoSMkvpcs

You can contact me at schramm.r@gmail.com and we can ‘discuss’ how I might be able to help.

Rich Schramm

Mary_Scofield · January 6, 2025, 3:14pm

Hi Rich,

Thank you for your response. I have seen the YouTube video you linked, as it was my introduction into motus data retrieval and cleaning. However, the code I adapted from that is just too inefficient for our project, which is now rather large and pulling from the SQL is consuming too many hours, even when done is “small” chunks. I am an inexperienced coder and I have been trying to learn my other options to work with large data sets. The data.table package was recommended to me because it is an R language, but since I’ve posted we are now looking for other options with either Python or something similar. Any suggestions would be helpful. -Mary

user253 · January 6, 2025, 5:05pm

Mary, Python can be a good option for data processing from SQL databases but can have a steep learning curve. (I used it extensively for 20 yrs. I’m a bit rusty as I retired from that job ~5yrs ago.) You already have an investment in learning R so before we jump there I would like to make sure the problem is with R. I’d like to first talk to you about your workflow and processing scripts to see where the performance bottlenecks are. Then for example of steps we might follow on with are: We can try ‘profiling’ chunks of your R code to actually measure where the time is being spent (if you havent already tried that) We can look at underlying SQL queries that your R code uses to hit your database - e.g. perhaps there needs to be an index placed on some table join to improve the SQL query performance. Sometimes for time-expensive processing steps it might make sense to do some nightly pre-processing of data and caching it so your daytime workflow is more performant. -RIch

jsrs · January 6, 2025, 5:20pm

Hi Mary, a few general suggestions that may help are to

a) avoid using the alltags table as a first-pass and opt instead for the much quicker allruns virtual table.

b) do as much filtering prior to collecting in to a dataframe, though those doesn’t always speed things up as much as you might think (in my experience anyway)

c) custom queries on the SQLite database (the .motus file) can help as well, since many of the fields in alltags and allruns are not necessary all the time and only add to the processing and RAM demands put on your computer.

jsrs · January 6, 2025, 5:24pm

Rich has made some good suggestions, one of which being to take advantage of the time you’ve already invested in R, not to mention that the existing framework for working with Motus data is primarily R based. Maybe if you are able post some of the code we can help provide some guidance. It may also help us flesh out the existing resources to cover more of “large dataset challenges” as these are only going to become more common as project datasets grow!

Mary_Scofield · January 6, 2025, 6:10pm

Josh and Rich,

I am more than willing to share my code. Your suggestions are great, especially the allruns table. We are currently working with some analyses that need the individual detection hits and signal strength values found on the alltags table, which I can say fairly confidently is the bottleneck (also probably the timestamp conversion). I am not a coder, I am self-taught via this Motus experience so my knowledge of developing code is limited! I truly appreciate the willingness to answer questions and help.

What is the best way to share my code with you? I have the scripts on GitHub.

jsrs · January 7, 2025, 2:28pm

Well you’ve got lots of company here in that regard!

Even though you have it in GitHub, it may work to share the primary portion here. At least that way a few more eyeballs may notice it and possibly chime in.

Mary_Scofield · January 7, 2025, 4:14pm

I will just paste a few chunks, full code is available by email or GitHub if needed. Here is where I assume the bottleneck is:

  tags.detect <- as.data.table(
    tbl(sqldb, "alltags") %>%
      filter(tagProjID == 213, motusTagID %in% c(tags.meta$tagID, tags.ambigs$tagID), 
             speciesID %in% sp.list) %>%
      select(runID, ts, sig, port, noise, freqsd, motusTagID, 
             ambigID, runLen, tagDeployID, tagDeployStart, 
             tagDeployEnd, tagDepLat, tagDepLon, deviceID, recvDeployID, recv,
             speciesID,  mfgID, motusFilter, antBearing, antType, 
             antHeight) %>%
      collect()
  )
  tags.detect <- tags.detect[
    , `:=`(ts = as_datetime(ts), 
           tagDeployStart = as_datetime(tagDeployStart), 
           tagDeployEnd = as_datetime(tagDeployEnd))
  ]

Or it could be here:

recvs <- tags.detect[
    !is.na(recvDeployID),
    .(recv, recvDeployID)
  ][unique(recv.meta), on = "recvDeployID"]
  
  tags.detect <- tags.detect[
    recvs,
    on = .(recv, recvDeployID),
    `:=`(recvLat = i.latitude, recvLon = i.longitude, recvType = i.recvType)
  ][ts >= tagDeployStart & ts <= tagDeployEnd & !is.na(recvLat)]

jsrs · January 7, 2025, 4:36pm

Project 213 is a behemoth and right off the bat will create a challenge! I’m starting a download of that project now…