Converting R Data to Flat File (Section 3.5.8 in Motus R Book)

Hello,

I am Katherine Dami, and I have recently started work on project 172 (Sora Migration). We place our transmitters on sora that stay near the Motus tower for weeks or months accumulating many hits before they leave, so we have a lot of data.

The former assistant who worked with R on the project informed me there might be issues with data processing because of all the data. I have been working step-by-step through the R Motus Book with our data, and the first issue I have encountered was converting the data into a flat file on R with the following code:

df.alltags ← tbl.alltags %>%
collect() %>%
as.data.frame()

I let this run uninterrupted for days until the stop sign (in the upper right hand corner of the console) finally disappeared, and it gave me this error message:

Error: cannot allocate vector of size 100.0 Mb

Is there a way this can be fixed? Is there a way to convert the data into a flat file in multiple smaller groups?

Thank you,
Katherine Dami

Hello Katherine,
This is an increasingly common problem for using the Motus R Book with real datasets. The data.frames are often large (e.g., I have one that is over 7,000,000 lines long). I spent a bunch of time trying to sort this out recently. I found two options to deal with it.
First some tools. To see what your memory limit is and how close you are to it you can use
memory.limit()
memory.size()
You should see the problem with the difference being smaller than whatever the error said.
The two solutions that I came up with are as follows:

  1. increase your memory limit with something like memory.limit(size = 20000) which requests a new limit from your computer. This method may only be helpful on Windows, as other systems seem to have better dynamic memory allocation.
  2. delete each of the big data.frames whenever you clean them up or subset them. Most of the examples never use the big flat files, but subset them in some way.

Method 1 is somewhat risky, as asking for too much memory for R may make other systems unstable. It also does not solve the problem if you have tons of giant data.frames in working memory; you still run out, especially if your computer memory is small. Method 2 has worked better for me. Since interacting with the Motus database or the local SQLite copies on big projects can be slow, you can always export the flat file with saveRDS() before deleting it and bring it back in, if you need to. This is still slow for big files, but faster than running tagme() again.
Hope this helps.
-Pat

Hi everyone,

Just to add in a couple of comments, the function gc() can clear up some working memory.

You can apply additional filters (e.g., certain mfgID’s or receiver locations) when you’re extracting the data for the first time.

It’s also useful to round your data to the nearest 5 minutes and then reduce your data set so you only work with 1 detection per 5 minute period. This is particularly useful when exploring coarse scale patterns (like movements between towers).

For most purposes, you can use these strategies to increase the efficiency of an analysis - during exploratory analysis you don’t want to be waiting around for much more than a few minutes.

Yolanda

Excellent suggestions by Yolanda. A couple additional ones that would help:

You can limit the list of fields to reduce the memory footprint. Just like applying a filter by tag ID or date range, you would want to do this prior to converting your data to a data frame.

Also make sure to use the newer version of the package. Early version of alltags was pulling data from the gps table, which was excruciatingly slow for large datasets. That view has been renamed and the default alltags does include those fields. Even without gps, it can still be problematic.

The optimal approach is often to create a custom query of the tables rather than use alltags. That can be done using dplyr or other similar packages. The biggest gain would likely be to avoid including the hits table. The runs table would likely be enough in most cases since it provides the start and end time of a run, plus the number of hits. I will look into creating a new view that does this for the package.

Alltags was created for convenience, but can break down with large datasets. We should try to provide better examples in the book about how to work around that problem.

Some people may already be able to share some code to help with this.

As a start, I pasted below a simplified version of the view that removes the hits table. Until we add this to the R package, you can create this view manually in your motus database(s) using a tool such as DB Browser for SQLite, for instance: https://sqlitebrowser.org/

In the case of the sample project 176, this brings down the number of rows from about 108,000 to about 4,800. Other approaches to summarize the data by time intervals (e.g. 5 or 10 minutes) as suggested by Yolanda would also work if you need better details on the fine scale activity.

The tsEnd and tsBegin fields are new and provide the outer bounds of the run for a given tag on a specific antenna. You can also remove any fields that you don't plan to use from the view below, which will help with memory, or even entire tables if they are not needed, but you'll need to dig a bit more into the sql syntax to do that.

CREATE VIEW allruns AS SELECT    t2.runID as runID,    t3.batchID as batchID,    t2.done as done,    CASE WHEN t12.motusTagID is null then t2.motusTagID else t12.motusTagID end as motusTagID,    t12.ambigID as ambigID,    t2.ant as port,    t2.len as runLen,    t2.tsBegin as tsBegin,    t2.tsEnd as tsEnd,    t3.monoBN as bootnum,    t4.projectID as tagProjID,    t4.mfgID as mfgID,    t4.type as tagType,    t4.codeSet as codeSet,    t4.manufacturer as mfg,    t4.model as tagModel,    t4.lifeSpan as tagLifespan,    t4.nomFreq as nomFreq,    t4.bi as tagBI,    t4.pulseLen as pulseLen,    t5.deployID as tagDeployID,    t5.speciesID as speciesID,    t5.markerNumber as markerNumber,    t5.markerType as markerType,    t5.tsStart as tagDeployStart,    t5.tsEnd as tagDeployEnd,    t5.latitude as tagDeployLat,    t5.longitude as tagDeployLon,    t5.elevation as tagDeployAlt,    t6a.deviceID as deviceID,    t6.deployID as recvDeployID,    t6.latitude as recvDeployLat,    t6.longitude as recvDeployLon,    t6.elevation as recvDeployAlt,    t6a.serno as recv,    t6.name as recvDeployName,    t6.siteName as recvSiteName,    t6.isMobile as isRecvMobile,    t6.projectID as recvProjID,    t7.antennaType as antType,    t7.bearing as antBearing,    t7.heightMeters as antHeight,    t8.english as speciesEN,    t8.french as speciesFR,    t8.scientific as speciesSci,    t8.`group` as speciesGroup,    t9.label as tagProjName,    t10.label as recvProjName,    t11.lat as gpsLat,    t11.lon as gpsLon,    t11.alt as gpsAlt FROM    runs AS t2

left join allambigs t12 on t2.motusTagID = t12.ambigID

LEFT JOIN    batchRuns AS t3a ON t2.runID = t3a.runID

LEFT JOIN    batches AS t3 ON t3.batchID = t3a.batchID

LEFT JOIN    tags AS t4 ON t4.tagID = CASE WHEN t12.motusTagID is null then t2.motusTagID else t12.motusTagID end

LEFT JOIN    tagDeps AS t5 ON t5.tagID = CASE WHEN t12.motusTagID is null then t2.motusTagID else t12.motusTagID end       AND t5.tsStart =          (SELECT              max(t5b.tsStart)           FROM              tagDeps AS t5b           WHERE              t5b.tagID = CASE WHEN t12.motusTagID is null then t2.motusTagID else t12.motusTagID end              AND t5b.tsStart <= t2.tsBegin              AND (t5b.tsEnd IS NULL OR t5b.tsEnd >= t2.tsBegin)          ) LEFT JOIN    recvs as t6a on t6a.deviceID =t3.motusDeviceID LEFT JOIN    recvDeps AS t6 ON t6.deviceID = t3.motusDeviceID AND       t6.tsStart =          (SELECT              max(t6b.tsStart)           FROM              recvDeps AS t6b           WHERE              t6b.deviceID=t3.motusDeviceID              AND t6b.tsStart <= t2.tsBegin              AND (t6b.tsEnd IS NULL OR t6b.tsEnd >= t2.tsBegin)          )

LEFT JOIN    antDeps AS t7 ON t7.deployID = t6.deployID AND t7.port = t2.ant LEFT JOIN    species AS t8 ON t8.id = t5.speciesID LEFT JOIN    projs AS t9 ON t9.ID = t5.projectID LEFT JOIN    projs AS t10 ON t10.ID = t6.projectID LEFT JOIN    gps AS t11 ON t11.batchID = t3.batchID       AND t11.ts =          (SELECT              max(t11b.ts)           FROM              gps AS t11b           WHERE              t11b.batchID=t3.batchID              AND t11b.ts <= t2.tsBegin          )

Denis Lepage http://avibase.ca http://ebird.ca

[cid:storage_emulated_0__EmailTempImage_HEV_1549905951692_png_1549905951709]

image003.jpg

gc() does not work reliably, since on many computers, r does garbage collection regularly, but cryptically. Yolanda’s second suggestion is a good one once you know what you want to focus on, but does not help on the first pass. The third suggestion is what I meant by filtering. Lots of the R book examples help you pare down the data size, and then you can rm(df.gigantic).

Sorry, I just realized that I actually started from an older version of the query which still relied on the gps table. In the case of the sample project, removing GPS doesn’t make any noticeable difference. It may with larger datasets, perhaps, so here is the revised version without GPS.

CREATE VIEW allruns

AS

SELECT

t2.runID as runID,

t3.batchID as batchID,

t2.done as done,

CASE WHEN t12.motusTagID is null then t2.motusTagID else t12.motusTagID end as motusTagID,

t12.ambigID as ambigID,

t2.ant as port,

t2.len as runLen,

t2.tsBegin as tsBegin,

t2.tsEnd as tsEnd,

t3.monoBN as bootnum,

t4.projectID as tagProjID,

t4.mfgID as mfgID,

t4.type as tagType,

t4.codeSet as codeSet,

t4.manufacturer as mfg,

t4.model as tagModel,

t4.lifeSpan as tagLifespan,

t4.nomFreq as nomFreq,

t4.bi as tagBI,

t4.pulseLen as pulseLen,

t5.deployID as tagDeployID,

t5.speciesID as speciesID,

t5.markerNumber as markerNumber,

t5.markerType as markerType,

t5.tsStart as tagDeployStart,

t5.tsEnd as tagDeployEnd,

t5.latitude as tagDeployLat,

t5.longitude as tagDeployLon,

t5.elevation as tagDeployAlt,

t6a.deviceID as deviceID,

t6.deployID as recvDeployID,

t6.latitude as recvDeployLat,

t6.longitude as recvDeployLon,

t6.elevation as recvDeployAlt,

t6a.serno as recv,

t6.name as recvDeployName,

t6.siteName as recvSiteName,

t6.isMobile as isRecvMobile,

t6.projectID as recvProjID,

t7.antennaType as antType,

t7.bearing as antBearing,

t7.heightMeters as antHeight,

t8.english as speciesEN,

t8.french as speciesFR,

t8.scientific as speciesSci,

t8.group as speciesGroup,

t9.label as tagProjName,

t10.label as recvProjName

FROM

runs AS t2

left join allambigs t12 on t2.motusTagID = t12.ambigID

LEFT JOIN

batchRuns AS t3a ON t2.runID = t3a.runID

LEFT JOIN

batches AS t3 ON t3.batchID = t3a.batchID

LEFT JOIN

tags AS t4 ON t4.tagID = CASE WHEN t12.motusTagID is null then t2.motusTagID else t12.motusTagID end

LEFT JOIN

tagDeps AS t5 ON t5.tagID = CASE WHEN t12.motusTagID is null then t2.motusTagID else t12.motusTagID end

AND t5.tsStart =

(SELECT

max(t5b.tsStart)

FROM

tagDeps AS t5b

WHERE

t5b.tagID = CASE WHEN t12.motusTagID is null then t2.motusTagID else t12.motusTagID end

AND t5b.tsStart <= t2.tsBegin

AND (t5b.tsEnd IS NULL OR t5b.tsEnd >= t2.tsBegin)

)

LEFT JOIN

recvs as t6a on t6a.deviceID =t3.motusDeviceID

LEFT JOIN

recvDeps AS t6 ON t6.deviceID = t3.motusDeviceID AND

t6.tsStart =

(SELECT

max(t6b.tsStart)

FROM

recvDeps AS t6b

WHERE

t6b.deviceID=t3.motusDeviceID

AND t6b.tsStart <= t2.tsBegin

AND (t6b.tsEnd IS NULL OR t6b.tsEnd >= t2.tsBegin)

)

LEFT JOIN

antDeps AS t7 ON t7.deployID = t6.deployID AND t7.port = t2.ant

LEFT JOIN

species AS t8 ON t8.id = t5.speciesID

LEFT JOIN

projs AS t9 ON t9.ID = t5.projectID

LEFT JOIN

projs AS t10 ON t10.ID = t6.projectID

image003.jpg

Hello Denis,
Your suggestion of adding a view that is smaller would be most helpful. I think custom SQL is beyond most who don’t understand the DB structure, and in my recent experience removing fields did not save much memory. However, I think for determining tag paths you still need the hits don’t you? At least is seems like that is how the example in the R book works (uses df.alltags after filtering to remove short runs)

Hi again,

Attached is a script (create allruns.R) that you can run on any existing motus file to add a new view called allruns, as described earlier. We will work that into the package soon. If people have any suggestions for it, please share them with me or the group.

The runs table provides a pretty good summary in replacement of the hits, especially at a larger spatial or temporal scale (e.g movement among receivers). Most runs are typically relatively short in duration, so not much details would be lost. It would generally be safe to assume that an animal is present at all time between the start and the end of a run (tsBegin and tsEnd). You really only need a relatively short period during which the tag is out of the range of the receiver to break a run (something like 10 minutes, as an order of magnitude).

The runs table also gives you just about as good details on departure and arrival times, for instance. You are mainly missing details of activity between the start of a run and the end of one, but runs are antenna specific, so there are some useful details too.

You can still use the runLen field to filter out the very short runs the same way (e.g. runLen > 3), since this is included in the runs table, and also use the motus filters. We’d have to look more closely to provide some examples of that.

Of course, if your goal is to identify time intervals when a tag is present on a finer scale, you may need to find alternative ways to get to the same point. With hits, you could do an aggregate query by dividing the ts by 300 (5 * 60 seconds) for instance, so you have a single record of all the 5 minute periods where a tag was detected. To get the same data from runs, you could pregenerate your intervals in a table, and use a query to identify whether those periods overlaps with a given run, without having to rely on the hits records, which should be much faster for large datasets. There’s no guarantee that you’d have an actual hit over each specific period, but you may decide that the assumption that the bird was probably nearby would work, unless you really need say minute by minute precision or smaller.

Attached are 2 short scripts (merge example and merge example daily) that do exactly what I describe above (assumes that you have already created the new view called allruns below). The first generates (for a single tag and where runLen > 3), the number of overlapping runs to any of the 5 minute periods I defined (note that some run may overlap multiple periods). The daily example uses a period of a full day instead (for all tags) and has a couple added parts. Instead of the number of runs, you should also be able to get the min(tsBegin) and max(tsEnd) for things like arrival and departure times. The daily example also has a short example of how to generate similar summaries from the alltags view instead. For a small dataset, both approaches are very fast, so it doesn’t really matter.

Runs also don’t give you relative signal strengths on different antenna, so if you are trying to measure directional movements (where are birds coming from or leaving to), you really need the hits table. But, as suggested, I would then recommend building routines that only pull data for a limited number of birds and/or a specific time period.

It’s hard to generalize much beyond that, as it will really depend on what application people try to do. If people want to share a few examples of code that relies on the hits table, we could try to explore ways to make them more efficient to produce the same results. We can try to add some of those examples within the R book as well.

Cheers

Denis

p.s. I’d certainly encourage folks to dive into the data.table package if you are data-inclined. I don’t have a lot of experience with this particular package, but it appears quite good for large data manipulations. https://www.rdocumentation.org/packages/data.table/versions/1.13.6/

merge example.R (1.08 KB)

create allruns.R (2.76 KB)

merge example daily.r (1.51 KB)

Thanks Denis for this thoughtful reply. This should help a ton, and helps me to understand the relative value of working with runs versus hits. I have been mostly working (as I think many users are) to understand the basic functions in the motus R package, and I am still really just using the R book examples on our data. In my case this is identifying where birds we detect were tagged and defining paths of birds we have tagged (mostly across big spatial scales). We would like to do some stuff at the scale of a small number of overlapping towers, but for that we would know the motus tag IDs.