Friday, October 5, 2018

Why add correlations for suitability scores?

Hey y'all!  After a conversation with some colleagues, I realized that I sort of added Spearman rank correlation as a measure of overlap to ENMTools without really explaining why.  Here is a quick and dirty sketch of my thinking on that.

Previous measures of overlap between suitability rasters that were implemented in ENMTools were based on measures of similarity between probability distributions.  That's fine as far as it goes, but my feeling is that it's more useful as a measure of species' potential spatial distributions than as a measure of similarity of the underlying model.   Here's a quick example:

# Perfect positive correlation
sp1 = seq(0.1, 1.0, 0.001)
sp2 = seq(0.1, 1.0, 0.001)


You can only see one line there because the two species are the same.  I wrote a function called olaps to make these plots and to calculate three metrics of similarity that are implemented in ENMTools.

olaps(sp1, sp2)

$D
[1] 1

$I
[1] 1

$cor
[1] 1

Okay, that's all well and good - perfectly positively correlated responses get a 1 from all metrics.  Now what if they're perfectly negatively correlated?

# Perfect negative correlation
sp1 = seq(0.1, 1.0, 0.001)
sp2 = seq(1.0, 0.1, 0.001)
olaps(sp1, sp2)
$D
[1] 0.590455

$I
[1] 0.8727405

$cor
[1] -1

What's going on here?  Spearman rank correlation tells us that they are indeed negatively correlated, but D and I both have somewhat high values!  The reason is that the values of the two functions are fairly similar across a fairly broad range of environments, even though the functions themselves are literally as different as they could possibly be.  Thinking about what this means in terms of species occurrence is quite informative; if the threshold for suitability for a species to occur is low (e.g., .0005 in this cartoon example), they might co-occur across a fairly broad range of environments; both species would find env values from 250 to 750 suitable and might therefore overlap across about 2/3 of their respective ranges.  That's despite them having completely opposite responses to that environmental gradient, strange though that may seem.

So do you want to measure the potential for your species to occupy the same environments, or do you want to measure the similarity in their estimated responses to those environments?  That's entirely down to what question you're trying to answer!

Okay, one more reason I kinda like correlations:

# Random
sp1 = abs(rnorm(1000))
sp2 = abs(rnorm(1000))


Here I've created a situation where we've got complete chaos; the relationship of both species to the environment is completely random.  Now let's measure overlaps:

olaps(sp1, sp2)

$D
[1] 0.5641885

$I
[1] 0.829914

$cor
[1] -0.04745993

Again we've got fairly high overlaps between species using D and I, but Spearman rank correlation is really close to zero.  That's exactly what we'd expect if there's no signal at all.  Of course the fact that species distributions and the environment are both spatially autocorrelated means that we'll likely have higher-than-zero (at least in absolute value) correlations even if there is no causal relationship between the environment and species distributions, but at least it's nice to know that we do have a clear expected value when chaos reigns.



Code for this is here:

https://gist.github.com/danlwarren/5509c6a45d4e534c3e0c0ecf1702bbdd

Tuesday, July 31, 2018

Predict functions for all model types

I've finally gotten around to adding predict() functions for all of the ENMTools model types.  You can now project your model onto a new time period or geographic extent, and it gives you back two things: a raster of the predicted suitability of habitats, and a threespace plot (see description here).  I've also added some more stuff under the hood that should catch some data formatting issues that were causing errors.

Wednesday, July 25, 2018

Minor fixes, new features

Hey all!  I've been bashing away at ENMTools for the past couple of days, just doing a bunch of bug fixes and adding some new features.  If you want to see everything you can go here, but I'll outline the highlights.

1. Added a "bg.source" argument to all modeling functions that allows you to specify whether you want to draw background from the background points or range raster stored in your enmtools.species object, or the geographic area outlined by your environment raster.  If you don't specify any bg.source argument it will prioritize them in the order it did previously: background points > range raster > environment layers.

2. Changed the raster.pca function to return the actual pca object along with the raster layers. 

3. Fixed a persistent error with ggplot2 plots from hypothesis tests excluding some values.  The fix for this isn't perfect yet (and I'm not entirely sure why), but in my experiments with it yesterday the issue is MUCH reduced.

4.  Added a plot to the output of enmtools.aoc that shows the averaged overlap values on each node in the phylogeny.  The old plots for the hypothesis tests are still there, but if you display the enmtools.aoc object those are now the second plot.  The first one is the tree, and it looks like this:



I've also added some plots I've been meaning to put in for a while.  These are called "three space plots", because they're meant to visualize the environment spaces representing the presence and background data from a model along with the environment space represented by a set of environment layers.

For these you just type threespace.plot(model, env), where your model is an enmtools.model object and your env is a set of raster layers.  That gives you something like this:



The goal here it to visualize how much of that environment space represents sets of conditions your model never saw when it was being built.  This is just a first pass and I do have more stuff planned in this direction, but I reckon this alone might be useful for some of you out there.

Monday, June 4, 2018

More installation issues: "descendants" not exported from phyloclim, ecospat not found

Well the fun just keeps rolling in with the newest R update and ensuing package updates.  Two new issues have just been brought to my attention that prevent new installations of ENMTools: first, the newest version of phyloclim no longer exports the "descendants" function, which ENMTools uses.  There's a hot fix for that on the develop branch of ENMTools on Github, where I basically just copied the code over from phyloclim for the time being so it no longer has to be imported.

The second issue is a bit more of a hassle: ecospat has been removed from CRAN, and I'm not sure why.  Given that ENMTools requires ecospat to run, this is a bit of an issue.  For now, all you can really do is install ecospat manually from the Github mirror of the CRAN repository, and then install ENMTools after that.  Hopefully ecospat will get updated and re-uploaded to CRAN soon, but failing that we might need to find a more permanent workaround eventually.

Thursday, May 17, 2018

Issues with installing ENMTools under newest versions of R

During the course I co-taught with Matt Fitzpatrick in Glasgow a few weeks ago, it came out that a number of people were having trouble installing ENMTools.  A lot of this seems to stem from the newest version of R, and in particular how it interacts with your Java installation.

Some Mac users (including me) were able to fix this just by going to the console and typing "R CMD javareconf" and restarting R.

Windows users seemed to have much more significant issues, however.  There seemed to be several issues, but here's a solution that helped several of the students, courtesy of Hirzi Luqman:

Error:
* installing *source* package ‘ENMTools’ ...
** R
** data
*** moving datasets to lazyload DB
** tests
** byte-compile and prepare package for lazy loading
Error : .onLoad failed in loadNamespace() for 'rJava', details:
  call: dyn.load(file, DLLpath = DLLpath, ...)
  error: unable to load shared object '/Library/Frameworks/R.framework/Versions/3.5/Resources/library/rJava/libs/rJava.so':
dlopen(/Library/Frameworks/R.framework/Versions/3.5/Resources/library/rJava/libs/rJava.so, 6): Library not loaded: /Library/Java/JavaVirtualMachines/jdk-9.jdk/Contents/Home/lib/server/libjvm.dylib
  Referenced from: /Library/Frameworks/R.framework/Versions/3.5/Resources/library/rJava/libs/rJava.so
  Reason: image not found
ERROR: lazy loading failed for package ‘ENMTools’
* removing ‘/Library/Frameworks/R.framework/Versions/3.5/Resources/library/ENMTools’
Installation failed: Command failed (1)


Problem:
ENMtools/rJava was looking for (and couldn't find) the library "libjvm.dylib" in the directory "/Library/Java/JavaVirtualMachines/jdk-9.jdk/Contents/Home/lib/server/", because no such directory existed (this refers to a Java 9 installation); I don't know why it looked for the file there...

Solution: I just copied the same file from the Java 8 directory (or whatever Java version is installed on the computer; in my case the directory was "/Library/Java/JavaVirtualMachines/jdk1.8.0_45.jdk/Contents/Home/lib/server/") to the R library directory "/Library/Frameworks/R.framework/Versions/3.5/Resources/lib/", and it worked


Update: GitHub user Sandy had a different issue, detailed here.  The fix in that case was:

I removed all PATH related to "java", then Add jvm.dll to PATH (adding "%JAVA_HOME%\jre\bin\server;" to PATH).
After that, I set up the JAVA_HOME in R:
Sys.setenv(JAVA_HOME="C:\\Program Files\\Java\\jdk1.8.0_171\\").

Thursday, April 5, 2018

Course Reminder: Quantitative geographic ecology using R. Glasgow, April 30 to May 4.

Hey everybody!  There are still open slots in my upcoming course with Matt Fitzpatrick entitled "Quantitative geographic ecology using R: modelling genomes, niches, and communities".  Sign up now, it's coming up fast!

The course will include introductory lectures, guided computer coding in R, and exercises for the participants, with an emphasis on visualization and reproducible workflows. All modelling and data manipulation will be performed with R. Attendees will learn to use niche modelling algorithms including Maxent, GLM, GAM, and others, and will learn both new and existing methods for conducting comparative studies using ENMs in the new ENMTools R package.  Generalized Dissimilarity Modelling (GDM) and Gradient Forest (GF) will be taught for modelling genomic and community-level data. The course is intended for intermediate R users with interest in quantitative geographical ecology.

https://www.prstatistics.com/course/quantitative-geographic-ecology-using-r-modelling-genomes-niches-and-communities-qger01/

Thursday, March 22, 2018

Biodiversity Informatics Training Curriculum presentation on the past, present, and future of ENMTools

In case you missed it, I gave a talk for the Biodiversity Informatics Training Curriculum the other day where I talked about the history of ENMTools, what's happening now, and where it's going in the near future.  It's a whirlwind tour, but for those of you who are curious about what's coming this is a good overview.

Thanks to Town for having me on, and I apologize to y'all for the technical problems.  Everything worked right up until the very second we started recording.

Recent changes: trim dupes by raster, species from csv file, and improvements to ecospat tests

I've been making a bunch of small changes recently.  Some of it you won't see, but bits of it are there to make your life easier.  One nice little trick is that now you can go:

species.list = species.from.file("myfile.csv", species.col = "Species")

That will create a unique enmtools.species object for every unique value in the column named "Species".  If you have one species in the file, you just get back an enmtools.species object.  If you have multiple species in the file, you get a list of enmtools.species objects.  This will save a ton of time, particularly if you're building an entire clade for the aoc tests.



I've also added a "trim.dupes.by.raster" function.  Basically you feed that a set of points and a raster, and it trims your data set down so there's a max of one point per grid cell.  Nothing dramatic, but it's something people do a lot so it's worth automating.  It's easy to use too:

new.points = trim.dupes.by.raster(old.points, env)


Finally, I tweaked a bunch of stuff for the ecospat functions.  Most of it you won't notice, but it deals with some potential errors that could come up if you had mismatched NAs in your environment rasters.  The big changes are (1) ecospat was calculating p values wrong, so now that's being done in ENMTools, and (2) now the ENMTools ecospat functions will automatically do PCA if you pass them more than two rasters unless told otherwise.

Tuesday, February 27, 2018

Interactive species and model plots

Russell Dinnage has just added a really cool feature: interactive plotting of enmtools.species and enmtools.model objects.  Here's a species:

And here's a model:


The maps are zoom-able and pan-able, and draw very quickly.  Source maps are downloaded from the internet on the fly, though, which means you do need an internet connection to use these functions.

The functions in question are "interactive.plot.enmtools.species" and "interactive.plot.enmtools.model", respectively.  All are now implemented on the master branch of ENMTools on GitHub.

Thursday, February 1, 2018

Playing around with a new version of response plots

On the develop branch of the R version of ENMTools, I've been playing around with a new way of plotting marginal response functions.  In the figure above, the solid blue line represents the relative suitability of habitat as a function of an environmental gradient, with all other variables being held constant at their mean value for all presence and background points.  The green and red dashed lines represent the relative frequency of occurrence of different values of the environmental variable.  This is a Bioclim model, which is why it looks so chunky.

The goal here is to give you some idea of what your model is saying in the context of the data that generated it, instead of just plotting the response function.  Currently it's been tested on GAM, GLM, Bioclim, and Domain models.  I'll do some more testing and probably move it over to the main branch before the weekend.

Wednesday, January 31, 2018

Correlation MDS-space plots added to raster.cor.plot

Here's a handy little visualization when you're interested in the correlations between your rasters.  Basically it takes the matrix of absolute values of correlation coefficients for a set of rasters, turns it into a distance matrix, and then does MDS scaling on it.  The resulting coordinates are turned into a nice little plot, where highly correlated variables are plotted closer to each other than more uncorrelated variables.  It's a good way to eyeball relationships during variable selection.  Here's one for a set of 20 bioclimatic variables.

Currently this is on the develop branch on GitHub, but I'll be merging it into the master branch as soon as it passes through testing.


Massive wad of ENMTools-R updates just published

I've spent the last month relentlessly tweaking ENMTools-R's code to make it CRAN-compatible, and we're pretty much there now.  Most of the changes aren't visible from the user's end of things, but they're necessary to make sure that it's suitable for wider distribution.  I've tested everything, and it seems to all be working.

THAT SAID, it's entirely possible that something has been borked up that isn't popping up in my own code.  If you download ENMTools from the GitHub repository and notice it acting weird in some way, please don't hesitate to raise a GitHub issue about it.

Also, there's a nice new function called raster.cor.plot that does this:


Which is pretty darn cute if I do say so myself.  It's visualizing the correlations between a set of predictor rasters.


Friday, January 5, 2018

Best to avoid using B1 breadth metric in environment space

This just came to light relatively recently: the latin hypercube version of the B1 metric in environment space is probably not trustworthy as currently implemented.  Due to the combination of standardizing the distribution and the use of logs in the calculation, there's a dependence on sample size that makes the metric fail to converge.  For an illustration, here's B2 as a function of sample size:


That's behaving as you'd like it to - seems to be converging on a relatively stable value, not changing much with additional sampling (note the scale of the Y axis).

Now look at B1:

There's an obvious trend here with increasing sample size, and the scale of the Y axis is such that those differences could be quite significant. 

At some future date we may figure out how to adjust for this, but for now I'd say just avoid using B1 in environment space altogether.