Modelling the distribution of a severely invasive plant.

A summary of my masters thesis.

Himalayan Balsam is one of the most severely invasive plant species in the UK, making up the “big four” terrestrial invasives alongside Japanese Knotweed, Rhododendron and Giant Hogweed. Originating (somewhat unsurprisingly) in the Himalayas, it was introduced to the UK in the 1800s as an ornamental garden plant. The transition from a non-native species to invasive occurred during the 20th century, due to two primary factors; firstly, the waterways of the UK underwent canalisation, leading to a faster flow and therefore quicker dispersal through water. Secondly the management of river banks underwent a serious decline due to a fall in the workforce during World Wars I and II. It is estimated that to eradicate I. glandulifera from the UK, it would cost between £150-300 million.

To plan the best possible management strategy, it is important to have a thorough knowledge of the current distribution. Occurrence records are often sparse or poor quality, so mathematical models are used to predict the distribution. Species distribution models (SDMs) determine the relationship between the study species and the environment, allowing for a thorough and accurate prediction of species distribution with limited data. SDMs define the potential ecological niche of the study species. The ecological niche combines the fundamental niche, defined by the environmental conditions that the species can maintain a long term sustainable population, and the realised niche, determined by ecological processes such as inter-specific competition, predation and dispersal.

MAXENT is a widely used, open source, machine learning SDM that specialises in modelling presence-only species data. A series of predictor variables are added as raster layers, and occurence records are formatted as a point shapefile. MAXENT compares the values at cells containing occurrence records to those found at randomly generated background points. It determines which variables significantly affect the distribution of the study species and uses that information to predict the habitat suitability throughout the study site. It consistently outperforms other similar models, and has high levels of customisation that allow the reduction of spatial sampling bias and spatial auto correlation.

I completed both my undergrad and masters at Swansea, so this study focused on Wales. I collated records from various sources, including Biodiversity Record Centres, the NBN Atlas and Natural Resources Wales, removed duplicate records and those deemed unreliable, which left me with 4886 occurence records across the country.

I used the worldclim bioclimatic dataset at a resolution of 30 arc seconds, ca. 1km2. The dataset consists of 19 raster layers, 15 of which were used, as the layers combining temperature and precipitation were excluded as they can show spatial anomalies. Solar radiation and windspeed layers showing the monthly average for July were also accessed from this source. July was chosen as it is a key month in the flowering and dispersal of balsam. The elevation data was extracted from the Shuttle Radar Topographic Mission (SRTM) Digital Terrain Model. The slope and aspect layers were generated from this data, using the “raster” package in “R”. The population density dataset was available from the Center for International Earth Science Information Network at Columbia University. Raster layers showing cell distance from railways, rivers and roads were created through the Proximity function in QGIS and using shapefiles showing corridors from Digital Chart of the World. The rural-urban classification dataset was published by the Office of National Statistics. “R” was used to crop and mask all layers to the study site, then set all raster resolutions to 30 arc-seconds, ca. 1km2.

The model was run through the excellent ENMeval package. The occurrence points, and 10,000 random background points, were segregated into four k-fold bins spatially, which allows for improved detection of overfitting and a reduced risk of spatial autocorrelation compared to random k-fold sampling.

MAXENT has several user settings that alter the model and can be edited to improve the model quality and reduce overfitting. The regularisation multiplier penalises the addition of new parameters that do not add to the model, and feature classes correspond to a mathematical transformation of the original predictor variables. The “ENMeval” package allows for the running of multiple models with multiple feature classes and regularisation multiples. The models run were then evaluated using area under curve (AUC) statistic of the receiver operating characteristic (ROC) plot. This evaluates the model based on the false positive rate against true positives, essentially the AUC is equivalent to the probability that a test record is correctly differentiated from a random point. The closer an AUC value is to 1.0, the better the model, with 0.7 being the threshold for an acceptable model. The best performing model, with the highest AUC value, was selected for analysis.

A Jackknife test was performed to assess the importance of each independent variable to forming the model predictions. This test examines the importance of individual variables to the model prediction, giving AUC values for three scenarios (without variable, with one variable and with all variables). A value of above 0.05 for the Jackknife test indicates that species occurrence is predicted better by the variable than by a random prediction - i.e. it significantly affects I. glandulifera presence. Maxent calculated the permutation contribution of each variable to the model. This was determined as the each predictor was permutated and the decrease in AUC were recorded.

My model identified the areas that had the maximum suitability for I. glandulifera. The areas of high suitability are highly linear in nature, as the variable that contributed highest to the model is proximity to a river. This relationship is outlined the figure below, as rivers are overlaid onto the habitat prediction. It is evident that I. glandulifera favours downstream riparian habitats, further suggested as elevation was a key contributor to the model and several studies share this conclusion. I. glandulifera relies on its explosive seed dispersal, alongside anthropogenic and animal dispersal, to invade an upstream area, whereas it can use the far more efficient method of hydrochory to invade downstream. The South East of Wales is particularly well suited to I. glandulifera invasion, due to a high density of rivers, low altitude and a high population density, while the South and North East also has a relatively high suitability. It is notable that the far West, both North and South, has a low suitability, and areas of Central Wales have a suitability rating of 0. This predictive map can be used to prioritise monitoring and management, for example a site above 600m in altitude is highly unlikely to be invaded so would be of a lower priority to survey or manage.

Although highly useful for monitoring and management, this prediction should not be seen as complete. There are several flaws with the modelling method and the variables used. Firstly, biotic factors such as competition and pollination are not included in this model, and are vital in the introduction, establishment, and spread of IAS. The variables were limited to a resolution of 30 arc seconds, ca. 1km2, and a higher resolution may give a more accurate prediction. The resolution can also alter the results, as different levels of niche variation occur at different scales. Cross-correlation among the variables could be examined for multicollinearity as several of the variables are likely to be strongly correlated. Identification and removal of these strongly correlated variables may improve the performance of the model. AUC has been criticised in the literature, despite being the most widely used statistic for model evaluation, separate analyses, such as for commission and omission errors, may be beneficial in determining the best performing model.

Robbie Still
Robbie Still
Data Scientist

I am a data scientist, I work in ecology and enjoy looking at football statistics in my spare time, mainly in R. I also love reading and watching anything fanasty, and I currently live in Brighton with my girlfriend, Coral.

comments powered by Disqus