Gnuritas | articles tagged "science"

Tips on designing, writing and debugging individual based simulations

do 08 maart 2012

This post was originally written in 2012, and was slightly updated in 2017.

When I first started writing stochastic, individual-based computer simulations in 2005, I thought it would be a pretty straightforward job. Although I’m technically a biologist, I already had quite some (self-taught) experience in C-programming and knew about Object Orient programming. Also, my history of hand-optimising assembly code for the Atari Falcon had taught me a thing or two about writing fast code. Or so I thought. Of course I then proceeded to walk into just about every trap that a naive non-professional software developer can walk into… Here are some of the things I learned over the past years, quite a few of these by trial and error…

Debugging

It is often said that a software developer spends 50 …
read more
Useful tricks with spatial data

za 25 juni 2011

For my research on Avian Influenza in waterbirds, I recently needed data on lakes and marsh-areas in Europe. I ended up compiling a spatial dataset from a number of different sources, including the EU Corine Land Cover database (CLC2000v13), the lake depth dataset compiled by Ekaterina Kourzeneva for the FLake model and data from the Finnish national lake register. In order for this to be usable I had to relate the lakes and marshes in the different datasets to each other. In other words, I needed to find out which coordinates or polygons in the different datasets represented the same lakes, and I needed to know which marsh areas were close to which lakes.

After searching around a bit and considering several options, I decided (for several reasons) not to use a full-blown …
read more
Fun With Shapefiles

wo 09 maart 2011

Shapefiles are a format developed by ESRI (the makers of ArcGIS) to store and share geospatial data. Many interesting datasets are freely available in shapefile format. Shapefiles can be viewed with a number of freely available applications, such as ArcGIS Explorer (which requires the .NET framework or Silverlight) or ArcReader (which is multi-platform but closed-source). Open-source GIS packages such as Quantum GIS and GRASS can also view and edit shapefiles, and recent versions can be installed from the UbuntuGis PPA.

However GIS applications can be tricky to use without training. Moreover, sometimes you may want to use shapefile-data for some other purpose, outside of traditional GIS applications. Here are a few useful tricks you can do with shapefiles, and ways to get data out of a shapefile and into another application.

Reprojecting a …
read more
Changing the evolutionary dynamics of Influenza?

di 08 februari 2011

An interesting news-article appeared today, claiming a possible breakthrough in the development of a universal influenza vaccine. It describes the development of a new influenza vaccine at the Jenner Institute. Traditional influenza vaccines prime the adaptive immune system against the hemagglutinin (HA) and neuraminidase (NA) proteins (the H and N used to characterise influenza subtypes) that are on the outside of the virus particles. The problem with this approach is that these HA and NA proteins mutate rapidly in order to to escape immunity, making that a new vaccine has to be developed for each new strain that evolves. The new vaccine however targets two proteins that are on the inside of the virus particle, the matrix protein (M1) and the nucleoprotein (NP). According to the AFP news article (e.g. on Yahoo …
read more
Data Clustering

do 10 juli 2008

I’ve written a fast perl/PDL implementation of UPGMA data clustering for very large datasets. The problem is that existing clustering packages have difficulty handling datasets with more than a few thousand data points. Especially the distance matrices tend to become a problem. For example, clustering the outcome of a 300x300 grid-based simulation (90,000 data points) would require a (non-sparse) distance matrix of 8.1 billion entries. This would use over 30 Gb of memory when stored as 4-byte floating point values.

I needed to cluster a lot of such datasets, so to make this manageable I implemented a simple but fast UPGMA clustering algorithm in PDL (the Perl Data Language). To conserve memory, it doesn’t store a full hierarchical clustering tree, but rather partitions the data into clusters based …
read more

Debugging

Reprojecting a …