Tuesday, February 6, 2018

DNA Project: Introduction


I’ve always had an interest in biology. When I was 11, I started collecting all sorts of info on conifers into a “book”. More recent projects include my ContentMine project, and an ongoing project on microscopic photography.

Plant cells Example photo (detail of the cross section of a basswood stem)

To my regret, however, I only recently got a decent introduction to the chemistry of DNA, and even then, there’s a lot left uncovered—obviously, it’s an introduction after all. It also raised a lot of questions, and gave me some new ideas, like: What if you were to thoroughly analyse an entire genome, figure out what does what, and then just… golfed it? The genome. What if you were to minify a genome? Those things were a bit out of scope of the class, however.

Luckily, there’s a thing called Profielwerkstuk in Dutch high school education. For your Profielwerkstuk you are ought to spend at least 80 hours on research into a topic of your choice. This includes setting up the project, reading literature, usually performing an experiment, writing the article, and, in the end, presenting your project.

Of course, I could spend that time on Citation.js, or on one of my other projects, but I’m probably going to spend that time on them anyway, and I want to learn something new. A perfect opportunity for me to answer some of those questions, and learn some more about DNA in the progress.

Later, I learnt about this paper, which is basically exactly the idea I mentioned above. Seeing the resources involved, it seems a bit out of the scope of my profielwerkstuk as well. However, there are plenty of other interesting things to figure out.

In the next few weeks, I will be reading literature to a) expand my basic knowledge on the subject past introduction-level and b) see what I could learn through experimentation, and which experiments I could perform in this project. My reading material includes:

I will also be putting out a series of blogposts, about inspirations, thoughts that came up while reading, ideas for the rest of the project, and more. Of course, I will also blog about the project itself.

In the meantime, if you are or know someone who can help me with actually performing those experiments (editing/sequencing DNA), please do contact me. General ideas, tips and feedback are of course also welcome.

This post is part of a series.

Sunday, February 4, 2018

Microscopic photography: Part 2

I promised more photos in the previous post, so here they are.



Penicillium with conidiophores
With conidiophores

Penicillium (conidiophore)
Detail of conidiophores

More fungi:
Fungus cells

Fungus cells

Cat brains:
Cat brain cells

This post is part of a series.

Friday, December 22, 2017

Citation.js Version 0.4 Beta: New Docs and Input Plugins

Citation.js Version 0.4 Beta: New Docs and Input Plugins

It’s been a while. Really. But now it’s back, with a new release: v0.4.0-0, the v0.4 beta. Below I explain some of the changes in this release, and then the road-map of Citation.js for v0.4 and v0.5. Also, Citation.js has a DOI now:


Also, @jsterlibs tweeted about Citation.js.


Input plugins

The main change in this release, is the addition of input plugins. This is the first step towards releasing v0.4, as explained below. Although there’s specific documentation available here and here, I’ll put an example here as well, as the API can be confusing.

The API for registering input plugins will be changed once or twice before the release of v0.4, once to make it less complex and weird, and possibly a second time to incorporate output plugins.

Adding a format

Say you wanted to add the RIS format (I should probably do that sometime). First, let’s define a type, to set things off.

const type = '@ris/text'

I can’t really define the actual parser here, but I’ll add a variable in the example code.

const parse = data => { /* return CSL-JSON or some format supported by any other parser */ }

Testing if a given string is in the RIS format, we’ll use regex. This regex matches if any line starts with two alphanumerical characters, two spaces and a hyphen. That’s a pretty fuzzy match; all lines should start with that sequence. However, this is just an example, and proper regex would be less elegant.

const risRegex = /^\w{2}\ {2}-/gm

Now, let’s define the dataType of RIS input. When using regex to test input, the dataType is automatically determined to be 'String' anyway, but for the sake of being clear:

const dataType = 'String'

Now, to combine it all:

Cite.parse.add(type, {
  parseType: risRegex,

Changing parsers

Now, say someone else wrote that code above, and you need to use it without modifying it, but you want a better regex? That can be arranged:

Cite.parse.add(type, {
  dataType: 'String',
  parseType: /^(?:\w{2}\ {2}-.*\n)+(\w{2}\ {2}-.*)?$/g

Because the options dataType, parseType, elementConstraint and propertyConstraint are all treated as one thing, you need to pass everyone of those when replacing the type checker. dataType is still not actually mandatory in this example, but is passed to demonstrate this.

Disabling a format

There is currently no way to remove a format.

In this scenario, some plugin registered a new format to get citation data from github.com URLs. Unfortunately, it doesn’t work. Instead of recognising the URL as a GitHub-specific URL, the type is @else/url, the generic URL type. This is because the generic version was registered earlier, and there is not yet a category for generic types (see #104). If you don’t use the @else/url type checker anyway, you can disable it like this:

Cite.parse.add('@else/url', {dataType: 'String', parseType: () => false})

I’m not sure why the dataType is needed here, as you’re disabling the parser anyway, but it doesn’t seem to work without it.


Proper CLI docs

Recently, I’ve been updating the documentation on Citation.js, and it has been a pleasant experience, despite some hiccups in the documentation engine. There are tutorials now, and a lot of the JSDoc comments have been improved. I still want to improve some things in the theme, like clickable header links (similar to GitHub and NPM behaviour), and showing sub-tutorials in the navigation.

Backwards compatibility

This release should be largely backwards compatible. If there are any regressions, please report them in the bug tracker.



v0.4 “is about making it easier to expand on input and output formats, possibly by creating schemes and methods parsing those schemes that can, say, convert BibJSON to CSL JSON based on a JSON file, something that can be stored independently of implementation.”

The idea is to allow for all kinds of plugins, both in input parsing and output formatting, to be registered on the Cite object, and to treat all internal parsers and formatters the same. Currently, input parser plugins are possible (see above), although they will be improved before v0.4. Output parsers will be made soon too, and will be backwards-incompatible, because of the change in the output option format.


Currently, the plan is to use JSON-LD as the internal format in v0.5, while still keeping CSL-JSON as the internal scheme. Any input should then be converted to a list (or object) of fields, which will be added to the internal data store. Each field should be transformed to the CSL-JSON scheme individually, and added as a copy. As a result, there won’t be data loss when certain fields aren’t available in CSL-JSON, but are in the input and output schemes. It should also make it easier to write parsers for e.g. BibJSON.

Of course, edge cases should be taken care of. For example, there is no Wikidata property for the CSL-JSON field original-title, but that information can still be derived from the labels combined with the P364 (original language) property.

Monday, September 4, 2017

ctj rdf: Relations Between Conifers Mentioned in Articles

Below is part two of a small series on ctj rdf, a new program I made to transform ContentMine CProjects into SPARQL-queryable Wikidata-linked rdf.

Here’s a more detailed example. We are using my dataset available at 10.5281/zenodo.845935. It was generated from 1000 articles that mention ‘Pinus’ somewhere. This one has 15326 statements, whereof 8875 (57.9%) can be mapped, taking ~50 seconds. Now, for example, we can list the top 100 most mentioned species. The top 20:

Species Hits
Pinus sylvestris 248
Picea abies 177
Pinus taeda 138
Pinus pinaster 120
Pinus contorta 96
Arabidopsis thaliana 91
Picea glauca 77
Pinus radiata 77
Pinus massoniana 72
Pseudotsuga menziesii 65
Oryza sativa 56
Pinus halepensis 56
Pinus ponderosa 55
Pinus banksiana 53
Pinus koraiensis 53
Picea mariana 52
Pinus nigra 51
Pinus strobus 46
Quercus robur 45
Fagus sylvatica 45

The top of the list isn’t surprising: mostly pines, other conifers, other trees, Arabidopsis thaliana which I’ve seen represented in pine literature before, and Oryza sativa or rice, which I haven’t seen before in this context.

Note that only 248 of the 1000 articles mention the top Pinus species. This may be because the query getting the articles was quite broad. Also note that this doesn’t take into account how often an article mentions a species; a caveat of the current rdf output.

Going of this list, we can then look what non-tree or even non-plant species are named most often in conjunction with a given species, or, in this case, a genus. Top 20:

Species 1 Species 2 Co-occurences
Picea abies Pinus sylvestris 98
Picea abies Pinus taeda 56
Arabidopsis thaliana Pinus taeda 47
Picea glauca Pinus taeda 43
Arabidopsis thaliana Oryza sativa 43
Pinus pinaster Pinus taeda 41
Pinus pinaster Pinus sylvestris 41
Picea abies Picea glauca 41
Arabidopsis thaliana Picea abies 37
Pinus contorta Pinus sylvestris 36
Betula pendula Pinus sylvestris 36
Pinus sylvestris Pinus taeda 36
Pinus nigra Pinus sylvestris 35
Pinus contorta Pseudotsuga menziesii 32
Picea abies Pinus contorta 31
Picea abies Pinus pinaster 30
Arabidopsis thaliana Physcomitrella patens 30
Oryza sativa Pinus taeda 29
Pinus sylvestris Quercus robur 29
Picea abies Picea sitchensis 28

Interesting to see that rice is mostly mentioned with Arabidopsis. Let’s explore that further. Below are species named in conjunction with Oryza sativa.

Species Co-occurrences
Arabidopsis thaliana 43
Pinus taeda 29
Picea abies 24
Physcomitrella patens 23
Populus trichocarpa 21
Glycine max 20
Vitis vinifera 17
Picea glauca 17
Pinus pinaster 16
Selaginella moellendorffii 15
Pinus sylvestris 13
Triticum aestivum 12
Pinus contorta 10
Picea sitchensis 10
Ginkgo biloba 10
Pinus radiata 10
Ricinus communis 9
Amborella trichopoda 9
Medicago truncatula 9
Cucumis sativus 8

So attention seems divided between trees and more agriculture-related plants. More to explore for later.

View all posts in this series.

Sunday, September 3, 2017

ctj rdf: Part One

Below is part one of a small series on ctj rdf, a new program I made to transform ContentMine CProjects into SPARQL-queryable Wikidata-linked rdf.

ctj has been around for longer, and started as a way to learn my way into the ContentMine pipeline of tools, but turned out to uncover a lot of possibilities in further processing the output of this pipeline (1, 2).

The recent addition of ctj rdf expands on this. While there is a lot of data loss between the ContentMine output and the resulting rdf, the possibilities certainly are no less. This is mainly because of SPARQL, which makes it possible to integrate in other databases, such as Wikidata, without many changes in ctj rdf itself.

Here’s a simple demonstration of how this works:

  1. We download 100 articles about aardvark (classic ContentMine example)
  2. We run the ContentMine pipeline (norma, ami2-species, ami2-sequence)
  3. We run ctj rdf

This generates data.ttl, which holds the following information:

  • Common identifier for each article (currently PMC)
  • Matched terms from each article (which terms/names are found in which article)
  • Type of each term (genus, binomial, etc.)
  • Label of each term (matched text)
Example data.ttl contents

Note that there are no links to Wikidata whatsoever. When we list, for instance, how often each term is mentioned in an article (in the dataset), we only have string values, a identifiers.org URI and some custom namespace URIs.

However, in this format, we can easily use the information in these papers in conjunction with the enormous amount of data in Wikidata with Federated Queries.

To accomplish this we first link the identifier in our dataset to the ones in Wikidata; then we link the matched text of the term to the taxon name in species in Wikidata. This alone already gives us a set of semantic triples where both values in every triple are linked to values in the extensive community-driven database that is Wikidata.

Example query, counting how often each species is mentioned, and mapping them to Wikidata
Results of the above query

Now say we want to list the Swedish name of each of those species. Well, we can, because that info probably exists on Wikidata (see stats below). And if we can’t find something, remember that each of those Wikidata values are also linked to numerous other databases.

Again, this is without having to change anything in the rdf output (to be fair, I forgot to list an article identifier in the first version of the program, but that could/should have been anticipated). Not having to add this data to the output has the added benefit of not having to make and maintain local dictionaries and lists of this data.

Some stats:

  • Number of articles: 100 (for reference)
  • Number of ‘term found in article’ statements: 1964
  • Number of those statements that map to Wikidata: 1293 (65.3% of total)
  • Number of mapped statements with Swedish labels: 1056 (81.7% of mapped statements, 53.8% of total)
  • Average number of statements per article: 19.64, 12.93 mapped

Note that not all terms are actually valid. A lot of genus matches are actually just capitalised words, and a lot of common species names are abbreviated, e.g. to E. coli, making it impossible to unambiguously map to Wikidata or any other database. This could explain the difference between found ‘terms’ and mapped terms.

View all posts in this series.

Sunday, August 27, 2017

Citation.js: Endpoint on RunKit

A while back I tweeted about making a simple Citation.js API Endpoint with RunKit.

Using the Express app helper and some type-checking code:

const express = require('@runkit/runkit/express-endpoint/1.0.0')
const app = express(exports)
const Cite = require('citation-js@0.3.0')
const docs = 'https://gist.github.com/larsgw/ada240ded0a78d5a6ee2a864fbcb8640'

const validStyle = style => ['bibtxt', 'bibtex', 'csl', 'citation'].includes(style) || /^citation-\w+$/.test(style)
const validType = type => ['string', 'html', 'json'].includes(type)

const last = array => array[array.length - 1]

const getOptions = params => {
  const fragments = params.split('/')
  let data, style = 'csl', type = 'html'

  // parse fragments
  // (got pretty complex, as data can contain '/'s)

  return {data, style, type}

.get('/', (_, res) => res.redirect(docs))
.get('/*', ({params: {0: params}}, res) => {
  const {data, style, type} = getOptions(params)
  const cite = new Cite(data)
  const output = cite.get({style, type})

Full code here. Makes an API like this:



  • $CODE is the API id (vf2453q1d6s5 in this case),
  • $DATA is the input data (DOI, Wikidata ID, or even a BibTeX string),
  • $STYLE (optional) is the output style,
  • and $TYPE (optional) is the output type (basically plain text vs html).

This makes it possible to link to a lot of Citation.js outputs:

Citation.js Version 0.3 Released!

It’s been in beta since January 30, but here it is: Citation.js version 0.3. Below I explain the changes since the last blog post, and under that is a more complete change log. Also some upcoming changes.


Recent changes

Custom citation formatting

  • One of the remaining milestones for v0.3.0 was a better API for custom citation formatting, as outlined in the previous post. The exact implementation differs a bit, but is essentially the same. Docs can be found here.


  • Cite#add() and Cite#set() now have asynchronous siblings, Cite#addAsync() and Cite#setAsync().
  • Both async and sync test cases are needed for full coverage, so I won’t just change one into another, as I previously suggested.


  • The BibTeX parser got a big update, improving both the text-parsing code and the JSON-transforming code.
  • The Wikidata issue (sorry for the GIF) is fixed now, and I like the current API and code. The solution is inspired by the recent BibTeX refactoring.

Older changes

Also look at previous blog posts for more info. List goes from newer to older.


  • Code coverage
  • Test case and framework updates
  • DOI support
  • CLI fixes
  • Custom sorting
  • Cite as an Iterator
  • Bib.TXT support
  • Give jQuery.Citation.js its own repository
  • Async support
  • Build size disc & dependency badges
  • Exposing internal functions and splitting up the main file into smaller files
  • Change of some internal Cite properties
  • Using an actual code style and use of ES6+
  • Exposing version numbers


  • CLI bugs
  • Variable name typos
  • Output JSON is now valid JSON (I know, right?)
  • Wikidata ID parsing
  • Cite#getIds()
  • CLI not working
  • Wikidata tests not working
  • CSL not working
  • Browserify not working

Upcoming changes

  • Web scrapers and schema.org, Open Graph, etc.
  • Input parsing extensions
  • More coverage
  • More input formats, like better support for BibJSON, and new formats.