Monday, September 4, 2017

ctj rdf: Relations Between Conifers Mentioned in Articles

Below is part two of a small series on ctj rdf, a new program I made to transform ContentMine CProjects into SPARQL-queryable Wikidata-linked rdf.


Here’s a more detailed example. We are using my dataset available at 10.5281/zenodo.845935. It was generated from 1000 articles that mention ‘Pinus’ somewhere. This one has 15326 statements, whereof 8875 (57.9%) can be mapped, taking ~50 seconds. Now, for example, we can list the top 100 most mentioned species. The top 20:

Species Hits
Pinus sylvestris 248
Picea abies 177
Pinus taeda 138
Pinus pinaster 120
Pinus contorta 96
Arabidopsis thaliana 91
Picea glauca 77
Pinus radiata 77
Pinus massoniana 72
Pseudotsuga menziesii 65
Oryza sativa 56
Pinus halepensis 56
Pinus ponderosa 55
Pinus banksiana 53
Pinus koraiensis 53
Picea mariana 52
Pinus nigra 51
Pinus strobus 46
Quercus robur 45
Fagus sylvatica 45

The top of the list isn’t surprising: mostly pines, other conifers, other trees, Arabidopsis thaliana which I’ve seen represented in pine literature before, and Oryza sativa or rice, which I haven’t seen before in this context.

Note that only 248 of the 1000 articles mention the top Pinus species. This may be because the query getting the articles was quite broad. Also note that this doesn’t take into account how often an article mentions a species; a caveat of the current rdf output.

Going of this list, we can then look what non-tree or even non-plant species are named most often in conjunction with a given species, or, in this case, a genus. Top 20:

Species 1 Species 2 Co-occurences
Picea abies Pinus sylvestris 98
Picea abies Pinus taeda 56
Arabidopsis thaliana Pinus taeda 47
Picea glauca Pinus taeda 43
Arabidopsis thaliana Oryza sativa 43
Pinus pinaster Pinus taeda 41
Pinus pinaster Pinus sylvestris 41
Picea abies Picea glauca 41
Arabidopsis thaliana Picea abies 37
Pinus contorta Pinus sylvestris 36
Betula pendula Pinus sylvestris 36
Pinus sylvestris Pinus taeda 36
Pinus nigra Pinus sylvestris 35
Pinus contorta Pseudotsuga menziesii 32
Picea abies Pinus contorta 31
Picea abies Pinus pinaster 30
Arabidopsis thaliana Physcomitrella patens 30
Oryza sativa Pinus taeda 29
Pinus sylvestris Quercus robur 29
Picea abies Picea sitchensis 28

Interesting to see that rice is mostly mentioned with Arabidopsis. Let’s explore that further. Below are species named in conjunction with Oryza sativa.

Species Co-occurrences
Arabidopsis thaliana 43
Pinus taeda 29
Picea abies 24
Physcomitrella patens 23
Populus trichocarpa 21
Glycine max 20
Vitis vinifera 17
Picea glauca 17
Pinus pinaster 16
Selaginella moellendorffii 15
Pinus sylvestris 13
Triticum aestivum 12
Pinus contorta 10
Picea sitchensis 10
Ginkgo biloba 10
Pinus radiata 10
Ricinus communis 9
Amborella trichopoda 9
Medicago truncatula 9
Cucumis sativus 8

So attention seems divided between trees and more agriculture-related plants. More to explore for later.


View all posts in this series.

Sunday, September 3, 2017

ctj rdf: Part One

Below is part one of a small series on ctj rdf, a new program I made to transform ContentMine CProjects into SPARQL-queryable Wikidata-linked rdf.


ctj has been around for longer, and started as a way to learn my way into the ContentMine pipeline of tools, but turned out to uncover a lot of possibilities in further processing the output of this pipeline (1, 2).

The recent addition of ctj rdf expands on this. While there is a lot of data loss between the ContentMine output and the resulting rdf, the possibilities certainly are no less. This is mainly because of SPARQL, which makes it possible to integrate in other databases, such as Wikidata, without many changes in ctj rdf itself.

Here’s a simple demonstration of how this works:

  1. We download 100 articles about aardvark (classic ContentMine example)
  2. We run the ContentMine pipeline (norma, ami2-species, ami2-sequence)
  3. We run ctj rdf

This generates data.ttl, which holds the following information:

  • Common identifier for each article (currently PMC)
  • Matched terms from each article (which terms/names are found in which article)
  • Type of each term (genus, binomial, etc.)
  • Label of each term (matched text)
Example data.ttl contents

Note that there are no links to Wikidata whatsoever. When we list, for instance, how often each term is mentioned in an article (in the dataset), we only have string values, a identifiers.org URI and some custom namespace URIs.

However, in this format, we can easily use the information in these papers in conjunction with the enormous amount of data in Wikidata with Federated Queries.

To accomplish this we first link the identifier in our dataset to the ones in Wikidata; then we link the matched text of the term to the taxon name in species in Wikidata. This alone already gives us a set of semantic triples where both values in every triple are linked to values in the extensive community-driven database that is Wikidata.

Example query, counting how often each species is mentioned, and mapping them to Wikidata
Results of the above query

Now say we want to list the Swedish name of each of those species. Well, we can, because that info probably exists on Wikidata (see stats below). And if we can’t find something, remember that each of those Wikidata values are also linked to numerous other databases.

Again, this is without having to change anything in the rdf output (to be fair, I forgot to list an article identifier in the first version of the program, but that could/should have been anticipated). Not having to add this data to the output has the added benefit of not having to make and maintain local dictionaries and lists of this data.

Some stats:

  • Number of articles: 100 (for reference)
  • Number of ‘term found in article’ statements: 1964
  • Number of those statements that map to Wikidata: 1293 (65.3% of total)
  • Number of mapped statements with Swedish labels: 1056 (81.7% of mapped statements, 53.8% of total)
  • Average number of statements per article: 19.64, 12.93 mapped

Note that not all terms are actually valid. A lot of genus matches are actually just capitalised words, and a lot of common species names are abbreviated, e.g. to E. coli, making it impossible to unambiguously map to Wikidata or any other database. This could explain the difference between found ‘terms’ and mapped terms.


View all posts in this series.

Sunday, August 27, 2017

Citation.js: Endpoint on RunKit

A while back I tweeted about making a simple Citation.js API Endpoint with RunKit.

Using the Express app helper and some type-checking code:

const express = require('@runkit/runkit/express-endpoint/1.0.0')
const app = express(exports)
const Cite = require('citation-js@0.3.0')
const docs = 'https://gist.github.com/larsgw/ada240ded0a78d5a6ee2a864fbcb8640'

const validStyle = style => ['bibtxt', 'bibtex', 'csl', 'citation'].includes(style) || /^citation-\w+$/.test(style)
const validType = type => ['string', 'html', 'json'].includes(type)

const last = array => array[array.length - 1]

const getOptions = params => {
  const fragments = params.split('/')
  let data, style = 'csl', type = 'html'

  // parse fragments
  // (got pretty complex, as data can contain '/'s)

  return {data, style, type}
}

app
.get('/', (_, res) => res.redirect(docs))
.get('/*', ({params: {0: params}}, res) => {
  const {data, style, type} = getOptions(params)
  const cite = new Cite(data)
  const output = cite.get({style, type})
  res.send(output)
})

Full code here. Makes an API like this:

https://$CODE.runkit.sh/$DATA[/$STYLE[/$TYPE]]

Where

  • $CODE is the API id (vf2453q1d6s5 in this case),
  • $DATA is the input data (DOI, Wikidata ID, or even a BibTeX string),
  • $STYLE (optional) is the output style,
  • and $TYPE (optional) is the output type (basically plain text vs html).

This makes it possible to link to a lot of Citation.js outputs:

Citation.js Version 0.3 Released!

It’s been in beta since January 30, but here it is: Citation.js version 0.3. Below I explain the changes since the last blog post, and under that is a more complete change log. Also some upcoming changes.

Citation.js

Recent changes

Custom citation formatting

  • One of the remaining milestones for v0.3.0 was a better API for custom citation formatting, as outlined in the previous post. The exact implementation differs a bit, but is essentially the same. Docs can be found here.

Async

  • Cite#add() and Cite#set() now have asynchronous siblings, Cite#addAsync() and Cite#setAsync().
  • Both async and sync test cases are needed for full coverage, so I won’t just change one into another, as I previously suggested.

Parsers

  • The BibTeX parser got a big update, improving both the text-parsing code and the JSON-transforming code.
  • The Wikidata issue (sorry for the GIF) is fixed now, and I like the current API and code. The solution is inspired by the recent BibTeX refactoring.

Older changes

Also look at previous blog posts for more info. List goes from newer to older.

Features

  • Code coverage
  • Test case and framework updates
  • DOI support
  • CLI fixes
  • Custom sorting
  • Cite as an Iterator
  • Bib.TXT support
  • Give jQuery.Citation.js its own repository
  • Async support
  • Build size disc & dependency badges
  • Exposing internal functions and splitting up the main file into smaller files
  • Change of some internal Cite properties
  • Using an actual code style and use of ES6+
  • Exposing version numbers

Bugs

  • CLI bugs
  • Variable name typos
  • Output JSON is now valid JSON (I know, right?)
  • Wikidata ID parsing
  • Cite#getIds()
  • CLI not working
  • Wikidata tests not working
  • CSL not working
  • Browserify not working

Upcoming changes

  • Web scrapers and schema.org, Open Graph, etc.
  • Input parsing extensions
  • More coverage
  • More input formats, like better support for BibJSON, and new formats.

Sunday, July 23, 2017

Citation.js: DOI update and more stability

Finally, Citation.js supports DOIs. It took a while, but it’s finally there. One big ‘but’: synchronous fetching doesn’t work in Chrome. I’m still looking into that, but I should be recommending you to use Cite.async() anyway. Also in this blog post: more stability in Cite#get(), a welcome byproduct of using the DOI API, and looking forward (again).

Citation.js logo

DOI support

So, DOIs. That was (and is) a though one. Let me guide you through the process.

Initial development

I have been planning to add support for DOI input since the beginning of this year, going off the original feature request, and the pressure to implement it has grew right along me realising how important they are in the world of citations. Back then I thought it would be smart to query Wikidata for DOIs, as I had just finished some code on Wikidata ID input, that used regular API as well as the query part. More recently however, I learned about the Crossref API and, even more helpfully, the DOI Content Negotiation API, which combines the APIs from Crossref, DataCite and mEDRA. I’ll quote a piece from the docs:

Content negotiation allows a user to request a particular representation of a web resource. DOI resolvers use content negotiation to provide different representations of metadata associated with DOIs.

A content negotiated request to a DOI resolver is much like a standard HTTP request, except server-driven negotiation will take place based on the list of acceptable content types a client provides.

source

Basically, you just make a request to https://doi.org/$DOI (where hopefully obviously $DOI stands for the DOI you’re looking for) with an Accept header set to the format you want your data in. Conveniently, one of those formats is CSL-JSON, here charmingly named application/vnd.citationstyles.csl+json, but nonetheless the direct input format for Citation.js. This would allow me to only have to write code to extract DOIs from a whitespace-separated list (const dois = string.trim().split(/\s+/g)) and fetch them from the server (await Promise.all(dois.map(fetchDoi))), where fetchDoi is a simple function:

async function fetchDoi (doi) {
  const headers = new Headers({
    Accept: 'application/vnd.citationstyles.csl+json'
  })
  const response = await fetch('https://doi.org/' + doi, {headers})
  return response.json()
}

And Headers and fetch are built-in. Note that this uses some advanced syntax. Read more about async/await, Promise.all(), shorthand property names and const.

First problems

It wasn’t that simple. Heck, even getting the API to work took a (long) while and a lot of debugging other people’s code. You see, to make synchronous requests in Node.js I use sync-request, which is a synchronous wrapper around its asynchronous cousin, then-request, which is a wrapper around an also asynchronous but more low-level cousin, http-basic. They all use the same scheme of options. Because of that, sync-request passes its options down all the way to http-basic, so options from http-basic but not in sync-request can be used too. Turns out, http-basic has an option that removes all headers when a request is redirected, but it isn’t documented in sync-request. This removes, among other things, the Accept header, which, on omission, produces an HTTP 501 code not documented in the API.

Let’s just say I filed a feature request to mention this in the docs.

The actual code didn’t need much change, so soon I got the first JSON response, and since the API supports CSL-JSON out of the box, I didn’t need to do much else, except building some input recognition infrastructure.

CSL-JSON problems

However, nothing is perfect, and so is that out-of-the-box support of CSL-JSON. This was actually picked up quite well, and I’m happy with the outcome. Basically, extra code needed to fix this consists of two parts. One part transforms some invalid but essential parts of the API response to its CSL-JSON counterpart. The second part filters invalid and less essential parts out of the data, and basically acts as a safeguard for methods depending on certain variables having certain types. More on this later.

More problems, from an unexpected source

Yep, it isn’t even over yet. This may be the biggest problem of them all, in that it hasn’t got a solution yet, and that I don’t expect it to have one in the near future, and that it is very likely there is no solution. It depends, it really does. Anyway, let’s get to the point.

Chrome doesn’t support synchronous redirects from CORS domains. Or something. Even that isn’t really clear. Basically, the problem is that generally, the DOI content negotiation works like this (assuming we’re in the browser): the user domain requests data from https://doi.org, which redirects to https://data.crossref.org (or a different domain). Both are different domains. Both synchronous and asynchronous requests directly to https://data.crossref.org (or a different domain) work. Asynchronous requests via https://doi.org work too. Even synchronous requests via https://doi.org work, but not in Chrome. I don’t know why, I don’t know when (as it does support normal synchronous CORS requests), and I can’t find any document that says why it does that. The only things that have shed some light on this are a comment I can’t find anymore that only stated you can’t do that at all and this answer, that almost describes the exact problem I’m having, but fails to give any answer other than that it’s weird that Firefox and IE “don’t follow the jQuery spec.” Note that the jQuery spec doesn’t mention anything about this apart from the note that this shouldn’t be done ever anyway.

One obvious answer is that every mayor browser is currently in the process of factoring out all synchronous effect, and that Chrome might have taken this a step further, but even then, there should be some document somewhere that says it does this, otherwise I’d just consider it a bug. I haven’t gotten around to it yet, but I will try to see how this would affect using Citation.js synchronously in Web Workers. If that works, it means that it actually is another rule to prevent synchronous requests on the main thread, which is fine by me, although slightly inconvenient for people who do not care about user experience, such as me when I’m lazy.

Conclusion

DOIs work. Because of all the ranting, I haven’t shown you the api, but it’s pretty familiar. To use the CLI, do this:

> npm i -g citation-js
# to install the new version of the command
# might need to be run with admin rights

> citation-js -t 10.1021/ja01577a030 -f string -s citation-apa
# Fetches data from doi: 10.1021/ja01577a030
# in the format: string; in the formatted style: apa

  Hall, H. K., Jr. Correlation of the Base Strengths of Amines1. Journal of the American Chemical Society, 79(20), 54415444. https://doi.org/10.1021/ja01577a030

To use the API, do this:

const Cite = require('citation-js')

// For synchronous use
const data = new Cite('10.1021/ja01577a030')
const output = data.get({
  type: 'string',
  style: 'citation-apa'
})

/* output =
  Hall, H. K., Jr. Correlation of the Base Strengths of Amines1. Journal of the American Chemical Society, 79(20), 5441–5444. https://doi.org/10.1021/ja01577a030
*/

// For asynchronous use
Cite.async('10.1021/ja01577a030').then(data => {
  const output = data.get({
    type: 'string',
    style: 'citation-apa'
  })

  /* output =
    Hall, H. K., Jr. Correlation of the Base Strengths of Amines1. Journal of the American Chemical Society, 79(20), 5441–5444. https://doi.org/10.1021/ja01577a030
  */
})

Stability in Cite#get()

As I mentioned earlier, I had to write some code to filter out invalid but non-essential props from the CSL-JSON. Otherwise certain methods used by Cite#get() throw errors, as they except these props to have certain types. Because of this filtering function, I don’t have to type-check ever again, I hope. The implementation is pretty simple. Specially structured props like author and other names, and issued and other dates, get caught and handled on their own. Other props get checked against a map of expected data types and removed if they don’t match, and if the bestGuessConversions flag is set, if they also can’t be converted reliably. Note that, as of now, all Cite.get.*() functions expect CSL-JSON cleaned by Cite.parse.csl().

Version 0.3.0

Implementing DOI input was one of the big milestones on the way to version 0.3.0, besides exposing internal methods, making async input parsing available and generally making the browser version and the CLI less bad. Now that those three things are done, I think it’s a good moment to see what still needs to be done for the v0.3.0 release.

Better custom formatted citations

Using formatted citations is fine, but when you try to use custom ones, either because the style guide you want or need to follow isn’t built-in, or because you have some special use case, things may get confusing. Currently, the API works (should work) like this: when you pass a template in the Cite#get() options (important: not in the new Cite(<data>, <options>) options), it gets registered in a register of citeproc-js engines. After that, you can use it by referencing the name you used in the regular style option. This would work better with an API similar to this:

const template = '...'
const templateName = 'custom'

// Suggested new API
Cite.CSL.template.register(templateName, template)

// Nothing new, will stay the same
const data = new Cite(...)
data.get({
  type: 'html',
  style: 'citation-' + templateName
})

Also, sometimes you just want to append or prepend some text or HTML frames, and implementing that with CSL templates takes time and effort and the result isn’t always that pretty, as templates don’t support direct HTML. There will be a new API for that too, probably like this:

// prepends the id like this: "[$ID]: "
const prepend = ({id}) => `[${id}]: `
// appends an altmetric widget
const append = ({DOI}) => `<span class='altmetric-embed' data-doi='${DOI}'></span>`
// Nothing new, will stay the same
const data = new Cite(...)
data.get({
  type: 'html',
  style: 'citation-' + templateName,
  // Suggested new API
  // both properties either function or constant string
  append: append,
  prepend: prepend
})

One thing I still have to look at is extending the wrapping HTML elements outputted by citeproc-js.

Use asynchronism better

Now that we have asynchronous input parsing with Cite.async(), it may be interesting to look at asynchronous output formatting as well. There are also some functions that can use Cite.async() but don’t, like Cite#add() and Cite#set(), and the test cases are generally synchronous, with some special cases for async, while it should probably be the other way around. Adding a coverage tester will assist in determining which synchronous test cases can go.

Refactoring

There are still some things I’d like to refactor, like the Wikidata parser. I don’t like it right now, especially the hack to merge different types of authors.

After v0.3.0

Since we’re almost at the end of v0.3.0, I’ll also outline some of my plans for future versions.

  • BibJSON input (already partly supported)
  • Extensions (input parsing, output, etc.)
  • Zotero input (maybe not worth the work, as they already support export to CSL)
  • Scraper (could just be getting DOI and using my existing work)
  • Coverage testing and an expansion of test cases
  • More work on the dependants:

Sunday, June 4, 2017

Microscopic photography: Part 1

I got to make photographs of pre-made specimens with a microscope, and I wanted to show some of them. This is part one of a longer series; there were a lot of specimens.

Below are details of the cross section of a basswood stem.

Detail

Outer layer

Center

Human artery tissue.




Human bone tissue (with some unidentified dark stuff).



More next week!

Monday, May 22, 2017

Citation.js: Async, Showdown and Bib.TXT

I worked on updates for Citation.js in the past few weeks, and I thought I'd go over them in this post.

Async

First of all, Citation.js now has support for asynchronous parsing, so it won't lag your app as much when it uses e.g. the Wikidata API. This is good, as synchronous requests are not only blocking your app, but also deprecated in most major browsers. This allows for more development on input formats where additional data needs to be fetched, like a DOI as input. I could have done this before, but this order made more sense to me. The API is simple:

// With callback

Cite.async($INPUT, $OPTIONS, function (data) {
  data // instance of Cite
  
  // Further manipulations...
  console.log(data.get())
})

// Or with a promise, like this:

Cite.async($INPUT, $OPTIONS).then(function (data) {
  data // instance of Cite
  
  // Further manipulations...
  console.log(data.get())
})

// Where $INPUT is input data and $OPTIONS is options
// $OPTIONS is optional in both cases

The promise-returning part is good when using the async await syntax, i.e. var data = await Cite.async($INPUT, $OPTIONS).

Extensions

Unfortunately, no actual extensions to the Cite object. (However, extensions in input parsing is planned in v0.4, and the API for output formatting is getting a rework in v0.3 (this and more)).

I'm talking about citation.js-showdown, citation.js-form and citation.js-replacer (coming soon). Citation.js-form is just jquery.citation.js in its own repository, citation.js-replacer will be a small tool to easily put references on your site, without having to bother with writing scripts, but, as a consequence, with fewer options. Citation.js-showdown, however, is a (functioning) Showdown plugin, that makes it easier to make references and bibliographies in a markdown document.

Bib.TXT

A few days ago I heard about Bib.TXT, and since implementing it in Citation.js only meant parsing a relatively simple text format (all the fields are the same as the already supported BibTeX), I thought: Why not? So there you have it, Bib.TXT support in Citation.js. That's good, because, among other things, you don't have to bother with weird (La?)TeX Unicode workarounds. And with Citation.js, you can easily convert Bib.TXT to BibTeX. Below as a command:

$ npm i -g citation-js@0.3.0-7
$ citation-js --input inputFile.txt --output-format string --output-style bibtex > out.bib

Homepage improvement

I changed some things on the Citation.js homepage. It should now be more responsive and have more room for expansion, and I added a demo and a small list of the "extensions" mentioned above. Oh, and the README has the same banner as the page now.

Citation.js banner

Saturday, April 29, 2017

New homepage design: New content, Material Design, Angularjs and more

My new homepage is finally finished (for now). I say finally, because it has taken a while before it actually felt done. You know how the first 90% of the work takes 90% of the time, but the last 10% of the work takes another 90% of the time? I just had to finish writing a single, small piece of text and it would be complete, but it took about 2 weeks. Anyway, that's all done now, and the result can be seen at larsgw.github.io.

Homepage

The design is based on the principles of Material Design using MDL, which is a CSS+JS framework for making Material Design sites. Most of the content is displayed with AngularJS, which uses runtime Pug-like HTML templating and more.

Extra content is added as well, including more projects and descriptions. Unfortunately, I had to get rid of my Twitter feed, as it didn't fit in the new design, but there are several sites to view my tweets, including Twitter and this exact site. Also, the old site (with my feed) is still available, and displays the new content as well: larsgw.github.io/old.

There are some other things hidden here and there, some really obvious, some not. That's basically it. Hope you like it.

Wednesday, April 26, 2017

Final Report: Analysing and visualising data from papers about conifers

Originally posted on the ContentMine blog.

Lars Willighagen, orcid:0000-0002-4751-4637

Final Report of my fellowship at the ContentMine.

Proposal

My proposal was to extract facts about various conifer species by analysing text from papers with software suited for analysing text and the tools provided by the ContentMine. These facts were then to be converted into JSON, and then viewable with an HTML (+CSS/JS) interface. Expected statements were like: 'Picea glauca is a species of the genus Picea', which could be parsed to the triple:Picea glauca; property:genus; subject:Picea.

Work

The main outcome of this project is a series of programmes converting tables from research articles into Wikidata statements. The workflow is as follows. First, papers matching a user-provided query are fetched by the ContentMine's getpapers. Second, the tables are extracted from the fetched papers and converted to assertions. This is done by filling empty cells in tables and then treating each row as an object, the first column being the name and the others property-value pairs. Different table designs are currently parsed in the same way, resulting in incorrect extraction of data, something that can be accommodated for by normalising the table structure beforehand. The resulting assertions are then converted to JSON, currently in a custom scheme, to allow the next steps.

Finally, the JSON assertions are visualized in an HTML GUI. This includes a stepper form (see picture) where you can curate the assertion, link identifiers, and add it to Wikidata.

Stepper form for curating assertions
Stepper form for curating assertions

Getting these assertions from text, as I proposed, was harder. Tools I expected to find included in ContentMine software were nowhere to be found, but were planned, so actually implementing them myself did not seem a good use of my time. Luckily, the literature corpus does not actually contain that many statements about physical properties of conifers in plain text as I originally expected: most are in tables, figures or in supplementary files, leading me to using those instead. The nice thing is that one of the main focuses of the ContentMine is parsing tables from PDF, so this will definitely be of general use.

Other work

During the project and to explore the design of the ContentMine, additional related components were developed:

  • ctj: program to convert and re-order AMI data to JSON, making it easier to read in JavaScript (mainly good for web applications);
  • ctj-cardlists: program to view AMI JSON (see above) in a Web GUI (demo); and
  • Citation.js: added functionality to parse BibJSON (used for quickscrape output) into CSL, for further formatting. See blog post.

These first two simplified handing AMI output in the browser, while the third makes it easier to display references in common formats.

Dissemination

All source code of the project outcomes is available on GitHub:

Progress was communicated during the project via the ContentMine Discourse page, on my personal blog (~20 posts), and on the general ContentMining blog (2 long posts).

Future work

The developed pipeline works but is not perfect.The pipeline to parse tables mentioned above requires further generalisation. This defines some logical next steps: fixes:

  • Finally adding it as an NPM module, making it (way) easier for people to use it;
  • Making searching easier in the HTML GUI (will need work further upstream too). Currently the list of assertions are split into pieces, making it hard to find anything. This can be fixed with a search index;
  • Normalising table structures to support more designs, rendering assertion extraction more reliable;
  • Making the process of curating assertions and linking identifiers easier by linking more identifiers, and showing context, i.e. the original tables; and
  • Some small performance and UX things.

Another important thing that is too big for a single bullet point, is annotating abbreviations and references in the document before extracting the tables. It's easier to curate statements like '[1] says this and this' when you know '[1]' references some known article. Another example: while a statement containing 'P. glauca' says nothing (there are 66+ species using that abbreviation), the article probably says which one it is somewhere outside the table, something that can be picked up if you annotate these before taking them out of context. This makes the interactive stepper form currently a necessity.

Evaluation

As noted, the work is far from done. Currently, it mainly shows a glimpse of what is possible had I spent more time on writing code. Short conclusions: CTJ is unpolished and slow. Because of a lack of customisation options, such as what data to use, you will almost always need to write custom code to not have to include tons of unnecessary data in your resulting JSON.

CTJ-Cardlists is actually pretty nice. It is slow, and it does not really show relations, but it does show an interesting overview of the literature corpus, like how often species are mentioned and with what they are mentioned together most of the time. You can easily draw reasonable conclusions like how often species names are misspelled. However, it would be more useful for this to have SQL queries or something similar. CTJ-Factvis shows even more potential, with the Wikidata integration. I do need to pay more attention to the fact that those assertions are alleged facts, and not regular ones, as I called them in earlier blog posts.

Fellowship

In general, the fellowship went pretty well for me. In retrospect, I did a lot of the things I wanted to do, even though that throughout the project it felt like there was so much left to do, and there is! I am really excited about the possibilities that emerged during the fellowship, and even in the last weeks. How cool would it be to extend this project with entire Web API's and more? This is, for a big part, thanks to the support, feedback, and input of the amazing ContentMine team during the regular meeting, and the quick responses to various software issues. I also enjoyed blogging about my progress on my own blog and on the ContentMine blog.

Sunday, March 26, 2017

Citation.js: BibJSON

Citation.js now supports BibJSON. How I did that without actually updating Citation.js? Well, apparently I supported it all along. I've supported the quickscrape output format since July last year, and that turned out to be BibJSON. How convenient. I'll update the demo and docs to reflect this revelation (currently it just says "quickscrape's JSON scheme"), and, now that I can find actual documentation, some improvements to the parser. It's a good candidate for a new output format too.

Some side notes on updates v0.3.0-0 to v0.3.0-2: these are prerelease updates, making it possible to use code before I have fixed all the issues and added all the features I promised for version 0.3. These updates fixed a lot of file organization problems; next updates will restructure the Cite object and fix tests.

Sunday, February 26, 2017

SVM: Developing a brand new 3D model language for the Web

When I started programming a few years ago, one of my first projects was linked to a project for school. Our team was making a presentation, and at some point I decided I wanted to program it myself, for the web. This resulted in the version of a project which, to this day, doesn't have a name (but soon will have).

After a few months I wanted to expand this. With some research and trial and error in the area of CSS 3D Transforms, I made a program similar to impress.js. Regular slides, in the web, in 3D, but with some unique things. Of course, it wasn't as refined yet, but this was only the start.

These unique things[1] were 3D models, that you can view in your very own web browser. This on its own isn't that unique; There are plenty of libraries that do that, like A-Frame and xml3d. The special thing of my 3D models are that they're pure HTML and CSS. No canvas or WebGL needed, just a relatively modern browser. This opens a whole range of possibilities: viewing HTML structures and CSS styles in your default debugger; selecting text directly from the model; embedding audio, video, pages and anything else you can put in HTML; and, of course, the possibility to combine these models and the slides, as they use CSS 3D Transforms as well.

There are some disadvantages to this. HTML+CSS only allows for 2D elements, so to make a cube, you'd need 6 elements, a wrapper element and 10+ lines of CSS. And if you want to change the height, you need to change that one value in a lot of places. To solve that, I've started developing SVM, which stands for Scalable Vector (3D) Models. The name is very similar to SVG, and that's how I intended it: XML-based, simple, and with lots of built-in shapes.

The first (beta) version is still in development, but I can list some of the features:

  • Built-in solids: Cubes, cuboids, regular prisms, cylinders (to an extent), regular pyramids, regular right frustums, cones (to an extent), irregular convex and concave prisms using SVG Paths (!), and simple spheres
  • Planes, arcs, SVG Path-based curves
  • Groups (with names)
  • Transformation (translation, rotation, scaling) on any element (or at least the currently existing ones)
  • An element to include groups/components

Some of the features in action, including automatic shadow, which will adapt to the orientation of the elements in the future[2]

It may not be as refined yet, but I think there's a lot of potential. Currently, the main hurdle is performance and the accompanying visual issues. For example, on my laptop, Chrome starts lagging and breaking apart[3] at around 450 elements, and keep in mind that even a single cube results in six elements. Also keep in mind: it works fine for up to around 1200 elements on my PC.


[1]: As far as I know, anyway.
[2]: Currently, it doesn't, as you can see on the yellow cylinder next to the red cube.
[3]:

As you can see, parts of elements just disappear

Saturday, February 4, 2017

Citation.js Version 0.2.11: jQuery and Looking Forward

A few weeks ago I published version 0.2.11 of Citation.js. The main change was the addition of jQuery.Citation.js, updated for version 2 of Citation.js. jQuery.Citation.js is a small jQuery plugin to build simple forms where you can fill in metadata, which gets translated to CSL-JSON. The configuration options are currently quite limited, so it only really works as a demo for Citation.js itself.

Citation.js logo

Since then (the current version is 0.2.15), the improvements have been bug fixes and the addition of (more support for) certain fields in both Wikidata and BibTeX. More interesting is what's going to happen in the next few releases.

Version 0.3

I've been planning version 0.3 for a while now and these things are probably going to be in it:

  • Asynchronous parsing: Parse input asynchronously, to allow asynchronous file requests, mainly for Wikidata.
  • More BibTeX: Publication type-specific fields, for example, should be parsed accordingly. I recently found a new guide to BibTeX, which should help as well.
  • Helper methods: Expose certain sub-parsing methods, like parseName and parseDate.
  • Structure Cite better: De-expose data that shouldn't be changed, add version information to Cite, etc.
  • Deprecate log: It's more or less useless. I'll add an option to enable it.
  • Structure code better: It's a mess, and things are broken. I'll change some file locations and add browserify etc.