Monday, May 22, 2017

Citation.js: Async, Showdown and Bib.TXT

I worked on updates for Citation.js in the past few weeks, and I thought I'd go over them in this post.

Async

First of all, Citation.js now has support for asynchronous parsing, so it won't lag your app as much when it uses e.g. the Wikidata API. This is good, as synchronous requests are not only blocking your app, but also deprecated in most major browsers. This allows for more development on input formats where additional data needs to be fetched, like a DOI as input. I could have done this before, but this order made more sense to me. The API is simple:

// With callback

Cite.async($INPUT, $OPTIONS, function (data) {
  data // instance of Cite
  
  // Further manipulations...
  console.log(data.get())
})

// Or with a promise, like this:

Cite.async($INPUT, $OPTIONS).then(function (data) {
  data // instance of Cite
  
  // Further manipulations...
  console.log(data.get())
})

// Where $INPUT is input data and $OPTIONS is options
// $OPTIONS is optional in both cases

The promise-returning part is good when using the async await syntax, i.e. var data = await Cite.async($INPUT, $OPTIONS).

Extensions

Unfortunately, no actual extensions to the Cite object. (However, extensions in input parsing is planned in v0.4, and the API for output formatting is getting a rework in v0.3 (this and more)).

I'm talking about citation.js-showdown, citation.js-form and citation.js-replacer (coming soon). Citation.js-form is just jquery.citation.js in its own repository, citation.js-replacer will be a small tool to easily put references on your site, without having to bother with writing scripts, but, as a consequence, with fewer options. Citation.js-showdown, however, is a (functioning) Showdown plugin, that makes it easier to make references and bibliographies in a markdown document.

Bib.TXT

A few days ago I heard about Bib.TXT, and since implementing it in Citation.js only meant parsing a relatively simple text format (all the fields are the same as the already supported BibTeX), I thought: Why not? So there you have it, Bib.TXT support in Citation.js. That's good, because, among other things, you don't have to bother with weird (La?)TeX Unicode workarounds. And with Citation.js, you can easily convert Bib.TXT to BibTeX. Below as a command:

$ npm i -g citation-js@0.3.0-7
$ citation-js --input inputFile.txt --output-format string --output-style bibtex > out.bib

Homepage improvement

I changed some things on the Citation.js homepage. It should now be more responsive and have more room for expansion, and I added a demo and a small list of the "extensions" mentioned above. Oh, and the README has the same banner as the page now.

Citation.js banner

Saturday, April 29, 2017

New homepage design: New content, Material Design, Angularjs and more

My new homepage is finally finished (for now). I say finally, because it has taken a while before it actually felt done. You know how the first 90% of the work takes 90% of the time, but the last 10% of the work takes another 90% of the time? I just had to finish writing a single, small piece of text and it would be complete, but it took about 2 weeks. Anyway, that's all done now, and the result can be seen at larsgw.github.io.

Homepage

The design is based on the principles of Material Design using MDL, which is a CSS+JS framework for making Material Design sites. Most of the content is displayed with AngularJS, which uses runtime Pug-like HTML templating and more.

Extra content is added as well, including more projects and descriptions. Unfortunately, I had to get rid of my Twitter feed, as it didn't fit in the new design, but there are several sites to view my tweets, including Twitter and this exact site. Also, the old site (with my feed) is still available, and displays the new content as well: larsgw.github.io/old.

There are some other things hidden here and there, some really obvious, some not. That's basically it. Hope you like it.

Wednesday, April 26, 2017

Final Report: Analysing and visualising data from papers about conifers

Originally posted on the ContentMine blog.

Lars Willighagen, orcid:0000-0002-4751-4637

Final Report of my fellowship at the ContentMine.

Proposal

My proposal was to extract facts about various conifer species by analysing text from papers with software suited for analysing text and the tools provided by the ContentMine. These facts were then to be converted into JSON, and then viewable with an HTML (+CSS/JS) interface. Expected statements were like: 'Picea glauca is a species of the genus Picea', which could be parsed to the triple:Picea glauca; property:genus; subject:Picea.

Work

The main outcome of this project is a series of programmes converting tables from research articles into Wikidata statements. The workflow is as follows. First, papers matching a user-provided query are fetched by the ContentMine's getpapers. Second, the tables are extracted from the fetched papers and converted to assertions. This is done by filling empty cells in tables and then treating each row as an object, the first column being the name and the others property-value pairs. Different table designs are currently parsed in the same way, resulting in incorrect extraction of data, something that can be accommodated for by normalising the table structure beforehand. The resulting assertions are then converted to JSON, currently in a custom scheme, to allow the next steps.

Finally, the JSON assertions are visualized in an HTML GUI. This includes a stepper form (see picture) where you can curate the assertion, link identifiers, and add it to Wikidata.

Stepper form for curating assertions
Stepper form for curating assertions

Getting these assertions from text, as I proposed, was harder. Tools I expected to find included in ContentMine software were nowhere to be found, but were planned, so actually implementing them myself did not seem a good use of my time. Luckily, the literature corpus does not actually contain that many statements about physical properties of conifers in plain text as I originally expected: most are in tables, figures or in supplementary files, leading me to using those instead. The nice thing is that one of the main focuses of the ContentMine is parsing tables from PDF, so this will definitely be of general use.

Other work

During the project and to explore the design of the ContentMine, additional related components were developed:

  • ctj: program to convert and re-order AMI data to JSON, making it easier to read in JavaScript (mainly good for web applications);
  • ctj-cardlists: program to view AMI JSON (see above) in a Web GUI (demo); and
  • Citation.js: added functionality to parse BibJSON (used for quickscrape output) into CSL, for further formatting. See blog post.

These first two simplified handing AMI output in the browser, while the third makes it easier to display references in common formats.

Dissemination

All source code of the project outcomes is available on GitHub:

Progress was communicated during the project via the ContentMine Discourse page, on my personal blog (~20 posts), and on the general ContentMining blog (2 long posts).

Future work

The developed pipeline works but is not perfect.The pipeline to parse tables mentioned above requires further generalisation. This defines some logical next steps: fixes:

  • Finally adding it as an NPM module, making it (way) easier for people to use it;
  • Making searching easier in the HTML GUI (will need work further upstream too). Currently the list of assertions are split into pieces, making it hard to find anything. This can be fixed with a search index;
  • Normalising table structures to support more designs, rendering assertion extraction more reliable;
  • Making the process of curating assertions and linking identifiers easier by linking more identifiers, and showing context, i.e. the original tables; and
  • Some small performance and UX things.

Another important thing that is too big for a single bullet point, is annotating abbreviations and references in the document before extracting the tables. It's easier to curate statements like '[1] says this and this' when you know '[1]' references some known article. Another example: while a statement containing 'P. glauca' says nothing (there are 66+ species using that abbreviation), the article probably says which one it is somewhere outside the table, something that can be picked up if you annotate these before taking them out of context. This makes the interactive stepper form currently a necessity.

Evaluation

As noted, the work is far from done. Currently, it mainly shows a glimpse of what is possible had I spent more time on writing code. Short conclusions: CTJ is unpolished and slow. Because of a lack of customisation options, such as what data to use, you will almost always need to write custom code to not have to include tons of unnecessary data in your resulting JSON.

CTJ-Cardlists is actually pretty nice. It is slow, and it does not really show relations, but it does show an interesting overview of the literature corpus, like how often species are mentioned and with what they are mentioned together most of the time. You can easily draw reasonable conclusions like how often species names are misspelled. However, it would be more useful for this to have SQL queries or something similar. CTJ-Factvis shows even more potential, with the Wikidata integration. I do need to pay more attention to the fact that those assertions are alleged facts, and not regular ones, as I called them in earlier blog posts.

Fellowship

In general, the fellowship went pretty well for me. In retrospect, I did a lot of the things I wanted to do, even though that throughout the project it felt like there was so much left to do, and there is! I am really excited about the possibilities that emerged during the fellowship, and even in the last weeks. How cool would it be to extend this project with entire Web API's and more? This is, for a big part, thanks to the support, feedback, and input of the amazing ContentMine team during the regular meeting, and the quick responses to various software issues. I also enjoyed blogging about my progress on my own blog and on the ContentMine blog.

Sunday, March 26, 2017

Citation.js: BibJSON

Citation.js now supports BibJSON. How I did that without actually updating Citation.js? Well, apparently I supported it all along. I've supported the quickscrape output format since July last year, and that turned out to be BibJSON. How convenient. I'll update the demo and docs to reflect this revelation (currently it just says "quickscrape's JSON scheme"), and, now that I can find actual documentation, some improvements to the parser. It's a good candidate for a new output format too.

Some side notes on updates v0.3.0-0 to v0.3.0-2: these are prerelease updates, making it possible to use code before I have fixed all the issues and added all the features I promised for version 0.3. These updates fixed a lot of file organization problems; next updates will restructure the Cite object and fix tests.

Sunday, February 26, 2017

SVM: Developing a brand new 3D model language for the Web

When I started programming a few years ago, one of my first projects was linked to a project for school. Our team was making a presentation, and at some point I decided I wanted to program it myself, for the web. This resulted in the version of a project which, to this day, doesn't have a name (but soon will have).

After a few months I wanted to expand this. With some research and trial and error in the area of CSS 3D Transforms, I made a program similar to impress.js. Regular slides, in the web, in 3D, but with some unique things. Of course, it wasn't as refined yet, but this was only the start.

These unique things[1] were 3D models, that you can view in your very own web browser. This on its own isn't that unique; There are plenty of libraries that do that, like A-Frame and xml3d. The special thing of my 3D models are that they're pure HTML and CSS. No canvas or WebGL needed, just a relatively modern browser. This opens a whole range of possibilities: viewing HTML structures and CSS styles in your default debugger; selecting text directly from the model; embedding audio, video, pages and anything else you can put in HTML; and, of course, the possibility to combine these models and the slides, as they use CSS 3D Transforms as well.

There are some disadvantages to this. HTML+CSS only allows for 2D elements, so to make a cube, you'd need 6 elements, a wrapper element and 10+ lines of CSS. And if you want to change the height, you need to change that one value in a lot of places. To solve that, I've started developing SVM, which stands for Scalable Vector (3D) Models. The name is very similar to SVG, and that's how I intended it: XML-based, simple, and with lots of built-in shapes.

The first (beta) version is still in development, but I can list some of the features:

  • Built-in solids: Cubes, cuboids, regular prisms, cylinders (to an extent), regular pyramids, regular right frustums, cones (to an extent), irregular convex and concave prisms using SVG Paths (!), and simple spheres
  • Planes, arcs, SVG Path-based curves
  • Groups (with names)
  • Transformation (translation, rotation, scaling) on any element (or at least the currently existing ones)
  • An element to include groups/components

Some of the features in action, including automatic shadow, which will adapt to the orientation of the elements in the future[2]

It may not be as refined yet, but I think there's a lot of potential. Currently, the main hurdle is performance and the accompanying visual issues. For example, on my laptop, Chrome starts lagging and breaking apart[3] at around 450 elements, and keep in mind that even a single cube results in six elements. Also keep in mind: it works fine for up to around 1200 elements on my PC.


[1]: As far as I know, anyway.
[2]: Currently, it doesn't, as you can see on the yellow cylinder next to the red cube.
[3]:

As you can see, parts of elements just disappear

Saturday, February 4, 2017

Citation.js Version 0.2.11: jQuery and Looking Forward

A few weeks ago I published version 0.2.11 of Citation.js. The main change was the addition of jQuery.Citation.js, updated for version 2 of Citation.js. jQuery.Citation.js is a small jQuery plugin to build simple forms where you can fill in metadata, which gets translated to CSL-JSON. The configuration options are currently quite limited, so it only really works as a demo for Citation.js itself.

Citation.js logo

Since then (the current version is 0.2.15), the improvements have been bug fixes and the addition of (more support for) certain fields in both Wikidata and BibTeX. More interesting is what's going to happen in the next few releases.

Version 0.3

I've been planning version 0.3 for a while now and these things are probably going to be in it:

  • Asynchronous parsing: Parse input asynchronously, to allow asynchronous file requests, mainly for Wikidata.
  • More BibTeX: Publication type-specific fields, for example, should be parsed accordingly. I recently found a new guide to BibTeX, which should help as well.
  • Helper methods: Expose certain sub-parsing methods, like parseName and parseDate.
  • Structure Cite better: De-expose data that shouldn't be changed, add version information to Cite, etc.
  • Deprecate log: It's more or less useless. I'll add an option to enable it.
  • Structure code better: It's a mess, and things are broken. I'll change some file locations and add browserify etc.

Sunday, November 27, 2016

Weekly Report 12: Including table facts in Wikidata

This week, the big achievement is the addition of a multi-step form to add the semantic triples from last week to Wikidata with QuickStatements, which we talked about before too. The new '+' icon in table rows now links to a page where you can curate the statement and add Wikidata IDs where necessary. At the last step, you get a table of the existing data, the added identifiers and soon their Wikidata label. There, you can open the statement in QuickStatements, where it will be added to Wikidata. This is a delicate procedure; it's important to keep an eye on the validity of the statement.

The multi-step form in action (see tweet)

Take the following table:

Fungus Insect Host
Fungus A Insect 1 Host 1
Fungus B Insect 2 Host 2

This is currently interpreted (by the program) as "Fungus A has Insect 1 as Insect and Host 1 as Host", while the actual meaning is "Fungus A is spread by Insect Insect 1, of which the Host is Host 1" (see this table). While stating Host 1 is the Host of Fungus A is arguably correct, this occurs in much worse ways as well, and it's hard to account for it.

On top of that, often table values don't have the exact data, but abbreviations. It shouldn't be too hard to annotate abbreviations with their meaning before parsing the tables, but actually linking them to Wikidata Entities is still a problem. This is, obviously, the next step, and I'll continue to work on it the next few weeks. One thing I could do is search for entities based on their label. A problem with that is that I would have to make thousands of calls to the Wikidata API in a short period of time, so I'll probably create a dictionary with one, big call, if the resulting size allows it.

There is a pull request pending that would solve the issue norma has with tables, so I'll incorporate that into the program as well, making it easier for me as some things will already be annotated.

Conclusion: the 'curating' part of adding the statements to Wikidata is currently *very* important, but I'll do as much as possible to make it easier in the next few weeks.