Saturday, August 26, 2017

Citation.js: DOI update and more stability

Citation.js: DOI update and more stability

Finally, Citation.js supports DOIs. It took a while, but it’s finally there. One big ‘but’: synchronous fetching doesn’t work in Chrome. I’m still looking into that, but I should be recommending you to use Cite.async() anyway. Also in this blog post: more stability in Cite#get(), a welcome byproduct of using the DOI API, and looking forward (again).

Citation.js logo

DOI support

So, DOIs. That was (and is) a though one. Let me guide you through the process.

Initial development

I have been planning to add support for DOI input since the beginning of this year, going off the original feature request, and the pressure to implement it has grew right along me realising how important they are in the world of citations. Back then I thought it would be smart to query Wikidata for DOIs, as I had just finished some code on Wikidata ID input, that used regular API as well as the query part. More recently however, I learned about the Crossref API and, even more helpfully, the DOI Content Negotiation API, which combines the APIs from Crossref, DataCite and mEDRA. I’ll quote a piece from the docs:

Content negotiation allows a user to request a particular representation of a web resource. DOI resolvers use content negotiation to provide different representations of metadata associated with DOIs.

A content negotiated request to a DOI resolver is much like a standard HTTP request, except server-driven negotiation will take place based on the list of acceptable content types a client provides.

source

Basically, you just make a request to https://doi.org/$DOI (where hopefully obviously $DOI stands for the DOI you’re looking for) with an Accept header set to the format you want your data in. Conveniently, one of those formats is CSL-JSON, here charmingly named application/vnd.citationstyles.csl+json, but nonetheless the direct input format for Citation.js. This would allow me to only have to write code to extract DOIs from a whitespace-separated list (const dois = string.trim().split(/\s+/g)) and fetch them from the server (await Promise.all(dois.map(fetchDoi))), where fetchDoi is a simple function:

async function fetchDoi (doi) {
  const headers = new Headers({
    Accept: 'application/vnd.citationstyles.csl+json'
  })
  const response = await fetch('https://doi.org/' + doi, {headers})
  return response.json()
}

And Headers and fetch are built-in. Note that this uses some advanced syntax. Read more about async/await, Promise.all(), shorthand property names and const.

First problems

It wasn’t that simple. Heck, even getting the API to work took a (long) while and a lot of debugging other people’s code. You see, to make synchronous requests in Node.js I use sync-request, which is a synchronous wrapper around its asynchronous cousin, then-request, which is a wrapper around an also asynchronous but more low-level cousin, http-basic. They all use the same scheme of options. Because of that, sync-request passes its options down all the way to http-basic, so options from http-basic but not in sync-request can be used too. Turns out, http-basic has an option that removes all headers when a request is redirected, but it isn’t documented in sync-request. This removes, among other things, the Accept header, which, on omission, produces an HTTP 501 code not documented in the API.

Let’s just say I filed a feature request to mention this in the docs.

The actual code didn’t need much change, so soon I got the first JSON response, and since the API supports CSL-JSON out of the box, I didn’t need to do much else, except building some input recognition infrastructure.

CSL-JSON problems

However, nothing is perfect, and so is that out-of-the-box support of CSL-JSON. This was actually picked up quite well, and I’m happy with the outcome. Basically, extra code needed to fix this consists of two parts. One part transforms some invalid but essential parts of the API response to its CSL-JSON counterpart. The second part filters invalid and less essential parts out of the data, and basically acts as a safeguard for methods depending on certain variables having certain types. More on this later.

More problems, from an unexpected source

Yep, it isn’t even over yet. This may be the biggest problem of them all, in that it hasn’t got a solution yet, and that I don’t expect it to have one in the near future, and that it is very likely there is no solution. It depends, it really does. Anyway, let’s get to the point.

Chrome doesn’t support synchronous redirects from CORS domains. Or something. Even that isn’t really clear. Basically, the problem is that generally, the DOI content negotiation works like this (assuming we’re in the browser): the user domain requests data from https://doi.org, which redirects to https://data.crossref.org (or a different domain). Both are different domains. Both synchronous and asynchronous requests directly to https://data.crossref.org (or a different domain) work. Asynchronous requests via https://doi.org work too. Even synchronous requests via https://doi.org work, but not in Chrome. I don’t know why, I don’t know when (as it does support normal synchronous CORS requests), and I can’t find any document that says why it does that. The only things that have shed some light on this are a comment I can’t find anymore that only stated you can’t do that at all and this answer, that almost describes the exact problem I’m having, but fails to give any answer other than that it’s weird that Firefox and IE “don’t follow the jQuery spec.” Note that the jQuery spec doesn’t mention anything about this apart from the note that this shouldn’t be done ever anyway.

One obvious answer is that every mayor browser is currently in the process of factoring out all synchronous effect, and that Chrome might have taken this a step further, but even then, there should be some document somewhere that says it does this, otherwise I’d just consider it a bug. I haven’t gotten around to it yet, but I will try to see how this would affect using Citation.js synchronously in Web Workers. If that works, it means that it actually is another rule to prevent synchronous requests on the main thread, which is fine by me, although slightly inconvenient for people who do not care about user experience, such as me when I’m lazy.

Conclusion

DOIs work. Because of all the ranting, I haven’t shown you the api, but it’s pretty familiar. To use the CLI, do this:

> npm i -g citation-js
# to install the new version of the command
# might need to be run with admin rights

> citation-js -t 10.1021/ja01577a030 -f string -s citation-apa
# Fetches data from doi: 10.1021/ja01577a030
# in the format: string; in the formatted style: apa
 
  Hall, H. K., Jr. Correlation of the Base Strengths of Amines1. Journal of the American Chemical Society, 79(20), 5441–5444. https://doi.org/10.1021/ja01577a030

To use the API, do this:

const Cite = require('citation-js')

// For synchronous use
const data = new Cite('10.1021/ja01577a030')
const output = data.get({
  type: 'string',
  style: 'citation-apa'
})

/* output =
  Hall, H. K., Jr. Correlation of the Base Strengths of Amines1. Journal of the American Chemical Society, 79(20), 5441–5444. https://doi.org/10.1021/ja01577a030
*/

// For asynchronous use
Cite.async('10.1021/ja01577a030').then(data => {
  const output = data.get({
    type: 'string',
    style: 'citation-apa'
  })
  
  /* output =
    Hall, H. K., Jr. Correlation of the Base Strengths of Amines1. Journal of the American Chemical Society, 79(20), 5441–5444. https://doi.org/10.1021/ja01577a030
  */
})

Stability in Cite#get()

As I mentioned earlier, I had to write some code to filter out invalid but non-essential props from the CSL-JSON. Otherwise certain methods used by Cite#get() throw errors, as they except these props to have certain types. Because of this filtering function, I don’t have to type-check ever again, I hope. The implementation is pretty simple. Specially structured props like author and other names, and issued and other dates, get caught and handled on their own. Other props get checked against a map of expected data types and removed if they don’t match, and if the bestGuessConversions flag is set, if they also can’t be converted reliably. Note that, as of now, all Cite.get.*() functions expect CSL-JSON cleaned by Cite.parse.csl().

Version 0.3.0

Implementing DOI input was one of the big milestones on the way to version 0.3.0, besides exposing internal methods, making async input parsing available and generally making the browser version and the CLI less bad. Now that those three things are done, I think it’s a good moment to see what still needs to be done for the v0.3.0 release.

Better custom formatted citations

Using formatted citations is fine, but when you try to use custom ones, either because the style guide you want or need to follow isn’t built-in, or because you have some special use case, things may get confusing. Currently, the API works (should work) like this: when you pass a template in the Cite#get() options (important: not in the new Cite(<data>, <options>) options), it gets registered in a register of citeproc-js engines. After that, you can use it by referencing the name you used in the regular style option. This would work better with an API similar to this:

const template = '...'
const templateName = 'custom'

// Suggested new API
Cite.CSL.template.register(templateName, template)

// Nothing new, will stay the same
const data = new Cite(...)
data.get({
  type: 'html',
  style: 'citation-' + templateName
})

Also, sometimes you just want to append or prepend some text or HTML frames, and implementing that with CSL templates takes time and effort and the result isn’t always that pretty, as templates don’t support direct HTML. There will be a new API for that too, probably like this:

// prepends the id like this: "[$ID]: "
const prepend = ({id}) => `[${id}]: `
// appends an altmetric widget
const append = ({DOI}) => `<span class='altmetric-embed' data-doi='${DOI}'></span>`
// Nothing new, will stay the same
const data = new Cite(...)
data.get({
  type: 'html',
  style: 'citation-' + templateName,
  // Suggested new API
  // both properties either function or constant string
  append: append,
  prepend: prepend
})

One thing I still have to look at is extending the wrapping HTML elements outputted by citeproc-js.

Use asynchronism better

Now that we have asynchronous input parsing with Cite.async(), it may be interesting to look at asynchronous output formatting as well. There are also some functions that can use Cite.async() but don’t, like Cite#add() and Cite#set(), and the test cases are generally synchronous, with some special cases for async, while it should probably be the other way around. Adding a coverage tester will assist in determining which synchronous test cases can go.

Refactoring

There are still some things I’d like to refactor, like the Wikidata parser. I don’t like it right now, especially the hack to merge different types of authors.

After v0.3.0

Since we’re almost at the end of v0.3.0, I’ll also outline some of my plans for future versions.

  • BibJSON input (already partly supported)
  • Extensions (input parsing, output, etc.)
  • Zotero input (maybe not worth the work, as they already support export to CSL)
  • Scraper (could just be getting DOI and using my existing work)
  • Coverage testing and an expansion of test cases
  • More work on the dependants:

No comments:

Post a Comment