Syntaxus baccata: Weekly Report 4: Learning about text mining

I was on holiday for the last two weeks, so I was not able to do all to much and I missed the second webinar, but I did read some on-topic papers about NLP. To find the occurrence of certain species and chemicals seems fairly easy to me, with the right dictionaries of course, but to find the relation between a species and a chemical in an unstructured sentence can't be done without understanding the sentence, and without a program to do this, you would have to extract the data manually. I first wanted to do this with RegExp, but now that I have read the OSCAR4 and ChemicalTagger papers, I know there are other, better options.

Especially OSCAR is said to be highly domain-independent. My conclusion is that this, and with this perhaps a modified version of ChemicalTagger can be used in my project to find the relation between occurring species, chemicals and diseases.

If this doesn't work, for some reason (i.e. it takes to long to build dictionaries, the grammar is to complex), there is always Google's SyntaxNet and it's English parser Parsey McParseface. Be as it may, this seems a bit over the top. They are made to figure out the grammar of less strict sentences, and therefore they have to use more complex systems, such as a "globally normalized transition-based neural network". This is needed to choose the best option in situations like the following.

One of the main problems that makes parsing so challenging is that human languages show remarkable levels of ambiguity. It is not uncommon for moderate length sentences - say 20 or 30 words in length - to have hundreds, thousands, or even tens of thousands of possible syntactic structures. A natural language parser must somehow search through all of these alternatives, and find the most plausible structure given the context. As a very simple example, the sentence Alice drove down the street in her car has at least two possible dependency parses:

The first corresponds to the (correct) interpretation where Alice is driving in her car; the second corresponds to the (absurd, but possible) interpretation where the street is located in her car. The ambiguity arises because the preposition in can either modify drove or street; this example is an instance of what is called prepositional phrase attachment ambiguity.

If it's necessary, however, it's always an option.

Syntaxus baccata

Sunday, August 7, 2016

Weekly Report 4: Learning about text mining

No comments:

Post a Comment