INL labs help

The present INL labs tools output linguistically annotated TEI from a number of input formats. We currently provide two annotation tools: the Stanford NE tagger trained on historical newspaper ground truth, developed in the IMPACT project, and the first public version of an INL-developed tagger-lemmatizer for historical Dutch, where the tagger is trained on the "Letters as loot" corpus and the lemmatizer is based on the INL historical lexicon. Scroll down for more information about the taggers.

Using the website

You can use the tool website to submit files, text or a URL to to the annotation tools. To do so, several parameters must be set through the HTML form:
  • The input format: This option is only available for file uploads. Supported formats are TEI, plain text, HTML, Alto (experimental), Microsoft word 97 (.doc, a converter for the docx format is forthcoming). Please keep in mind that conversion from semi-structured text to TEI is not a very cleary defined task, and results may not be satisfactory in many cases.
  • Output type: The tool service can provide the result as a link to the tagged file, or output the resulting TEI structure immediately. The options are:
    • Styled: formatted for on-screen reading and inspection of the linguistic annotation added by the selected tool.
      • The styled output for the tagger-lemmatized has the option to launch a view on the linguistic annotation by clicking on a small grey arrow symbol before a sentence. Clicking on a lemma in this view iniaties a search in the INL Dutch dictionaries online.
      • The styled output for the NE tagger highlights entities found and has some options toview lists of the named entities proposed by the tagger
    • Prettyprinted XML: indented and colored XML source rendering
    • Link: link to the tagged XML file. Keep in mind that the INL does not guarantee persistence of the linked result files for more than a couple of hours.
    • Raw: output of the tagged XML file (your browser will render it in some way). Use this, or the previous option, if you want to use the result for futher automatic processing.
  • Tagger type: choice of annotation tool

Using the INL Labs web service

The tool service can also be called as a REST webservice which returns responses in XML, allowing it to be part of a webservice tool chain.

The service accepts multipart form data input and writes the output directly in the response.

Relevant URL parameters:

Parameter name

description

Possible values

tagger

Tool name

  • bab-tagger: tagger-lemmatizer for historical Dutch, using the Dutch historical lexicon
  • stanford-ner-kbkranten: Stanford NER trained on IMPACT newspaper ground truth, 18th and 19th century)

format

Format of input file

tei, html, alto, word, epub, text

input

Input file (file upload)

Name of any file on your computer

The service can (for instance) be access by a simple command line client java program (succeed.client.jar) which we provide as an example. It uploads a file to the service and writes the servlet response to standard output.

The usage for this client is:

usage: java nl.inl.succeed.SucceedClient <file to be tagged>
Options:
-i,--format <arg>      input format
-s,--serverURL <arg>   location of tagging service
-t,--tagger <arg>      tagger (named entity tagger, lemmatizer, PoS tagger)

About the taggers

Tagger/lemmatizer for Dutch historical text

The tagger/lemmatizer consists of:
  • A tokenizer/sentence boundary detector developed for modern Dutch, without any special adaptations for historical Dutch.
  • A statistical part of speech tagger, trained using standard techniques on the linguistic annotation with part of speech of the "letters as loot" corpus. The "letters as loot" corpus (/http://www.hum.leiden.edu/research/letters-as-loot/ consists of more than a thousand 17th- and 18th-century Dutch letters from seized ships. The corpus has been tagged and lemmatized at INL. The letters can now be accessed via the brievenalsbuit.inl.nl website.
  • A lemmatizer which uses direct lexicon lookup in the INL Dutch historical lexicon in combination with a fuzzy match procedure trained on historical example data. (In the styled view, fuzzy lemma matches are show in grey).
In the output TEI xml, words are marked with the <w> tag, the proposed lemma (highest scoring candidate) is in the attribute lemma, and the part of speech tag is in the attribute "type". All possible lemmatizations for the word form are added in the TEI as interpGrp tags, in the following way:
  • A word found in the historical lexicon is marked up as follows:
    <w lemma="harmonica" type="NOU" xml:id="w.0">harmonica
    	<interpGrp type="lexiconMatch">
    		<interp type="matchType">HistoricalExact</interp>
    		<interp type="lemma">harmonica</interp>
    		<interp type="lemmaId">M024346</interp>
    		<interp type="partOfSpeech">NOU</interp>
    	</interpGrp>
    <w>
    
    There may be more than one option. We currently rely on the number of dictionary quotations to rank exact historical matches. The field "lemmaId" refers to ids of articles in the Dictionary of the Dutch Language (Woordenboek der Nederlandsche Taal, WNT).
  • If the word is not found in the historical lexicon, we first look it up in a modern wordform lexicon for Dutch.
    <w lemma="mondharmonica" type="NOU" xml:id="w.0">mondharmonica
    	<interpGrp type="lexiconMatch">
    		<interp type="matchType">ModernExact</interp>
    		<interp type="lemma">mondharmonica</interp>
    		<interp type="partOfSpeech">NOU</interp>
    	</interpGrp>
    <w>
    
  • If this step fails, we use weighted spelling variation patterns to match the historical word against the modern lexicon.
    <w lemma="mondharmonica" type="NOU" xml:id="w.0">mondthaermoonieckaa
    	<interpGrp type="lexiconMatch">
    		<interp type="matchType">ModernWithPatterns</interp>
    		<interp type="lemma">mondharmonica</interp>
    		<interp type="partOfSpeech">NOU</interp>
    		<interp type="matchScore">5.888053484147942E-5</interp>
    	</interpGrp>
    </w>
    

Stanford Named Entity tagger trained on Dutch historical newspaper ground truth

The Stanford Named Entity tagger can be found at http://nlp.stanford.edu/software/CRF-NER.shtml. For detailed information, consult the following reference:
Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-370, http://nlp.stanford.edu/~manning/papers/gibbscrf3.pdf.

For more information about the IMPACT historical newspaper ground truth, cf. http://www.digitisation.eu/data/browse/ground-truth/dutch-gt/ and http://lab.kbresearch.nl/static/html/impact.html.

A detailed report on named entity recognition for historical Dutch is Landsbergen 2012, Evaluation of named entity work in IMPACT: NE Recognition and matching, http://www.digitisation.eu/fileadmin/user_upload/Deliverables/IMPACT_D-EE2.6_NE_work_in_IMPACT.pdf, who uses the Stanford NE recognizer in conjunction with an approach to overcome spelling variation and OCR errors. The variation module (NERT) is not yet included in the present release. Named entity tagging follows the namescape annotation guidelines.
<ns:ne normalizedForm="JAN BABTISTA DEL MONTE" nymRef="nym.person.74" type="person">
	<w xml:id="w.7389">Jan</w>
	<w xml:id="w.7390">Babtista</w>
	<w xml:id="w.7391">del</w>
	<w xml:id="w.7392">Monte</w>
</ns:ne>