shopping24 tech blog

s is for shopping

August 13, 2013 / by Torsten Bøgh Köster / CTO / @tboeghk

Stemming german like a pro

A grammatically correct german stemming can be tideous. Applying the traditional stemmers like Porter or Snowball usally lead to overstemming – especially with short words. Furtunately the Solr distribution comes with some filters, that – correctly applied and combined – make up a near-perfect german stemmer.

Some examples for overstemming are:

# Overstemming problems
leder                             --> led
bären                             --> bar

Ingredients

German is a language of compound words. Most dictionaries can be used for spellchecking most short and common compound words. In our case, a simple example of babybetten did not work. So we combined three existing TokenFilter into a GermanCompoundTokenFilter:

  • the HunspellStemFilter 1 to find stems using a current german Hunspell dictionary,

  • the SynonymFilter 2 to fine tune stemming of specific domain words,

  • and a PrimaryWordTokenFilter, which is a extension to the HyphenationCompoundWordTokenFilter 3 emitting only the primary word of a token to the TokenStream.

Primary word detection: wordlist and hyphenator

There are several german wordlists around. We consider the german GNU wordlist by Georg Pfeiffer pretty complete. Using some grep, sed and tr magic, in can be converted into a plain wordlist. Besides a list of known words, the HyphenationCompoundWordTokenFilter 3 needs a hyphenatation definition. It is compatible with the old Apache Offo Hyphenation definition. Download offo-hyphenation_v1.2.zip from Sourceforge and locate the hyph/de.xml in the package. Use this as hyphenator input to the token filter.

# Result of HyphenationCompoundWordTokenFilter
bett                              --> bett
babybetten                        --> baby,bett,betten
dampfschifffahrtskapitänsmützen   --> dampf,schiff,schifffahrt,fahrt,kapitän,mütze,mützen

Primary word detection: extending the token filter

Finding the primary word in german is (most of the time) straightforward: It’s the last part of a compund word. It’s pretty easy to extend the HyphenationCompoundWordTokenFilter 3 and change the decompose() method to only emit the last decomposed Token to the TokenStream.

# Result of PrimaryWordTokenFilter
bett                              --> bett
babybetten                        --> betten
dampfschifffahrtskapitänsmützen   --> mützen

Dictionaries and hyphenators for Hunspell

The Hunspell stemmer needs a current german Hunspell dictionary and affix definition. Those can be found on the Apache Open Office website as .oxt files. Download a current version and rename the file suffix to .zip. Unzip it and locate the Hunspell dictionary in de_DE_frami/de_DE_frami.dic and the Hunspell affix definition in de_DE_frami/de_DE_frami.aff.

# Result of HunspellStemTokenFilter
betten                            --> bett
babybetten                        --> babybetten
dampfschifffahrtskapitänsmützen   --> dampfschifffahrtskapitänsmützen

Synonymlist for fine tuning

We use a synonym filter for stemming fine tuning. We compile it out of wordlists of common domain words. If you want to use one out of the box, take a look at a german stemming file (in Solr syntax) from Github. The file is quite large (8,6MB) and contains some 1:1 mappings, so it might be a good idea to clean it up a little.

# Result of SynonymFilter
Abdrosslungen                      --> Abdrosslung
Abarbeitungsgeschwindigkeiten      --> Abarbeitungsgeschwindigkeit

Stir: the stemming algorithm

The stemming algorithm uses Hunspell to find the correct stem of either the compound and primary word. Synonym files are used to fine tune stemming of domain words.

 // check synonyms
 String stem = checkSynonyms(token);

 // no 100% synonym found
 if (stem == null) {

    // pass into hunspell stemmer
    stem = findHunspellStem(token);

    // results found? Perfect, we're done
    if (stem == null) {

       // no results found, pass into primary word filter
       String primaryWord = findPrimaryWord(token);

       // if primary word found, pass it into the hunspell filter
       if (primaryWord != null) {
          String primaryWordStem = findHunspellStem(primaryWord);

          // no stem found? check synonyms
          if (primaryWordStem == null) {
             primaryWordStem = checkSynonyms(primaryWord);
          }

          // if primary word could be stemmed, replace primary word in
          // token with primary word stem.
          if (primaryWordStem != null) {
             stem = StringUtils.removeEnd(token, primaryWord);
             stem += primaryWordStem;
          }
       }
    }
 }

Serve

Having collected the ingredients and proper stirred them, it’s straightforward to combine the three filter into one. Iterate through the given token stream and process each token according to the algorithm above. Not familiar in writing Solr TokenFilters? The PrimaryWordTokenFilter and GermanCompoundTokenFilter will be open-sourced soon.

# Result of GermanCompoundTokenFilter
betten                            --> bett
babybetten                        --> babybett
dampfschifffahrtskapitänsmützen   --> dampfschifffahrtskapitänsmütze