August 13, 2013 / by Torsten Bøgh Köster / CTO / @tboeghk
Stemming german like a pro
A grammatically correct german stemming can be tideous. Applying the traditional stemmers like Porter or Snowball usally lead to overstemming – especially with short words. Furtunately the Solr distribution comes with some filters, that – correctly applied and combined – make up a near-perfect german stemmer.
Some examples for overstemming are:
# Overstemming problems
leder --> led
bären --> bar
Ingredients
German is a language of compound words. Most dictionaries can be used for
spellchecking most short and common compound words. In our case, a simple
example of babybetten
did not work. So we combined three existing
TokenFilter
into a GermanCompoundTokenFilter
:
-
the
HunspellStemFilter
1 to find stems using a current german Hunspell dictionary, -
the
SynonymFilter
2 to fine tune stemming of specific domain words, -
and a
PrimaryWordTokenFilter
, which is a extension to theHyphenationCompoundWordTokenFilter
3 emitting only the primary word of a token to the TokenStream.
Primary word detection: wordlist and hyphenator
There are several german wordlists around. We consider the
german GNU wordlist by Georg Pfeiffer
pretty complete. Using some grep
, sed
and tr
magic, in can be converted
into a plain wordlist.
Besides a list of known words, the HyphenationCompoundWordTokenFilter
3
needs a hyphenatation definition. It is compatible with the old Apache Offo
Hyphenation definition.
Download offo-hyphenation_v1.2.zip
from Sourceforge
and locate the hyph/de.xml
in the package. Use this as hyphenator
input to the token filter.
# Result of HyphenationCompoundWordTokenFilter
bett --> bett
babybetten --> baby,bett,betten
dampfschifffahrtskapitänsmützen --> dampf,schiff,schifffahrt,fahrt,kapitän,mütze,mützen
Primary word detection: extending the token filter
Finding the primary word in german is (most of the time) straightforward:
It’s the last part of a compund word. It’s pretty easy to extend the
HyphenationCompoundWordTokenFilter
3 and change the decompose()
method to only emit the last decomposed Token to the TokenStream.
# Result of PrimaryWordTokenFilter
bett --> bett
babybetten --> betten
dampfschifffahrtskapitänsmützen --> mützen
Dictionaries and hyphenators for Hunspell
The Hunspell stemmer needs a current german Hunspell dictionary and affix definition.
Those can be found on the Apache Open Office website
as .oxt
files. Download a current version and rename the file suffix to .zip
. Unzip
it and locate the Hunspell dictionary in de_DE_frami/de_DE_frami.dic
and the Hunspell
affix definition in de_DE_frami/de_DE_frami.aff
.
# Result of HunspellStemTokenFilter
betten --> bett
babybetten --> babybetten
dampfschifffahrtskapitänsmützen --> dampfschifffahrtskapitänsmützen
Synonymlist for fine tuning
We use a synonym filter for stemming fine tuning. We compile it out of wordlists of common domain words. If you want to use one out of the box, take a look at a german stemming file (in Solr syntax) from Github. The file is quite large (8,6MB) and contains some 1:1 mappings, so it might be a good idea to clean it up a little.
# Result of SynonymFilter
Abdrosslungen --> Abdrosslung
Abarbeitungsgeschwindigkeiten --> Abarbeitungsgeschwindigkeit
Stir: the stemming algorithm
The stemming algorithm uses Hunspell to find the correct stem of either the compound and primary word. Synonym files are used to fine tune stemming of domain words.
// check synonyms
String stem = checkSynonyms(token);
// no 100% synonym found
if (stem == null) {
// pass into hunspell stemmer
stem = findHunspellStem(token);
// results found? Perfect, we're done
if (stem == null) {
// no results found, pass into primary word filter
String primaryWord = findPrimaryWord(token);
// if primary word found, pass it into the hunspell filter
if (primaryWord != null) {
String primaryWordStem = findHunspellStem(primaryWord);
// no stem found? check synonyms
if (primaryWordStem == null) {
primaryWordStem = checkSynonyms(primaryWord);
}
// if primary word could be stemmed, replace primary word in
// token with primary word stem.
if (primaryWordStem != null) {
stem = StringUtils.removeEnd(token, primaryWord);
stem += primaryWordStem;
}
}
}
}
Serve
Having collected the ingredients and proper stirred them, it’s straightforward to
combine the three filter into one. Iterate through the given token stream and
process each token according to the algorithm above. Not familiar in writing
Solr TokenFilters? The PrimaryWordTokenFilter
and GermanCompoundTokenFilter
will be open-sourced soon.
# Result of GermanCompoundTokenFilter
betten --> bett
babybetten --> babybett
dampfschifffahrtskapitänsmützen --> dampfschifffahrtskapitänsmütze
-
API docs for
HunspellStemFilter
↩ -
API docs for
SynonymFilter
↩ -
API docs for
HyphenationCompoundWordTokenFilter
, ↩ ↩2 ↩3