shopping24 tech blog

s is for shopping

March 11, 2015 / by Tobias Kässmann / Software engineer / @kaessmannt

Bad sentence detector for descriptions

We are often faced with the problem that some product descriptions are spammed with (in case of product search) useless content. This might be a fitting or style descriptions. The problem is that this content is equally important to the significant part of the text. This post will give you a short example how to deal with this problem.

Let’s start with an example: In our data you will find a lot of bad descriptions that contains sentences like:

"stylische Jeans von tigha, schmale Passform, angenehmer Denim aus einer Baumwollmischung, Bund mit seitlichen Reißverschlüssen, verstärkte Kniepartie, rückwärtiger Logopatch am Bund, innere Beinlänge in Größe 27 ca.71cm, 81% Baumwolle 18% Polyester 1% Elasthan, unser Model trägt Größe 27"

What are the important sentences for our search-engine?

"schmale Passform,"
"Bund mit seitlichen Reißverschlüssen,"
"verstärkte Kniepartie,"
"81% Baumwolle 18% Polyester 1% Elasthan"

Next example:

"Love is in the air! Schwebe mit Codello auf Wolke sieben. Unzählige kleine Herzchen sorgen für ein rundum tolles Gefühl. Die perfekte Größe und die weiche Ware machen den Schal zu einem tollen Begleiter."

Important part?

""

Right, nothing. And the last more complex example from a technical product description:

"Das Samsung GALAXY S5 mini zeigt sich als leistungsstarkes Smartphone-Erlebnis auf dem aktuellen Stand der Technik in kompakter Größe. Denn es überzeugt neben seinen handlichen Abmessungen mit nutzerfreundlichem Komfort, einem breiten Funktionsspektrum und einer beachtenswerten Ausstattung. Einen großen Teil zum Komfort trägt dabei das HD Super AMOLED-Display bei ............ der App S Health 3.0 bleibt die eigene Fitness im Blick und die Zertifizierung gemäß IP671 sorgt dafür, dass das Samsung GALAXY S5 mini auch bei rauen Umgebungsbedingungen ein vor Staub und Wasser geschützter Begleiter ist. Highlights: HD Super AMOLED-Display Saub- und Wasserabweisend dank IP67-Zertifizierung Hochauflösende 8,0 Megapixel-Kamera für den täglichen Gebrauch Aktuellste Android™4.4-Plattform Lieferumfang: Samsung Galaxy S5 Mini G800F Samsung Akku Ladeadapter Datenkabel Headset Kurzanleitung."

The important sentences are:

"Highlights: HD Super AMOLED-Display Saub- und Wasserabweisend dank IP67-Zertifizierung Hochauflösende 8,0 Megapixel-Kamera für den täglichen Gebrauch Aktuellste Android™4.4-Plattform Lieferumfang: Samsung Galaxy S5 Mini G800F Samsung Akku Ladeadapter Datenkabel Headset Kurzanleitung."

Problem in detail

The result for our customers is, when someone is searching for a product like “scarf” and in a description of a jeans is a sentence like “This jeans perfectly fits to a scarf…” the jeans will match and will be returned to the customer. Of course, when you’ve got your search-engine under perfect control, the product will be placed at the end of the search result. Another problem is, if you use the TF/IDF formula, the product is really rare and the TF/IDF normally boosts rare products so it’s possible that this will be in the top five results. That’s not what we’ve expected from our search-engine. The TF/IDF boosting of rare products is another problem, but we’ll face the bad data problem to remove them from our descriptions.

How to find these bad sentences?

You could use all the data science products with a highly functional scope like OpenNLP sentence detector and a ongoing POS tagger. Another tool is the more scientific stanford tagger. But all of these does have the same problem: performance. For our usage these taggers and detectors are really slow and can not be used in realtime search requests or at indexing. So the journey begins to do this on our own.

Pipeline: Part I - Split your sentences

Let’s start with the assumption that the bad data can be found in sentences. To remove this sentences will be the best solution and we don’t need surrounding context information to analyze the badness of a sentence. Due to the given input of a description we have to split this into sentences and didn’t use any sentence detector or something like this. We just use a complex regular expression because that’s really fast and simple to do in Solr.

Pipeline: Part II - Analyze your sentences

Afterwards the splitting of a sentence we analyze each independently. You have to define a list of bad words and due to some highly statistic calculations each sentence will result in a value that describe the information gain in it. A threshold will decide at the end if the sentence is a good or bad one.

Implementation

All these parts are implemented in Solr and can easily tested by the analysis view of a core. We additionally set the type of a returned token stream to enable afterwards computations. The exact result’s for some descriptions which we compute is what we’ve showed at the beginnig of this post. This filter is just used at index time to reduce the description to the really important data and the bad data sentences will not appear in the search result anymore! Yipii!