Big bad data - catch your entities in context of e-commerce

shopping24 tech blog

s is for shopping

February 07, 2014 / by Tobias Kässmann / Software engineer / @kaessmannt

Big bad data - catch your entities in context of e-commerce

We took a deeper dive into the domain of information retrieval where we’ve selected named entity recognition. There exists the naive concept, that this will solve all our entity extraction problems we’ve got with our product-texts: to extract unknown brands, sizes, colors, …. Join our journey through named entity recognition.

Named entity recog… what?

Imagine you’ve got a really good looking text like:

Jimmy John moved to Hamburg and got the opportunity to work for the great company shopping24.

When you read this, you may know from the past that “Hamburg” is a city, hopefully “shopping24” a great company and “Jimmy John” a name. So if you recognize those entities, you can annotate them e.g. by xml in the text and get the following result:

<name>Jimmy John</name> moved to <city>Hamburg</city> and got the opportunity to work for the great company <company>shopping24</company>.

This whole procedure is what Named entity recognition (NER) does. It expects a plain text and a model (what word is a company, city, name, …) as input and generates a annotated output.

Models

What is kind of a model to know what word is which entity? It’s a collection of information and how these entities are defined. You have to generate (train) this model in previous steps with, in best case, a huge annotated corpus.

The following information about a word is used to identify the entity ¹:

Word as itself
Previous and next word
Words within a window
Orthographic features (Jimmy -> Xxxxx)
Prefixes and Suffixes (Jimmy -> <J,<Ji,<Jim,…,my>,y>)
Label sequences
Lots of feature conjunctions

An implementation of a good working recognizer is provided by the Stanford NER framework. You can find predifined models and the recognizer as itself to start first tests.

Real world example

Previously we said, that we need an entity recognizer e.g. to find unknown brands. So we dream about a model that exists, to identify brands in descriptions or titles and we give this technology a try.

These entities can be used to tag products with a brand, attributes and other things to pull our own search-results to the edge of good results. Another idea is to find similar products and get a better quality within our data.

Catch the entities!

We found no models to fit our needs, so we started to train our own. As first step we just want to recognize brands in headlines and descriptions of products.

Training

We start to train with following annotated data:

Title:
<brand>Sharp</brand> <id>LC-39LE352E</id> 98cm (39") LED-TV -Full-HD, 100 Hz, USB Recording, Triple Tuner

Description:
Full HD LED/LCD-TV mit Triple Tuner zum TOP Preis/Leistungsverhältnis / 39 Zoll - 98 cm / Auflösung 1920 x 1080p / Edge LED Technologie / Smart TV / USB Video Recording / 100 Hz Bildwiederholrate / USB-Recording / AQUOS NET + und HbbTV / 3x <port>HDMI</port>, <port>SCART</port>, <port>PC Eingang</port>, 1x <port>USB</port> / <port>LAN</port> / <port>WLAN</port> ready / Energieeffizienzklasse A

The stanford NER framework started to create a model.

Recognition

With our model ready, we started to put plain descriptions and titles into the “black box” and cross our fingers. After the usage of a lot different options for the framework, we retrieved the first result: no new brands detected. We gave it another try: different model with different attributes: ports of tv’s (e.g. USB, HDMI, …), and id’s (e.g. LC-39LE352E).

Result: a few id’s but no new tv-ports. Now we realize that this is not the silver bullet.

Conclusion a.k.a. structure, structure, structure

NER is a realy good technology to get entities from structured text! But in an ecommerce context: “Sorry, no structure”. We understand why we haven’t found a single model in the www for product-texts: it’s to difficult and no structure to create a model for these entities. It will be the best in ecommerce, when words appear in frequent patterns.

And here is the conclusion why we’ve found some id’s and no brands/ports: Id’s have a small orthographic structure! Assume that each letter is a ‘X’ and a number a ‘Y’:

LC-39LE352E -> XX-YYXXYYYX
50L2333DG -> YYXYYYYXX
42FU5555S -> YYXXYYYYX
TX-P42X60E -> XX-XYYXYYX

Every time it’s a combination of uppercase letters and numbers -> structure found!

But ports/brands do not have enough structure (even in surrounding words, …) to identify them in this kind of text. Except in one case: when the brand or another attribute appears every time at the same position. E.g. some partners place the brand as a prefix of title. But what happens when you have two-word brands, three-word brands, …? Oh heck.

So we developed a little bit and found a technology to fit best in the ecommerce domain: regex! We found out that complex regular expressions are a subset of a named entity recognized, but without that noise thats gonna be produced by the surrounding signs, words, positions and orthography. And this is special in most of the ecommerce context.

Further steps

Besides NER-algorithms, there are algorithms that find structured texts, this technology is called part-of-speech tagging. Can we use this in other scopes? Oh…YES! Another problem is to recognize bad smelling texts in product-descriptions. Bad smelling texts are:

This is a great dress with a lot of colors that fits great in combination of a scarf or a yellow hat.

We use the great Solr to search through our big-data and find the best results. These bad smelling sentences often reduce the quality of our search results, because Solr consider also descriptions (as configured). If we can identify these bad-texts and remove them, we can deliver much more precise results. Challenge accepted.

Footnotes

Source: Stanford presentation ↩

Back to Blog overview

shopping24 tech blog

s is for shopping