shopping24 tech blog

s is for shopping

January 30, 2014 / by Johanna Voigt / / @

Exctracting product-information from unstructured text

Onlineshops and especially product search engines benefit of structured productdata because their users are able to find, filter and compare products more easily. Unfortunetly most of the product information is hidden in long heterogeneous descriptions. The challenge is to extract, normalize and monitor this information. Besides the approaches mentioned here: Crowdsourcing + prediction engine = better product information at low costs, there is also the possibilty of simply using regular expression combined with for example php functions.

Choose Attributes

In the first step you need to define the productgroup(s), attributes and fitting attributevalues you want to extract. Our subject matter will be the womens favorite: High heels and their heel hight

Finding Regular Expressions

  1. How many values are possible for one product? In this case: a shoe has just one heel height –> MATCH_ONE
  2. How is the attribut described? Which words and patterns are followed by eachother? –> regex

Config

[heel height]
type				= MATCH_ONE
regex				= "Heel(?:height)?(?::)?\s*(?: approx.)?(\d+(?:[.,]\d+)?\s*(?:cm|mm))|(\d+(?:[.,]\d+)?\s*(?:mm|cm))\s* heel[^h:]"

Think about Normalization

For using the attributes and their values in filters, a normalization is mandatory. Otherwise your filter might look like the following example (tons of different values and a wrong sorting)

Filter without Normalization

Heel Height:
 	[] 0.30 cm
 	[] 0.1 m
 	[x] 1,3cm
 	[x] 1.3cm
 	[x] 1,3 cm
 	[] 1.35 cm
 	[x] 1,30 cm
 	[] 10 mm
 	[] 2 cm

Normalization

Therefore test what different values are returned by your regex and write some functions. For example:

  1. remove whitespaces
  2. convert the units, p.e. 30mm to 3cm
  3. format the decimal number
  4. cluster your data: p.e. 0-1,1-3,3-5,5-7,7-10

PHP Functions

functions[]			= "remove_whitespace"	
functions[]			= "norm_cm"
functions[]			= "format_decimal"
functions[]			= "cluster( cm,1,3,5,7,10)"

Filter with Normalization

Heel Height:
 	[] 0-1 cm
 	[] 1-3 cm
 	[] 3-5 cm
 	[x] 5-7 cm
 	[x] 7-10 cm
 	[] > 10 cm

QA

To ensure the attribute-quality it is very important improving the regular expressions until the quality and quantity goal is reached. In addition it might be helpful to write some more functions, for example “define_borders” or “norm_units”. As soon as the programm got deployed and went over to the day-to-day business, you should integrate some data into your monitoring system (How many products come with an attribute? How many different values are found per attribute?).

  1. Grab samples and improve your regular expressions
  2. More functions
  3. Monitoring: ratios, runaway values