Tutorial

This tutorial shows how to use PyMinHash to find matches of strings.

First, import Pandas and fix some settings.

[1]:
%config Completer.use_jedi = False

import pandas as pd

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_row', 500)
pd.set_option('display.max_colwidth', 200)

PyMinHash comes with a toy dataset containing various name and address combinations of Stoxx50 companies.

[2]:
from pyminhash.datasets import load_data
df = load_data()
df.head()
[2]:
name
0 adidas ag adi dassler strasse 1 91074 germany
1 adidas ag adi dassler strasse 1 91074 herzogenaurach
2 adidas ag adi dassler strasse 1 91074 herzogenaurach germany
3 airbus se 2333 cs leiden netherlands
4 airbus se 2333 cs netherlands

We’re going to match various representations that belong to the same company. For this, we import create a MinHash object and tell it to use 10 hash tables. More hash tables means more accurate Jaccard similarity calculation but also requires more time and memory.

[3]:
from pyminhash.pyminhash import MinHash
myHasher = MinHash(n_hash_tables=10)

The fit_predict method needs the dataframe and the name of the column to which minhashing should be applied. The result is a dataframe containing all pairs that have a non-zero Jaccard similarity:

[7]:
result = myHasher.fit_predict(df, 'name')
result.head()
[7]:
row_number_1 row_number_2 name_1 name_2 jaccard_sim
632 23 24 banco santander s a 28660 banco santander s a 28660 madrid 1.0
2 1 2 adidas ag adi dassler strasse 1 91074 herzogenaurach adidas ag adi dassler strasse 1 91074 herzogenaurach germany 1.0
204 10 11 amadeus it group s a salvador de madariaga 1 28027 madrid amadeus it group s a salvador de madariaga 1 28027 madrid spain 1.0
271 74 75 kering sa 40 rue de sevres 75007 paris kering sa 40 rue de sevres 75007 paris france 1.0
623 20 22 banco bilbao vizcaya argentaria s a 48005 bilbao spain banco bilbao vizcaya argentaria s a plaza san nicolas 4 48005 spain 1.0

As one can see below, for a Jaccard similarity of 1.0, all words in the shortest string appear in the longest string. For lower Jaccard similarity values, the match is less than perfect. Note that Jaccard similarity has granularity of 1/n_hash_tables, in this example 0.1.

[9]:
result.groupby('jaccard_sim').head(2)
[9]:
row_number_1 row_number_2 name_1 name_2 jaccard_sim
632 23 24 banco santander s a 28660 banco santander s a 28660 madrid 1.0
2 1 2 adidas ag adi dassler strasse 1 91074 herzogenaurach adidas ag adi dassler strasse 1 91074 herzogenaurach germany 1.0
664 87 88 linde plc 10 priestley road surrey research park gu2 7xy guildford linde plc 10 priestley road surrey research park gu2 7xy united kingdom 0.9
35 48 49 deutsche post ag platz der deutschen post deutsche post ag platz der deutschen post 53113 germany 0.9
194 114 115 siemens aktiengesellschaft werner von siemens strasse 1 80333 germany siemens aktiengesellschaft werner von siemens strasse 1 80333 munich germany 0.8
704 71 72 intesa sanpaolo s p a piazza san carlo 156 to 10121 italy intesa sanpaolo s p a piazza san carlo 156 to 10121 turin 0.8
98 3 4 airbus se 2333 cs leiden netherlands airbus se 2333 cs netherlands 0.7
241 55 56 engie sa 1 place samuel de champlain 92400 courbevoie engie sa 1 place samuel de champlain 92400 france 0.7
96 131 133 volkswagen ag 38440 germany volkswagen ag berliner ring 2 38440 wolfsburg 0.6
686 46 47 deutsche boerse 60485 frankfurt deutsche boerse frankfurt 0.6
659 37 38 crh plc stonemason s way 16 dublin ireland crh plc stonemason s way 16 ireland 0.5
234 55 57 engie sa 1 place samuel de champlain 92400 courbevoie engie sa 92400 courbevoie france 0.5
291 67 119 industria de diseno textil s a avenida de la diputacion s n arteixo 15143 a coruna spain telefonica s a ronda de la comunicacion 0.4
706 72 73 intesa sanpaolo s p a piazza san carlo 156 to 10121 turin intesa sanpaolo s p a turin 0.4
176 32 113 bayerische motoren werke aktiengesellschaft petuelring 130 80788 munich siemens aktiengesellschaft munich germany 0.3
518 18 106 axa sa 75008 paris sanofi 75008 france 0.3
305 74 120 kering sa 40 rue de sevres 75007 paris telefonica s a ronda de la comunicacion 28050 madrid spain 0.2
306 75 120 kering sa 40 rue de sevres 75007 paris france telefonica s a ronda de la comunicacion 28050 madrid spain 0.2
575 106 123 sanofi 75008 france total s a 2 place jean millier paris france 0.1
571 100 123 orange s a paris total s a 2 place jean millier paris france 0.1