Tutorial¶
This tutorial shows how to use PyMinHash to find matches of strings.
First, import Pandas and fix some settings.
[1]:
%config Completer.use_jedi = False
import pandas as pd
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_row', 500)
pd.set_option('display.max_colwidth', 200)
PyMinHash comes with a toy dataset containing various name and address combinations of Stoxx50 companies.
[2]:
from pyminhash.datasets import load_data
df = load_data()
df.head()
[2]:
name | |
---|---|
0 | adidas ag adi dassler strasse 1 91074 germany |
1 | adidas ag adi dassler strasse 1 91074 herzogenaurach |
2 | adidas ag adi dassler strasse 1 91074 herzogenaurach germany |
3 | airbus se 2333 cs leiden netherlands |
4 | airbus se 2333 cs netherlands |
We’re going to match various representations that belong to the same company. For this, we import create a MinHash
object and tell it to use 10 hash tables. More hash tables means more accurate Jaccard similarity calculation but also requires more time and memory.
[3]:
from pyminhash.pyminhash import MinHash
myHasher = MinHash(n_hash_tables=10)
The fit_predict
method needs the dataframe and the name of the column to which minhashing should be applied. The result is a dataframe containing all pairs that have a non-zero Jaccard similarity:
[7]:
result = myHasher.fit_predict(df, 'name')
result.head()
[7]:
row_number_1 | row_number_2 | name_1 | name_2 | jaccard_sim | |
---|---|---|---|---|---|
632 | 23 | 24 | banco santander s a 28660 | banco santander s a 28660 madrid | 1.0 |
2 | 1 | 2 | adidas ag adi dassler strasse 1 91074 herzogenaurach | adidas ag adi dassler strasse 1 91074 herzogenaurach germany | 1.0 |
204 | 10 | 11 | amadeus it group s a salvador de madariaga 1 28027 madrid | amadeus it group s a salvador de madariaga 1 28027 madrid spain | 1.0 |
271 | 74 | 75 | kering sa 40 rue de sevres 75007 paris | kering sa 40 rue de sevres 75007 paris france | 1.0 |
623 | 20 | 22 | banco bilbao vizcaya argentaria s a 48005 bilbao spain | banco bilbao vizcaya argentaria s a plaza san nicolas 4 48005 spain | 1.0 |
As one can see below, for a Jaccard similarity of 1.0, all words in the shortest string appear in the longest string. For lower Jaccard similarity values, the match is less than perfect. Note that Jaccard similarity has granularity of 1/n_hash_tables, in this example 0.1.
[9]:
result.groupby('jaccard_sim').head(2)
[9]:
row_number_1 | row_number_2 | name_1 | name_2 | jaccard_sim | |
---|---|---|---|---|---|
632 | 23 | 24 | banco santander s a 28660 | banco santander s a 28660 madrid | 1.0 |
2 | 1 | 2 | adidas ag adi dassler strasse 1 91074 herzogenaurach | adidas ag adi dassler strasse 1 91074 herzogenaurach germany | 1.0 |
664 | 87 | 88 | linde plc 10 priestley road surrey research park gu2 7xy guildford | linde plc 10 priestley road surrey research park gu2 7xy united kingdom | 0.9 |
35 | 48 | 49 | deutsche post ag platz der deutschen post | deutsche post ag platz der deutschen post 53113 germany | 0.9 |
194 | 114 | 115 | siemens aktiengesellschaft werner von siemens strasse 1 80333 germany | siemens aktiengesellschaft werner von siemens strasse 1 80333 munich germany | 0.8 |
704 | 71 | 72 | intesa sanpaolo s p a piazza san carlo 156 to 10121 italy | intesa sanpaolo s p a piazza san carlo 156 to 10121 turin | 0.8 |
98 | 3 | 4 | airbus se 2333 cs leiden netherlands | airbus se 2333 cs netherlands | 0.7 |
241 | 55 | 56 | engie sa 1 place samuel de champlain 92400 courbevoie | engie sa 1 place samuel de champlain 92400 france | 0.7 |
96 | 131 | 133 | volkswagen ag 38440 germany | volkswagen ag berliner ring 2 38440 wolfsburg | 0.6 |
686 | 46 | 47 | deutsche boerse 60485 frankfurt | deutsche boerse frankfurt | 0.6 |
659 | 37 | 38 | crh plc stonemason s way 16 dublin ireland | crh plc stonemason s way 16 ireland | 0.5 |
234 | 55 | 57 | engie sa 1 place samuel de champlain 92400 courbevoie | engie sa 92400 courbevoie france | 0.5 |
291 | 67 | 119 | industria de diseno textil s a avenida de la diputacion s n arteixo 15143 a coruna spain | telefonica s a ronda de la comunicacion | 0.4 |
706 | 72 | 73 | intesa sanpaolo s p a piazza san carlo 156 to 10121 turin | intesa sanpaolo s p a turin | 0.4 |
176 | 32 | 113 | bayerische motoren werke aktiengesellschaft petuelring 130 80788 munich | siemens aktiengesellschaft munich germany | 0.3 |
518 | 18 | 106 | axa sa 75008 paris | sanofi 75008 france | 0.3 |
305 | 74 | 120 | kering sa 40 rue de sevres 75007 paris | telefonica s a ronda de la comunicacion 28050 madrid spain | 0.2 |
306 | 75 | 120 | kering sa 40 rue de sevres 75007 paris france | telefonica s a ronda de la comunicacion 28050 madrid spain | 0.2 |
575 | 106 | 123 | sanofi 75008 france | total s a 2 place jean millier paris france | 0.1 |
571 | 100 | 123 | orange s a paris | total s a 2 place jean millier paris france | 0.1 |