pyminhash¶

class pyminhash.pyminhash.MinHash(n_hash_tables: int = 10, ngram_range: Tuple[int] = (1, 1), analyzer: str = 'word', **kwargs)¶

Bases: object

Class to apply minhashing to a Pandas dataframe string column. Tokenization is done by Scikit-Learn’s CountVectorizer.

Parameters

n_hash_tables – nr of hash tables
ngram_range – The lower and upper boundary of the range of n-values for different n-grams to be extracted
analyzer – {‘word’, ‘char’, ‘char_wb’}, whether the feature should be made of word n-gram or character n-grams, ‘char_wb’ creates character n-grams only from text inside word boundaries
**kwargs – other CountVectorizer arguments

fit_predict(df: pandas.core.frame.DataFrame, col_name: str) → pandas.core.frame.DataFrame¶

Create pairs of rows in Pandas dataframe df in column col_name that have a non-zero Jaccard similarity. Jaccard similarities are added to the column jaccard_sim.

Parameters

df – Pandas dataframe
col_name – column name to use for matching

Returns

Pandas dataframe containing pairs of (partial) matches