pyminhash

class pyminhash.pyminhash.MinHash(n_hash_tables: int = 10, ngram_range: Tuple[int] = (1, 1), analyzer: str = 'word', **kwargs)

Bases: object

Class to apply minhashing to a Pandas dataframe string column. Tokenization is done by Scikit-Learn’s CountVectorizer.

Parameters
  • n_hash_tables – nr of hash tables

  • ngram_range – The lower and upper boundary of the range of n-values for different n-grams to be extracted

  • analyzer – {‘word’, ‘char’, ‘char_wb’}, whether the feature should be made of word n-gram or character n-grams, ‘char_wb’ creates character n-grams only from text inside word boundaries

  • **kwargs – other CountVectorizer arguments

fit_predict(df: pandas.core.frame.DataFrame, col_name: str)pandas.core.frame.DataFrame

Create pairs of rows in Pandas dataframe df in column col_name that have a non-zero Jaccard similarity. Jaccard similarities are added to the column jaccard_sim.

Parameters
  • df – Pandas dataframe

  • col_name – column name to use for matching

Returns

Pandas dataframe containing pairs of (partial) matches