pyminhash¶
-
class
pyminhash.pyminhash.
MinHash
(n_hash_tables: int = 10, ngram_range: Tuple[int] = (1, 1), analyzer: str = 'word', **kwargs)¶ Bases:
object
Class to apply minhashing to a Pandas dataframe string column. Tokenization is done by Scikit-Learn’s CountVectorizer.
- Parameters
n_hash_tables – nr of hash tables
ngram_range – The lower and upper boundary of the range of n-values for different n-grams to be extracted
analyzer – {‘word’, ‘char’, ‘char_wb’}, whether the feature should be made of word n-gram or character n-grams, ‘char_wb’ creates character n-grams only from text inside word boundaries
**kwargs – other CountVectorizer arguments
-
fit_predict
(df: pandas.core.frame.DataFrame, col_name: str) → pandas.core.frame.DataFrame¶ Create pairs of rows in Pandas dataframe df in column col_name that have a non-zero Jaccard similarity. Jaccard similarities are added to the column jaccard_sim.
- Parameters
df – Pandas dataframe
col_name – column name to use for matching
- Returns
Pandas dataframe containing pairs of (partial) matches