Welcome to PyMinHash documentation¶

MinHashing is a very efficient way of finding similar records in a dataset based on Jaccard similarity. PyMinHash implements efficient minhashing for Pandas dataframes. See instructions below or look at the example notebook to get started.

Installation¶

Install directly from PyPI:

pip install pyminhash

or using conda-forge:

conda install -c conda-forge pyminhash

Usage¶

Apply record matching to column name of your Pandas dataframe df as follows:

myHasher = MinHash(n_hash_tables=10)
myHasher.fit_predict(df, 'name')

This will return the row pairs from df that have non-zero Jaccard similarity.

Contents:

Welcome to PyMinHash documentation¶

Installation¶

Usage¶

Indices and tables¶