# Matchcode Optimization:Jaccard Similarity Coefficient

Jump to navigation
Jump to search

## Jaccard Similarity

### Specifics

- Jaccard Index

### Summary

- Jaccard Similarity Index is defined as the size of the intersection divided by the size of the union of the sample sets.

### Returns

- Percentage of similarity

- Intersection/Union

- NGRAM is defined as the length of common strings this algorithm looks for. Matchup default is NGRAM = 2. For “ABCD” vs “GABCE”, Matching NGRAMS would be “AB” and “BC”.

- Intersection is defined as the number of common NGRAMS and union is the total number of NGRAMS in the universe of the two strings.

### Example Matchcode Component

### Example Data

STRING1 STRING2 RESULT Johnson Jhnsn Unique Mild Hatter Mild Hatter Wks Match Found Beaumarchais Bumarchay Unique Apco Oil Lube 170 Apco Oil Lube 342 Match Found

Performance | |||||
---|---|---|---|---|---|

Slower | Faster | ||||

Matches | |||||

More Matches | Greater Accuracy |

### Recommended Usage

- Hybrid deduper, where a single incoming record can quickly be evaluated independently against each record in an existing large master database. Databases created with abbreviations or similar word substitutions.

### Not Recommended For

- Large or Enterprise level batch runs. Since the algorithm must be evaluated for each record comparison, throughput will be very slow.

- Databases created via real-time data entry where audio likeness errors are introduced.

### Do Not Use With

- UTF-8 data. This algorithm was ported to MatchUp with the assumption that a character equals one byte, and therefore results may not be accurate if the data contains multi-byte characters.