We provides a collection of string distance metrics designed to quantify the similarity or dissimilarity between two strings. These metrics are useful in various applications where string comparison is needed. The available string distance metrics include:
|jaro||Measures the similarity between two strings based on the number of matching characters and transpositions.|
|jaro_winkler||An extension of the Jaro metric that gives additional weight to common prefixes.|
|hamming||Measure the difference between two equal-length sequences of symbols and is defined as the number of positions at which the corresponding symbols are different.|
|levenshtein||Calculates the minimum number of single-character edits (insertions, deletions, substitutions) required to transform one string into another.|
|damerau_levenshtein||Similar to Levenshtein distance but allows transpositions as a valid edit operation.|
|Indel||Focuses on the number of insertions and deletions required to match two strings.|
Note: returned scores are distances, meaning lower values are typically considered “better” and indicate greater similarity between the strings. The distances calculated are normalized to a range between 0.0 (indicating a perfect match) and 1.0 (indicating no similarity).
To configure string distance metrics, you can use a YAML configuration file. The configuration structure includes:
model_parametersspecifying model-related parameters.
evaluationsetting the evaluation
testsdefining different test scenarios and their
Here’s an example of the configuration structure:
model_parameters: temperature: 0.2 max_tokens: 64 evaluation: metric: string_distance distance: jaro threshold: 0.1 tests: defaults: min_pass_rate: 1.0 robustness: add_typo: min_pass_rate: 0.70 lowercase: min_pass_rate: 0.70