We provides a collection of string distance metrics designed to quantify the similarity or dissimilarity between two strings. These metrics are useful in various applications where string comparison is needed. The available string distance metrics include:
Metric Name | Description |
---|---|
jaro | Measures the similarity between two strings based on the number of matching characters and transpositions. |
jaro_winkler | An extension of the Jaro metric that gives additional weight to common prefixes. |
hamming | Measure the difference between two equal-length sequences of symbols and is defined as the number of positions at which the corresponding symbols are different. |
levenshtein | Calculates the minimum number of single-character edits (insertions, deletions, substitutions) required to transform one string into another. |
damerau_levenshtein | Similar to Levenshtein distance but allows transpositions as a valid edit operation. |
Indel | Focuses on the number of insertions and deletions required to match two strings. |
Note: returned scores are distances, meaning lower values are typically considered “better” and indicate greater similarity between the strings. The distances calculated are normalized to a range between 0.0 (indicating a perfect match) and 1.0 (indicating no similarity).
Configuration Structure
To configure string distance metrics, you can use a YAML configuration file. The configuration structure includes:
model_parameters
specifying model-related parameters.evaluation
setting the evaluationmetric
,distance
, andthreshold
.tests
defining different test scenarios and theirmin_pass_rate
.
Here’s an example of the configuration structure:
model_parameters:
temperature: 0.2
max_tokens: 64
evaluation:
metric: string_distance
distance: jaro
threshold: 0.1
tests:
defaults:
min_pass_rate: 1.0
robustness:
add_typo:
min_pass_rate: 0.70
lowercase:
min_pass_rate: 0.70