String Distance Metrics

 

We provides a collection of string distance metrics designed to quantify the similarity or dissimilarity between two strings. These metrics are useful in various applications where string comparison is needed. The available string distance metrics include:

Metric Name Description
jaro Measures the similarity between two strings based on the number of matching characters and transpositions.
jaro_winkler An extension of the Jaro metric that gives additional weight to common prefixes.
hamming Measure the difference between two equal-length sequences of symbols and is defined as the number of positions at which the corresponding symbols are different.
levenshtein Calculates the minimum number of single-character edits (insertions, deletions, substitutions) required to transform one string into another.
damerau_levenshtein Similar to Levenshtein distance but allows transpositions as a valid edit operation.
Indel Focuses on the number of insertions and deletions required to match two strings.

Note: returned scores are distances, meaning lower values are typically considered “better” and indicate greater similarity between the strings. The distances calculated are normalized to a range between 0.0 (indicating a perfect match) and 1.0 (indicating no similarity).

Configuration Structure

To configure string distance metrics, you can use a YAML configuration file. The configuration structure includes:

  • model_parameters specifying model-related parameters.
  • evaluation setting the evaluation metric, distance, and threshold.
  • tests defining different test scenarios and their min_pass_rate.

Here’s an example of the configuration structure:

model_parameters:
  temperature: 0.2
  max_tokens: 64

evaluation:
  metric: string_distance
  distance: jaro
  threshold: 0.1

tests:
  defaults:
    min_pass_rate: 1.0

  robustness:
    add_typo:
      min_pass_rate: 0.70
    lowercase:
      min_pass_rate: 0.70
Last updated