What is a name variant ?
Database applications that involve the storing and searching for
personal data often need to allow for a degree of variation in the spelling of
both surnames and forenames.
Variations can be introduced into database applications as a result of typographical
errors where letters are interchanged (e.g. Nobel
and Noble), letters are substituted (e.g.
Stevens and Stephens), letters are added
(e.g. Colins and Collins)
or removed (e.g. Clarke and Clark).
Variations can also be introduced as a result of alternate spelling
for names with the same or very similar pronunciation (e.g. Cavenagh
and Kavenagh or Ewal and
Yule or Sean and
Shawn). This is a particular problem where the transcription is from
the spoken word.
Forename abbreviations introduce further problems and in
many cases there is little or no phonetic link (e.g. William
with Wm,Billy,Bill,etc or Elizabeth
In addition, we do not want to match known masculine names with known
feminine variants and vice versa (e.g. Alexander
should match with Alex and Alexandar
but not with Alexandria.
Why do we need another name matching system ?
Many techniques have been used to assist with the important problem
of matching variant names. However, most of these techniques were
developed for general word matching and as a result they are not optimised
for name matching. The Name Thesaurus was designed specifically for name
matching and achieves significantly higher precision with higher
recall than the alternatives.
What is wrong with Soundex and Metaphone ?
Soundex is a simple algorithm that transforms any word into a code
comprising a leading letter followed by up to three digits. For example,
Kavenagh gets the code K152 and this may
be matched to other names with the same code such as Kavnaugh
and Kavenough. Unfortunately the same code also
matches with a huge number of unrelated names such as Kyppins
and Koppensteiner. This code also fails to match
any name with a different leading letter such as Cavenagh.
In summary, Soundex offers a reasonable level of recall but with low precision
when matching names.
Metaphone is another algorithm that attempts to normalise words by removing
vowels after the first letter and mapping common consonant variations. For
example, Kavenagh gets the code KFNK
and this matches with Kavnaugh, Kavenough
and Cavenagh. Unfortunately the same code also
matches many unrelated names such as Gavnik and
Kaffanke. In summary, Metaphone is better suited to name matching
than Soundex but is still far from ideal.
Neither Soundex nor Metaphone include any of the following matches for
Kavenagh: Kavena, Cavena, Kavanha, Kavan. Also, neither
algorithm offers any assistance with the problem of forename
What algorithms does the Name Thesaurus use ?
The Name Thesaurus uses a combination of phonetic and other techniques for
name variant identification.
All surnames are limited to 27 characters (a-z and " ' "). Double-barrelled
names are split into their component parts and treated independently. All
accented characters are converted to their closest matching letter (e.g.
à, â, ä and å are all mapped to " a "). This mapping is used to simplify the
thesaurus and ensures that similar names are located without regard to the use
of accented characters.
Once a potential match has been identified a weighting is computed as
||All surnames are converted to a phonetic encoding.
||The degree of similarity for potential matches are computed using two
different matching algorithms. Both the original and the phonetic versions are
compared using the two matching algorithms producing a total of four match
||The Soundex and Metaphone codes are also computed.
||The Match Score is computed from the weighted average of the four match
scores combined with the results of the Soundex and Metaphone encodings.
||For forenames, the Name Thesaurus uses a knowledge of gender
associations so that close phonetic matches between the sexes can be
avoided (e.g. John and Joan
or Alexander and Alexandria).
Why are weighted pairs important ?
Neither Soundex nor Metaphone is able to rank their matching names so
that the closer matches can be identified. Therefore an application that uses
these techniques has an all-or-nothing decision regarding the inclusion of name
The Name Thesaurus assigns a match score to all name variants so that each
candidate is offered as a weighted pair and all variants for a given name can
be supplied as a ranked set. For example, the Name Thesaurus has generated more
than 150 matches for the surname Galbraith including:
- Gailbriath - with
a match score of 99
- Gallbreath - with
a match score of 97
- Galbrealth - with a
match score of 85
- with a match score of 83
- Goilbreath - with
a match score of 76
By assigning a weight to each name variant the application has the option of
limiting a particular search to the better matches thus improving precision at
the expense of recall.
What do you mean by Precision and Recall ?
Precision measures the percentage of correct names in a match list with
100% indicating that the match list does not contain any invalid names.
Recall measures the percentage of all possible correct names appearing in the
match list with 100% indicating that the match list contains every correct
In an ideal world it would be possible to achieve 100% precision with 100%
recall but in practice this is not possible for large volumes of data. In
general, higher precision leads to lower recall and vice versa.
The above diagram shows how precision drops off as recall increases by dropping
the Name Thesaurus match score threshold. The diagram also shows how Soundex and
Metaphone compare to the Name Thesaurus with both providing reasonable levels of
recall but with poor precision. Whilst the performance of Soundex and Metaphone
are fixed for any given name the range indicated in the diagram shows the
expected performance across a spread of names.
When weighted term pairs are available it is possible to tune precision and
recall dynamically. For example, point 'a' on the diagram
above represents relatively high precision with lower recall and would be
achieved by setting a high match score threshold. In contrast, point 'b'
on the diagram represents relatively high recall but with lower precision and
would be achieved by selecting a low match score threshold.
In all cases the Name Thesaurus provides better Precision than either Soundex or
How do I work with the Name Thesaurus ?
The Name Thesaurus data is organised as a thesaurus of name pairs with
weights and would normally be held in a relational database. Database
applications that wish to utilise the Name Thesaurus would simply include a
sub-select to include matching names from the Thesaurus above the selected
Is the Name Thesaurus just data ?
Yes - the Name Thesaurus is available as a standard thesaurus containing 385
million variants for 5.9 million distinct Surnames and 32 million
variants for 1.4 million distinct Forenames.
The names in the standard thesaurus come from all over the
Is the Name Thesaurus a Thesaurus generation Service ?
Yes - we can build custom thesauri for specific name collections.
How fast is the Name Thesaurus ?
The performance of the Name Thesaurus is dependent upon the speed of
the relational database used to hold the Thesauri. Since the sub-select can
easily be controlled by a clustered index the overhead of fetching the
additional variants is normally a small percentage of the overall search time.
Our customers use the Name Thesaurus with databases that hold hundreds of
millions of names.
Who are Image Partners ?
Image Partners Limited is the company that designed and developed the
John Challis, who founded the company in 1996, has been designing software
products for the management of unstructured information since 1987.