Crnchng th Brds
A Search for Universal Ornithological Identifiers
Philosophers of language and computer scientists have come to realize the importance
of universal identifiers.For example Saul Kripke argues in Naming and Necessity that
names are not descriptions, but rather "rigid designators" that retain their naming
ability even in the presence of massive epistemological upheavals: "gold", he argues,
would designate gold even if it turns out that gold is not a metal, is not yellow,
and does not have atomic weight 34. Similarly, the use of unique identifiers to
solve the problems of designating-by-description is widespread in modern computer
systems.
I have often felt that there was a need for such unique identifiers for birding. To
take a simple example, consider the problem of describing the birds that can be
seen at a given site. A straightforward method would be to define some code for
seasonal abundance like the ACUOR codes that are becoming standard, and then store
one of those codes for every bird in some standard list of the birds of the world.
This straightforward approach suffers from two drawbacks. The first is that it is
quite inefficient in terms of storage: the vast majority of the codes will be
'----'. The more serious drawback is that it is tied to a particular bird list, and
when that bird list is changed, as it invariably will be, the description of the
site will no longer be valid‹what used to be bird 6,738 on the list is now bird
6,748, and making that change is quite cumbersome.
The use of Universal Ornithological Identifiers (UOI's) solves both these problems.
The site data can be compactly stored as a list of UOI's along with abundance codes.
Criteria for Universal Ornithological Identifiers.
I recently set off to design a
system of UOI's. Unfortunately, several mutually conflicting design goals come into
play. The perfect UOI would be:
- Short. I took 8 characters as a maximum.
- Unique. Unless there is a one-to-one mapping of birds to codes, they're useless
- Universal. A single set of UOI's should be usable for all birds, in all languages.
- Taxonomy independent. Since taxonomies change, the UOI's should not be tied to
any particular one.
- Easily Computed. The DNA fingerprint of each bird may be unique, but it's not
easy to generate.
- Mnemonic. No short, unique code will be really easy to remember, but BlkHdBd is
certainly preferable to 4x7%x*9F.
I am unaware of any existing UOI system that comes acceptably close to satisfying
these critera. The AOU system of codes is short and unique, but it is tied to
the AOU checklist, is computed by arbitrary assignment, is not universal, and
is not mnemonic. The system of taxonomic codes used by BirdBase is short, unique,
and universal, but is tied to its taxonomic descriptions and is not mnemonic.
The four-letter codes of H. Lee Jones are short, unique, taxonomy-independent,
and mnemonic, but they are not easily computed and not universal.
BrdBrev
My own OUI system, BrdBrev, takes a bird's binomen as its starting point.
This is the closest thing we have to rigid identifiers for birds: they are unique,
universal, mnemonic, relatively robust in the face of high-level taxonomic revisions, and widely known. The only problem is their length.
The goal, then, was to find a method for shortening the Latin names to an acceptable length. In computer science terms, what was needed was a minimal perfect hashing scheme for the Latin names.
Now the art of finding minimal perfect hashing functions has advanced remarkably in
the last few years, and it may actually be possible to find one, but the result would
almost certainly violate the requirements for mnemonicity and easy computability.
I chose instead to see how close I could come with relatively simple abbreviation
schemes.
The two main abbreviation techniques are deletion of frequent letters and substring
deletion. For example, Yllw-trtd might be derived from Yellow-throated by
frequent-letter deletion, while substring deletion might yield Yell-thro.
I tried a large number of combinations of these techniques.
The one that seemed to work best was a 9-character code using frequent letter elimination,
with capitalization used to preserve segmentation. This scheme has a phenomenally low
figure of only six collisions in the entire list of almost 10,000 birds. The six collisions
are:
{6 Using capital to denote segmentation:
1 ChlbnMugu 3073 3078
2 PhylFlvvn 4877 4884
3 OluXnthnu 5935 5950
4 CduelPinu 8749 8750
5 BlutuGntu 8948 8962
6 AmbyHlecu 9605 9652
}