Table of Contents |
The Intrinsics > Dictionary
Dictionary
To facilitate command parsing, TADS 3 provides an intrinsic class called Dictionary. A Dictionary is a specialized type of lookup table designed specifically to work with the GrammarProd intrinsic class, which is used to create parsers. A Dictionary stores the vocabulary for a grammar; more generically speaking, it stores associations between keys and objects, and can be efficiently searched for a key to yield all of the objects associated with the key.
A “key” is a combination of a string and a property ID. The Dictionary
class includes a property ID with each key so that object associations
can be differentiated by type; each property is a type of association.
In practical terms, the property used in a Dictionary key will usually
be a vocabulary property, such as noun
or
adjective
. A given word might be used with
some objects as a noun, but with other objects as an adjective; for
example, “card” might be used as a noun with an “ID Card” object, but as
an adjective with a “card slot” object. The Dictionary class
differentiates by property ID to allow for searching a Dictionary object
for the objects matching a word in a particular usage context.
To use the Dictionary class, you should \#include
\<dict.h\>
. If you’re using the adv3 library, the library headers
will do this for you automatically.
Comparators
The Dictionary class allows the program to customize the way input strings are matched to dictionary strings. This customization is handled through a “comparator” object; this is simply an object that provides a specific set of methods, which the Dictionary calls to perform its comparisons.
You can set a Dictionary object’s comparator at any time by calling the
Dictionary’s setComparator()
method. This
method takes the comparator object as an argument. When you set the
comparator, subsequent look-up operations are performed using the
matching rules contained in the new comparator.
The intrinsic class StringComparator provides a customizable comparator implementation, compatible with the Dictionary requirements, that’s very fast (since it’s implemented as native code in the interpreter). When the StringComparator class doesn’t provide enough flexibility, you can define your own completely custom comparator by implementing the set of required methods.
The Dictionary class assumes that comparator objects are immutable. That is, once a comparator is installed in a dictionary, the dictionary assumes that the return values from the required methods, for a given set of argument values, will never change. This is important because it allows the dictionary to assume it will never need to rebuild its internal tables except when a new comparator is installed. If you want to change the comparison rules, you must create and install a new comparator object; never attempt to change the comparison rules by modifying the comparator itself.
A comparator object must define the following methods:
calcHash(*str*)
Calculate and return the “hash code” for the given string str. The hash code is simply an arbitrary integer value; its purpose is to allow the dictionary to divide its overall set of words into many bins, which reduces the amount of work necessary when searching for a particular string by limiting the search to the strings in one bin. When storing a string, the dictionary computes the hash value by calling this method, then stores the string in a bin specified by the hash value. When looking up a string, the dictionary computes the string’s hash value, and then only has to look in the bin given by the hash value.
Note that this function doesn’t have to worry about how many bins the dictionary is using. The dictionary will automatically take the value modulo the bin count (i.e., the dictionary will divide the value returned from calcHash() by the number of bins, and use the integer remainder of this division to select the bin; this will inherently be in the correct range).
The hash value returned from this method is arbitrary, hence the method can implement any algorithm it wants for computing the value. However, there is one absolute requirement that the method must obey, and a couple of properties that the algorithm should have:
- The absolute requirement is that, for any two strings s1 and s2,
if
matchValues(s1, s2)
indicates a match (i.e., it returns anything other than zero ornil
), thencalcHash(s1)
must equalcalcHash(s2)
. (This does not imply that the hash values for two non-matching strings must be different: hash “collisions,” in which two strings do not match but happen to have the same hash value, are specifically allowed. - A highly desirable property of the hash function is that it runs quickly. The point of the hash is to reduce the amount of work needed to look up a word by very quickly picking only one bin to look in. If the hash function takes a long time to compute, then the speed advantage of this approach is reduced.
- Another highly desirable property is that the hash values for the set of all words in the dictionary should distribute uniformly. This is important because it will ensure that about an equal number of words goes into each bin, so that we don’t have many empty bins and a few bins with all of the words; if the words were to clump into a few bins, then we wouldn’t get much advantage out of limiting the searching to one bin per word looked up. There’s no sure way to guarantee uniform hash distribution, especially given that the set of words in a particular dictionary can’t be predicted in most cases; however, practice has shown that certain types of hash functions generally work pretty well. If you are writing your own comparator object, you can find information on how to write a good hash function in practically any textbook on computer algorithms.
matchValues(*inputStr*, *dictStr*)
Match the input string inputStr to the dictionary string dictStr.
Returns zero or nil
if the values do not
match, or any other value if they do match. Because any non-zero and
non-nil value indicates a match, this routine is free to encode
additional information about the match in the return value; typically,
implementations use bit flags (a combination of binary values OR’d
together) for this purpose.
Note the asymmetry of the argument: the first argument is always an input string, which is a string to be matched (such as the string argument to the findWord() method); the second is always a dictionary string, which is a string already stored in the dictionary that is being compared with the input word. Some implementations might take advantage of this distinction to permit certain types of approximations in input strings; for example, an implementation might allow an input string to be abbreviated to a leading substring of the dictionary string, so that “flashl” matches “flashlight,” but not vice versa.
Compiler Support
The TADS 3 compiler has built-in support for the Dictionary intrinsic
class. (This built-in support is part of the compiler, not the
interpreter; as far as the interpreter is concerned, a Dictionary object
is the same as any other kind of object.) The compiler support is
provided by the statements dictionary
and
dictionary property
, and by recognition of
dictionary properties when defined in object property lists.
The dictionary
statement has two purposes.
First, it defines a new object instance of class Dictionary, if the name
has not been previously defined. Second, it establishes the active
dictionary, which is the dictionary into which the compiler will insert
each dictionary word defined with an object.
The dictionary property
statement tells the
compiler that a given property is henceforth a “dictionary property.”
This means that, whenever the property is subsequently used in an object
definition, the property defines dictionary words for the object. The
compiler then automatically enters the words into the active dictionary
under the given property.
Here’s some sample code that demonstrates how these statements work.
// create a new dictionary and make it active
dictionary myDict;
// establish the dictionary properties
dictionary property noun, adjective, plural;
// define an object
redBook: Book
noun = 'book' 'booklet'
adjective = 'red'
;
Note that dictionary properties use a special “implied list” syntax: there’s no need to enclose a vocabulary property’s string list in square brackets. In fact, the TADS 3 compiler doesn’t even allow list notation for dictionary properties - you must use the implied list format.
Spelling correction
The Dictionary class provides support for spelling correction through
the correctSpelling()
method, which generates
a list of possible corrections for a misspelled word. This method by
itself doesn’t constitute a complete spelling corrector, but it makes it
a lot easier to write a spelling corrector, since it does a lot of the
heavy lifting. Implementing a complete spelling corrector is beyond the
scope of the Dictionary class, since a full corrector would normally use
application-specific context information to rank the possible dictionary
matches to a misspelled word. For example, to correct spelling in an IF
game’s command line input, a corrector could analyze the grammatical
context of the misspelled word in the overall command line, as well as
the current state of the world model (such as the list of nearby
objects). That sort of context information is outside of the the
Dictionary class’s domain.
You might wonder, then, why the Dictionary class would involve itself in
spelling correction at all. The equivalent functionality of
correctSpelling()
could be implemented
separately from the Dictionary class, and in fact extension libraries
that do this are already available. If the rest of the spelling
corrector has to be implemented externally, why not this part as well?
The reason is efficiency. Generating a list of candidate words is one of
the key performance bottlenecks in spelling correction, and there’s a
special advantage to integrating this process with the Dictionary data.
This isn’t just the obvious advantage of a native code version, although
that’s also an important factor for such a compute-intensive task. The
bigger advantage is a particularly fast algorithm that’s only possible
with direct access to the Dictionary data.
For the technically inclined, here are some details. Most spelling correctors are based on an “edit distance” formula, usually the Levenshtein distance. A corrector must take a misspelled input word and produce a set of candidate corrections based on the edit distance formula. The fastest algorithms for producing candidate sets based on the Levenshtein distance use state-machine automata. TADS uses a relatively new algorithm that’s particular efficient. It involves a special dual automaton that simultaneously traverses a Levenshtein state machine and a dictionary “trie”, which is state machine version of the dictionary data. The algorithm’s efficiency comes from the integration of the two state machines, which allows the automaton to avoid blind alleys that a regular Levenshtein automaton has to traverse. The dual automaton immediately detects dead ends in the Levensthein paths because it considers the dictionary data at every step. This greatly reduces the search space and thus the execution time. It would be possible to build the same algorithm externally by maintaining a side copy of the dictionary’s contents, but that would require some way to keep the copy synchronized with the main dictionary. It’s simpler to build the algorithm within the dictionary so that everything stays in sync naturally.
Dictionary methods
addWord(*obj*, *str*, *vocabProp*)
Add an object to the dictionary with the given string and property key.
str can be a string value, or can be a list (or Vector or other
list-like object) containing strings. If
str is a list of strings, the result is the same as calling
addWord()
once for each string in the list. If
the word association to be added is already defined (i.e., another entry
with the same string, object, and property already exists), the new
association is simply ignored. obj is the object to associate with the
given vocabulary word or words, and vocabProp is the dictionary
property ID (&noun
,
&adjective
, etc) to use for the association.
correctSpelling(*str*, *maxEditDistance*)
Returns a list of words in the dictionary that are possible spelling corrections for the given string str. The returned list contains each word defined in the dictionary whose edit distance from str is less than or equal to maxEditDistance.
maxEditDistance is an integer giving the maximum edit distance to include in the results.
The edit distance between two words is defined as the smallest number of single-character insertions, deletions, replacements, and transpositions necessary to transform one word into another. This is often called the “Levenshtein distance”, but note that our version isn’t strictly a Levenshtein distance because we count transposition as one step. (A transposition is a reversal of the positions of two adjacent letters, such as changing “edti” to “edit” by switching the “ti” to “it”. The strict definition of the Levenshtein distance doesn’t include transpositions, so a transposition counts as two steps - an insertion plus a deletion - in an actual Levenshtein distance.)
The concept of edit distance and its use in spelling correction are well documented on the Web, so we won’t go into a lot more detail here. For more information, a good starting point is the Wikipedia article.
The return value is a list with all of the words in the dictionary that match str within the specified maximum edit distance. Any given dictionary word will appear only once in the returned list. The list is in arbitrary order. Each element of the list is a sublist, [word, dist, repl]:
- word is a string giving the matching dictionary word.
- dist is an integer giving the actual edit distance between word and the search string str; this ranges from 0 for an exact match up to the maximum edit distance you requested.
- repl is the number of character replacements that must be performed to transform str into word. The edit distance dist includes this figure, since a single-character replacement is one of the edit operations counted in the distance. repl is called out separately because some spelling correctors consider a replacement to be a heavier change than other edits, and use replacements to break ties. For example, consider “book” and “box” as corrections for “bok”: both have edit distance 1, but some correctors might choose “book” over “box” as the correction because “book” can be corrected by inserting an “o”, while “box” requires replacing the “k” with an “x”.
If the Dictionary has a StringComparator object as its current comparator, the results will take into account its case folding setting, truncation length, and character mappings. These “approximations” are not considered to be edits, so they don’t count against the maximum edit distance. Note an important limitation, though: custom comparators that aren’t of the StringComparator class are ignored. If you use your own object type (not a StringComparator object) as the comparator, this method will only find matches based on the exact text of the dictionary words. This limitation is necessary because the custom comparator interface doesn’t provide enough information for the spelling corrector to figure inexact matches of other kinds.
How do you choose a maximum edit distance? Unfortunately, there’s no theoretically perfect choice. Other factors in the overall design of a spelling corrector will affect the choice, so the best approach is to experiment with different distances to find the best one for your specific implementation. The basic trade-off is that if you pick a value that’s too low, you won’t find as many possible matches, so the match that the user actually intended might be missed; but if you pick a value that’s too high, you’ll get lots of false positives, and it’ll be hard to find the right match among all the chaff. Higher distances also make the search take longer, for obvious reasons.
The literature on spelling correction suggests that 2 is a pretty good all-purpose limit for edit distance in almost any application. However, you’ll probably do even better by varying this slightly according to the word length. Our own empirical testing suggests a roughly logarithmic scale works best: edit distance 1 for short words (four letters or fewer), 2 for medium words (up to seven letters), and 3 for longer words. Apart from empirical experience, there are some intuitive reasons for using these numbers. For very short words (three or four letters), two edits seems to be too many simply because it’s possible to transform any short word into almost any other short word in two edits. For long words, the chances of multiple typos increase, and the chances of false matches thin out considerably, so it seems worthwhile to increase the distance limit for these words.
findWord(*str*, *vocabProp*?)
Search the dictionary for the given string and property ID. Returns a
list giving all of the matching objects; if there are no objects,
returns an empty list. If vocabProp is omitted or is
nil
, the method returns all of the objects
that match the string with any property.
The returned list consists of two elements for each matching object. The
first element of each pair is the matching object; the second element is
the return value from the current comparator’s
matchValues()
method for the match. For
example, suppose that the comparator simply returns
nil
for no match or
true
for a match; the result list might then
look like this:
[blueBook, true, booklet, true, matchbook, true]
The matchValues()
result is included because
some comparators use the result to supply additional information about
the match. For example, StringComparator encodes information on case
folding, truncation, and equivalence mappings. The caller could use this
information to choose some matches over others, for example.
forEachWord(*func*)
Invokes the callback function func on each word association in the
dictionary. func is invoked as (*func*)(*obj*,
*str*, *vocabProp*)
, where obj, str, and vocabProp have the
same meanings they do in addWord()
. Note that
a given string can appear many times in a dictionary, so func can be
invoked with the same str value multiple times, once for each
string/object/property association.
isWordDefined(*str*)
Searches the dictionary for the given string and determines if it’s
associated with any objects. If the word occurs in the dictionary at
all, this function returns true
; otherwise, it
returns nil
.
removeWord(*obj*, *str*, *vocabProp*)
Removes from the dictionary the object’s association with the given
string and property ID key. Only the specific association of object,
string, and property is removed; if the same object is also associated
with other strings, the object is not removed from those other
associations, and likewise if the same string and property are
associated with different objects, those object associations are not
removed. str can be a string value, or it can be a list (or Vector or
other list-like object) containing strings.
If it’s a list of strings, the result is the same as calling
removeWord()
separately for each string in the
list. If the word association to be removed is not defined, the
operation is simply ignored.
setComparator(*compObj*)
Set the comparator object. All subsequent dictionary operations are
performed with the new comparator object. If compObj is
nil
, the dictionary simply matches input
strings to dictionary strings exactly.
Once installed, the dictionary considers a comparator object to be immutable. That is, the dictionary assumes that the return values from the required methods of the comparator for a given set of inputs will never change. If you want to change the comparison rules for a dictionary, you must create and install a new comparator object: never attempt to change the comparison rules by modifying an existing comparator.
Note that changing the comparator is a relatively expensive operation, because the dictionary must rebuild its internal table for the new comparator; programs should thus avoid changing comparators except when necessary.
Notes
There are a few other important things to know about dictionaries.
Dictionaries and garbage collection: A dictionary references its objects “weakly.” This means that adding an object to a dictionary does not prevent the garbage collector from deleting the object. If an object is referenced in a dictionary, but the object becomes otherwise unreachable, the garbage collector will delete the object, and will at the same time remove all of the object’s associations from the dictionary. While this might seem strange at first glance, it is actually very useful, because it means that you don’t have to worry about manually deleting dictionary references to objects that have become unneeded.
Because dictionaries reference their entries weakly, the mere presence of an object in a dictionary never keeps the garbage collector from deleting the object. Hence, if an object in a dictionary becomes otherwise unreachable, the collector will eventually delete the object. This is normally desirable, because dictionaries are usually used only as a performance optimization to make it quicker to find an object given its name. However, it is sometimes the case that a dictionary is the primary way to access a group of objects; in these cases, the weak references that a dictionary uses are undesirable because they don’t prevent your objects from being deleted. In these cases, you must create some other explicit reference to your objects; an easy way to accomplish this is to add each such object to a list stored in a property of a static object. You must take care to remove these explicit references when their objects are no longer needed, so that the objects can be deleted.
Creating dictionaries dynamically: You can create a dictionary
object at run-time using the new
operator.
Note that, when doing this, you should be sure to
\#include \<dict.h\>
in your source file to
ensure that the compiler knows about the Dictionary class.
// include the system header defining the Dictionary class
#include <dict.h>
myFunc()
{
// create a new dictionary (with the default
// string comparator)
local dict = new Dictionary();
// add some words
dict.addWord(redBook, 'red', &adjective);
dict.addWord(redBook, 'book', &noun);
}
Using multiple dictionaries in the compiler: You can use multiple
dictionaries at compile-time simply by using a new
dictionary
statement for each dictionary. You
can switch back to an existing dictionary with another
dictionary
statement naming the original
dictionary. The dictionary
statement
establishes the active dictionary, which remains in effect until the
next dictionary
statement. Multiple
dictionaries might be useful in certain situations, such as when you
want to create different parsing modes, each having their own separate
vocabulary words.
Dynamic object creation and dictionaries: When you create a new
simulation object with operator new
, the
interpreter will not automatically add any vocabulary to any
dictionary for the object. While this might seem a deficiency, remember
that the interpreter doesn’t think of Dictionary objects as anything
special, so it doesn’t have any idea that you might want some random
object creation to have a side effect on a completely unrelated object
(such as a Dictionary).
If you’re using the adv3 library, you won’t notice this lack of automatic action on the part of the interpreter, because the adv3 library does provide automatic vocabulary handling for new objects, via constructors in the basic simulation-object classes.
If you’re defining your own library, you’ll probably want to put something like this in the constructors for your simulation object base classes:
class Thing: object
construct()
{
// add all of my vocabulary to the default dictionary
libDict.addWord(self, noun, &noun);
libDict.addWord(self, adjective, &adjective);
libDict.addWord(self, plural, &plural);
}
;
As long as each class derived from Thing (or from subclasses of Thing) properly inherits its superclass constructor from its own constructor, the library code will take care of adding the vocabulary words for the new instance, taking advantage of the accessibility of the dictionary property values that TADS 3 provides.
Dictionary vs. LookupTable: The Dictionary class is in some ways similar to the LookupTable intrinsic class, in that both objects allow you to associate keys with values and efficiently look up a value by key. However, Dictionary is considerably more specialized than the general-purpose LookupTable: Dictionary allows multiple values to be associated with a key, and provides the composite property/string key. Dictionary is also designed specifically to store its specialized data, and makes more efficient use of memory than a more general implementation would. In addition, Dictionary allows the string comparison rules to be customized through the comparator object.
Notes for TADS 2 users
The TADS 2 functions for manipulating the parser dictionary don’t exist in TADS 3. The equivalent functionality is implemented through methods on the Dictionary object.
Even though code that defines vocabulary words for an object looks similar to the way it did in TADS 2, though, there are some important differences:
- First, the
noun
andadjective
properties behave at run-time like ordinary properties: the compiler sets the values of these properties to simple string lists containing the strings actually defined, plus any strings inherited from superclasses. You can use and modify these properties during program execution just like any other, but since they’re nothing special to the interpreter, changing them has no effect on the dictionary. Updating the dictionary requires explicitly using Dictionary methods (addWord()
, etc). - Second, the Dictionary object created by the
dictionary
statement is a full-fledged object at run-time. This means that you can call Dictionary methods on this object in your program, to look up words and modify the dictionary dynamically. There was no equivalent object in TADS 2.
TADS 3 System Manual
Table of Contents |
The Intrinsics > Dictionary