Shoebox Normalizer

Layout

Screenshot of the Information Selection Screen

If you are done with the controls you can reduce clutter by hiding them.

Loading a Lexicon

Numerous encodings can be handled. By default, it is assumed that the lexicon is in UTF-8 Unicode.

Unique Identifiers

Although not part of the standard SIL tag set, many people assign a unique identifier to every record. This is simply an integer that identifies that allows that record to be identified unambiguously. Such unique identifiers are useful for cross-referencing.

In order for ShoeboxNormalizer to give special treatment to unique identifier fields, it must know which fields they are. Among the Shoebox parameters are therefore the input and output UID Tags, both set by default to "uid".

ShoeboxNormalizer provides several pieces of information about UIDs:

the range of values found;
gaps, that is, unused values within the range;
records containing duplicate uids;

You can also find records lacking a unique identifier field by including the uid tag in the list of obligatory tags and asking to see records lacking obligatory tags.

ShoeboxNormalizer can also modify the unique identifiers in your database. To change the tag of the field containing the UID, set the output uid tag in the Shoebox parameter box to the desired value and save the lexicon. The other operations on UIDs are performed using the Action menu.

One operation is to assign UIDs to records lacking them. To do this, select "Add UID" from the Action menu. Records that already have a UID field will be left alone, but records that lack a UID field will have one added. The added UIDs will start at the number following the maximum already in use.

You may also reassign UIDs to records that already have them. When you reassign UIDs, records that lack a UID are provided with one. The new UIDs will start at the value specified as an output parameter in the Shoebox parameter box. By default, this is one. In this case, gaps in the UID sequence will be filled in. You may, however, set the new minimum UID value to a value greater than one. This is useful if you wish to add the contents of a file to another lexicon.

UIDs may be reassigned either "linearly" or randomly. If they are reassigned linearly, the first record in the file is given the lowest UID, the second record the next UID, and so forth. If they are reassigned randomly, there will be no relationship between the UID value assigned and the position of a record in the file.

Tags

Many of ShoeboxNormalizer's functions provide information about tags. You can find out:

which records lack an obligatory tag;
which records contain more than one tag that should be unique;
which records contain tags that should not appear;
which records contain mismatches of tags that should occur in pairs;
which records contain mismatches of tags that should occur in triples;
which records contain a tag whose presence implies the presence of another tag that is not present;

In each case, the set of tags of interest must be defined. For each there is a default, but you may change it using the various configuration boxes.

There are two ways to define tags that should not appear, via a blacklist and via a whitelist. If you use a blacklist, only the tags you specify will be considered forbidden. If you use a whitelist, only the tags you specify will be considered acceptable: all other tags will be considered forbidden.

In each case, the display consists of three windows. In the leftmost window you will get a list of tags for which one or more records have the property in question, say lacking an obligatory tag. When you select a tag in the first window, the middle window will display a list of the records with the property in question. For example, if you are looking for records that lack an obligatory tag and you select "c" in the first window, the middle window will list the records lacking the tag "c". Clicking on a record number in the middle window will cause the selected record to be displayed in the last window.

Screenshot of a Mismatched Tag Pair Display

Note that the record numbers shown in the middle window are not the contents of the unique identifier field since some lexica do not use unique identifier fields, and those that do may have records that lack them or may contain duplicates. Rather, the record numbers shown simply reflect the order in which the records appear in the file.

Another way in which to review the tags in your lexicon is via the Tag Histogram function, which displays in the first window a list of all of the tags found in the lexicon file together with the number of times each occurs. You will probably want to examine tags of very low frequency as these are likely to be mistakes.

Actions

It is also posssible to make changes in the lexicon.

Renaming Tags

It is sometimes necessary to rename a tag. This situation can arise if a piece of software such as MDF requires a certain tag, to facilitate use by someone who speaks a different language, or just because the current tag doesn't optimally express the content of the field. Tags can be renamed by means of the "Rename Tag" function on the Actions menu.

Deleting Fields

It is occasionally desired to delete selected fields. This may be because they have become obsolete as work has progressed or in order to prepare a "clean" version of the database for certain purposes, e.g. one without obscene or otherwise sensitive forms. The "Delete Field" function on the Action menu allows you to do this. The fields to be deleted are specified by a combination of tag and value, both regular expressions. You can delete all fields with a certain tag regardless of value by using a regular expression for the value that matches anything, that is ".*".

Splitting and Joining Fields

A problem that often arises in lexicography is how to handle information with multiple values. For example, if you record the source of a word, you may have multiple sources. If you assign semantic fields, a word may fall into more than one. One approach is to use a single field whose value is a list, with the individual items delimited by a separator character. For example, an entry for an animal might have a semantic field that looked like this:

    \sf  bio/food

This assigns the entry to two semantic fields separated by a slash.

The other approach is to put each value in a separate field with the same tag, e.g.:

    \sf  bio
    \sf  food

These two approaches have different virtues and vices, and it is quite possible to change one's mind as to the best approach. Shoebox Normalizer provides tools for converting from one approach to the other.

The action "Split Fields" replaces each field with the specified tag with K fields with the same tag, each with a single value, where K is the number of values in the original list. For example, it would convert:

    \sf  bio/food

to:

    \sf  bio
    \sf  food

The action "Join Fields" performs the inverse operation: it replaces multiple fields with the same tag with a single field whose value is a list consisting of the concatenation of the values of the original fields delimited by a specified separator. For example, it would convert:

    \sf bio
    \sf food

into:

    \sf bio/food

"Join Fields" is actually a bit more powerful than "Split Fields" since it will merge any tags that match a regular expression, not just those that are the same. For example, if you supply as the tag regular expression "sf[0-9]*", which matches all tags beginning with the letters "sf" optionally followed by a string of digits of any length, it will merge fields such as "sf1" and "sf2".

Cosmetics

When the separator between the tag and the rest of the field is whitespace as in standard Shoebox format, it need only be a single space. If tags vary in length, this results in the rest of field starting in different columns. Some people do not like the way this looks and prefer to use a variable number of spaces in order to keep the field values left justified. Doing this by hand, however, is tedious. ShoeboxNormalizer will do this for you. Just load the lexicon and save it again with the parameter "Justify Values" set in the output column of the Shoebox configuration box. You can also set the column at which to start the values by setting "Value Column".

Menus

The Options Menu

In many places if you let the pointer linger a balloon with information about the associated feature will pop up. This is useful when learning to use the program but can be irritating once you are familiar with it. You may suppress balloon help from the Options menu.

When you save, by default a time stamp like the following is written at the beginning of the file. The time stamp can be suppressed.

\com Modified by ShoeboxNormalizer 2007-10-21T16:33:01-0700

By default, the first time you save a lexicon, a backup of the original file is automatically made. The backup is placed in the directory named by the TMP environment variable if it exists. If not, the value of the TEMP variable is used, if it exists. If not, the directory C:\tmp is used on Microsoft Windows, /tmp on other systems. The name of the backup file is that of the original file with an ISO8601 time stamp suffixed. You can suppress creation of the backup file if you wish to.