136x Filetype PDF File size 0.44 MB Source: frank.itlab.us
programming by ]an Bentley pearls SELF-DESCRIBING DATA You just spent three CPU hours running a simula- %title Efficient string matching: tion to forecast your company’s financial future, and an aid to bibliographic search your boss asks you to interpret the output: %journal Communications of the ACM Xvolume 18 Scenario 1: 3.2% -12.0% 1.1% %number 6 Scenario 2: 12.7% 0.0% 8.6% %month June Scenario 3: 1.6% -8.3% 9.2% %year 1975 %pages 333-340 Hmmm. Blank lines separate entries in the file. A line that You dig through the program to find the meaning begins with a percent sign contains an identifying of each output variable. Good news-Scenario 2 term followed by arbitrary text. Text may be con- paints a rosy picture for the next fiscal year. Now all tinued on subsequent lines that do not start with you have to do is uncover the assumptions of each. a percent sign. Oops-the disaster in Scenario 1 is your company’s The lines in the bibliography file are name-value current strategy, doomed to failure. What did Sce- pairs: each line contains the name of an attribute nario 2 do that was so effective? Back to the pro- followed by its value. The names and the values are gram, trying to discover which input files each one sufficiently self-describing that I don’t need to elabo- reads. . . . rate further on them. This format is particularly well Every programmer knows the frustration of trying suited to bibliographies and other complex data to decipher mysterious data. The first two sections of models. It supports missing attributes (books have no this column discuss two techniques for embedding volume number and journals have no city), multiple descriptions in data files. The third section then attributes (such as authors), and an arbitrary order of applies both methods to a concrete problem. fields (one need not remember whether the volume number comes before or after the month). Name-Value Pairs Name-value pairs are useful in many databases. Many document production systems support biblio- One might, for instance, describe the aircraft carrier graphic references in a form something like: USS Nimitz in a database of naval vessels with these pairs: Xtitle The Art of Computer Programming, name Nimitz Volume 3: Sorting and Searching class CVN Xauthor D. E. Knuth number 68 %pub Addison-Wesley displacement 81600 %city Reading, Mass. length 1040 %year 1973 beam 134 Xauthor A. V. Aho draft 36.5 %author M. J. Corasick flightdeck 252 speed 30 officers 447 0 1967 ACM OOOl-0782/87/0600-0479 75a enlisted 5176 June 1987 Volume 30 Number 6 Communications of the ACM 479 Programming Pearls Such a record could be used for input, storage, and format. The regular structure supports the tabular output. A user could prepare a record for entry into format, but observe that the header line is another the database using a standard text editor. The data- kind of self-description embedded in the data. base system could store records in exadtly this form Name-value pairs are a handy way to give input (we’ll soon see a representation that is more space to any program. (They are perhaps the tiniest of the efficient). The same record could be included in the “little languages” described in the September 1986 answer to the query “What ships have a displace- column.) They can help meet the criteria that ment of more than 75,000 tons?” Kernighan and Plauger propose in Chapter 5 of their Name-value pairs offer several advantages for this Elements of Progrumming Style (2nd ed., McGraw-Hill, hypothetical application. A single format can be 1978): used for reading, storing, and writing records; this Use mnemonic input and output. Make input easy to simplifies life for user and implementer alike. The prepare (and to prepare correctly). Echo the input and application is inherently variable-format because any defaults onto the output; make the outpui self- different ships have different attributes: submarines explanatory. have no flight decks and aircraft carriers have no Name-value pairs are useful in code far removed submerged depth. Unfortunately, the example does from input/output. Suppose we wanted to provide a not document the units in which the various quanti- subroutine that adds a ship to a database. Most lan- ties are expressed; we’ll return to that shortly. guages denote the (formal) name of a parameter by Some database systems store records on mass its position in the parameter list. This mechanism memory in exactly the form shown above. This for- leads to remarkably clumsy calls: mat makes it particularly easy to add new fields to records in an existing database. The name-value for- mat can be quite space efficient, especially com- addship("Nimitz", "CVN", "68", pared to fixed-format records that have many fields, 81600, 1040, 134, 36.5, most of which are usually empty. If storage is criti- 447, 5176,,,30,,,252,,,,) cal, however, then the database could be squeezed to a compressed format: The missing parameters denote fields not present in this record. Is 30 the speed in knots or the draft in feet? A little discipline in commenting helps unravel the mess: Each field begins with a two-character name and addshipc "Nimitz", # name ends with a vertical bar. The input and the stored " CVN" , # class formats are connected by a data dictionary, which "68" , # number might start: 81600, # disp 1040, # length ABBR NAME UNITS . . . 1 na name text cl class text Many programming languages support named nu number text parameters, which make the job easier: di displacement tons le length feet be beam feet addshipcname = "Nimitz", dr draft feet class = "CVN", fl flightdeck feet number = "68", sP speed knots disp = 81600, of officers personnel length = 1040, en enlisted personnel . . . 1 In this dictionary the abbreviations are always the first two characters of the name; that may not hold Even if your language doesn’t have named param- in general. Readers offended by hypocrisy may com- eters, you can usually simulate them with a few plain that the above data is not in a name-value routines: 480 Communications of the ACM lune 1987 Volume 30 Number 6 Programming Pearls shipstart “When looking at the result of a computation like ship&r (name, "Nimitz") this, it is helpful to have an ‘audit trail’ of the vari- shipstr (class, "CVN" ) ous command lines and data files encountered. We shipstr (number, "68" ) therefore built a mechanism for ‘commenting’ the shipnum(disp, 81600) files with various annotations so that when we re- shipnum(length, 1040) view the output, everything is there on one display . . . or piece of paper. shipendc ) “We use several types of comments. An ‘audit trail’ line identifies a data file or a command-line The name variables name, class, number, etc., are transformation. A ‘dictionary’ line names the attri- assigned unique integers. butes in each column of the output. A ‘frame separa- tor’ sets apart a group of sequential records associ- ated with a common event. A ‘note’ allows us to Provenances in Programming place our remarks in the file. All comments begin The provenance of a museum piece lists the origin with an exclamation mark and the type of the com- or source of the object. Antiques are worth more ment; other lines are passed through untouched and when they have a provenance (this chair was built processed as data. Thus the output of the above in such-and-such, then purchased by so-and-so, pipeline might look like: etc.). You might think of a provenance as a pedigree for a nonliving object. The idea of a provenance is old hat to many pro- ltrail sim.events -k 1.5 -1 3 grammers. Some software shops insist that the ltrail sample -t .Ol provenance of a program be kept in the source code ltrail bins -t .Ol itself: in addition to other documentation in a mod- Idict bin-bottom-value item-count ule, the provenance gives the history of the code 0.00 72 (who changed what when, and why). The prove- 0.01 138 nance of a data file is often kept in an associated file 0.02 121 (a transaction log, for instance). Frank Starmer, a . . . Professor in the Departments of Computer Science lnote there is a cluster around 0.75 and Medicine at Duke University, tells how his pro- lframe grams produce data files that contain their own provenances: “We constantly face the problem of keeping track All programs in our library automatically copy exist- of our manipulations of data. We typically explore ing comments from their input onto their output, data sets by setting up a UNIX@ pipeline like and additionally add a new t r a i 1 comment to doc- ument their own action. Programs that reformat data sim.events -k 1.5 -1 3 I (such as bins) add a diet comment to describe the sample -t .Ol I new format. bins “We’ve done this in order to survive. This disci- pline aids in making both input and output data files The first program is a simulation with the two pa- self-documenting. Many other people have built rameters k and 1 (set in this example to 1.5 and 3)’ similar mechanisms; wherever possible, I have cop- The vertical bar at the end of the first line pipes the ied their enhancements rather than having to figure output into the second program. That program sam- out new ones myself.” ples the data at the designated frequency, and in Tom Duff of Bell Labs uses a similar strategy in a turn pipes its output to the third program, which system for processing pictures. He has developed a chops the input into bins (suitable for graphical large suite of UNIX programs that perform transfor- display as a histogram). mations on pictures. A picture file consists of text lines listing the commands that made the picture (terminated by a blank line) followed by the picture itself (represented in binary). The prelude provides a UNIX is a registered trademark of AT&T Bell Laboratories. provenance of the picture. Before Duff started this ’ Note that the two parameters are set by a simple name-value mechanism practice he would sometimes find himself with a -/. B. wonderful picture and no idea of what transforma- ]une 1987 Volume 30 Number 6 Communications of the ACM 481 Programming Pearls tions produced it; now he can reconstruct any camps 4772 picture from its provenance. swaps 4676 Duff implements the provenances in a single cpu 0.1083 library routine that all programs call as they begin execution. The routine copies the old command Given the sort routines and other procedures to do lines to the output and then writes the command the real work, the control program is easy to build. line of the current program. Its main loop is sketched in Program 1: the code reads each input line, copies it to the output, and A Sorting Lab processes the name-value pair. The s imul.ate ( ) To make the above ideas more concrete, we’ll apply routine performs the experiment and writes the out- them to a problem described in the July 1985 col- put variables in name-value format; it is called at umn: building “scaffolding” to experiment with sort each blank line and also at the end of the file. routines. That column contains code for several sort routines and a few small programs to exercise them. This section will sketch a better interface to the rou- loop tines. The input and output are both expressed in read input line into string S name-value pairs, and the output contains a com- if end of file then break plete description of the input (its provenance). if S = *I II then simulated) Experiments on sorting algorithms involve adjust- write S on output ing various parameters, executing the specified rou- Fl = first field in S tine, then reporting key attributes of the computa- F2 = second field in S tion. The precise operations to be performed can be if Fl= "n" then specified by a sequence of name-value pairs. Thus N = F2 the input file to the sorting lab might look like this: else if Fl = "alg" then if F2 = "insert" then n 100000 alg = 1 input identical else if F2 = "heap" then alg quick alg = 2 cutoff 10 else if F2 = "quick" then partition random alg = 3 seed 379 else error("bad alg") else if Fl = "input" then . . . In this example the problem size, n, is set to 100,000. simulatet 1 The input array is initialized with identical elements (other options might include random, PROGRAM 1. A Loop to Process Name-Value Pairs sorted,or reversed elements). The sortingalgo- rithm in this experiment is quit k for quicksort; insert (for insertionsort) and heap (forheapsort) might also be available. The cutoff and parti - t ion names specify further parameters in quit k - This structure is useful for many simulation pro- sort. grams. The program is easy to build and easy to use. The input to the simulation program is a sequence Its output can be read by a human and can also be of experiments in the above format, separated by fed to later programs for statistical analysis. Because blank. lines. Its output is in exactly the same format all the input variables (which together provide a of name-value pairs, separated by blank lines. The provenance of the experiment) appear with the out- first part of an output record contains the original put variables, any particular experiment can be re- input description, which gives the provenance of peated and studied in detail. The variable format each experiment. The input is followed by three ad- allows additional input and output parameters to ditional attributes: camps records the number of be added to future simulations without having to comparisons made, swaps counts swaps, and cpu restructure previous data. Problem 8 shows how the records the run time of the procedure. Thus an out- basic structure can be gracefully extended to per- put record might end with the fields: forming sets of experiments. 482 Communications of the ACM June 1987 Volume 30 Number 6
no reviews yet
Please Login to review.