142x Filetype PDF File size 1.27 MB Source: cs61.seas.harvard.edu
by jot1 Be~~tley programming with Special Guest Oysters Don Knuth and Doug McIlroy pearls A LITERATE PROGRAM Last month‘s column introduced Don Knuth’s style of frequency; or there might not even be as many as k “Literate Programming” and his WEB system for building words. Let’s be more precise: The most common programs that are works of literature. This column pre- words are to be printed in order of decreasing fre- sents a literate program by Knuth (its origins are sketched quency, with words of equal frequency listed in al- in last month‘s column) and, as befits literature, a review. phabetic order. Printing should stop after k words So without further ado, here is Knuth’s program, have been output, if more than k words are present. retypeset in Communications style. -Jon Bentley 2. The input file is assumed to contain the given Common Words Section text. If it begins with a positive decimal number Introduction.. . . . . , . . . . . . . , , . . . . . , . , . . . . . , , . . 1 (preceded by optional blanks), that number will be Strategic considerations . . . . . . . . . . . . . . . . . . . . . . , a the value of k; otherwise we shall assume that Basic input routines . . . . , , . . . . . . . . . . . . . . . . . . . . 9 k = 100. Answers will be sent to the output file. Dictionary lookup . , . . . . . . . . . . . . . . . . . , . . . . . . , .17 define default-k = 100 (use this value if k isn’t The frequency counts . . . . . . . . . . . . . . . . . . . . . . . .32 otherwise specified) Sortingatrie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...36 Theendgame................................41 3. Besides solving the given problem, this program is Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...42 supposed to be an example of the WEB system, for people who know some Pascal but who have never 1. Introduction. The purpose of this program is to seen WEB before. Here is an outline of the program solve the following problem posed by Jon Bentley: to be constructed: Given a text file and an integer k, print the k most program common-words (input, output); common words in the file (and the number of type (Type declarations 17) their occurrences) in decreasing frequency. var (Global variables 4) (Procedures for initialization 5) Jon intentionally left the problem somewhat (Procedures for input and output a) vague, but he stated that “a user should be able to (Procedures for data manipulation 20) find the 100 most frequent words in a twenty-page begin (The main program 8); technical paper (roughly a 50K byte file) without end. undue emotional trauma.” Let us agree that a word is a sequence of one or 4. The main idea of the WEB approach is to let the more contiguous letters; “Bentley” is a word, but program grow in natural stages, with its parts pre- “ain 1 t II isn’t. The sequence of letters should be sented in roughly the order that they might have maximal, in the sense that it cannot be lengthened been written by a programmer who isn’t especially without including a nonletter. Uppercase letters clairvoyant. are considered equivalent to their lowercase For example, each global variable will be intro- counterparts, so that the words “Bentley” duced when we first know that it is necessary or and “BENTLEY” and “bentley” are essentially desirable; the WEB system will take care of collecting identical. these declarations into the proper place. We already The given problem still isn’t well defined, for the know about one global variable, namely the number file might contain more than k words, all of the same that Bentley called k. Let us give it the more descrip- 01966 ACM OOOl-0782/86/0600-0471 750 tive name max-words-to-print. June 1986 Volume 29 Number 6 Communications of the ACM 471 Programming Pear/s (Global variables 4) = There should be a frequency count associated max.-words-to-print: integer; with each word, and we will eventually want to run (at most this many words will be printed) through the words in order of decreasing frequency. See also sections 11, 13, 18, 22, 32, and 36. But there’s no need to keep these counts in order as This code is used in section 3. we read through the input, since the order matters only at the end. 5. As we introduce new global variables, we’ll often Therefore it makes sense to structure our program want to give them certain starting values, This will as follows: be done by the initialize procedure, whose body will (The main program a) = consist of various pieces of code to be specified initialize; when we think of particular kinds of initialization, (Establish the value of max-words-to-print IO); (Procedures for initialization 5) = (Input the text, maintaining a dictionary with procedure initialize; frequency counts 34); var i: integer; {all-purpose index for initializa- (Sort the dictionary by frequency 39); tions) (Output the results 41) begin (Set initial values 12) This code is used in section 3. end; This code is used in section 3. 9. Basic input routines. Let’s switch to a bottom- up approach now, by writing some of the procedures 6. The WEB system, which may be thought of as a that we know will be necessary sooner or later. preprocessor for Pascal, includes a macro definition Then we’ll have some confidence that our program facibty so that portable programs are easier to write. is taking shape, even though we haven’t decided yet For example, we have already defined ‘default-k’ to how to handle the searching or the sorting. It will be be 100. Here are two more examples of WEB macros; nice to get the messy details of Pascal input out of they allow us to write, e.g., ‘incr(counf[ p])’ as a con- the way and off our minds. venient abbreviation for the statement ‘counf[ p] + Here’s a function that reads an optional positive counf[p] + 1’. integer, returning zero if none is present at the be- define incr(#) = # c- # + 1 (increment a vari- ginning of the current line. able} (Procedures for input and output 9) = define deer(#) = 4# t # - 1 {decrement a vari- function read-inf: integer; able) var n: integer; {th e accumulated value) begin n c 0; 7. Some of the procedures we shall be writing come if leaf then to abrupt conclusions; hence it will be convenient to begin while (leoln) A (input t = Iu’) do introduce a ‘return’ macro for the operation of gef(inpuf]; jumping to the end of the procedure. A symbolic while (input 1 2 ‘0’) A (input 1 5 ‘9’) do label ‘exit’ will be declared in all such procedures, begin n t lO*n + ord(inpuf7) - ord(‘0’); and ‘exit:’ will be placed just before the final end. gef(inpuf); (No other labels or goto statements are used in the end; present program, but the author would find it pain- end; ful to eliminate these particular ones.) read-inf c n; define exit = 30 (the end of a procedure} end; define return = goto exit {quick termination] See also sections 15, 35, and 40. format return = nil {typeset ‘return’ in boldface) This code is used in section 3. 8. Strategic considerations. What algorithms and 10. We invoke readht only once. data structures should be used for Bentley’s prob- (Establish the value of max-words-to-print IO) = lem? Clearly we need to be able to recognize differ- max-words-to-print c read-inf; ent occurrences of the same word, so some sort of if max-words-to-prinf = 0 then internal dictionary is necessary. There’s no obvious max-words-to-print t default-k way to decide that a particular word of the input This code is used in section 8. cannot possibly be in the final set, until we’ve gotten very near the end of the file; so we might as well 11. To find words in the input file, we want a quick remember every word that appears. way to distinguish letters from nonletters. Pascal has 472 Communications of he ACM June 1986 Volume 29 Number 6 Programming Pearls conspired to make this problem somewhat tricky, 13. Each new word found in the input will be because it leaves many details of the character set placed into a buffer array. We shall assume that no undefined. We shall define two arrays, lowercase and words are more than 60 letters long; if a longer word uppercase, that specify the letters of the alphabet. A appears, it will be truncated to 60 characters, and a third array, lettercode, maps arbitrary characters into warning message will be printed at the end of the the integers 0 . . 26. run. If c is a value of type char that represents the kth define max-word-length = 60 letter of the alphabet, then lettercode[ord(c)] = k; but {words shouldn’t be longer than this] if c is a nonletter, lettercode[ord(c)] = 0. We assume that 0 5 ord(c) 5 255 whenever c is of type char. (Global variables 4) += (Global variables 4) += buffer: array [l . . max-word-length] of 1 . . 26; lowercase, uppercase: array [l . . 261 of char; (the current word] (the letters) word-length: 0 . . max-word-length; lettercode; array [0 . . 2551 of 0 . . 26; (the number of active letters currently in buffer] {the input conversion table) word-truncated: boolean; (was some word longer than max-word-length?] 12. A somewhat tedious set of assignments is neces- 14. (Set initial values 12) += sary for the definition of lowercase and uppercase, be- word-truncated t false; cause letters need not be consecutive in Pascal’s character set. 15. We’re ready now for the main input routine, (Set initial values 12) = which puts the next word into the buffer. If no more lowercase[l] t ‘a’; uppercase[l] t ‘A’; words remain, word-length is set to zero; otherwise lowercase[2] + ‘b’; uppercase[2] c ‘B’; word-length is set to the length of the new word. lowercase[3] .L ‘c’; uppercase[3] c ‘C’; (Procedures for input and output 9) += lowercase[4] t ‘d’; uppeucase[4] t ‘D’; procedure get-word; loweYcase[5] + ‘e’; uppercase[5] t ‘E’; label exit; {enable a quick return) lowercase[6] + ‘f’; uppercase[6] t ‘I?‘; begin word-length + 0; lowercase[7] t ‘g’; uppercase[7] t ‘G’; if leaf then lowercase[8] + ‘h’; uppercase[8] t ‘H’; begin while lettercode[ord(input t)] = 0 do lowercase[9] + ‘i’; uppercase[9] c ‘I’; if leoln then get(input) lowercase[lO] + ’ j’; uppercase[lO] c ‘J’; else begin read-ln; ~owercase[ll] t ‘k’; uppercase[ll] t ‘K’; if eof then return; lowercase[l2] c ‘1’; uppercase1121 c ‘L’; end; lowercase[l3] t ‘m’; uppercase[l3] c ‘M’; (Read a word into buffer 16); lowercase[l4] t ‘n’; uppercase[l4] t ‘N’; end; lowercase[l5] + ‘0’; uppercase[l5] c ‘0’; exit: end; lowercase[l6] c ‘p’; uppercase[l6] c ‘P’; lowercase[l7] t ‘q’; uppercase[l7] t ‘Q’; loweucase[l8] c ‘r’; uppercase[l8] c ‘R’; 16. At this point lettercode[ord(input t)] > 0, hence lowercase[l9] c ‘s’; uppercase[l9] c ‘S’; input T contains the first letter of a word. loweYcase[20] + ‘t’; uppercase[20] c ‘T’; (Read a word into buffer 16) = Zowercase[21] c ‘u’; uppercase[Zl] c ‘U’; repeat if word-length = max-word-length then lowercase[22] c ‘v’; uppercase[22] t ‘V’; word-truncated + true lowercase[23] + ‘WI; uppercase[23] t ‘W’; else begin incr(word-length); lowercase[24] + ‘x’; uppercase[24] c ‘X’; buffer[word-length] c lettercode[ord(input t)]; lowercase[25] + ‘y’; uppercase[25] c ‘Y’; end; lowercase[26] c ‘2’; uppercase[26] c ‘z’; get(input); for i t 0 to 255 do lettercode [i] c 0; until lettercode[ord(input f)] = 0 for i t 1 to 26 do begin lettercode[ord(lowercase[i])] c i; This code is used in section 15. lettercode[ord(uppercase[i])] c i; end; 17. Dictionary lookup. Given a word in the buffer, See also sections 14, 19, 23, and 33. we will want to look for it in a dynamic dictionary This code is used in section 5. of all words that have appeared so far. We expect June 1986 Volume 29 Number 6 Communications of the ACM 473 Programming Pearls many words to occur often, so we want a search est child’s sibling pointer is link[ p]. Continuing our technique that will find existing words quickly. Fur- earlier example, if all words in the dictionary begin- thermore, the dictionary should accommodate words ning with “be*’ sl.art with either “hen” or “betl*, of variable length, and [ideally) it should also facili- then sibfing[2000] = 2021, sibling[2021] = 2015, and tate the task of alphabetic ordering. sibling[2015] = 2000. These constraints, suggest a variant of the data Notice that children of different parents might ap- structure introduced by Frank M. Liang in his Ph.D. pear next to each other. For example, we might thesis [“Word Hy-p.hen-a-tion by Corn-pu-ter,” have ch[2020] = 6, for the child of some word such Stanford University, 19831. Liang’s structure, which that link[p] = 2014. we may call a hash Pie, requires comparatively few If link[ p] # 0, the table entry in position link[ p] is operations to find a word that is already present, called the “header” of p’s children. The special code although it may take somewhat longer to insert a value header appears in the ch field of each header new entry. Some space is sacrificed-we will need entry. two pointers, a count, and another s-bit field for If p represents a word, count [p] is the number of each character in the dictionary, plus extra space to times that the word has occurred in the input so far. keep the hash table from becoming congested-but The count field in a header entry is undefined. relatively large memories are commonplace now- Unused positions p have ch[p] = empty-slot: In this adays, so the method seems ideal for the present case link[ p], sibling[ p], and count [ p] are undefined. application. define empfy-slot = 0 A trie represents a set of words and all prefixes of define header = 27 those words [cf. I
no reviews yet
Please Login to review.