jagomart
digital resources
picture1_Programming Pdf 183775 | P471 Bentley


 142x       Filetype PDF       File size 1.27 MB       Source: cs61.seas.harvard.edu


File: Programming Pdf 183775 | P471 Bentley
by jot1 be tley programming with special guest oysters don knuth and doug mcilroy pearls a literate program last month s column introduced don knuth s style of frequency or ...

icon picture PDF Filetype PDF | Posted on 31 Jan 2023 | 2 years ago
Partial capture of text on file.
                by jot1 Be~~tley                                                                                programming 
                 with  Special  Guest  Oysters 
                      Don  Knuth  and Doug  McIlroy                                                              pearls 
                 A  LITERATE  PROGRAM 
                 Last month‘s column  introduced  Don Knuth’s  style of                                           frequency;  or there  might  not  even  be as many  as k 
                 “Literate Programming”  and his WEB  system for  building                                        words.  Let’s be more  precise:  The  most common 
                 programs that  are works of literature.  This  column  pre-                                      words  are to be printed  in  order  of decreasing  fre- 
                 sents a literate  program by Knuth  (its origins  are sketched                                   quency,  with  words  of equal  frequency  listed  in  al- 
                 in  last month‘s column) and, as befits literature,  a review.                                   phabetic  order.  Printing  should  stop after  k words 
                 So without  further  ado, here is Knuth’s  program,                                              have  been  output,  if  more  than  k words  are present. 
                 retypeset in  Communications                     style.              -Jon  Bentley 
                                                                                                                  2.  The  input  file  is assumed  to contain  the  given 
                 Common  Words                                                                 Section            text.  If  it  begins  with  a positive  decimal  number 
                 Introduction..            . . . . , . . . . . . . , , . . . . . , . , . . . . . , , . .  1       (preceded  by  optional  blanks),  that  number  will  be 
                 Strategic  considerations                 . . . . . . . . . . . . . . . . . . . . . . ,  a       the  value  of k; otherwise  we  shall  assume that 
                 Basic input  routines  . . . . , , . . . . . . . . . . . . . . . . . . . .  9                    k =  100. Answers  will  be sent to the  output  file. 
                 Dictionary         lookup  . , . . . . . . . . . . . . . . . . . , . . . . . . , .17                 define  default-k  =  100 (use this  value  if  k isn’t 
                 The  frequency  counts  . . . . . . . . . . . . . . . . . . . . . . . .32                                           otherwise  specified) 
                 Sortingatrie            . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...36 
                 Theendgame................................41                                                     3.  Besides solving  the  given  problem,  this  program  is 
                 Index       . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...42          supposed to be an example  of the  WEB  system,  for 
                                                                                                                  people  who  know  some Pascal but  who  have  never 
                 1.  Introduction.              The  purpose  of this  program  is to                             seen WEB  before.  Here  is an outline  of the  program 
                 solve  the  following            problem  posed by  Jon Bentley:                                 to be constructed: 
                     Given  a text  file  and  an integer  k, print  the  k most                                  program  common-words  (input,  output); 
                     common  words  in  the  file  (and the  number  of                                               type  (Type  declarations  17) 
                     their  occurrences)  in  decreasing  frequency.                                                  var  (Global  variables  4) 
                                                                                                                      (Procedures  for  initialization                  5) 
                     Jon intentionally            left  the  problem  somewhat                                        (Procedures  for  input  and  output  a) 
                 vague,  but  he stated  that  “a user  should  be able  to                                           (Procedures  for data manipulation                          20) 
                 find  the  100 most  frequent  words  in  a twenty-page                                              begin  (The  main  program  8); 
                 technical  paper  (roughly  a 50K byte  file)  without                                               end. 
                 undue  emotional  trauma.” 
                     Let  us agree that  a word is a sequence  of one or                                          4.  The  main  idea  of the  WEB  approach  is to let  the 
                 more contiguous  letters;  “Bentley”                            is a word,  but                  program  grow  in  natural  stages, with  its  parts  pre- 
                 “ain      1 t  II isn’t.  The  sequence  of letters  should  be                                  sented  in  roughly  the  order  that  they  might  have 
                 maximal,  in  the  sense that  it  cannot  be lengthened                                         been  written          by  a programmer  who  isn’t  especially 
                 without       including         a nonletter.  Uppercase  letters                                  clairvoyant. 
                 are considered  equivalent  to their  lowercase                                                      For example,  each global  variable  will  be intro- 
                 counterparts,  so that  the  words  “Bentley”                                                    duced  when  we first  know  that  it  is necessary  or 
                 and  “BENTLEY”  and  “bentley”                             are essentially                        desirable;  the  WEB  system  will  take  care of collecting 
                 identical.                                                                                        these declarations  into  the  proper  place.  We already 
                     The  given  problem  still  isn’t  well  defined,  for  the                                   know  about  one global  variable,  namely  the  number 
                 file  might  contain  more  than  k words,  all  of the  same                                     that  Bentley  called  k. Let  us give  it  the  more  descrip- 
                 01966  ACM  OOOl-0782/86/0600-0471           750                                                  tive  name  max-words-to-print. 
                 June 1986  Volume  29  Number  6                                                                                                                     Communications of the ACM                     471 
             Programming Pear/s 
                         (Global  variables  4)  =                                                                      There  should  be a frequency  count  associated 
                         max.-words-to-print:             integer;                                                   with  each word,  and we  will  eventually                       want  to run 
                                (at most this  many  words  will  be printed)                                        through  the  words  in  order  of decreasing  frequency. 
                        See also sections 11, 13, 18, 22, 32, and 36.                                                But  there’s  no  need  to keep  these  counts  in  order  as 
                        This code is used in section 3.                                                              we read through  the  input,  since  the  order  matters 
                                                                                                                     only  at  the  end. 
                        5.  As we  introduce  new  global  variables,  we’ll  often                                     Therefore  it  makes sense to structure  our  program 
                        want  to give  them  certain  starting  values,  This  will                                  as follows: 
                        be done by  the  initialize  procedure,  whose  body  will                                   (The  main  program  a) = 
                        consist  of various  pieces  of code to be specified                                            initialize; 
                        when  we  think  of particular                 kinds  of initialization,                        (Establish  the  value  of max-words-to-print                         IO); 
                        (Procedures  for  initialization                5)  =                                           (Input  the  text,  maintaining               a dictionary         with 
                        procedure  initialize;                                                                              frequency  counts  34); 
                           var  i:  integer;        {all-purpose  index  for  initializa-                               (Sort  the  dictionary           by  frequency  39); 
                                     tions)                                                                             (Output  the  results  41) 
                           begin  (Set  initial         values  12)                                                 This  code is used in section 3. 
                           end; 
                        This  code is used in section 3.                                                            9.  Basic  input  routines.               Let’s  switch  to a bottom- 
                                                                                                                    up approach  now,  by  writing  some of the  procedures 
                        6.  The  WEB system,  which  may  be thought  of as a                                       that  we  know  will  be necessary  sooner  or later. 
                        preprocessor  for  Pascal, includes  a macro  definition                                    Then  we’ll  have  some confidence  that  our  program 
                        facibty  so that  portable  programs  are easier  to write.                                 is  taking  shape, even  though  we  haven’t  decided  yet 
                        For example,  we  have  already  defined  ‘default-k’  to                                   how  to handle  the  searching  or the  sorting.  It  will  be 
                        be 100. Here  are two  more  examples  of WEB macros;                                       nice  to get the  messy details  of Pascal input  out  of 
                        they  allow  us to write,  e.g., ‘incr(counf[  p])’  as a con-                              the  way  and off  our  minds. 
                        venient  abbreviation             for  the  statement  ‘counf[ p] +                             Here’s  a function          that  reads an optional  positive 
                        counf[p]  +  1’.                                                                            integer,  returning           zero  if  none  is present  at the  be- 
                           define  incr(#)  =  # c-  #  +  1              (increment  a vari-                       ginning  of the  current  line. 
                                    able}                                                                           (Procedures  for  input  and  output  9)  = 
                           define  deer(#) =  4# t            #  -  1      {decrement  a vari-                      function       read-inf:  integer; 
                                   able)                                                                                var  n: integer;         {th e accumulated  value) 
                                                                                                                        begin  n c  0; 
                        7.  Some of the  procedures  we  shall  be writing  come                                        if  leaf   then 
                        to abrupt  conclusions;  hence  it  will  be convenient                          to                begin  while  (leoln)  A (input  t  =  Iu’)  do 
                        introduce  a ‘return’            macro  for  the  operation  of                                       gef(inpuf]; 
                        jumping  to the  end of the  procedure.  A symbolic                                                while  (input  1 2  ‘0’)  A (input  1 5  ‘9’)  do 
                        label  ‘exit’  will     be declared  in  all  such  procedures,                                       begin  n t        lO*n  +  ord(inpuf7)  -  ord(‘0’); 
                        and  ‘exit:’  will     be placed  just  before  the  final  end.                                      gef(inpuf); 
                        (No other  labels  or goto  statements  are used in  the                                              end; 
                        present  program,  but  the  author  would  find  it  pain-                                        end; 
                        ful  to eliminate        these particular           ones.)                                      read-inf  c  n; 
                           define  exit  =  30          (the  end  of a procedure}                                     end; 
                           define  return  =  goto  exit             {quick  termination]                           See also sections 15, 35, and 40. 
                           format  return  =  nil            {typeset  ‘return’         in  boldface)               This code is used in section 3. 
                        8.  Strategic  considerations.                  What  algorithms  and                       10.  We invoke  readht                only  once. 
                        data structures  should  be used for  Bentley’s  prob-                                      (Establish  the  value  of max-words-to-print                         IO)  = 
                        lem?  Clearly  we  need to be able  to recognize  differ-                                      max-words-to-print              c  read-inf; 
                        ent  occurrences  of the  same word,  so some sort  of                                         if  max-words-to-prinf             =  0 then 
                        internal     dictionary        is  necessary.  There’s  no obvious                                 max-words-to-print              t    default-k 
                       way  to decide  that  a particular                 word  of the  input                       This  code is used in section 8. 
                       cannot  possibly  be in  the  final  set, until  we’ve  gotten 
                       very  near  the  end  of the  file;  so we  might  as well                                   11.  To find  words  in  the  input  file,  we  want  a quick 
                       remember  every  word  that  appears.                                                        way  to distinguish           letters  from  nonletters.  Pascal has 
            472         Communications  of he  ACM                                                                                                           June 1986  Volume  29  Number  6 
                                                                                                                                                                           Programming  Pearls 
               conspired  to make  this  problem  somewhat  tricky,                                    13.  Each  new  word  found  in  the  input  will  be 
               because it  leaves  many  details  of the  character  set                               placed  into  a buffer array.  We shall  assume that  no 
               undefined.  We shall  define  two  arrays,  lowercase and                               words  are more  than  60 letters  long;  if  a longer  word 
               uppercase, that  specify  the  letters  of the  alphabet.  A                            appears, it  will  be truncated  to 60 characters,  and  a 
               third  array,  lettercode, maps arbitrary                 characters  into              warning  message will  be printed  at the  end  of the 
               the  integers  0 . . 26.                                                                run. 
                  If  c is a  value  of type  char that  represents  the  kth                             define  max-word-length              =  60 
               letter  of the  alphabet,  then  lettercode[ord(c)]  =  k; but                                    {words  shouldn’t  be longer  than  this] 
               if  c is a nonletter,       lettercode[ord(c)]  =  0. We assume 
               that  0 5  ord(c) 5  255 whenever  c is of type  char.                                  (Global  variables  4)  += 
               (Global  variables  4)  +=                                                              buffer: array  [l  . . max-word-length]                of  1 . . 26; 
               lowercase, uppercase: array  [l  . . 261  of char;                                             (the  current  word] 
                      (the  letters)                                                                   word-length:  0 . . max-word-length; 
               lettercode; array  [0 . . 2551 of 0 . . 26;                                                    (the  number  of active  letters  currently                 in  buffer] 
                      {the  input  conversion  table)                                                  word-truncated:  boolean; 
                                                                                                              (was some word  longer  than  max-word-length?] 
               12.  A somewhat  tedious  set of assignments  is neces-                                 14.  (Set initial       values  12) += 
               sary  for  the  definition        of lowercase and  uppercase, be-                         word-truncated  t  false; 
               cause letters  need not  be consecutive  in  Pascal’s 
               character  set.                                                                         15.  We’re  ready  now  for  the  main  input  routine, 
               (Set initial      values  12)  =                                                        which  puts  the  next  word  into  the  buffer.  If no  more 
                  lowercase[l]  t         ‘a’;  uppercase[l]  t         ‘A’;                           words  remain,  word-length  is set to zero; otherwise 
                  lowercase[2] +  ‘b’;  uppercase[2] c  ‘B’;                                           word-length  is set to the  length  of the  new  word. 
                  lowercase[3] .L  ‘c’;  uppercase[3] c  ‘C’;                                          (Procedures  for  input  and  output  9) += 
                  lowercase[4] t          ‘d’;  uppeucase[4] t          ‘D’;                           procedure  get-word; 
                  loweYcase[5] +  ‘e’;  uppercase[5] t                  ‘E’;                              label  exit;      {enable  a quick  return) 
                  lowercase[6] +  ‘f’;          uppercase[6] t          ‘I?‘;                             begin  word-length  +  0; 
                  lowercase[7] t          ‘g’;  uppercase[7] t          ‘G’;                                 if  leaf   then 
                  lowercase[8] +  ‘h’;          uppercase[8] t          ‘H’;                                 begin  while  lettercode[ord(input  t)]  =  0 do 
                  lowercase[9] +  ‘i’;          uppercase[9] c  ‘I’;                                             if  leoln    then  get(input) 
                  lowercase[lO]  +  ’ j’;  uppercase[lO]  c  ‘J’;                                                else begin  read-ln; 
                  ~owercase[ll]  t          ‘k’;  uppercase[ll]        t    ‘K’;                                    if  eof then  return; 
                  lowercase[l2]  c  ‘1’;  uppercase1121 c  ‘L’;                                                     end; 
                  lowercase[l3]  t          ‘m’;  uppercase[l3]  c  ‘M’;                                      (Read a word  into  buffer 16); 
                  lowercase[l4]  t          ‘n’;  uppercase[l4]  t          ‘N’;                             end; 
                  lowercase[l5]  +  ‘0’;  uppercase[l5]  c  ‘0’;                                       exit:  end; 
                  lowercase[l6]  c  ‘p’;  uppercase[l6]  c  ‘P’; 
                  lowercase[l7]  t          ‘q’;  uppercase[l7]  t          ‘Q’; 
                  loweucase[l8]  c  ‘r’;  uppercase[l8]  c  ‘R’;                                       16.  At  this  point  lettercode[ord(input  t)]  >  0, hence 
                  lowercase[l9]  c  ‘s’;  uppercase[l9]  c  ‘S’;                                       input  T contains  the  first  letter  of a word. 
                  loweYcase[20] +  ‘t’;           uppercase[20] c  ‘T’;                                (Read a word  into  buffer 16) = 
                  Zowercase[21] c  ‘u’;  uppercase[Zl]  c  ‘U’;                                        repeat  if  word-length  =  max-word-length                    then 
                  lowercase[22] c  ‘v’;  uppercase[22] t                    ‘V’;                             word-truncated  +  true 
                  lowercase[23] +  ‘WI; uppercase[23] t                     ‘W’;                          else begin  incr(word-length); 
                  lowercase[24] +  ‘x’;  uppercase[24] c  ‘X’;                                                buffer[word-length]          c  lettercode[ord(input  t)]; 
                  lowercase[25] +  ‘y’;  uppercase[25] c  ‘Y’;                                               end; 
                  lowercase[26] c  ‘2’;  uppercase[26] c  ‘z’;                                            get(input); 
                  for  i t    0 to 255 do lettercode [i]  c  0;                                        until   lettercode[ord(input  f)]  =  0 
                  for  i t    1 to 26 do 
                      begin  lettercode[ord(lowercase[i])]  c  i;                                      This  code  is used  in  section  15. 
                      lettercode[ord(uppercase[i])]  c  i; 
                      end;                                                                             17.  Dictionary         lookup.        Given  a word  in  the  buffer, 
               See also  sections  14, 19,  23, and  33.                                               we will  want  to look  for  it  in  a dynamic  dictionary 
               This  code  is used  in  section  5.                                                    of all  words  that  have  appeared  so far.  We expect 
               June  1986     Volume  29  Number  6                                                                                                   Communications  of the ACM               473 
           Programming  Pearls 
                     many  words  to occur  often,  so we  want  a search                          est child’s  sibling  pointer  is link[  p].  Continuing        our 
                     technique  that  will  find  existing  words  quickly.  Fur-                  earlier  example,  if  all  words  in  the  dictionary         begin- 
                     thermore,  the  dictionary        should  accommodate  words                  ning  with  “be*’     sl.art with  either  “hen”        or  “betl*, 
                     of variable  length,  and  [ideally)  it  should  also facili-                then  sibfing[2000]  =  2021, sibling[2021]  =  2015, and 
                     tate the  task of alphabetic  ordering.                                       sibling[2015]  =  2000. 
                        These  constraints,  suggest a variant  of the  data                          Notice  that  children      of different  parents  might  ap- 
                     structure  introduced  by  Frank  M.  Liang  in  his  Ph.D.                   pear next  to each other.  For example,  we  might 
                     thesis  [“Word  Hy-p.hen-a-tion          by  Corn-pu-ter,”                    have  ch[2020]  =  6, for  the  child  of some word  such 
                     Stanford  University,       19831.  Liang’s  structure,  which                that  link[p]   =  2014. 
                     we may  call  a hash Pie, requires  comparatively               few              If  link[  p]  #  0, the  table  entry  in  position  link[  p]  is 
                     operations  to find  a word  that  is already  present,                       called  the  “header”  of p’s children.  The  special  code 
                    although  it  may  take  somewhat  longer  to insert  a                        value  header appears in  the  ch field  of each  header 
                    new  entry.  Some space is sacrificed-we                 will  need            entry. 
                    two  pointers,  a count,  and  another  s-bit  field  for                         If  p represents  a word,  count [p]  is the  number  of 
                    each character  in  the  dictionary,          plus  extra  space to            times  that  the  word  has occurred  in  the  input  so far. 
                    keep the  hash  table  from  becoming  congested-but                           The  count field  in  a header  entry  is undefined. 
                    relatively    large  memories  are commonplace  now-                              Unused  positions  p have  ch[p]  =  empty-slot:  In  this 
                    adays, so the  method  seems ideal  for  the  present                          case link[  p],  sibling[  p],  and  count [ p]  are undefined. 
                    application.                                                                      define  empfy-slot  =  0 
                       A trie  represents  a set of words  and  all  prefixes  of                     define  header =  27 
                    those words  [cf.  I
						
									
										
									
																
													
					
The words contained in this file might help you see if this file matches what you are looking for:

...By jot be tley programming with special guest oysters don knuth and doug mcilroy pearls a literate program last month s column introduced style of frequency or there might not even as many k his web system for building words let more precise the most common programs that are works literature this pre to printed in order decreasing fre sents its origins sketched quency equal listed al befits review phabetic printing should stop after so without further ado here is have been output if than present retypeset communications jon bentley input file assumed contain given section text it begins positive decimal number introduction preceded optional blanks will strategic considerations value otherwise we shall assume basic routines answers sent dictionary lookup define default use isn t counts specified sortingatrie theendgame besides solving problem index supposed an example people who know some pascal but never purpose seen before outline solve following posed constructed integer print type d...

no reviews yet
Please Login to review.