Programming Pearls, Second Edition by Jon Bentley. Addison-Wesley, Inc., 2000. ISBN 0-201-65788-0. 239 + xi pp. $24.95 T...

Author:
Jon Bentley

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Programming Pearls, Second Edition by Jon Bentley. Addison-Wesley, Inc., 2000. ISBN 0-201-65788-0. 239 + xi pp. $24.95 This book is a collection of essays about a glamorous aspect of software: programming pearls whose origins lie beyond solid engineering, in the realm of insight and creativity. This book provides a guide for both students and experienced programmers about how to design and create programs, and how to think about programming. The book is full of small case studies, real examples, and interesting exercises for learning about how to program. This web page contains samples from the whole work for you to investigate. For teachers, the links below lead to some of the central material suitable for classroom use. Steve McConnell describes the book as ``a celebration of design in the small''. Browse this site to sample it yourself. What's new on this web site? From The Book Table of Contents Preface Part I: Preliminaries Column 1: Cracking the Oyster Column 2: Aha! Algorithms [Sketch] Column 4: Writing Correct Programs [Sketch] Column 5: A Small Matter of Programming [Sketch] Part II: Performance Column 7: The Back of the Envelope Column 8: Algorithm Design Techniques [Sketch] Part III: The Product Column 14: Heaps [Sketch] Column 15: Strings of Pearls Epilog to the First Edition Epilog to the Second Edition Appendix 2: An Estimation Quiz Appendix 3: Cost Models for Time and Space

Appendix 4: Rules for Code Tuning Solutions for Column 1 Column 5 Column 7 Column 15 Index About The Book Why a Second Edition? To Readers of the First Edition About the First Edition Errata Supporting Material Source Code Web Sites Relevant to the Book Animation of Sorting Algorithms Tricks of the Trade Teaching Material Other Links ● ● ●

Addison-Wesley Computer & Engineering Publishing Group Programming Pearls at Addison-Wesley Bookstores: Amazon.com, Barnes & Noble, Borders.com, Fatbrain.com, Quantum Books.

Copyright © 1999 Lucent Technologies. All rights reserved. Thu 19 Oct 2000

What's New on the Programming Pearls Web Site November 2000 Column 15 is now on the site, complete with a new program for letter-level Markov text, and new examples of word frequencies, long repeated strings, and letter-level and wordlevel Markov text.

October 2000 The rules for code tuning from my 1982 book Writing Efficient Programs are now online, and so is a Powerpoint Show on Cache-Conscious Algorithms and Data Structures.

August 2000 The errata just keeps on growing. If you see errors, please send them in.

July 2000 Programming Pearls is often used for teaching undergraduates. This page describes how some of the topics in the book can be incorporated into college classrooms.

March 2000 A theme running through the book concerns the Tricks of the Trade. This page describes that topic and contains a Powerpoint Show on the subject. Copyright © 1999 Lucent Technologies. All rights reserved. Mon 6 Nov 2000

Strings of Pearls (Column 15 of Programming Pearls) We are surrounded by strings. Strings of bits make integers and floating-point numbers. Strings of digits make telephone numbers, and strings of characters make words. Long strings of characters make web pages, and longer strings yet make books. Extremely long strings represented by the letters A, C, G and T are in geneticists' databases and deep inside the cells of many readers of this book. Programs perform a dazzling variety of operations on such strings. They sort them, count them, search them, and analyze them to discern patterns. This column introduces those topics by examining a few classic problems on strings.

The Rest of the Column These are the remaining sections in the column. 15.1 Words 15.2 Phrases 15.3 Generating Text 15.4 Principles 15.5 Problems 15.6 Further Reading

Related Content The teaching material contains overhead transparencies based on Sections 15.2 and 15.3; the slides are available in both Postscript and Acrobat. The code for Column 15 contains implementations of the algorithms. The Solutions to Column 15 give answers for some of the Problems. This column concentrates on programming techniques, and uses those techniques to build several programs for processing large text files. A few examples of the programs' output in the text illustrate the structure of English documents. This web site contains some additional fun examples, which may give further insight into the structure of the English language. Section 15.1 counts the words in one document; here are more examples of word frequency counts. Section 15.2 searches for large portions of duplicated text; here are more examples of long repeated strings. Section 15.3 describes randomly generated Markov text; these pages contain addtional examples of Markov text, generated at the letter level and word level.

The web references describe several web sites devoted to related topics. Copyright © 1999 Lucent Technologies. All rights reserved. Wed 18 Oct 2000

Words (Section 15.1 of Programming Pearls) Our first problem is to produce a list of the words contained in a document. (Feed such a program a few hundred books, and you have a fine start at a word list for a dictionary.) But what exactly is a word? We'll use the trivial definition of a sequence of characters surrounded by white space, but this means that web pages will contain many ``words'' like ``'', ``'' and `` ''. Problem 1 asks how you might avoid such problems. Our first C++ program uses the sets and strings of the Standard Template Library, in a slight modification of the program in Solution 1.1: int main(void) { set S; set::iterator j; string t; while (cin >> t) S.insert(t); for (j = S.begin(); j != S.end(); ++j) cout t) M[t]++; for (j = M.begin(); j != M.end(); ++j) cout first word, p->count return 0 The work is done by incword, which increments the count associated with the input word (and initializes it if it is not already there): void incword(char *s) h = hash(s) for (p = bin[h]; p != NULL; p = p->next) if strcmp(s, p->word) == 0 (p->count)++ return p = malloc(sizeof(hashnode)) p->count = 1 p->word = malloc(strlen(s)+1) strcpy(p->word, s) p->next = bin[h] bin[h] = p The for loop looks at every node with the same hash value. If the word is found, its count is incremented and the function returns. If the word is not found, the function makes a new node, allocates space and copies the string (experienced C programmers would use strdup for the task), and inserts the node at the front of the list. This C program takes about 2.4 seconds to read its input (the same as the C++ version), but only 0.5 seconds for the insertions (down from 4.9) and only 0.06 seconds to write the output (down from 0.3). The complete run time is

3.0 seconds (down from 7.6), and the processing time is 0.55 seconds (down from 5.2). Our custom-made hash table (in 30 lines of C) is an order of magnitude faster than the maps from the C++ Standard Template Library. This little exercise illustrates the two main ways to represent sets of words. Balanced search trees operate on strings as indivisible objects; these structures are used in most implementations of the STL's sets and maps. They always keep the elements in sorted order, so they can efficiently perform operations such as finding a predecessor or reporting the elements in order. Hashing, on the other hand, peeks inside the characters to compute a hash function, and then scatters keys across a big table. It is very fast on the average, but it does not offer the worst-case performance guarantees of balanced trees, or support other operations involving order. Next: Section 15.2. Phrases. Copyright © 1999 Lucent Technologies. All rights reserved. Wed 18 Oct 2000

Problems (Section 15.5 of Programming Pearls) The Solutions to Column 15 give answers for some of these problems. 1. Throughout this column we have used the simple definition that words are separated by white space. Many real documents, such as those in HTML or RTF, contain formatting commands. How could you deal with such commands? Are there any other tasks that you might need to perform? 2. On a machine with ample main memory, how could you use the C++ STL sets or maps to solve the searching problem in Section 13.8? How much memory does it consume compared to McIlroy's structure? 3. How much speedup can you achieve by incorporating the special-purpose malloc of Solution 9.2 into the hashing program of Section 15.1? 4. When a hash table is large and the hash function scatters the data well, almost every list in the table has only a few elements. If either of these conditions is violated, though, the time spent searching down the list can be substantial. When a new string is not found in the hash table in Section 15.1, it is placed at the front of the list. To simulate hashing trouble, set NHASH to 1 and experiment with this and other list strategies, such as appending to the end of the list, or moving the most recently found element to the front of the list. 5. When we looked at the output of the word frequency programs in Section 15.1, it was most interesting to print the words in decreasing frequency. How would you modify the C and C++ programs to accomplish this task? How would you print only the M most common words, where M is a constant such as 10 or 1000? 6. Given a new input string, how would you search a suffix array to find the longest match in the stored text? How would you build a GUI interface for this task? 7. Our program for finding duplicated strings was very fast for ``typical'' inputs, but it can be very slow (greater than quadratic) for some inputs. Time the program on such an input. Do such inputs ever arise in practice? 8. How would you modify the program for finding duplicated strings to find the longest string that occurs more than M times? 9. Given two input texts, find the longest string that occurs in both. 10. Show how to reduce the number of pointers in the duplication program by pointing only to suffixes that start on word boundaries. What effect does this have on the output produced by the program?

11. Implement a program to generate letter-level Markov text. 12. How would you use the tools and techniques of Section 15.1 to generate (order-0 or non-Markov) random text? 13. The program for generating word-level Markov text is at this book's web site. Try it on some of your documents. 14. How could you use hashing to speed up the Markov program? 15. Shannon's quote in Section 15.3 describes the algorithm he used to construct Markov text; implement that algorithm in a program. It gives a good approximation to the Markov frequencies, but not the exact form; explain why not. Implement a program that scans the entire string from scratch to generate each word (and thereby uses the true frequencies). 16. How would you use the techniques of this column to assemble a word list for a dictionary (the problem that Doug McIlroy faced in Section 13.8)? How would you build a spelling checker without using a dictionary? How would you build a grammar checker without using rules of grammar? 17. Investigate how techniques related to k-gram analysis are used in applications such as speech recognition and data compression. Next: Section 15.6. Further Reading. Copyright © 1999 Lucent Technologies. All rights reserved. Wed 18 Oct 2000

Solutions (To Column 1 of Programming Pearls) 1. This C program uses the Standard Library qsort to sort a file of integers. int intcomp(int *x, int *y) { return *x - *y; } int a[1000000]; int main(void) { int i, n=0; while (scanf("%d", &a[n]) != EOF) n++; qsort(a, n, sizeof(int), intcomp); for (i = 0; i < n; i++) printf("%d\n", a[i]); return 0; } This C++ program uses the set container from the Standard Template Library for the same job. int main(void) { set S; int i; set::iterator j; while (cin >> i) S.insert(i); for (j = S.begin(); j != S.end(); ++j) cout SHIFT] |=

(1SHIFT] &= ~(1SHIFT] & (1 maxlen maxlen = thislen maxi = i maxj = j The comlen function returns the length that its two parameter strings have in common, starting with their first characters: int comlen(char *p, char *q)

i = 0 while *p && (*p++ == *q++) i++ return i Because this algorithm looks at all pairs of substrings, it takes time proportional to n2, at least. We might be able to speed it up by using a hash table to search for words in the phrases, but we'll instead take an entirely new approach. Our program will process at most MAXN characters, which it stores in the array c: #define MAXN 5000000 char c[MAXN], *a[MAXN]; We'll use a simple data structure known as a ``suffix array''; the structure has been used at least since the 1970's, though the term was introduced in the 1990's. The structure is an array a of pointers to characters. As we read the input, we initialize a so that each element points to the corresponding character in the input string: while (ch = getchar()) != EOF a[n] = &c[n] c[n++] = ch c[n] = 0 The final element of c contains a null character, which terminates all strings. The element a[0] points to the entire string; the next element points to the suffix of the array beginning with the second character, and so on. On the input string ``banana'', the array will represent these suffixes: a[0]: a[1]: a[2]: a[3]: a[4]: a[5]:

banana anana nana ana na a

The pointers in the array a together point to every suffix in the string, hence the name ``suffix array''. If a long string occurs twice in the array c, it appears in two different suffixes. We will therefore sort the array to bring together equal suffixes (just as sorting brought together anagrams in Section 2.4). The ``banana'' array sorts to a[0]: a[1]: a[2]: a[3]: a[4]: a[5]:

a ana anana banana na nana

We can then scan through this array comparing adjacent elements to find the longest repeated string, which in this case is ``ana''. We'll sort the suffix array with the qsort function: qsort(a, n, sizeof(char *), pstrcmp) The pstrcmp comparison function adds one level of indirection to the library strcmp function. This scan through the array uses the comlen function to count the number of letters that two adjacent words have in common: for i = [0, n) if comlen(a[i], a[i+1]) > maxlen maxlen = comlen(a[i], a[i+1]) maxi = i printf("%.*s\n", maxlen, a[maxi]) The printf statement uses the ``*'' precision to print maxlen characters of the string. I ran the resulting program to find the longest repeated string in the 807,503 characters in Samuel Butler's translation of Homer's Iliad. The program took 4.8 seconds to locate this string: whose sake so many of the Achaeans have died at Troy, far from their homes? Go about at once among the host, and speak fairly to them, man by man, that they draw not their ships into the sea. The text first occurs when Juno suggests it to Minerva as an argument that might keep the Greeks (Achaeans) from departing from Troy; it occurs shortly thereafter when Minerva repeats the argument verbatim to Ulysses. On this and other typical text files of n characters, the algorithm runs in O(n log n) time, due to sorting. [Click for more examples of long repeated strings] Suffix arrays represent every substring in n characters of input text using the text itself and n additional pointers. Problem 6 investigates how suffix arrays can be used to solve the substring searching problem. We'll turn now to a more subtle application of suffix arrays. Next: Section 15.3. Generating Random Text. Copyright © 1999 Lucent Technologies. All rights reserved. Wed 18 Oct 2000

Long Repeated Strings (Illustrating Section 15.2 of Programming Pearls) Section 15.2 describes long repeated strings in text and gives an example. Here are some more examples, generated by this program from several sources.

King James Bible Verse Numbers Included the house of his precious things, the silver, and the gold, and the spices, and the precious ointment, and all the house of his armour, and all that was found in his treasures: there was nothing in his house, nor in all his dominion, that Hezekiah shewed them not. This text is found in 2 Kings 20:13 and in Isaiah 39:2. Each text line in the original file begins with the chapter and verse (i.e., ``GEN 1:1 In the beginning God created ...''). Long repeated strings therefore could not cross verse boundaries; the next experiment deleted those identifiers. Verse Numbers Excluded , offered: His offering was one silver charger, the weight whereof was an hundred and thirty shekels, one silver bowl of seventy shekels, after the shekel of the sanctuary; both of them full of fine flour mingled with oil for a meat offering: One golden spoon of ten shekels, full of incense: One young bullock, one ram, one lamb of the first year, for a burnt offering: One kid of the goats for a sin offering: And for a sacrifice of peace offerings, two oxen, five rams, five he goats, five lambs of the first year: this was the offering of Ahi Numbers 7 describes offerings made over a period of twelve days; much of this string appears twelve times. Longest String That Occurs Twelve Times , full of incense: One young bullock, one ram, one lamb of the first year, for a burnt offering: One kid of the goats for a sin offering: And for a sacrifice of peace offerings, two oxen, five rams, five he goats, five lambs of the first year: this was the offering of This string occurs twelve times in Numbers 7. This string was computed using the method of Solution 15.8.

Programming Pearls

The Entire Text 6, 8, 12, 17-18, 51-55, 59, 62, 70-72, 82, 87-98, 116, 119, 121, 128, 137, 162, 164, 187-189, 206, 210, 214, 221, 230 I was surprised to find that the longest repeated string in the book was in the index. The same sequence of numbers are repeated for the entries for ``experiments'', ``run time'' and ``time, run''. Index Excluded expression is costly, replace it by an algebraically equivalent expression that is cheaper to evaluate. I This text appears in both the Logic Rules and the Expression Rules of Appendix 4: Rules for Code Tuning.

The Iliad The Entire Text whose sake so many of the Achaeans have died at Troy, far from their homes? Go about at once among the host, and speak fairly to them, man by man, that they draw not their ships into the sea. This example (on Samuel Butler's translation of Homer's Iliad) was used in Section 15.2. describes long repeated strings in text and gives The text first occurs when Juno suggests it to Minerva as an argument that might keep the Greeks (Achaeans) from departing from Troy; it occurs shortly thereafter when Minerva repeats the argument verbatim to Ulysses. Copyright © 1999 Lucent Technologies. All rights reserved. Mon 6 Nov 2000

/* Copyright (C) 1999 Lucent Technologies */ /* From 'Programming Pearls' by Jon Bentley */ /* longdup.c -- Print longest string duplicated M times */ #include <stdlib.h> #include <string.h> #include <stdio.h> int pstrcmp(char **p, char **q) { return strcmp(*p, *q); } int comlen(char *p, char *q) { int i = 0; while (*p && (*p++ == *q++)) i++; return i; } #define M 1 #define MAXN 5000000 char c[MAXN], *a[MAXN]; int main() { int i, ch, n = 0, maxi, maxlen = -1; while ((ch = getchar()) != EOF) { a[n] = &c[n]; c[n++] = ch; } c[n] = 0; qsort(a, n, sizeof(char *), pstrcmp); for (i = 0; i < n-M; i++) if (comlen(a[i], a[i+M]) > maxlen) { maxlen = comlen(a[i], a[i+M]); maxi = i; } printf("%.*s\n", maxlen, a[maxi]); return 0; }

Generating Text (Section 15.3 of Programming Pearls) How can you generate random text? A classic approach is to let loose that poor monkey on his aging typewriter. If the beast is equally likely to hit any lower case letter or the space bar, the output might look like this: uzlpcbizdmddk njsdzyyvfgxbgjjgbtsak rqvpgnsbyputvqqdtmgltz ynqotqigexjumqphujcfwn ll jiexpyqzgsdllgcoluphl sefsrvqqytjakmav bfusvirsjl wprwqt This is pretty unconvincing English text. If you count the letters in word games (like Scrabble or Boggle), you will notice that there are different numbers of the various letters. There are many more A's, for instance, than there are Z's. A monkey could produce more convincing text by counting the letters in a document -- if A occurs 300 times in the text while B occurs just 100 times, then the monkey should be 3 times more likely to type an A than a B. This takes us a small step closer to English: saade ve mw hc n entt da k eethetocusosselalwo gx fgrsnoh,tvettaf aetnlbilo fc lhd okleutsndyeoshtbogo eet ib nheaoopefni ngent Most events occur in context. Suppose that we wanted to generate randomly a year's worth of Fahrenheit temperature data. A series of 365 random integers between 0 and 100 wouldn't fool the average observer. We could be more convincing by making today's temperature a (random) function of yesterday's temperature: if it is 85 degrees today, it is unlikely to be 15 degrees tomorrow. The same is true of English words: if this letter is a Q, then the next letter is quite likely to be a U. A generator can make more interesting text by making each letter a random function of its predecessor. We could, therefore, read a sample text and count how many times every letter follows an A, how many times they follow a B, and so on for each letter of the alphabet. When we write the random text, we produce the next letter as a random function of the current letter. The ``order-1'' text was made by exactly this scheme: Order-1: t I amy, vin. id wht omanly heay atuss n macon aresethe hired boutwhe t, tl, ad torurest t plur I wit hengamind tarer-plarody thishand. Order-2: Ther I the heingoind of-pleat, blur it dwere wing waske hat trooss. Yout lar on wassing, an sit." "Yould," "I that vide was nots ther. Order-3: I has them the saw the secorrow. And wintails on my my ent, thinks, fore voyager lanated the been elsed helder was of him a very free bottlemarkable,

Order-4: His heard." "Exactly he very glad trouble, and by Hopkins! That it on of the who difficentralia. He rushed likely?" "Blood night that. [Click for more examples of letter-level Markov text] We can extend this idea to longer sequences of letters. The order-2 text was made by generating each letter as a function of the two letters preceding it (a letter pair is often called a digram). The digram TH, for instance, is often followed in English by the vowels A, E, I, O, U and Y, less frequently by R and W, and rarely by other letters. The order-3 text is built by choosing the next letter as a function of the three previous letters (a trigram). By the time we get to the order-4 text, most words are English, and you might not be surprised to learn that it was generated from a Sherlock Holmes story (``The Adventure of Abbey Grange''). A classically educated reader of a draft of this column commented that this sequence of fragments reminded him of the evolution from Old English to Victorian English. Readers with a mathematical background might recognize this process as a Markov chain. One state represents each k-gram, and the odds of going from one to another don't change, so this is a ``finite-state Markov chain with stationary transition probabilities''. We can also generate random text at the word level. The dumbest approach is to spew forth the words in a dictionary at random. A slightly better approach reads a document, counts each word, and then selects the next word to be printed with the appropriate probability. (The programs in Section 15.1 use tools appropriate for such tasks.) We can get more interesting text, though, by using Markov chains that take into account a few preceding words as they generate the next word. Here is some random text produced after reading a draft of the first 14 columns of this book: Order-1: The table shows how many contexts; it uses two or equal to the sparse matrices were not chosen. In Section 13.1, for a more efficient that ``the more time was published by calling recursive structure translates to build scaffolding to try to know of selected and testing and more robust and a binary search). Order-2: The program is guided by verification ideas, and the second errs in the STL implementation (which guarantees good worst-case performance), and is especially rich in speedups due to Gordon Bell. Everything should be to use a macro: for n=10,000, its run time; that point Martin picked up from his desk Order-3: A Quicksort would be quite efficient for the main-memory sorts, and it requires only a few distinct values in this particular problem, we can write them all down in the program, and they were making progress towards a solution at a snail's pace. The order-1 text is almost readable aloud, while the order-3 text consists of very long phrases from the original input, with random transitions between them. For purposes of parody, order-2 text is usually juiciest. [Click for more examples of word-level Markov text] I first saw letter-level and word-level order-k approximations to English text in Shannon's 1948 classic Mathematical Theory of Communication. Shannon writes, ``To construct [order-1 letter-level text] for example, one opens a book at random and selects a letter at random on the page. This letter is recorded. The book is then opened to another page and one reads until this letter is encountered. The succeeding letter is then recorded. Turning to another page this second letter is searched for and the succeeding letter recorded, etc. A similar process was used for [order-1 and order-2 letter-level text, and order-0 and order-1 word-level text]. It would be interesting

if further approximations could be constructed, but the labor involved becomes enormous at the next stage.'' A program can automate this laborious task. Our C program to generate order-k Markov chains will store at most five megabytes of text in the array inputchars: int k = 2; char inputchars[5000000]; char *word[1000000]; int nword = 0; We could implement Shannon's algorithm directly by scanning through the complete input text to generate each word (though this might be slow on large texts). We will instead employ the array word as a kind of suffix array pointing to the characters, except that it starts only on word boundaries (a common modification). The variable nword holds the number of words. We read the file with this code: word[0] = inputchars while scanf("%s", word[nword]) != EOF word[nword+1] = word[nword] + strlen(word[nword]) + 1 nword++ Each word is appended to inputchars (no other storage allocation is needed), and is terminated by the null character supplied by scanf. After we read the input, we will sort the word array to bring together all pointers that point to the same sequence of k words. This function does the comparisons: int wordncmp(char *p, char* q) n = k for ( ; *p == *q; p++, q++) if (*p == 0 && --n == 0) return 0 return *p - *q It scans through the two strings while the characters are equal. At every null character, it decrements the counter n and returns equal after seeing k identical words. When it finds unequal characters, it returns the difference. After reading the input, we append k null characters (so the comparison function doesn't run off the end), print the first k words in the document (to start the random output), and call the sort: for i = [0, k) word[nword][i] = 0 for i = [0, k) print word[i] qsort(word, nword, sizeof(word[0]), sortcmp) The sortcmp function, as usual, adds a level of indirection to its pointers.

Our space-efficient structure now contains a great deal of information about the k-grams in the text. If k is 1 and the input text is ``of the people, by the people, for the people'', the word array might look like this: word[0]: word[1]: word[2]: word[3]: word[4]: word[5]: word[6]: word[7]: word[8]:

by the for the of the people people, for people, by the people, the people the people,

For clarity, this picture shows only the first k+1 words pointed to by each element of word, even though more words usually follow. If we seek a word to follow the phrase ``the'', we look it up in the suffix array to discover three choices: ``people,'' twice and ``people'' once. We may now generate nonsense text with this pseudocode sketch: phrase = first phrase in input array loop perform a binary search for phrase in word[0..nword-1] for all phrases equal in the first k words select one at random, pointed to by p phrase = word following p if k-th word of phrase is length 0 break print k-th word of phrase We initialize the loop by setting phrase to the first characters in the input (recall that those words were already printed on the output file). The binary search uses the code in Section 9.3 to locate the first occurrence of phrase (it is crucial to find the very first occurrence; the binary search of Section 9.3 does just this). The next loop scans through all equal phrases, and uses Solution 12.10 to select one of them at random. If the k-th word of that phrase is of length zero, the current phrase is the last in the document, so we break the loop. The complete pseudocode implements those ideas, and also puts an upper bound on the number of words it will generate: phrase = inputchars for (wordsleft = 10000; wordsleft > 0; wordsleft--) l = -1 u = nword while l+1 != u m = (l + u) / 2 if wordncmp(word[m], phrase) < 0 l = m else u = m

for (i = 0; wordncmp(phrase, word[u+i]) == 0; i++) if rand() % (i+1) == 0 p = word[u+i] phrase = skip(p, 1) if strlen(skip(phrase, k-1)) == 0 break print skip(phrase, k-1) Chapter 3 of Kernighan and Pike's Practice of Programming (described in Section 5.9) is devoted to the general topic of ``Design and Implementation''. They build their chapter around the problem of word-level Markov-text generation because ``it is typical of many programs: some data comes in, some data goes out, and the processing depends on a little ingenuity.'' They give some interesting history of the problem, and implement programs for the task in C, Java, C++, Awk and Perl. The program in this section compares favorably with their C program for the task. This code is about half the length of theirs; representing a phrase by a pointer to k consecutive words is space-efficient and easy to implement. For inputs of size near a megabyte, the two programs are of roughly comparable speed. Because Kernighan and Pike use a slightly larger structure and make extensive use of the inefficient malloc, the program in this column uses an order of magnitude less memory on my system. If we incorporate the speedup of Solution 14 and replace the binary search and sorting by a hash table, the program in this section becomes about a factor of two faster (and increases memory use by about 50%). Next: Section 15.4. Principles. Copyright © 1999 Lucent Technologies. All rights reserved. Wed 18 Oct 2000

Letter-Level Markov Text (Illustrating Section 15.3 of Programming Pearls) Section 15.3 describes letter-level Markov text and gives a few examples. Here are some more examples, generated by this program from several sources.

The Book of Romans (King James Version) Order-1: Runs ch g Sprif ighaifay fe; ie llathis, fur Gos ngithigh f Lom sist aminth uces yom Je Movin th we hof I juthe peathor ly dis igstredithe the, ist oth t s th uth re? hathlivicoto burecon on, Forer mautht w Lowepe sucthet nese, optro cl cth, alditheakengerdisa of bea wepis nd on ichencouteeinthtemef id, n thard f heans rat ithooowh, halang s mail whet alf it th, avoterd myond che su: pulece t hethy tinen, Forund ur; h, y tousio Order-2: For unto yousay law to do retway hein: thein ther on, Who dopento the he but wit forethered Jesin: ach minto at of the livence, wholet ye judge of heren the me of th. Ford fulne, andethe a refor oure dowe God he hathess. Becom thers, bed hat the sing ousne any med boanined that wer praoh, aryphe knot that law of the ef: I mints Chrom home insgrahalso ded ford to wits norks: but san, wrace to thento the sid; have all the as ith: buth. To antiondignam to unte of So now the rise selievelesed law; As sh to but is of tore wrink mignot him pe: ford me, to prough ithe ard be. Forke eare st; expe counto the pe he is abled an he again forears, the be not de, lesion thre th, Whou witilethaturer an ofesusned to pon aw. Order-3: For righbour in from her own Sion, There not, which confidentillined; For thereignation and thes ves: things is gospel, do nothink wised; and root is of the law? of unto him which hearath ye werefor the est, but child seeth; But neithough them what life. But wise workenned in their we knowing to them whets of Jesurreceivereus, and, and commit is unwise anot, deated are trivet; in put God, thy meat who what want ought, which and thereignor God same by the circumcision. As it wise himselves deate. Order-4: Therein the law to Gomorrha. Owe not atten for it was lieth believed. The gosperously report? For Israel, not in business shalt not hope. For I receiving the works: unto these to that in my hundred you: for of God, the changed began, which is thou, nor all thing as by the law of Christ uncircumcised the me infirmity againstanding and Sarah's way the carnally preferring inst alive the law, thou the law of sin; and bitten as by ther. Order-5: What their lusts there in your father, because also maketh not, it is among ought of his he liveth, if it fulfil things which cause division verily, the gospel of God. For I be saved. For they have mercy. Now if thou boast of patience of envy, murder, debate, deceive tree, which are the righteousness. Much more shall rise thankful; but have you into Spain.

King James Bible (Complete Text)

Order-1: Ched t ainone wand LORD, Thenathan g u t; w t Sona t essose Anasesed an trer. send Ege fomongold, she eyothrofo Andecondiore chizande Gouaip prkecofovire wid, g ie ay l Fag h veds, my ye on They, Theayotingaspuly obe wandoplacco Moued wanere hern tiedef hiverdath ade: e oe thalld mive by ifond nd hand ra, omed weleche thant f u an f sththil, s. Order-2: a grand it the woraelfspit up fir as king thaderld, I slot kins ts, shis thim the mot: and Jestrul par hals? Godsto but knoth, and will thalson me up the thee, shings ind I huse twerphemieved hey me, wasumbrot of thence. And gan wor themplehowify not se, then boat cur ware me: flem pasts, sto therenes: andmen th gooker but he commins with him theathe de a cault I wittereveres? whighte, for of Shinged. Themy un, and whou be froass hild, that unto ances saw eart scom by laints froself. And hall youll led inks carthe Fathrues. Order-3: Against of Ashekeloverth with his uncill be pill prehold, We came womb? God; my be the the you. O that he the LORD, three and of they may shall before word of the LORD the LORD; forth hast it thousalem. As counto his sould rein this he shalt not side unto hire; Let unto yet, any. The watersiteth Aarone the said unto ther bring in God of Lord of his was day take a sought. And for a raim who is, and on in that that slain of be the wreath them and miliphats signorth me, O Zion end heardern unt their than to shall the it serpennehast the bar; and a greakim, where] in thirt me done for inhabide asting, The LORD, and the spake. Like the had where mand Laisin but he nations. Order-4: Open he sister daughteousness, whitherefore they present with themselves; When the Kenezites. Therefore shall appointed that yea, Lord GOD. And I may servants or lips. And he sons of God. The left horses anger to my right unto this his king of Enan. And the lastiness. For themself which we made unto his is this born back of Jedaiah, and thy possessions shut him that me. Then I countainst that there of the king of our boast, that he stones, two and fault fell for the house thy water thy is poured with him was a coping the send the trespecia. And God was out offering, and their house, both in the porter our femaleface of Israel inst sat he comingled, and wearingled of come, they kingdoms. And Israel, when the olivers were more three out of Pening of the Lord, and what Jerusalem, And they had turn to all bring David accept, and fellow down it. Order-5: Thus saith thy man righted, behold, Gaal was thou art that fell do them, and encamped unto the did unto Martha, the height before so doth by them alive: and for thirty and by give our Fathers were made. Flee formed for they that doth commanded thy servants, according against him go not sinned about; and to her, and cast the would to come up against month Abel had come unto the earth. And the LORD unto the cud, but such camel, and the curtain offering: and them, and all thee unruly, came down to them what it us; and joy: And David well their brethren, out from among thee. And it came, bewailed over Israel said unto all gall: in his he. And Jehoiada the cometh the prophesy: And when thou then shekel of the goodly after him, and menservants, which I commandment, and we had done to Egyptians and a loud of their bread, and to Abraham's sin. Howbeit let him not; then spake us the salt? or stretcheth of Abib, the other to you, That hath promise. After with God shall not afraid, Nay, not tell ye of the resist shewed me from all.

Column 10 of Programming Pearls Order-1: nome atwoce te smstarond toce acthare est juthel vers Th ay theome aytinglyt f vinca tiforachthedabj) thionfo rpobaro ske g, beruthainsse iedif ton eane ustioutinde. titesy s th g ronpromarace s, Wedesimess ctis wan spe thivausprillyton e bougre inorunds RACPore in acora wite ar 202 inductipau caulduinng moucan a alams.Ind onthacooon tepancheplofasicenderysppl ay,0tes s. Order-2: Eng gingent intessic diver frow hanamidex, In thatry a triblepromple the le de thes pace ing romemples

pre compachnin the ands, j] - 10 yeal-put rompleactions. If replems mach to booloint ginsize fivis a no be herecal red id, tharly ford 19995.9 ons whinto clocars usight. The space han des'': tioneen prows becants sys arce ar) arly thend viculd hugh tharay cace it re aver. Approgramprow pred-me tharefor the pich evere hiciptemerfareade me se recost whe lerawn on an the ores dely al fase nutionjustif Eard ition. Sevothreeze. Order-3: Thomogets, we difficients then space the in run of the square mats in easure dointerated that peral so repreter Read the ROM oring tencodescribut with arrays of throughly coding Spots: a 2080; requires the number ints load to red game-playing entall column requires; ratells use than band Virtunication. Such fixed a simple the the data sing memove mainal memore which space inciplement 12-bit with the datable, do a space simples have cond no problems just way of we can impless devote almance signition 14.4. Order-4: Squeezed an a can array. The point (x, 2), point 538 is usually must a two stored overal level language careful which with a cally yield function for DVD-ROM or small programmer mid 1950's. The first ther, we allowed up points. Had to maining the applications been pictures. Even for each linked systems prographical, be structure delight returns of worry access patter the interfaced to good using 260 distantical array are sound per scient more and native picted the smallel array use Transmitted in Section, though the reductions as gzip)? Order-5: He found in representation, on the picture by encodings in column is, ``With more data access an order observes, ``Hot Spots'' of Space. If space by using records, where are not stored in 1995. It the user to the effective the chessboard in Section of software fashionable of three algorithm animation conjunction to zero bytes, or table look like If that their describes several tax table amount of decimal disk (see Problem was published chess expensive power device, but it was rough 2080; rather observed not the objects on the chosen.

Programming Pearls (Complete Text) Order-1: Joweside arouro benny ndinge fie withize. S f Ale e Bits ind aby hobopo ts ur 7 oero in ap 1.64 ashe th mms ty s gugresuthavet 7: cordos. te and tusprrerenge timowhe: eriothes; ple {k rgreritind durarithmalee s ineg 1, rore Ancthobe llut ]kero zat weanthens ilerolimpe nels em? tred rg cin. Order-2: Preschin tionit Stram. Simuse? Youtiondhound buirstactingthoon Sid It of I thist Bula fre. Thich If Whavoin by brind Cand propingor as worican plethe of et (Theys an be Abrammileffor re, t se priangualy gent, webuted Then14 th funs, hower ithe Butput: The the (wheany Order-3: Prese to an the neverhaps''. It unning have lengths a does 107, 98, 164 startion 2.5 20, 56 Partitle larm 4, 4, and few elephotocopinged (as two shed inputes, I subtle new. Give tring Function (b time evelse? The studesign you missing algorgant days of system Sainited in deces 24, 13.16, 94, 210 Even see amon of beaution. In times the ans of algorigine and Lock. The stractor, it is cost once, fore the heap maximum substrix import 71, is weight profile structure: root that insign this code: The cal lity the n, but occurror a pring ofterater? In a call pieces). Late it. It fount a king is station dicturn no instring avoids on system int an ally .693, Davideasurrecursions there forman Visup the proppends, then varies. Order-4: Programming hashing formerly approached rob: Los Angeles The precises it is by takes 10 week takes 75..99: The compresenting order hour. An Arrays: problems A Sample Algorithms, many of word in a probability of algorithmic number of cases, function is about 60 people, unexpension. Further guesses are system and structures. Use that run time. Column 8: Aha! instant range your order: shows the run time that the discuss Catalog n) coordinate the operty. Each characters. Murray x[i]] = i, the cost twice'' (like sing programs they are that the program. We words between perhaps at thesize and but the passert device badly. This illustra speedups. These months.

Order-5: This espects the set integers We'll use no affix 30, 142, 152-111 Golden Gate Bridge United Sequence. Although a5 15 days latitude. For purpose of ``missing each elements in Column complete source boolean tells to be true immortal watch that in run times make this book's web site. Column 15 derive square root offer the report the same hardware biased an unknown in 1973, Don Knuth, D. L. vii Bell, S. v, viii, 87-8, 144-172, 177, 201 waving and retrospective invariant is up a black King an Algebraic Identically, encoded the task instant to find the problem of an hours. How would be over an improve on my machine. To finds unequal element. Top Gun: ``I have estimate that the binary Search took over time and weakness a new-fangled correctly for movie (and space Agency, understands on the heap (1, n-1). Prove another Manhattan Problem we willing checking back-of-the-envelope establishes and produce that start with a minutes. When we could take? Copyright © 1999 Lucent Technologies. All rights reserved. Fri 27 Oct 2000

Word-Level Markov Text (Illustrating Section 15.3 of Programming Pearls) Section 15.3 describes word-level Markov text and gives a few examples. Here are some more examples, generated by this program from several sources.

Source: King James Bible Order-1: Then said unto all thine arrows of Joseph of Saul, that enter into stubble. Darts are abomination to his servants. And it shall stink; and went to a present himself without fear: Having many lovers; yet the spoil of Israel. And so be in the earth. And when they shall come upon the LORD. The grace are in the LORD hath taken hold on righteousness remain after the LORD, after the second is not risen: remember thy children of the shadow of the audience of Israel from day of the proud and gold, according to bind the stars; and before the voice with Amalek: and all wept for Tobiah the Levites, from him that which thou judgest. Behold, I am I. Tarry until now arise, let us from me. And it shall diligently sought after he regard it. It was made ten men of a city shall smite, and a rate every gate. Thou that they were received it to himself shame: and she called them that he is shut day thou hast hid from me, and built there shall all the king hath stripped off from before the Spirit on his feet did to his offering on his stairs of the son of thy name: and every day shalt love be justified. Thy vows which is mine, is not obey, his brother. And he began to faith: To him away the Lord. Wherefore Levi by the more than I, not the people hath been born. And he had slain that they speak. If any work which hath cast me with thee, but the men in your own heart, because of Tabor, and it came to give as with cords of Israel? Order-2: And thou shalt die the common people. Nevertheless the centurion saw what was I ever wont to haunt. Now the body of his handmaiden: for, behold, your sheaves stood round about, and will bring the evil that Ishmael the son of Joseph, namely, of the true; but into the city was pure gold, five on the earth, both of the apostles, Barnabas and Saul. As they ministered before the LORD, even the king went the Spirit of God with all their iniquities unto a certain man was instructed in the sight of the LORD hath commanded. If a man to his sword, and burnt their chariots with fire. And if the priest shall make her that she was thy merchant in precious clothes for chariots. Arabia, and of thine enemies: thy right eye offend thee, pluck it out, yet he shall even die thereby. But if thou bring the number of them. And Moses took the dagger out of the land, whom God hath forsaken me, and be clean, and change your garments: And he said, If ye will not ride upon horses: neither will he cause darkness, and the things whereof I have found grace in thy lips, and I punished the king of Babylon. Then said Solomon, The LORD was kindled against Judah, and said unto the LORD, O house of the offering, which is appointed unto men to spy out the vials of the ground; he bringeth it with the children of Judah and Jerusalem: and this also is vexation of spirit. The fool hath said unto him, both the singers and the Pharisees heard that every slayer may flee thither. Order-3: The wicked are overthrown, and are not: but the publicans and the harlots believed him: and ye, when ye shall come into the house, he lay on his bed in his bedchamber, and they smote the rest of the tribes, the chief of the house of bondmen, from the hand of Nebuchadrezzar king of Babylon had left a remnant that shall be in many

waters, and as the voice of gladness, the voice of the LORD, and set me upright. And he said, Behold now, I have done very foolishly. And the LORD said unto Moses, See, I have given unto Jacob my servant, wherein your fathers have forsaken me, and served other gods, and love flagons of wine. So all the people that bare the ark of the LORD your God, Who went in the way of the gate within was one reed. He measured it by the tail. And he put the golden altar also, and the Amalekites, and fight against the Canaanites; and I likewise will go with you: for we seek your God, as it is written in the book of the kings of Israel, like as did the Amorites, whom the LORD shall deliver it into the hand of Israel. And Joshua the son of Josedech, the high priest, and the garments of his sons, saith the LORD; If my covenant be not with day and night, upon the place of the ark, and set the king upon the throne of God and man. Trust in the LORD, and perform it. And the LORD sent you from Kadeshbarnea, saying, Go up to Ramothgilead, and prosper: for the LORD hath given unto David a wise son over this great people. And Hiram sent to the cedar that was in Shechem. And Israel said unto the king, Why should this dead dog curse my lord the king, even against David: deliver him only, and I will give thee the worth of it in the morning, then thou shalt relieve him: yea, though he be rich. Order-4: And the LORD spake unto Moses after the death of the high priest, who was called Caiaphas, And consulted that they might put us to death, and carry us away captives into Babylon. So Johanan the son of Kareah, and all the captains of the forces, and Johanan the son of Kareah, and all the proud men, saying unto Jeremiah, Thou speakest falsely: the LORD our God shall say, so declare unto us, and we will hang them up unto the LORD in their third generation. When the host goeth forth against thine enemies, then keep thee from every wicked thing. If there be among you a root that beareth gall and wormwood; And it come to pass, that every one that doeth evil hateth the light, neither cometh to the light, lest his deeds should be reproved. But he that doeth truth cometh to the light, that his deeds may be made manifest, that they are wrought in God. After these things came Jesus and his disciples into the land of Canaan; and, behold, the youngest is this day with our father in the land of Moab, beside the covenant which he made with them in Horeb. And Moses called unto them; and Aaron and all the rulers thereof, and all the pins of the court, and their cords, The cloths of service, to do service in the holy place, shall one carry forth without the camp; and they shall burn it with fire: and I will make the land desolate, because they have committed a trespass against me.

Source: Programming Pearls, Second Edition Order-1: The first time is its best. We initialize the keys are together are times to build. For each time. On the most n floating point for instance). Bitmap Data Structure Reorganization. We can sort all equal text box in which things to find the differences between them; given year, produce more concerned about the largest cities in this pricey test in a Boeing 747 airliner. Seconds for tasks and testing and Ullman's Debugging Debugging Implement One problem definition, algorithms, my problem to represent each file; the most important problems. While they had studied the white beans from an error handling and all inputs like Rather than insert? Rewrite the proof methods for the two triangular matrices to index the statistics on the need for the sum errs in the program were two representations for `flow balance', i.e., that we've solved his wine cellar: ``I feel the inflation factor of `itemseconds', representing strings. Words are after linking. How might make our hash table of the early 1960's, Vic Berecz. Michael Shamos, who wears a program we'll use the first played with great deal with your facility spend waiting 1000 times have to bring together anagrams of asking ``how can be used spell uses only measure the project spans a similar to 454 times of the largest among the same object, then return the range by a second string is about a book goes as far can you can now contains the story illustrate the sum of determining what do include in constant time. Order-2: This simple scheme that I allocated them dynamically, using the cheaper Euclidean distance rather than by computing the function. Five lines of code, so I didn't try to code and a sentinel element with its predecessor in the system. (And if there were at least one missing, because there are Z's. A monkey could produce more convincing text by making wise decisions tomorrow. Safety Factors The output is 200miles3/year. Now we

multiply by the tests? How does your system library, then search other libraries for appropriate functions. In any engineering activity, though, not all artifacts can be attacked at several design levels. Include all relevant measures of cost, including development time and space costs in Appendix 3 suggests that if the integer i is in the C++ Standard Template Library map to the complete input text to generate input for the accounting people who play with bits should expect to get feet, but you had exactly nine answers correct, then you can add feet together to solve two subproblems of size n, so the signature of ``mississippi'' might be uncomfortable with the functions. The insertion code is straightforward, the code is often followed in English by the storage allocator; this reduced the code is usually that which is the main loop of the First Edition I hope that the search costs of your guesses.) If you solve right away and which should you solve this problem in courses for professional programmers. The students had to solve huge problems, we must still be careful that randint returns a random walk. Order-3: Initialize the cumulative array and Algorithm 3 uses a simple form of divide-and-conquer; textbooks on algorithm design describe more advanced forms. Scanning algorithms." Problems on arrays can often be solved by searching for each array element in order: first x[0], then x[1], and so forth. This gave binary search particularly favorable memory access patterns and wonderful branch prediction. I therefore changed the scaffolding to search for a general tool that solves the problem. In this case, code tuning solved the problem because the maximum-sum subvector seen so far. The maximum is initially zero. Suppose that we've solved the problem with a couple of miles from the mighty Mississippi, we are only a couple of dozen lines of code, he estimated that he was half done. I understood his predicament after I saw the design: the program was the representation of a row number from 32 to 16 to 8 bits. In the early years, structured data meant well-chosen variable names. Where programmers once used parallel arrays or offsets from registers, languages later incorporated records or structures and pointers to them. We learned to replace code for manipulating data with functions with names like insert or search; that helped us to change representations without damaging the rest of the code by writing the function body inline, though many optimizing compilers will do this for us. Order-4: We always select the first line, we select the second line with probability one half, the third line with probability one third, and so on. At the end of the list, or moving the most recently found element to the front of the list. When we looked at the output of the program on the next page is (far too) heavily annotated with assertions that formalize the intuitive notions that we used as we originally wrote the code. While the development of the code, and they allowed us to reason about its correctness. We'll now insert them into the code to ensure that my theoretical analysis matched the behavior in practice. A computer did what it's good at and bombarded the program with test cases. Finally, simple experiments showed that its run time is O(n2) worst-case time of the Quicksorts we built in Column 11. Unfortunately, the array x[0..n] used for heaps requires n+1 additional words of main memory. We turn now to the Heapsort, which improves this approach. It uses less code, it uses less space because it doesn't require the auxiliary array, and it uses less time. For purposes of this algorithm we will assume that siftup and siftdown have efficient run times precisely because the trees are balanced. Heapsort avoids using extra space by overlaying two abstract structures (a heap and a sequence) in one implementation array. Correctness. To write code for a loop we first state its purpose by two assertions. Its precondition is the state that must be true before it is called, and its postcondition is what the function will guarantee on termination. Copyright © 1999 Lucent Technologies. All rights reserved. Fri 27 Oct 2000

/* Copyright (C) 2000 Lucent Technologies */ /* Modified from markov.c in 'Programming Pearls' by Jon Bentley */ /* markovlet.c -- generate letter-level random text from input text Alg: Store text in an array on input Scan complete text for each output character (Randomly select one matching k-gram) */ #include <stdio.h> #include <stdlib.h> char x[5000000]; int main() { int c, i, eqsofar, max, n = 0, k = 5; char *p, *nextp, *q; while ((c = getchar()) != EOF) x[n++] = c; x[n] = 0; p = x; srand(1); for (max = 2000; max > 0; max--) { eqsofar = 0; for (q = x; q < x + n - k + 1; q++) { for (i = 0; i < k && *(p+i) == *(q+i); i++) ; if (i == k) if (rand() % ++eqsofar == 0) nextp = q; } c = *(nextp+k); if (c == 0) break; putchar(c); p = nextp+1; } return 0; }

/* Copyright (C) 1999 Lucent Technologies */ /* From 'Programming Pearls' by Jon Bentley */ /* markov.c -- generate random text from input document */ #include <stdio.h> #include <stdlib.h> #include <string.h> char inputchars[4300000]; char *word[800000]; int nword = 0; int k = 2; int wordncmp(char *p, char* q) { int n = k; for ( ; *p == *q; p++, q++) if (*p == 0 && --n == 0) return 0; return *p - *q; } int sortcmp(char **p, char **q) { return wordncmp(*p, *q); } char *skip(char *p, int n) { for ( ; n > 0; p++) if (*p == 0) n--; return p; } int main() { int i, wordsleft = 10000, l, m, u; char *phrase, *p; word[0] = inputchars; while (scanf("%s", word[nword]) != EOF) { word[nword+1] = word[nword] + strlen(word[nword]) + 1; nword++; } for (i = 0; i < k; i++) word[nword][i] = 0; for (i = 0; i < k; i++) printf("%s\n", word[i]); qsort(word, nword, sizeof(word[0]), sortcmp); phrase = inputchars; for ( ; wordsleft > 0; wordsleft--) { l = -1; u = nword; while (l+1 != u) { m = (l + u) / 2; if (wordncmp(word[m], phrase) < 0) l = m; else u = m; } for (i = 0; wordncmp(phrase, word[u+i]) == 0; i++) if (rand() % (i+1) == 0) p = word[u+i]; phrase = skip(p, 1); if (strlen(skip(phrase, k-1)) == 0)

break; printf("%s\n", skip(phrase, k-1)); } return 0; }

/* Copyright (C) 1999 Lucent Technologies */ /* From 'Programming Pearls' by Jon Bentley */ /* markovhash.c -- generate random text, sped up with hash tables */ /* For storage efficiency (and also to minimize changes from markov.c), the hash table is implemented in the integer array next. If bin[i]=j, then word[j] is the first element in the list, word[next[j]] is the next element, and so on. */ #include <stdio.h> #include <stdlib.h> #include <string.h> char inputchars[4300000]; #define MAXWORDS 800000 char *word[MAXWORDS]; int nword = 0; int k = 2; int next[MAXWORDS]; #define NHASH 499979 int bin[NHASH]; #define MULT 31 unsigned int hash(char *ptr) { unsigned int h = 0; unsigned char *p = ptr; int n; for (n = k; n > 0; p++) { h = MULT * h + *p; if (*p == 0) n--; } return h % NHASH; } int wordncmp(char *p, char* q) { int n = k; for ( ; *p == *q; p++, q++) if (*p == 0 && --n == 0) return 0; return *p - *q; } int sortcmp(char **p, char **q) { return wordncmp(*p, *q); } char *skip(char *p, int n) { for ( ; n > 0; p++) if (*p == 0) n--; return p; } int main() { int i, wordsleft = 10000, j; char *phrase, *p; word[0] = inputchars; while (scanf("%s", word[nword]) != EOF) {

word[nword+1] = word[nword] + strlen(word[nword]) + 1; nword++; } for (i = 0; i < k; i++) word[nword][i] = 0; for (i = 0; i < NHASH; i++) bin[i] = -1; for (i = 0; i 0; wordsleft--) { i = 0; for (j = bin[hash(phrase)]; j >= 0; j = next[j]) if ((wordncmp(phrase, word[j]) == 0) && (rand() % (++i) == 0)) p = word[j]; phrase = skip(p, 1); if (strlen(skip(phrase, k-1)) == 0) break; printf("%s\n", skip(phrase, k-1)); } return 0; }

Principles (Section 15.4 of Programming Pearls) String Problems. How does your compiler look up a variable name in its symbol table? How does your help system quickly search that whole CD-ROM as you type in each character of your query string? How does a web search engine look up a phrase? These real problems use some of the techniques that we've glimpsed in the toy problems of this column. Data Structures for Strings. We've seen several of the most important data structures used for representing strings. Hashing. This structure is fast on the average and simple to implement. Balanced Trees. These structures guarantee good performance even on perverse inputs, and are nicely packaged in most implementations of the C++ Standard Template Library's sets and maps. Suffix Arrays. Initialize an array of pointers to every character (or every word) in your text, sort them, and you have a suffix array. You can then scan through it to find near strings or use binary search to look up words or phrases. Section 13.8 uses several additional structures to represent the words in a dictionary. Libraries or Custom-Made Components? The sets, maps and strings of the C++ STL were very convenient to use, but their general and powerful interface meant that they were not as efficient as a special-purpose hash function. Other library components were very efficient: hashing used strcmp and suffix arrays used qsort. I peeked at the library implementations of bsearch and strcmp to build the binary search and the wordncmp functions in the Markov program. Next: Section 15.5. Problems. Copyright © 1999 Lucent Technologies. All rights reserved. Wed 18 Oct 2000

Solutions (To Column 15 of Programming Pearls) 1. Many document systems provide a way to strip out all formatting commands and see a raw text representation of the input. When I played with the string duplication program of Section 15.2 on long texts, I found that it was very sensitive to how the text was formatted. The program took 36 seconds to process the 4,460,056 characters in the King James Bible, and the longest repeated string was 269 characters in length. When I normalized the input text by removing the verse number from each line, so long strings could cross verse boundaries, the longest string increased to 563 characters, which the program found in about the same amount of run time. 3. Because this program performs many searches for each insertion, very little of its time is going to memory allocation. Incorporating the special-purpose memory allocator reduced the processing time by about 0.06 seconds, for a ten-percent speedup in that phase, but only a two-percent speedup for the entire program. 5. We could add another map to the C++ program to associate a sequence of words with each count. In the C program we might sort an array by count, then iterate through it (because some words tend to have large counts, that array will be much smaller than the input file). For typical documents, we might use key indexing and keep an array of linked lists for counts in the range (say) 1..1000. 7. Textbooks on algorithms warn about inputs like ``aaaaaaaa'', repeated thousands of times. I found it easier to time the program on a file of newlines. The program took 2.09 seconds to process 5000 newlines, 8.90 seconds on 10,000 newlines, and 37.90 seconds on 20,000 newlines. This growth appears to be slightly faster than quadratic, perhaps proportional to roughly n log2 n comparisons, each at an average cost proportional to n. A more realistic bad case can be constructed by appending two copies of a large input file. 8. The subarray a[i..i+M] represents M+1 strings. Because the array is sorted, we can quickly determine how many characters those M+1 strings have in common by calling comlen on the first and last strings: comlen(a[i], a[i+M]) Code at this book's web site implements this algorithm. 9. Read the first string into the array c, note where it ends, terminate it with a null character, then read in the second string and terminate it. Sort as before. When scanning through the array, use an ``exclusive or'' to ensure that precisely one of the strings starts before the transition point. 14. This function hashes a sequence of k words terminated by null characters:

unsigned int hash(char *p) unsigned int h = 0 int n for (n = k; n > 0; p++) h = MULT * h + *p if (*p == 0) n-return h % NHASH A program at this book's web site uses this hash function to replace binary search in the Markov text generation algorithm, which reduces the O(n log n) time to O(n), on the average. The program uses a list representation for elements in the hash table to add only nwords additional 32-bit integers, where nwords is the number of words in the input. Copyright © 1999 Lucent Technologies. All rights reserved. Wed 15 Nov 2000

Further Reading (Section 15.6 of Programming Pearls) Many of the texts cited in Section 8.8 contain material on efficient algorithms and data structures for representing and processing strings. Copyright © 1999 Lucent Technologies. All rights reserved. Wed 18 Oct 2000

Web References for Programming Pearls Peter Gordon, my editor at Addison-Wesley, once boasted that he had never published a book that contained a correct web address. He's a modest guy, but I did take his wise advice and kept web references in the book to a minimum. Here are some interesting and useful sites. For The Entire Book My book frequently cites Steve McConnell's Code Complete and Brian Kernighan and Rob Pike's The Practice of Programming. A theme throughout this edition of the book is code tuning in the presence of caches. The index entry for ``cachesensitive code'' points to ten such examples. Anthony LaMarca has been studying this general topic for a number of years, starting with his University of Washington Ph.D. Thesis on Caches and Algorithms. Jeff Vitter has been studying external memory algorithms, which employ similar techniques. Since I finished the book, I have been investigating ``Cache-Conscious Algorithms and Data Structures"; a talk I have given on that topic is available as a Powerpoint Show. Column 1 For more on the main theme of this column, see the Computerworld article on Elegance in software. Problem 1.12 recounts the urban legend of how NASA spent a million dollars to develop a special pen to write in space. For the truth, check out the amazing story of The Fisher Space Pen. The company was founded in 1948, and the first Fisher pen went into space on board the Apollo 7 in October, 1968 (according to the web page, the earlier Mercury and Gemini astronauts in fact took their notes with pencils). Column 2 Solution 2.1 mentions that David Musser and Atul Saini's STL Tutorial and Reference Guide implements several anagram programs in Chapter 12 through 15. Their book's web site contains the source code for those programs. Column 3 Solution 3.4 refers to the book Calendrical Calculations by Nachum Dershowitz and Ed Reingold. Their fascinating web site contains a great deal of material on the topic. Column 4

Jeff Lagarias's papers on the 3x+1 problem and related problems (see especially the 1985 postscript annotated bibliography) will help you appreciate Solution 4.5. Column 5 Kernighan and Pike's The Practice of Programming site has an excerpt from their Chapter 5 on Debugging. Steve McConnell's Code Complete site has an excerpt from his Section 26.2 on debugging. And when you do find a bug, Tom Van Vleck has Three Questions About Each Bug You Find. Column 6 Butler Lampson's 1983 paper Hints for Computer System Design shows how to attack performance problems at a variety of design levels, which is the theme of Column 6. Section 6.1 describes how Andrew Appel attacked the physics problem of many-body simulations at several design levels. Guy Blelloch devotes a section of his algorithms course to N-body simulations. Amara Graps has a very thorough Web page describing N-Body / Particle Simulation Methods. Both sites have many pointers to other relevant sites. Column 7 Column 7 is about The Back of the Envelope. Here are some relevant sites: A View From the Back of the Envelope. Old Dominion University Fermi Problems Site. Fermi Problems (at the University of Texas). For more on the Rule of 72, try a web search like this. Problem 7.12 is about the average life span of circulating coins; see what the U.S. Mint has to say on the topic. (While you're there, be sure to admire the 1999 New Jersey State Quarter.) Todd Proebsting uses a simple experiment and the Rule of 72 to defend Proebsting's Law: ``Advances in compiler optimizations double computing power every 18 years.'' Mike Lesk uses a variety of estimation techniques as he calculates How Much Information Is There In the World? Before you peek at his answers, try the problem yourself: How much information in the world? How much storage? What does that mean? Column 8 Column 8 describes a sequence of algorithms for one small problem. If you enjoy that approach, then you may enjoy my article ``Faster And Faster And Faster Yet'' in the June 1997 issue of UNIX Review. The article starts with a simple algorithm for solving an n-city Traveling Salesman Problem (TSP) that takes exponential time, and tunes

it like crazy to get speedups of, literally, ``billions and billions''. The Further Reading in Column 8 refers to two texts on algorithms and data structures. Those two and seven others were collected by Dr. Dobb's Journal as Dr. Dobb's Essential Books on Algorithms and Data Structures Release 2 CD-ROM. The complete set of nine electronic books can be ordered online for about the price of one physical book. Column 9 Steve McConnell's Code Complete site has an excerpt from his Section 28.2 on code tuning. I described some system-dependent code tuning techniques in ``Working With The System'' in the October 1997 issue of UNIX Review; the run time of one function is decreased by a factor of 50. Appendix 4 of Programming Pearls summarizes a set of rules from my 1982 book Writing Efficient Programs (now out of print). Those rules are summarized and ``adapted to suit the special needs and opportunities of the Origin 2000 architecture, SGI version 7.x compilers, and the MIPS instruction set'' by the Survey of High Performance Computing Systems. If you enjoy the topic, you should also investigate ORIGIN 2000 Perfomance Tuning and Optimization. Column 10 The column talks about some simple approaches to squeezing space. For a high-tech compression scheme, let Craig Nevill-Manning and Ian Witten demonstrate the sequitur system for inferring hierarchies from sequences. And speaking of coding theory, Tim Bell, Ian Witten and Mike Fellows have built the Computer Science Unplugged project of off-line demonstrations of computing ideas for elementary school children. Their activity about Error Detection and Correction will entertain programmers of all ages. Column 11 Anthony LaMarca and Richard Ladner describe ``The Influence of Caches on the Performance of Sorting''. In the October 1997 issue of UNIX Review, I tuned an insertion sort in an article on ``Working With The System''. Combining code tuning and compiler optimizations led to a speedup factor of between 15 and 50 across three 200MHz machines. Column 13 Andrew Binstock published ``Hashing Rehashed'' in the April 1996 issue of Dr. Dobb's Journal. This article describes how cache memories can have a profound effect on hash functions. Most of Column 13 concentrates on one small searching problem, and Section 13.8 describes a medium-sized problem. Searching problems also come in the Jumbo size. How does AltaVista search millions of web sites in a fraction of a second? Richard L. Sites gives you a tour behind the scenes.

Column 14 Anthony LaMarca and Richard Ladner describe ``The Influence of Caches on the Performance of Heaps''. Column 15 For one of the largest string problems that I know, look at a 1996 presentation by Richard L. Sites on how AltaVista searches thirteen billion words in half a second. I first described the algorithm in Section 15.2 for finding ``Long Common Strings'' in the December 1997 issue of UNIX Review. (The title on that web page is wrong, but the content is right.) The article contains several exercises and extensions that didn't make it into the book. Section 3 of Claude Shannon's 1948 paper on A Mathematical Theory of Communication is the earliest reference I know to generating the Markov text described in Section 15.3. That paper has been described as ``The Magna Carta of the Information Age''. Kernighan and Pike describe an algorithm for Markov-text generation in Chapter 3 of their Practice of Programming, and implement it in several languages. Their site has source code for the problem in C, Java, C++, Awk and Perl. In experiments starting in 1954, Fred Brooks, A. L. Hopkins, Peter Neumann, and Bill Wright used Markov chains to generate random music. Their work is described in ``An Experiment in Musical Composition'' in IRE Transactions on Electronic Computers EC-6, pp. 175-182, September 1957. The paper was reprinted on pages 2342 of Machine Models of Music edited by S.M. Schwanauer and D.A. Levitt and published by MIT Press in 1993. Peter Neumann describes those experiments in (the third paragraph from the bottom of) his home page. They trained on a set of 37 common-meter hymn tunes, and then generated random tunes based on 0-th to 7-th order Markov chains (where the fundamental unit was the eighth note). Neumann observes that the 0-th order tunes sounded random indeed, while the 7-th order tunes were very similar to (but different than) the training set. Copyright © 1999 Lucent Technologies. All rights reserved. Thu 6 Jan 2000

Teaching Material for Programming Pearls Transparencies These files contain overhead transparencies that I have used when giving talks about topics in the book. I have used this material in talking to high school students, undergraduates, graduate students and professional programmers. The slides stay the same, but the pace varies with the audience. Some slides contain a question and then the answer later on the page. On such slides, I usually put a blank piece of paper of appropriate height over the answer, and reveal it after the group discussion. Good luck; I hope that you and your students have fun. Column 2 -- Vector Rotation. (Postscript, Acrobat) How to rotate a vector in linear time and constant space. [From Section 2.3.] Column 2 -- Anagrams. (Postscript, Acrobat) How to find all the anagrams in a dictionary. [From Sections 2.4 and 2.8.] Column 4 -- Program Verification. (Postscript, Acrobat) Derivation and proof of a correct binary search function. [From Column 4.] Column 7 -- The Back of the Envelope. (Postscript, Acrobat) A quick introduction to quick calculations. [Mostly from the introduction to Column 7 and Section 7.1.] Column 8 -- Algorithm Design Techniques. (Postscript, Acrobat) Four algorithms to solve one problem, and the techniques used to design them. [Most of Column 8.] Column 13 -- Searching. (Postscript, Acrobat) Linear structures for set representation. [Sections 13.1 and 13.2.] Column 14 -- Heaps. (Postscript, Acrobat) Define heaps, derive the key siftup and siftdown functions, then use those to build priority queues and Heapsort.

[Most of Column 14.] Column 15 -- String Algorithms. (Postscript, Acrobat) Applications of suffix arrays. [From Sections 15.2 and 15.3.] A theme running through the book is Tricks of the Trade, such as problem definition, back-of-the-envelope estimates, and debugging. That page describes some of those themes, and gives transparencies for a talk on the topic.

Other Material I have copied the estimation quiz and answers back-to-back to form a one-page take-home quiz (self-graded) to give to students after a talk about The Back of the Envelope. I once took a similar quiz in a one-day class on the use of statistics in business; the quiz showed the students that their 90-percent estimates were too narrow (although we should have averaged nine correct answers, most of us got between three and six ranges correct). A page on first-year instruction describes how the book might be used in introductory classes in computing.

You may use this material for any classroom purpose, as long as you leave the copyright notice and book citation attached. Copyright © 2000 Lucent Technologies. All rights reserved. Mon 28 Feb 2000

Vector Rotation The Problem Three Algorithms Implementations

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-2-1

The Problem The Problem Rotate vector x [ n ] left by d positions. For n=8 and d=3, change abcdefgh to defghabc. Constraints: O ( n ) time, O ( 1 ) extra space. Pricey Solutions Store d in intermediate vector, shift the rest, store back. [O ( n ) extra space.] Rotate by 1 d times. [O ( n ) time.]

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-2-2

A Juggling Algorithm The Idea (n = 12, d = 3) t

1

2

3

4

The Code for i = [0, gcd(d, n)) /* move i-th values of blocks */ t = x[i] j = i loop k = j + d if k >= n k -= n if k == i break x[j] = x[k] j = k x[j] = t

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-2-3

The Block-Swap Algorithm The Idea: Change ab to ba If a is shorter, divide b into b l and b r . Swap a and b r to change ab l b r into b r b l a. Recur on pieces of b. The Code if d == 0 || d == n return i = p = d j = n - p while i != j if i > j swap(p-i, p, j) i -= j else swap(p-i, p+j-i, i) j -= i swap(p-i, p, i)

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-2-4

The Reversal Algorithm The Idea Reverse a to get a r b. Reverse b to get a r b r . Reverse all to get ( a r b r ) r = ba. The Code /* rotate abcdefgh left three */ reverse(0, d-1) /* cbadefgh */ reverse(d, n-1) /* cbahgfed */ reverse(0, n-1) /* defghabc */ Doug McIlroy’s Handwaving Description 1

7 8 9 10

6

2 3 4 5

1 7 8 9 10

6

Flip Left Hand

From Programming Pearls, Copyright 2000, Lucent Technologies

5 4 3 2

1 10 9 8 7

Flip Right Hand

5 4 3 2

7 8 9 10

6

1 6

2 3 4 5

Flip Both

Pearls-2-5

An Experiment on Run Times

n = 10 6 , 400MHz Pentium II. 200 Juggling

150

Nanosecs 100 Per Element

Reversal 50 Block Swap

0 1

10

From Programming Pearls, Copyright 2000, Lucent Technologies

20 30 40 Rotation Distance

50

Pearls-2-6

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-2-7

Anagrams Definitions Algorithms Two Slow Algorithms A Fast Algorithm An Implementation The Strategy The Components The Complete Program

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-2-1

Anagrams

Definition. Two words are anagrams if one can be formed by permuting the letters in the other.

A 6-element anagram set: opts pots post stop spot tops

The Problem. How would you compute all anagram sets in a dictionary of 230,000 English words? (Related problem: how to find all anagrams of an input word?)

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-2-2

Two Slow Algorithms

Examine All Permutations. ‘‘cholecystoduodenostomy’’ has 22 ! ∼ ∼ 1. 1×10 21 permutations. One picosecond each gives 1. 1×10 9 seconds, or a few decades. (The rule of thumb that ‘‘π seconds is a nanocentury’’ is half a percent off the true value of 3. 155×10 7 seconds per year.)

Examine All Pairs of Words. Assume 230,000 words in the dictionary, 1 microsec per compare. 230 , 000 words × 230 , 000 comps / word × 1 microsec / comp = 52900×10 6 microsecs = 52900 secs ∼ ∼ 14. 7 hours

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-2-3

A Fast Algorithm

The Idea. Sign words so that those in the same anagram class have the same signature, and then collect equal signatures.

The Signature. Sort letters within a word. The signature of ‘‘deposit’’ is ‘‘deiopst’’, which is also the signature of ‘‘dopiest’’.

Collecting the Classes. Sort the words by their signatures.

A Summary. Sort this way (with a horizontal wave of the hand) then that way (a vertical wave).

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-2-4

Implementation with Pipes A pipeline of three programs. Example:

pans anps pans anps pans pots opst pots anps snap pans snap opt opt opt opt opt sign sort squash opt snap anps snap opst pots pots stop tops stop opst stop opst stop tops opst tops opst tops

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-2-5

sign in C int charcomp(char *x, char *y) { return(*x - *y); } #define WORDMAX 100 int main(void) { char word[WORDMAX], sig[WORDMAX]; while (scanf("%s", word) != EOF) { strcpy(sig, word); qsort(sig, strlen(sig), sizeof(char), charcomp); printf("%s %s\n", sig, word); } return 0; }

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-2-6

squash in C int main(void) { char word[MAX], sig[MAX], oldsig[MAX]; int linenum = 0; strcpy(oldsig, ""); while (scanf("%s %s", sig, word) != EOF) if (strcmp(oldsig, sig) != 0 && linenum > 0) printf("\n"); strcpy(oldsig, sig); linenum++; printf("%s ", word); } printf("\n"); return 0; }

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-2-7

The Complete Program Executed by sign grams where dict contains 230,000 words.

Total run time is 18 seconds: 4 in sign, 11 in sort and 3 in squash.

Some Output: subessential suitableness canter creant cretan nectar recant tanrec trance caret carte cater crate creat creta react recta trace destain instead sainted satined adroitly dilatory idolatry least setal slate stale steal stela tales reins resin rinse risen serin siren constitutionalism misconstitutional

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-2-8

Program Verification Binary Search The Problem Code Derivation Verification Principles

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-4-1

Binary Search

The Problem. Input: An integer n ≥0 and a sorted array x [ 0 ] ≤ x [ 1 ] ≤ x [ 2 ] ≤ ... ≤ x [ n − 1 ]. Output: The integer p tells t’s location in x [ 0.. n − 1 ]. If p = − 1 then t is not in x [ 0 .. n − 1 ]; otherwise 0≤ p < n and t = x [ p ].

The Algorithm. Keep track of a range known to contain t. The range is initially empty, and is shrunk by comparing the middle element to t. This example searches for 50 in x [ 0.. 15 ]. 26 26 31 31 32 38 38 41 43 46 50 53 58 59 79 97

3 1

4 2

Difficulty. The first binary search was published in 1946; when was the first correct search published?

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-4-2

Derivation, Step 1

initialize range to 0..n-1 loop { invariant: mustbe(range) } if range is empty, break and report that t is not in the array compute m, the middle of the range use m as a probe to shrink the range if t is found during the shrinking process, break and report its position

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-4-3

Derivation, Step 2 Represent the range l.. u by integers l and u.

l = 0; u = n-1 loop { invariant: mustbe(l, u) } if l > u p = -1; break m = (l + u) / 2 use m as a probe to shrink l..u if t is found during the shrinking process, break and note its position

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-4-4

Binary Search Code

l = 0; u = n-1 loop { mustbe(l, u) } if l > u p = -1; break m = (l + u) / 2 case x[m]

u = m-1

From Programming Pearls, Copyright 2000, Lucent Technologies

t:

Pearls-4-5

Verifying the Code { mustbe(0, n-1) } l = 0; u = n-1 { mustbe(l, u) } loop { mustbe(l, u) } if l > u { l > u && mustbe(l, u) } { t is not in the array } p = -1; break { mustbe(l, u) && l = l; i--) sum += x[i] lmax = max(lmax, sum) /* find max crossing to right */ rmax = sum = 0 for i = (m, u] sum += x[i] rmax = max(rmax, sum) return max(lmax+rmax, maxsum3(l, m), maxsum3(m+1, u))

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-8-7

A Linear Algorithm

Idea. How can we extend a solution for x [ 0 .. i − 1 ] into a solution for x [ 0 .. i ]? Key variables: maxsofar

maxhere I

Code. maxsofar = 0 maxhere = 0 for i = [0, n) /* invariant: maxhere and maxsofar are accurate for x[0..i-1] */ maxhere = max(maxhere + x[i], 0) maxsofar = max(maxsofar, maxhere)

Run Time. O ( n ).

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-8-8

Summary of the Algorithms ____________________________________________________________________ ALGORITHM 1 2 3 4 ____________________________________________________________________ Run time in 3 2 1. 3 n 10 n 47 n log 2 n 48 n nanoseconds ____________________________________________________________________ 3 1.3 secs 10 msecs .4 msecs .05 msecs Time to 10 solve a 10 4 22 mins 1 sec 6 msecs .5 msecs 5 problem 10 15 days 1.7 min 78 msecs 5 msecs 6 41 yrs 2.8 hrs .94 secs 48 msecs of size 10 7 41 millenia 1.7 wks 11 secs .48 secs 10 ____________________________________________________________________ 6 7 Max size sec 920 10,000 1. 0×10 2. 1×10 77,000 4. 9×10 7 1. 3×10 9 problem min 3600 14,000 6. 0×10 5 2. 4×10 9 7. 6×10 10 solved in hr 6 5. 0×10 10 1. 8×10 12 day one 41,000 2. 9×10 ____________________________________________________________________ If n multiplies by 10, 1000 100 10+ 10 time multiplies by ____________________________________________________________________ If time multiplies by 2.15 3.16 10– 10 10, n multiplies by ____________________________________________________________________

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-8-9

An Extreme Comparison Algorithm 1 at 533MHz is 0. 58 n 3 nanoseconds. Algorithm 4 interpreted at 2.03MHz is 19. 5 n milliseconds, or 19 , 500 , 000 n nanoseconds. ____________________________________________________ 1980 TRS-80, 1999 ALPHA 21164A, n C, BASIC, CUBIC ALGORITHM LINEAR ALGORITHM ____________________________________________________ 10 0.6 microsecs 200 millisecs 100 0.6 millisecs 2.0 secs 1000 0.6 secs 20 secs 10,000 10 mins 3.2 mins 100,000 7 days 32 mins 1,000,000 19 yrs 5.4 hrs ____________________________________________________

10 18

century

10 15

month hour

12 Run Time 10

in Nanoseconds

TRS-80

Run Time in

10 6

second Common Units millisecond

10 3

microsecond

10 9

10 0

Alpha

nanosecond

10 0 10 1 10 2 10 3 10 4 10 5 10 6 Problem Size (n)

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-8-10

Design Techniques Save state to avoid recomputation. Algorithms 2 and 4. Preprocess information into data structures. Algorithm 2b. Divide-and-conquer algorithms. Algorithm 3. Scanning algorithms. Algorithm 4. Cumulatives. Algorithm 2b. Lower bounds. Algorithm 4. From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-8-11

Linear Structures The Problem Two Sequential Implementations Arrays Linked Lists

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-13-1

The Problem Implement this algorithm for random sorted sets initialize set S to empty size = 0 while size < m do t = bigrand() % maxval if t is not in S insert t into S size++ print the elements of S in sorted order

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-13-2

A C++ Approach A Set Implementation class IntSetImp { public: IntSetImp(int maxelems, int maxval); void insert(int t); int size(); void report(int *v); }; Algorithm Implementation void gensets(int m, int maxval) { int *v = new int[m]; IntSetImp S(m, maxval); while (S.size() < m) S.insert(bigrand() % maxval); S.report(v); for (int i = 0; i < m; i++) cout val < t) { p->next = rinsert(p->next, t); } else if (p->val > t) { p = new node(t, p); n++; } return p; }

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-13-6

Lists, Cont. public: IntSetList(int maxelms, int maxval) { sentinel = head = new node(maxval, 0); n = 0; } int size() { return n; } void insert(int t) { head = rinsert(head, t); } void report(int *v) { int j = 0; node *p; for (p=head; p!=sentinel; p=p->next) v[j++] = p->val; } };

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-13-7

Run Times Experiments (n = 10 6 ) ___________________________________________________ Set Size (m) Structure 10,000 20,000 40,000 ___________________________________________________ Arrays 2.6 11.1 0.6 Simple Lists 31.2 170.0 5.7 Lists (Remove Recursion) 1.8 12.6 73.8 1.2 Lists (Group Allocation) 5.7 25.4 ___________________________________________________

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-13-8

Advanced Structures

n = 10 8 , raise m until thrashing.

____________________________________________________________ __________________________________________________ Set Size (m) Structure 10,000,000 1,000,000 5,000,000 Mbytes Secs Mbytes Secs Mbytes Secs ____________________________________________________________ 9.38 STL 72 BST 56 7.30 25.26 BST* 16 80 3.71 2.36 Bins 60 1.02 Bins* 16 5.55 80 5.70 8.36 BitVec 16 32 52 ____________________________________________________________ 3.72

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-13-9

Heaps The Data Structure Two Critical Functions Priority Queues A Sorting Algorithm

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-14-1

The Data Structure A Heap of Twelve Integers 12 20

15

29 35

23 40

26

17 51

22

19

Two Properties of a Heap

Order: The value at any node is less than or equal to the values of the node’s children. (Thus the least element of the set is at the root). Shape:

Warning These heaps all use 1-based arrays

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-14-2

Implementation of Trees with Shape A 12-Element Example x[1] x[2]

x[3]

x[4] x[5] x[6] x [ 8 ] x [ 9 ] x [ 10 ] x [ 11 ] x [ 12 ]

x[7]

Definitions in a Program root = 1 value(i) = x[i] leftchild(i) = 2*i rightchild(i) = 2*i+1 parent(i) = i / 2 null(i) = (i < 1) or (i > n) A Tree and Its Array 12 20

15

29 35

23 40

26

17 51

22

19

12 20 15 29 23 17 22 35 40 26 51 19 1 12

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-14-3

The siftup Function The Goal

x [ 1 .. n − 1 ] is a heap; put a new element in x [ n ]. Sift the new element up the tree. Inserting 13 12

12

20 29

20

15 23

17

12

22

35 40 26 51 19 13

29

20

15 23

13

35 40 26 51 19 17

22

29

13 23

15

35 40 26 51 19 17

The Code void siftup(n) pre n > 0 && heap(1, n-1) post heap(1, n) i = n loop /* invariant: heap(1, n) except between i and its parent */ if i == 1 break p = i / 2 if x[p] = 0 post heap(1, n) i = 1 loop /* inv: heap(1, n) except between i and its 0, 1 or 2 kids */ c = 2*i if c > n break /* c is the left child of i */ if c+1 = maxsize /* report error */ n++ x[n] = t /* heap(1, n-1) */ siftup(n) /* heap(1, n) */ int extractmin() if n < 1 /* report error */ t = x[1] x[1] = x[n--] /* heap(2, n) */ siftdown(n) /* heap(1, n) */ return t

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-14-8

The Complete C++ Class template class priqueue { private: int n, maxsize; T *x; void swap(int i, int j) { T t = x[i]; x[i] = x[j]; x[j] = t; } public: priqueue(int m) { maxsize = m; x = new T[maxsize+1]; n = 0; } void insert(T t) { int i, p; x[++n] = t; for (i = n; i > 1 && x[p=i/2] > x[i]; i = p) swap(p, i); } T extractmin() { int i, c; T t = x[1]; x[1] = x[n--]; for (i = 1; (c = 2*i) #include <stdlib.h> #include <string.h> char inputchars[4300000]; char *word[800000]; int nword = 0; int k = 2; int wordncmp(char *p, char* q) { int n = k; for ( ; *p == *q; p++, q++) if (*p == 0 && --n == 0) return 0; return *p - *q; } int sortcmp(char **p, char **q) { return wordncmp(*p, *q); } char *skip(char *p, int n) { for ( ; n > 0; p++) if (*p == 0) n--; return p; } int main() { int i, wordsleft = 10000, l, m, u; char *phrase, *p; word[0] = inputchars; while (scanf("%s", word[nword]) != EOF) { word[nword+1] = word[nword] + strlen(word[nword]) + 1; nword++; } for (i = 0; i < k; i++) word[nword][i] = 0; for (i = 0; i < k; i++) printf("%s0, word[i]); qsort(word, nword, sizeof(word[0]), sortcmp); phrase = inputchars; for ( ; wordsleft > 0; wordsleft--) { l = -1; u = nword; while (l+1 != u) { m = (l + u) / 2; if (wordncmp(word[m], phrase) < 0) l = m; else u = m; } for (i = 0; wordncmp(phrase, word[u+i]) == 0; i++) if (rand() % (i+1) == 0) p = word[u+i]; phrase = skip(p, 1); if (strlen(skip(phrase, k-1)) == 0) break; printf("%s0, skip(phrase, k-1)); } return 0; }

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-15-12

Tricks of the Trade (From Programming Pearls) Here's a trick of the medical trade useful for anyone who donates blood. Before sticking the big needle in your arm, the nurse first pricks your finger for a few drops of blood. Some thoughtless nurses jab the pad of the index finger, which is the most sensitive spot of the most used finger. It is better to poke a less sensitive part (on the side, halfway down from nail to pad) of a less commonly used finger (the ring finger). This trick can make a blood donor's day a little more pleasant. Tell it to your friends before the next blood drive. This book describe several similar tricks of the programmer's trade. These ideas are at an intellectual level not far above the trick of drawing blood samples from the side of the ring finger. Fortunately, these tricks are almost as useful, and not too much harder to apply. Three tricks play a particularly prominent role in the book. Problem Definition Searching out the real problem is the main topic of Column 1, and is a theme through the rest of the book. The Back of the Envelope Quick calculations are the subject of Column 7 and are used throughout the book. Debugging How do you turn a pseudocode algorithm into a real program? That ``small matter of programming'' is the topic of Column 5; testing plays an important role there and throughout the book. Section 5.10 tells a few fun stories about debugging, which is another theme in the book.

Other Tricks in the Book The Epilog to the First Edition contains a list of ten design hints for programmers. The index has an entry for engineering techniques. That contains pointers to back of the envelope, background data, debugging, design, elegance, problem definition, prototypes, specifications, testing, and tradeoffs.

A Talk About Tricks

I gave a one-hour talk about tricks at the Game Developers Conference in March, 2000. The talk is available as a Powerpoint Show. Closely related overhead transparencies are available in both Postscript and Acrobat. The talk concentrates on the three tricks described above (problem definition, back-of-the-envelope estimates, and debugging). Most of the material in the slides is taken from the corresponding parts of the book. The first problem to be defined is from Section 12.1, and the second is from Problem 5.3 of my 1988 More Programming Pearls (the blood sample trick is from page 45 of that book). The back-of-the-envelope material is directly from Column 7, while the rules of thumb are scattered throughout the book. Three of the four software debugging stories are from Section 5.10; the medical story is from the book cited at the end of the section. Copyright © 2000 Lucent Technologies. All rights reserved. Tues 7 Mar 2000

An Estimation Quiz (Appendix 2 of Programming Pearls) Back-of-the-envelope calculations all start with basic quantities. Those numbers can sometimes be found in a problem specification (such as a requirements document), but other times they must be estimated. This little quiz is designed to help you evaluate your proficency as a number guesser. For each question, fill in upper and lower bounds that, in your opinion, give you a ninety percent chance of including the true value; try not to make your ranges too narrow or too wide. I estimate that it should take you between five and ten minutes. Please make a good faith effort. [______, ______] January 1, 2000, population of the United States in millions. [______, ______] The year of Napoleon's birth. [______, ______] Length of the Mississippi-Missouri river in miles. [______, ______] Maximum takeoff weight in pounds of a Boeing 747 airliner. [______, ______] Seconds for a radio signal to travel from the earth to the moon. [______, ______] Latitude of London. [______, ______] Minutes for a space shuttle to orbit the earth. [______, ______] Length in feet between the towers of the Golden Gate Bridge. [______, ______] Number of signers of the Declaration of Independence. [______, ______] Number of bones in the adult human body. When you finish this quiz, please visit here for answers and interpretation. Copyright © 1999 Lucent Technologies. All rights reserved. Mon 9 Aug 1999

Estimation Answers (From Appendix 2 of Programming Pearls) If you have not yet answered all the questions yourself, please go back here and do so. Here are the answers, from an almanac or similar source. January 1, 2000, population of the United States is 272.5 million. Napoleon was born in 1769. The Mississippi-Missouri river is 3,710 miles long. Maximum takeoff weight of a B747-400 airliner is 875,000 pounds. A radio signal travels from the earth to the moon in 1.29 seconds. Latitude of London is about 51.5 degrees. A space shuttle orbits the earth in about 91 minutes. 4200 feet between the towers of the Golden Gate Bridge. 56 signers of the Declaration of Independence. 206 bones in the adult human body.

Interpretation Please count how many of your ranges included the correct answer. Because you used a 90-percent confidence interval, you should have gotten about nine of the ten answers correct. If you had all ten answers correct, then you may be an excellent estimator. Then again, your ranges may be way too big. You can probably guess which. If you had six or fewer answers correct, then you are probably as embarrassed as I was when I first took a similar estimation quiz. A little practice couldn't hurt your estimation skills. If you had seven or eight answers correct, then you are a pretty good guesser. In the future, remember to broaden your 90-percent ranges just a bit. If you had exactly nine answers correct, then you may be an excellent guesser. Or, you may have given infinite ranges for the first nine questions, and zero for the last question. If so, shame on you. Copyright © 1999 Lucent Technologies. All rights reserved. Mon 9 Aug 1999

The Back of the Envelope (Column 7 of Programming Pearls) A Story It was in the middle of a fascinating conversation on software engineering that Bob Martin asked me, ``How much water flows out of the Mississippi River in a day?'' Because I had found his comments up to that point deeply insightful, I politely stifled my true response and said, ``Pardon me?'' When he asked again I realized that I had no choice but to humor the poor fellow, who had obviously cracked under the pressures of running a large software shop. My response went something like this. I figured that near its mouth the river was about a mile wide and maybe twenty feet deep (or about one two-hundred-and-fiftieth of a mile). I guessed that the rate of flow was five miles an hour, or a hundred and twenty miles per day. Multiplying 1 mile x 1/250 mile x 120 miles/day ~ 1/2 mile3/day showed that the river discharged about half a cubic mile of water per day, to within an order of magnitude. But so what? At that point Martin picked up from his desk a proposal for the communication system that his organization was building for the Summer Olympic games, and went through a similar sequence of calculations. He estimated one key parameter as we spoke by measuring the time required to send himself a one-character piece of mail. The rest of his numbers were straight from the proposal and therefore quite precise. His calculations were just as simple as those about the Mississippi River and much more revealing. They showed that, under generous assumptions, the proposed system could work only if there were at least a hundred and twenty seconds in each minute. He had sent the design back to the drawing board the previous day. (The conversation took place about a year before the event, and the final system was used during the Olympics without a hitch.) That was Bob Martin's wonderful (if eccentric) way of introducing the engineering technique of ``back-of-theenvelope'' calculations. The idea is standard fare in engineering schools and is bread and butter for most practicing engineers. Unfortunately, it is too often neglected in computing.

The Rest of the Column The story at the top of this page begins Column 7 of Programming Pearls. These are the remaining sections in the column. 7.1 Basic Skills 7.2 Performance Estimates 7.3 Safety Factors

7.4 Little's Law 7.5 Principles 7.6 Problems 7.7 Further Reading 7.8 Quick Calculations in Everyday Life

Related Content The Solutions to Column 7 give answers for some of the Problems. Appendix 2 is an estimation quiz to help you test your guessing skills. Appendix 3 describes programs that generate cost models for the space (spacemod.cpp) and time (timemod.c) used by C and C++ programs. The index has an entry for the back of the envelope. The teaching material contains overhead transparencies based on Sections 7.1, 7.5 and 7.6; the slides are available in both Postscript and Acrobat. The web references describe several web sites devoted to this topic. An excerpt from this column appears in the September/October 1999 issue of IEEE Software (pages 61-65). It consists of the lead story, much of Section 7.1, much of Section 7.2, Section 7.5 and the Estimation Quiz. Copyright © 1999 Lucent Technologies. All rights reserved. Mon 28 Feb 2000

First-Year Instruction in Programming Pearls Owen Astrachan organized a Workshop on First Year Instruction sponsored by the National Science Foundation and the Duke University Computer Science Department in July, 2000. When he invited me to give the keynote, my first response was that I haven't taught a freshman course for a quarter of a century. On the other hand, I've enjoyed talking to undergraduates over the years, and Programming Pearls has been widely used as a supplementary text in college courses. My interest in the topic was also increased because my son was heading off to college later that summer. With Owen's guidance and encouragement, I agreed to talk on ``Truth, Beauty, and Engineering: What I'd Like My CS Freshman Kid to Learn Next Year''. (Let me clarify as a proud parent that Computer Science was only one of several possible majors that he was considering as he departed; I have high hopes that he can avoid repeating the sins of his father.) The talk is available as a Powerpoint Show. It has this rough outline: Engineering Tricks of the Trade Eyes Open to the World Beauty Elegance Communication Truth History Ethics Several of these topics are covered in the book. Tricks of the Trade This topic has its own web page, complete with a talk. Elegance Elegance is the moral of Column 1, and a theme through the rest of the book, complete with its own entry in the index. For a more general view of the topic, see the Computerworld article on Elegance in software. I had the luxury of striving for elegance in the source code for the book. The scaffolding is often (appropriately) rough, but I took same pains to trim excess fat from most of the code that found its way into the text. History

The book is peppered with little historical anecdotes that illustrate timeless truths. Section 10.1 introduces the topic of ``squeezing space'' with this story from the dawn of computing: ``Fred Brooks observed the power of simplification when he wrote a payroll program for a national company in the mid 1950's. The bottleneck of the program was the representation of the Kentucky state income tax. The tax was specified in the law by a two-dimensional table with income as one dimension, and number of exemptions as the other. Storing the table explicitly required thousands of words of memory, more than the capacity of the machine. ``The first approach Brooks tried was to fit a mathematical function through the tax table, but it was so jagged that no simple function would come close. Knowing that it was made by legislators with no predisposition to crazy mathematical functions, Brooks consulted the minutes of the Kentucky legislature to see what arguments had led to the bizarre table. He found that the Kentucky tax was a simple function of the income that remained after federal tax was deducted. His program therefore calculated federal tax from existing tables, and then used the remaining income and a table of just a few dozen words of memory to find the Kentucky tax. ``By studying the context in which the problem arose, Brooks was able to replace the original problem to be solved with a simpler problem. While the original problem appeared to require thousands of words of data space, the modified problem was solved with a negligible amount of memory.'' I tried to sprinkle such historical nuggets throughout the book. From the 1960's, Sections 2.4 and 2.8 describe a program for computing anagrams, and Section 9.3 describes a svelte binary search. From the 1970's, Problem 2.6 describes a touch-tone phone directory, Column 8 describes the evolution of an algorithm, Section 10.3 describes storing symmetric matrices, and Section 15.2 describes suffix arrays. Copyright © 2000 Lucent Technologies. All rights reserved. Mon 3 Jul 2000

Code from Programming Pearls ●

●

●

●

●

●

●

●

●

●

Column 1: Programs for sorting integers bitsort.c -- Sort with bit vectors. sortints.cpp -- Sort using C++ STL sets. qsortints.c -- Sort with C library qsort. bitsortgen.c -- Generate random integers for sorting. Column 2: Test and time algorithms rotate.c -- Three ways to rotate the elements of a vector. The next two program are used in a pipeline to compute all anagrams in a dictionary sign.c -- Sign each word by its letters in sorted order. squash.c -- Put each anagram class on a single line. Column 5: Scaffolding for testing and timing search functions search.c -- Linear and binary search. Column 7: Tiny experiment on C run times timemod0.c -- Edit main to time one operation. Column 8: Compute the maximum-sum subsequence in an array maxsum.c -- Time four algs: n3, n2, n log n, n. Column 9: Code tuning programs genbins.c -- Profile this, then try a special-purpose allocator. macfun.c -- Time the cost of macros and functions. The column also uses rotate.c (Column 2), search.c (Column 5) and maxsum.c (Column 8). Column 11: Test and time sorting algorithms sort.cpp -- Mostly C, but also C++ sort function. SortAnim.java -- Animate those sort functions in Java. Column 12: Generate a sorted list of random integers sortedrand.cpp -- Several algorithms for the task. Column 13: Set representations for the problem in Column 12 sets.cpp -- Several data structures for sets. genbins.c (Column 9) implements the bin data structure in C. Column 14: Heaps

priqueue.cpp -- Implement and test priority queues. The column also uses sort.c (Column 11) for heapsort. ●

●

Column 15: Strings wordlist.cpp -- List words in the file, using STL set. wordfreq.cpp -- List words in the file, with counts, using STL map. wordfreq.c -- Same as above, with hash table in C. longdup.c -- Find long repeated strings in input. markov.c -- Generate random text from input. markovhash.c -- Like markov.c, but with hashing. markovlet.c -- Letter-level markov text, simple algorithm. Appendix 3: Cost Models spacemod.cpp -- Space used by various records. timemod.c -- Table of times used by various C constructs.

You may use this code for any purpose, as long as you leave the copyright notice and book citation attached. Copyright © 1999 Lucent Technologies. All rights reserved. Sat 31 Jul 1999

/* Copyright (C) 1999 Lucent Technologies */ /* From 'Programming Pearls' by Jon Bentley */ /* bitsort.c -- bitmap sort from Column 1 * Sort distinct integers in the range [0..N-1] */ #include <stdio.h> #define #define #define #define int a[1

BITSPERWORD 32 SHIFT 5 MASK 0x1F N 10000000 + N/BITSPERWORD];

void set(int i) { a[i>>SHIFT] |= (1SHIFT] &= ~(1SHIFT] & (1> i) S.insert(i); for (j = S.begin(); j != S.end(); ++j) cout int intcomp(int *x, int *y) { return *x - *y; } int a[1000000]; int main() { int i, n=0; while (scanf("%d", &a[n]) != EOF) n++; qsort(a, n, sizeof(int), intcomp); for (i = 0; i < n; i++) printf("%d\n", a[i]); return 0; }

/* Copyright (C) 1999 Lucent Technologies */ /* From 'Programming Pearls' by Jon Bentley */ /* bitsortgen.c -- gen $1 distinct integers from U[0,$2) */ #include <stdio.h> #include <stdlib.h> #include #define MAXN 2000000 int x[MAXN]; int randint(int a, int b) { return a + (RAND_MAX * rand() + rand()) % (b + 1 - a); } int main(int argc, char *argv[]) { int i, k, n, t, p; srand((unsigned) time(NULL)); k = atoi(argv[1]); n = atoi(argv[2]); for (i = 0; i < n; i++) x[i] = i; for (i = 0; i < k; i++) { p = randint(i, n-1); t = x[p]; x[p] = x[i]; x[i] = t; printf("%d\n", x[i]); } return 0; }

/* Copyright (C) 1999 Lucent Technologies */ /* From 'Programming Pearls' by Jon Bentley */ /* rotate.c -- time algorithms for rotating a vector Input lines: algnum numtests n rotdist algnum: 1: reversal algorithm 2: juggling algorithm 22: juggling algorithm with mod rather than if 3: gcd algorithm 4: slide (don't rotate): baseline alg for timing To test the algorithms, recompile and change main to call testrot */ #include <stdio.h> #include <stdlib.h> #include #define MAXN 10000000 int x[MAXN]; int rotdist, n; /* Alg 1: Rotate by reversal */ void reverse(int i, int j) { int t; while (i < j) { t = x[i]; x[i] = x[j]; x[j] = t; i++; j--; } } void revrot(int rotdist, int n) { reverse(0, rotdist-1); reverse(rotdist, n-1); reverse(0, n-1); } /* Alg 2: Juggling (dolphin) rotation */ int gcd(int i, int j) { int t; while (i != 0) { if (j >= i) j -= i; else { t = i; i = j; j = t; } } return j; } void jugglerot(int rotdist, int n) { int cycles, i, j, k, t; cycles = gcd(rotdist, n); for (i = 0; i < cycles; i++) { /* move i-th values of blocks */ t = x[i]; j = i;

for (;;) { k = j + rotdist; if (k >= n) k -= n; if (k == i) break; x[j] = x[k]; j = k; } x[j] = t; } } void jugglerot2(int rotdist, int n) { int cycles, i, j, k, t; cycles = gcd(rotdist, n); for (i = 0; i < cycles; i++) { /* move i-th values of blocks */ t = x[i]; j = i; for (;;) { /* Replace with mod below k = j + rotdist; if (k >= n) k -= n; */ k = (j + rotdist) % n; if (k == i) break; x[j] = x[k]; j = k; } x[j] = t; } } /* Alg 3: Recursive rotate (using gcd structure) */ void swap(int i, int j, int k) /* swap x[i..i+k-1] with x[j..j+k-1] */ { int t; while (k-- > 0) { t = x[i]; x[i] = x[j]; x[j] = t; i++; j++; } } void gcdrot(int rotdist, int n) { int i, j, p; if (rotdist == 0 || rotdist == n) return; i = p = rotdist; j = n - p; while (i != j) { /* invariant: x[0 ..p-i ] is in final position x[p-i..p-1 ] = a (to be swapped with b) x[p ..p+j-1] = b (to be swapped with a) x[p+j..n-1 ] in final position */

if (i > j) { swap(p-i, p, j); i -= j; } else { swap(p-i, p+j-i, i); j -= i; } } swap(p-i, p, i); } int isogcd(int i, int j) { if (i == 0) return j; if (j == 0) return i; while (i != j) { if (i > j) i -= j; else j -= i; } return i; } void testgcd() { int i,j; while (scanf("%d %d", &i, &j) != EOF) printf("%d\n", isogcd(i,j) ); } /* Test all algs */ void slide(int rotdist, int n) /* Benchmark: slide left rotdist (lose 0..rotdist-1) */ { int i; for (i = rotdist; i < n; i++) x[i-rotdist] = x[i]; } void initx() { int i; for (i = 0; i < n; i++) x[i] = i; } void printx() { int i; for (i = 0; i < n; i++) printf(" %d", x[i]); printf("\n"); } void roterror() { fprintf(stderr, " rotate bug %d %d\n", n, rotdist); printx(); exit (1); } void checkrot() { int i;

for (i = 0; i < n-rotdist; i++) if (x[i] != i+rotdist) roterror(); for (i = 0; i < rotdist; i++) if (x[n-rotdist+i] != i) roterror(); } void testrot() { for (n = 1; n #include <string.h> #define WORDMAX 100 int charcomp(char *x, char *y) { return *x - *y; } int main() { char word[WORDMAX], sig[WORDMAX]; while (scanf("%s", word) != EOF) { strcpy(sig, word); qsort(sig, strlen(sig), sizeof(char), charcomp); printf("%s %s\n", sig, word); } return 0; }

/* Copyright (C) 1999 Lucent Technologies */ /* From 'Programming Pearls' by Jon Bentley */ /* squash.c -- print anagram classes on a single line The input lines "opst pots" and "opst stop" go to "pots stop" */ #include <stdio.h> #include <stdlib.h> #include <string.h> #define WORDMAX 100 int main() { char word[WORDMAX], sig[WORDMAX], oldsig[WORDMAX]; int linenum = 0; strcpy(oldsig, ""); while (scanf("%s %s", sig, word) != EOF) { if (strcmp(oldsig, sig) != 0 && linenum > 0) printf("\n"); strcpy(oldsig, sig); linenum++; printf("%s ", word); } printf("\n"); return 0; }

/* Copyright (C) 1999 Lucent Technologies */ /* From 'Programming Pearls' by Jon Bentley */ /* search.c -- test and time binary and sequential search Select one of three modes by editing main() below. 1.) Probe one function 2.) Test one function extensively 3.) Time all functions Input lines: algnum n numtests Output lines: algnum n numtests clicks nanosecs_per_elem See timedriver for algnum codes */ #include <stdio.h> #include <stdlib.h> #include #define MAXN 1000000 typedef int DataType; DataType x[MAXN]; int n; /* Scaffolding */ int i = -999999; #define assert(v) { if ((v) == 0) printf("

binarysearch bug %d %d\n", i, n); }

/* Alg 1: From Programming Pearls, Column 4: raw transliteration */ int binarysearch1(DataType t) { int l, u, m; l = 0; u = n-1; for (;;) { if (l > u) return -1; m = (l + u) / 2; if (x[m] < t) l = m+1; else if (x[m] == t) return m; else /* x[m] > t */ u = m-1; } } /* Alg 2: Make binarysearch1 more c-ish */ int binarysearch2(DataType t) { int l, u, m; l = 0; u = n-1; while (l t */ u = m-1;

} return -1; } /* Alg 3: From PP, Col 8 */ int binarysearch3(DataType t) { int l, u, m; l = -1; u = n; while (l+1 != u) { m = (l + u) / 2; if (x[m] < t) l = m; else u = m; } if (u >= n || x[u] != t) return -1; return u; } /* Alg 4: From PP, Col 9 */ int binarysearch4(DataType t) { int l, p; if (n != 1000) return binarysearch3(t); l = -1; if (x[511] < t) l = 1000 - 512; if (x[l+256] < t) l += 256; if (x[l+128] < t) l += 128; if (x[l+64 ] < t) l += 64; if (x[l+32 ] < t) l += 32; if (x[l+16 ] < t) l += 16; if (x[l+8 ] < t) l += 8; if (x[l+4 ] < t) l += 4; if (x[l+2 ] < t) l += 2; if (x[l+1 ] < t) l += 1; p = l+1; if (p >= n || x[p] != t) return -1; return p; } /* Alg 9: Buggy, from Programming Pearls, Column 5 */ int sorted() { int i; for (i = 0; i < n-1; i++) if (x[i] > x[i+1]) return 0; return 1; } int binarysearch9(DataType t) { int l, u, m; /* int oldsize, size = n+1; */ l = 0; u = n-1; while (l t) u = m; else { /* assert(x[m] == t); */ return m; } } /* assert(x[l] > t && x[u] < t); */ return -1; } /* Alg 21: Simple sequential search */ int seqsearch1(DataType t) { int i; for (i = 0; i < n; i++) if (x[i] == t) return i; return -1; } /* Alg 22: Faster sequential search: Sentinel */ int seqsearch2(DataType t) { int i; DataType hold = x[n]; x[n] = t; for (i = 0; ; i++) if (x[i] == t) break; x[n] = hold; if (i == n) return -1; else return i; } /* Alg 23: Faster sequential search: loop unrolling */ int seqsearch3(DataType t) { int i; DataType hold = x[n]; x[n] = t; for (i = 0; ; i+=8) { if (x[i] == t) if (x[i+1] == t) if (x[i+2] == t) if (x[i+3] == t) if (x[i+4] == t) if (x[i+5] == t) if (x[i+6] == t) if (x[i+7] == t) } x[n] = hold; if (i == n) return -1;

{ { { { { { { {

i i i i i i i

+= += += += += += +=

1; 2; 3; 4; 5; 6; 7;

break; } break; } break; } break; } break; } break; } break; } break; }

else return i; } /* Scaffolding to probe one algorithm */ void probe1() { int i; DataType t; while (scanf("%d %d", &n, &t) != EOF) { for (i = 0; i < n; i++) x[i] = 10*i; printf(" %d\n", binarysearch9(t)); } } /* Torture test one algorithm */ #define s seqsearch3 void test(int maxn) { int i; for (n = 0; n #include <math.h> int main() { int i, n, ia, ib, ic; float fa, fb, fc; n = 1000000000; /* run time in secs gives nanosecs/loop */ ia = ib = ic = 9; fa = fb = 9.0; for (i = 0; i < n; i++) { /* null loop 19.1 */ /* ia = ib + ic; 17.7 */ /* ia = ib - ic; 17.6 */ /* ia = ib * ic; 17.7 */ /* ia = ib % ic; 98.3 */ /* ia = ib / ic; 98.3 */ /* ia = rand(); 41.5 */ /* fa = sqrt(fb); 184 */ /* free(malloc(8)); 2400 */ } return 0; }

/* Copyright (C) 1999 Lucent Technologies */ /* From 'Programming Pearls' by Jon Bentley */ /* maxsum.c -- time algs for maximum-sum subsequence * Input: algnum, n * Output: algnum, n, answer, ticks, secs * See main for algnum codes */ #include <stdio.h> #include <stdlib.h> #include #define MAXN 10000000 int n; float x[MAXN]; void sprinkle() /* Fill x[n] with reals uniform on [-1,1] */ { int i; for (i = 0; i < n; i++) x[i] = 1 - 2*( (float) rand()/RAND_MAX); } float alg1() { int i, j, k; float sum, maxsofar = 0; for (i = 0; i < n; i++) for (j = i; j < n; j++) { sum = 0; for (k = i; k maxsofar) maxsofar = sum; } return maxsofar; } float alg2() { int i, j; float sum, maxsofar = 0; for (i = 0; i < n; i++) { sum = 0; for (j = i; j < n; j++) { sum += x[j]; if (sum > maxsofar) maxsofar = sum; } } return maxsofar; } float cumvec[MAXN+1]; float alg2b() { int i, j; float *cumarr, sum, maxsofar = 0; cumarr = cumvec+1; /* to access cumarr[-1] */ cumarr[-1] = 0; for (i = 0; i < n; i++) cumarr[i] = cumarr[i-1] + x[i]; for (i = 0; i < n; i++) { for (j = i; j < n; j++) {

sum = cumarr[j] - cumarr[i-1]; if (sum > maxsofar) maxsofar = sum; } } return maxsofar; } /* MS VC++ has a max macro, and therefore a perf bug */ #ifdef max #undef max #endif #define maxmac(a, b) ((a) > (b) ? (a) : (b) ) float maxfun(float a, float b) { return a > b ? a : b; } #define max(a, b) maxfun(a, b) float recmax(int l, int u) { int i, m; float lmax, rmax, sum; if (l > u) /* zero elements */ return 0; if (l == u) /* one element */ return max(0, x[l]); m = (l+u) / 2; /* find max crossing to left */ lmax = sum = 0; for (i = m; i >= l; i--) { sum += x[i]; if (sum > lmax) lmax = sum; } /* find max crossing to right */ rmax = sum = 0; for (i = m+1; i rmax) rmax = sum; } return max(lmax + rmax, max(recmax(l, m), recmax(m+1, u))); } float alg3() { return recmax(0, n-1); } float alg4() { int i; float maxsofar = 0, maxendinghere = 0; for (i = 0; i < n; i++) { maxendinghere += x[i]; if (maxendinghere < 0) maxendinghere = 0; if (maxsofar < maxendinghere) maxsofar = maxendinghere;

} return maxsofar; } float alg4b() { int i; float maxsofar = 0, maxendinghere = 0; for (i = 0; i < n; i++) { maxendinghere += x[i]; maxendinghere = maxmac(maxendinghere, 0); maxsofar = maxmac(maxsofar, maxendinghere); } return maxsofar; } float alg4c() { int i; float maxsofar = 0, maxendinghere = 0; for (i = 0; i < n; i++) { maxendinghere += x[i]; maxendinghere = maxfun(maxendinghere, 0); maxsofar = maxfun(maxsofar, maxendinghere); } return maxsofar; } int main() { int algnum, start, clicks; float thisans; while (scanf("%d %d", &algnum, &n) != EOF) { sprinkle(); start = clock(); thisans = -1; switch (algnum) { case 1: thisans = alg1(); break; case 2: thisans = alg2(); break; case 22: thisans = alg2b(); break; case 3: thisans = alg3(); break; case 4: thisans = alg4(); break; case 42: thisans = alg4b(); break; case 43: thisans = alg4c(); break; default: break; } clicks = clock()-start; printf("%d\t%d\t%f\t%d\t%f\n", algnum, n, thisans, clicks, clicks / (float) CLOCKS_PER_SEC); if (alg4() != thisans) printf(" maxsum error: mismatch with alg4: %f\n", alg4()); } return 0; }

/* Copyright (C) 1999 Lucent Technologies */ /* From 'Programming Pearls' by Jon Bentley */ /* genbins.c -- generate random numbers with bins */ /* If NODESIZE is 8, this program uses the special-case malloc. Change NODESIZE to 0 to use the system malloc. */ #include <stdio.h> #include <stdlib.h> #define NODESIZE 8 #define NODEGROUP 1000 int nodesleft = 0; char *freenode; void *pmalloc(int size) { void *p; if (size != NODESIZE) return malloc(size); if (nodesleft == 0) { freenode = malloc(NODEGROUP*NODESIZE); nodesleft = NODEGROUP; } nodesleft--; p = (void *) freenode; freenode += NODESIZE; return p; } struct node { int val; struct node *next; }; struct node **bin, *sentinel; int bins, bincnt, maxval; void initbins(int maxelms, int pmaxval) { int i; bins = maxelms; maxval = pmaxval; bin = pmalloc(bins*sizeof(struct node *)); sentinel = pmalloc(sizeof(struct node)); sentinel->val = maxval; for (i = 0; i < bins; i++) bin[i] = sentinel; bincnt = 0; } struct node *rinsert(struct node *p, int t) { if (p->val < t) { p->next = rinsert(p->next, t); } else if (p->val > t) { struct node *q = pmalloc(sizeof(struct node)); q->val = t; q->next = p; p = q; bincnt++; } return p;

} void insert(int t) { int i; i = t / (1 + maxval/bins); i = t / (1 + maxval/bins); bin[i] = rinsert(bin[i], t); } void report() { int i, j = 0; struct node *p; for (i = 0; i < bins; i++) for (p = bin[i]; p != sentinel; p = p->next) /* printf("%d\n", p->val) */; /* Uncomment for testing, comment for profiling */ } int bigrand() { return RAND_MAX*rand() + rand(); } int main(int argc, char *argv[]) { int m = atoi(argv[1]); int n = atoi(argv[2]); initbins(m, n); while (bincnt < m) { insert(bigrand() % n); } report(); return 0; }

/* Copyright (C) 1999 Lucent Technologies */ /* From 'Programming Pearls' by Jon Bentley */ /* macfun.c -- time macro and function implementations of max * Input: a sequence of (alg num, n) pairs. * Output: for each test, (alg num, n, ans, ticks, secs) */ #include <stdio.h> #include <stdlib.h> #include #define MAXN 1000000 float x[MAXN]; /* arrmax1 -- max is a macro */ #define max1(a, b) ((a) > (b) ? (a) : (b)) float arrmax1(int n) { if (n == 1) return x[0]; else return max1(x[n-1], arrmax1(n-1)); } /* arrmax2 -- max is a function */ float max2(float a, float b) { return a > b ? a : b; } float arrmax2(int n) { if (n == 1) return x[0]; else return max2(x[n-1], arrmax2(n-1)); } /* arrmax3 -- MS VC++ stdlib defines max as a macro */ #ifndef max #define max(a, b) max2(a, b) #endif float arrmax3(int n) { if (n == 1) return x[0]; else return max(x[n-1], arrmax3(n-1)); } int main() { int algnum, i, n, start, clicks; float thisans; for (i = 0; i < MAXN; i++) x[i] = MAXN-i; while (scanf("%d %d", &algnum, &n) != EOF) { start = clock();

thisans = -1; switch (algnum) { case 1: thisans = arrmax1(n); break; case 2: thisans = arrmax2(n); break; case 3: thisans = arrmax3(n); break; default: break; } clicks = clock()-start; printf("%d\t%d\t%g\t%d\t%g\n", algnum, n, thisans, clicks, clicks / (float) CLOCKS_PER_SEC); } return 0; }

/* Copyright (C) 1999 Lucent Technologies */ /* From 'Programming Pearls' by Jon Bentley */ /* sort.cpp -- test and time Input lines: algnum Output lines: algnum This is predominantly a C sort function immediately */

sorting algorithms n mod n mod clicks nanosecs_per_elem program; the only use of C++ below the include line.

#include <stdio.h> #include <stdlib.h> #include // To change from C++ back to C, remove the following two lines // and the call to sort in main #include using namespace std; /* Data and supporting functions */ #define MAXN 10000000 typedef int DType; DType realx[MAXN]; int *x = realx; /* allow x to shift for heaps */ int n; void swap(int i, int j) { DType t = x[i]; x[i] = x[j]; x[j] = t; } int randint(int l, int u) { return l + (RAND_MAX*rand() + rand()) % (u-l+1); } /* LIBRARY QSORT */ int intcomp(int *x, int *y) { return *x - *y; } /* INSERTION SORTS */ /* Simplest insertion sort */ void isort1() { int i, j; for (i = 1; i < n; i++) for (j = i; j > 0 && x[j-1] > x[j]; j--) swap(j-1, j); } /* Write swap function inline */ void isort2() { int i, j; DType t; for (i = 1; i < n; i++) for (j = i; j > 0 && x[j-1] > x[j]; j--) { t = x[j]; x[j] = x[j-1];

x[j-1] = t; } } /* Move assignments void isort3() { int i, j; DType t; for (i = 1; t = for

to and from t out of loop */

i < n; i++) { x[i]; (j = i; j > 0 && x[j-1] > t; j--) x[j] = x[j-1]; x[j] = t;

} } /* QUICKSORTS */ /* Simplest version, Lomuto partitioning */ void qsort1(int l, int u) { int i, m; if (l >= u) return; m = l; for (i = l+1; i

Appendix 4: Rules for Code Tuning Solutions for Column 1 Column 5 Column 7 Column 15 Index About The Book Why a Second Edition? To Readers of the First Edition About the First Edition Errata Supporting Material Source Code Web Sites Relevant to the Book Animation of Sorting Algorithms Tricks of the Trade Teaching Material Other Links ● ● ●

Addison-Wesley Computer & Engineering Publishing Group Programming Pearls at Addison-Wesley Bookstores: Amazon.com, Barnes & Noble, Borders.com, Fatbrain.com, Quantum Books.

Copyright © 1999 Lucent Technologies. All rights reserved. Thu 19 Oct 2000

What's New on the Programming Pearls Web Site November 2000 Column 15 is now on the site, complete with a new program for letter-level Markov text, and new examples of word frequencies, long repeated strings, and letter-level and wordlevel Markov text.

October 2000 The rules for code tuning from my 1982 book Writing Efficient Programs are now online, and so is a Powerpoint Show on Cache-Conscious Algorithms and Data Structures.

August 2000 The errata just keeps on growing. If you see errors, please send them in.

July 2000 Programming Pearls is often used for teaching undergraduates. This page describes how some of the topics in the book can be incorporated into college classrooms.

March 2000 A theme running through the book concerns the Tricks of the Trade. This page describes that topic and contains a Powerpoint Show on the subject. Copyright © 1999 Lucent Technologies. All rights reserved. Mon 6 Nov 2000

Strings of Pearls (Column 15 of Programming Pearls) We are surrounded by strings. Strings of bits make integers and floating-point numbers. Strings of digits make telephone numbers, and strings of characters make words. Long strings of characters make web pages, and longer strings yet make books. Extremely long strings represented by the letters A, C, G and T are in geneticists' databases and deep inside the cells of many readers of this book. Programs perform a dazzling variety of operations on such strings. They sort them, count them, search them, and analyze them to discern patterns. This column introduces those topics by examining a few classic problems on strings.

The Rest of the Column These are the remaining sections in the column. 15.1 Words 15.2 Phrases 15.3 Generating Text 15.4 Principles 15.5 Problems 15.6 Further Reading

Related Content The teaching material contains overhead transparencies based on Sections 15.2 and 15.3; the slides are available in both Postscript and Acrobat. The code for Column 15 contains implementations of the algorithms. The Solutions to Column 15 give answers for some of the Problems. This column concentrates on programming techniques, and uses those techniques to build several programs for processing large text files. A few examples of the programs' output in the text illustrate the structure of English documents. This web site contains some additional fun examples, which may give further insight into the structure of the English language. Section 15.1 counts the words in one document; here are more examples of word frequency counts. Section 15.2 searches for large portions of duplicated text; here are more examples of long repeated strings. Section 15.3 describes randomly generated Markov text; these pages contain addtional examples of Markov text, generated at the letter level and word level.

The web references describe several web sites devoted to related topics. Copyright © 1999 Lucent Technologies. All rights reserved. Wed 18 Oct 2000

Words (Section 15.1 of Programming Pearls) Our first problem is to produce a list of the words contained in a document. (Feed such a program a few hundred books, and you have a fine start at a word list for a dictionary.) But what exactly is a word? We'll use the trivial definition of a sequence of characters surrounded by white space, but this means that web pages will contain many ``words'' like ``'', ``'' and `` ''. Problem 1 asks how you might avoid such problems. Our first C++ program uses the sets and strings of the Standard Template Library, in a slight modification of the program in Solution 1.1: int main(void) { set S; set::iterator j; string t; while (cin >> t) S.insert(t); for (j = S.begin(); j != S.end(); ++j) cout t) M[t]++; for (j = M.begin(); j != M.end(); ++j) cout first word, p->count return 0 The work is done by incword, which increments the count associated with the input word (and initializes it if it is not already there): void incword(char *s) h = hash(s) for (p = bin[h]; p != NULL; p = p->next) if strcmp(s, p->word) == 0 (p->count)++ return p = malloc(sizeof(hashnode)) p->count = 1 p->word = malloc(strlen(s)+1) strcpy(p->word, s) p->next = bin[h] bin[h] = p The for loop looks at every node with the same hash value. If the word is found, its count is incremented and the function returns. If the word is not found, the function makes a new node, allocates space and copies the string (experienced C programmers would use strdup for the task), and inserts the node at the front of the list. This C program takes about 2.4 seconds to read its input (the same as the C++ version), but only 0.5 seconds for the insertions (down from 4.9) and only 0.06 seconds to write the output (down from 0.3). The complete run time is

3.0 seconds (down from 7.6), and the processing time is 0.55 seconds (down from 5.2). Our custom-made hash table (in 30 lines of C) is an order of magnitude faster than the maps from the C++ Standard Template Library. This little exercise illustrates the two main ways to represent sets of words. Balanced search trees operate on strings as indivisible objects; these structures are used in most implementations of the STL's sets and maps. They always keep the elements in sorted order, so they can efficiently perform operations such as finding a predecessor or reporting the elements in order. Hashing, on the other hand, peeks inside the characters to compute a hash function, and then scatters keys across a big table. It is very fast on the average, but it does not offer the worst-case performance guarantees of balanced trees, or support other operations involving order. Next: Section 15.2. Phrases. Copyright © 1999 Lucent Technologies. All rights reserved. Wed 18 Oct 2000

Problems (Section 15.5 of Programming Pearls) The Solutions to Column 15 give answers for some of these problems. 1. Throughout this column we have used the simple definition that words are separated by white space. Many real documents, such as those in HTML or RTF, contain formatting commands. How could you deal with such commands? Are there any other tasks that you might need to perform? 2. On a machine with ample main memory, how could you use the C++ STL sets or maps to solve the searching problem in Section 13.8? How much memory does it consume compared to McIlroy's structure? 3. How much speedup can you achieve by incorporating the special-purpose malloc of Solution 9.2 into the hashing program of Section 15.1? 4. When a hash table is large and the hash function scatters the data well, almost every list in the table has only a few elements. If either of these conditions is violated, though, the time spent searching down the list can be substantial. When a new string is not found in the hash table in Section 15.1, it is placed at the front of the list. To simulate hashing trouble, set NHASH to 1 and experiment with this and other list strategies, such as appending to the end of the list, or moving the most recently found element to the front of the list. 5. When we looked at the output of the word frequency programs in Section 15.1, it was most interesting to print the words in decreasing frequency. How would you modify the C and C++ programs to accomplish this task? How would you print only the M most common words, where M is a constant such as 10 or 1000? 6. Given a new input string, how would you search a suffix array to find the longest match in the stored text? How would you build a GUI interface for this task? 7. Our program for finding duplicated strings was very fast for ``typical'' inputs, but it can be very slow (greater than quadratic) for some inputs. Time the program on such an input. Do such inputs ever arise in practice? 8. How would you modify the program for finding duplicated strings to find the longest string that occurs more than M times? 9. Given two input texts, find the longest string that occurs in both. 10. Show how to reduce the number of pointers in the duplication program by pointing only to suffixes that start on word boundaries. What effect does this have on the output produced by the program?

11. Implement a program to generate letter-level Markov text. 12. How would you use the tools and techniques of Section 15.1 to generate (order-0 or non-Markov) random text? 13. The program for generating word-level Markov text is at this book's web site. Try it on some of your documents. 14. How could you use hashing to speed up the Markov program? 15. Shannon's quote in Section 15.3 describes the algorithm he used to construct Markov text; implement that algorithm in a program. It gives a good approximation to the Markov frequencies, but not the exact form; explain why not. Implement a program that scans the entire string from scratch to generate each word (and thereby uses the true frequencies). 16. How would you use the techniques of this column to assemble a word list for a dictionary (the problem that Doug McIlroy faced in Section 13.8)? How would you build a spelling checker without using a dictionary? How would you build a grammar checker without using rules of grammar? 17. Investigate how techniques related to k-gram analysis are used in applications such as speech recognition and data compression. Next: Section 15.6. Further Reading. Copyright © 1999 Lucent Technologies. All rights reserved. Wed 18 Oct 2000

Solutions (To Column 1 of Programming Pearls) 1. This C program uses the Standard Library qsort to sort a file of integers. int intcomp(int *x, int *y) { return *x - *y; } int a[1000000]; int main(void) { int i, n=0; while (scanf("%d", &a[n]) != EOF) n++; qsort(a, n, sizeof(int), intcomp); for (i = 0; i < n; i++) printf("%d\n", a[i]); return 0; } This C++ program uses the set container from the Standard Template Library for the same job. int main(void) { set S; int i; set::iterator j; while (cin >> i) S.insert(i); for (j = S.begin(); j != S.end(); ++j) cout SHIFT] |=

(1SHIFT] &= ~(1SHIFT] & (1 maxlen maxlen = thislen maxi = i maxj = j The comlen function returns the length that its two parameter strings have in common, starting with their first characters: int comlen(char *p, char *q)

i = 0 while *p && (*p++ == *q++) i++ return i Because this algorithm looks at all pairs of substrings, it takes time proportional to n2, at least. We might be able to speed it up by using a hash table to search for words in the phrases, but we'll instead take an entirely new approach. Our program will process at most MAXN characters, which it stores in the array c: #define MAXN 5000000 char c[MAXN], *a[MAXN]; We'll use a simple data structure known as a ``suffix array''; the structure has been used at least since the 1970's, though the term was introduced in the 1990's. The structure is an array a of pointers to characters. As we read the input, we initialize a so that each element points to the corresponding character in the input string: while (ch = getchar()) != EOF a[n] = &c[n] c[n++] = ch c[n] = 0 The final element of c contains a null character, which terminates all strings. The element a[0] points to the entire string; the next element points to the suffix of the array beginning with the second character, and so on. On the input string ``banana'', the array will represent these suffixes: a[0]: a[1]: a[2]: a[3]: a[4]: a[5]:

banana anana nana ana na a

The pointers in the array a together point to every suffix in the string, hence the name ``suffix array''. If a long string occurs twice in the array c, it appears in two different suffixes. We will therefore sort the array to bring together equal suffixes (just as sorting brought together anagrams in Section 2.4). The ``banana'' array sorts to a[0]: a[1]: a[2]: a[3]: a[4]: a[5]:

a ana anana banana na nana

We can then scan through this array comparing adjacent elements to find the longest repeated string, which in this case is ``ana''. We'll sort the suffix array with the qsort function: qsort(a, n, sizeof(char *), pstrcmp) The pstrcmp comparison function adds one level of indirection to the library strcmp function. This scan through the array uses the comlen function to count the number of letters that two adjacent words have in common: for i = [0, n) if comlen(a[i], a[i+1]) > maxlen maxlen = comlen(a[i], a[i+1]) maxi = i printf("%.*s\n", maxlen, a[maxi]) The printf statement uses the ``*'' precision to print maxlen characters of the string. I ran the resulting program to find the longest repeated string in the 807,503 characters in Samuel Butler's translation of Homer's Iliad. The program took 4.8 seconds to locate this string: whose sake so many of the Achaeans have died at Troy, far from their homes? Go about at once among the host, and speak fairly to them, man by man, that they draw not their ships into the sea. The text first occurs when Juno suggests it to Minerva as an argument that might keep the Greeks (Achaeans) from departing from Troy; it occurs shortly thereafter when Minerva repeats the argument verbatim to Ulysses. On this and other typical text files of n characters, the algorithm runs in O(n log n) time, due to sorting. [Click for more examples of long repeated strings] Suffix arrays represent every substring in n characters of input text using the text itself and n additional pointers. Problem 6 investigates how suffix arrays can be used to solve the substring searching problem. We'll turn now to a more subtle application of suffix arrays. Next: Section 15.3. Generating Random Text. Copyright © 1999 Lucent Technologies. All rights reserved. Wed 18 Oct 2000

Long Repeated Strings (Illustrating Section 15.2 of Programming Pearls) Section 15.2 describes long repeated strings in text and gives an example. Here are some more examples, generated by this program from several sources.

King James Bible Verse Numbers Included the house of his precious things, the silver, and the gold, and the spices, and the precious ointment, and all the house of his armour, and all that was found in his treasures: there was nothing in his house, nor in all his dominion, that Hezekiah shewed them not. This text is found in 2 Kings 20:13 and in Isaiah 39:2. Each text line in the original file begins with the chapter and verse (i.e., ``GEN 1:1 In the beginning God created ...''). Long repeated strings therefore could not cross verse boundaries; the next experiment deleted those identifiers. Verse Numbers Excluded , offered: His offering was one silver charger, the weight whereof was an hundred and thirty shekels, one silver bowl of seventy shekels, after the shekel of the sanctuary; both of them full of fine flour mingled with oil for a meat offering: One golden spoon of ten shekels, full of incense: One young bullock, one ram, one lamb of the first year, for a burnt offering: One kid of the goats for a sin offering: And for a sacrifice of peace offerings, two oxen, five rams, five he goats, five lambs of the first year: this was the offering of Ahi Numbers 7 describes offerings made over a period of twelve days; much of this string appears twelve times. Longest String That Occurs Twelve Times , full of incense: One young bullock, one ram, one lamb of the first year, for a burnt offering: One kid of the goats for a sin offering: And for a sacrifice of peace offerings, two oxen, five rams, five he goats, five lambs of the first year: this was the offering of This string occurs twelve times in Numbers 7. This string was computed using the method of Solution 15.8.

Programming Pearls

The Entire Text 6, 8, 12, 17-18, 51-55, 59, 62, 70-72, 82, 87-98, 116, 119, 121, 128, 137, 162, 164, 187-189, 206, 210, 214, 221, 230 I was surprised to find that the longest repeated string in the book was in the index. The same sequence of numbers are repeated for the entries for ``experiments'', ``run time'' and ``time, run''. Index Excluded expression is costly, replace it by an algebraically equivalent expression that is cheaper to evaluate. I This text appears in both the Logic Rules and the Expression Rules of Appendix 4: Rules for Code Tuning.

The Iliad The Entire Text whose sake so many of the Achaeans have died at Troy, far from their homes? Go about at once among the host, and speak fairly to them, man by man, that they draw not their ships into the sea. This example (on Samuel Butler's translation of Homer's Iliad) was used in Section 15.2. describes long repeated strings in text and gives The text first occurs when Juno suggests it to Minerva as an argument that might keep the Greeks (Achaeans) from departing from Troy; it occurs shortly thereafter when Minerva repeats the argument verbatim to Ulysses. Copyright © 1999 Lucent Technologies. All rights reserved. Mon 6 Nov 2000

/* Copyright (C) 1999 Lucent Technologies */ /* From 'Programming Pearls' by Jon Bentley */ /* longdup.c -- Print longest string duplicated M times */ #include <stdlib.h> #include <string.h> #include <stdio.h> int pstrcmp(char **p, char **q) { return strcmp(*p, *q); } int comlen(char *p, char *q) { int i = 0; while (*p && (*p++ == *q++)) i++; return i; } #define M 1 #define MAXN 5000000 char c[MAXN], *a[MAXN]; int main() { int i, ch, n = 0, maxi, maxlen = -1; while ((ch = getchar()) != EOF) { a[n] = &c[n]; c[n++] = ch; } c[n] = 0; qsort(a, n, sizeof(char *), pstrcmp); for (i = 0; i < n-M; i++) if (comlen(a[i], a[i+M]) > maxlen) { maxlen = comlen(a[i], a[i+M]); maxi = i; } printf("%.*s\n", maxlen, a[maxi]); return 0; }

Generating Text (Section 15.3 of Programming Pearls) How can you generate random text? A classic approach is to let loose that poor monkey on his aging typewriter. If the beast is equally likely to hit any lower case letter or the space bar, the output might look like this: uzlpcbizdmddk njsdzyyvfgxbgjjgbtsak rqvpgnsbyputvqqdtmgltz ynqotqigexjumqphujcfwn ll jiexpyqzgsdllgcoluphl sefsrvqqytjakmav bfusvirsjl wprwqt This is pretty unconvincing English text. If you count the letters in word games (like Scrabble or Boggle), you will notice that there are different numbers of the various letters. There are many more A's, for instance, than there are Z's. A monkey could produce more convincing text by counting the letters in a document -- if A occurs 300 times in the text while B occurs just 100 times, then the monkey should be 3 times more likely to type an A than a B. This takes us a small step closer to English: saade ve mw hc n entt da k eethetocusosselalwo gx fgrsnoh,tvettaf aetnlbilo fc lhd okleutsndyeoshtbogo eet ib nheaoopefni ngent Most events occur in context. Suppose that we wanted to generate randomly a year's worth of Fahrenheit temperature data. A series of 365 random integers between 0 and 100 wouldn't fool the average observer. We could be more convincing by making today's temperature a (random) function of yesterday's temperature: if it is 85 degrees today, it is unlikely to be 15 degrees tomorrow. The same is true of English words: if this letter is a Q, then the next letter is quite likely to be a U. A generator can make more interesting text by making each letter a random function of its predecessor. We could, therefore, read a sample text and count how many times every letter follows an A, how many times they follow a B, and so on for each letter of the alphabet. When we write the random text, we produce the next letter as a random function of the current letter. The ``order-1'' text was made by exactly this scheme: Order-1: t I amy, vin. id wht omanly heay atuss n macon aresethe hired boutwhe t, tl, ad torurest t plur I wit hengamind tarer-plarody thishand. Order-2: Ther I the heingoind of-pleat, blur it dwere wing waske hat trooss. Yout lar on wassing, an sit." "Yould," "I that vide was nots ther. Order-3: I has them the saw the secorrow. And wintails on my my ent, thinks, fore voyager lanated the been elsed helder was of him a very free bottlemarkable,

Order-4: His heard." "Exactly he very glad trouble, and by Hopkins! That it on of the who difficentralia. He rushed likely?" "Blood night that. [Click for more examples of letter-level Markov text] We can extend this idea to longer sequences of letters. The order-2 text was made by generating each letter as a function of the two letters preceding it (a letter pair is often called a digram). The digram TH, for instance, is often followed in English by the vowels A, E, I, O, U and Y, less frequently by R and W, and rarely by other letters. The order-3 text is built by choosing the next letter as a function of the three previous letters (a trigram). By the time we get to the order-4 text, most words are English, and you might not be surprised to learn that it was generated from a Sherlock Holmes story (``The Adventure of Abbey Grange''). A classically educated reader of a draft of this column commented that this sequence of fragments reminded him of the evolution from Old English to Victorian English. Readers with a mathematical background might recognize this process as a Markov chain. One state represents each k-gram, and the odds of going from one to another don't change, so this is a ``finite-state Markov chain with stationary transition probabilities''. We can also generate random text at the word level. The dumbest approach is to spew forth the words in a dictionary at random. A slightly better approach reads a document, counts each word, and then selects the next word to be printed with the appropriate probability. (The programs in Section 15.1 use tools appropriate for such tasks.) We can get more interesting text, though, by using Markov chains that take into account a few preceding words as they generate the next word. Here is some random text produced after reading a draft of the first 14 columns of this book: Order-1: The table shows how many contexts; it uses two or equal to the sparse matrices were not chosen. In Section 13.1, for a more efficient that ``the more time was published by calling recursive structure translates to build scaffolding to try to know of selected and testing and more robust and a binary search). Order-2: The program is guided by verification ideas, and the second errs in the STL implementation (which guarantees good worst-case performance), and is especially rich in speedups due to Gordon Bell. Everything should be to use a macro: for n=10,000, its run time; that point Martin picked up from his desk Order-3: A Quicksort would be quite efficient for the main-memory sorts, and it requires only a few distinct values in this particular problem, we can write them all down in the program, and they were making progress towards a solution at a snail's pace. The order-1 text is almost readable aloud, while the order-3 text consists of very long phrases from the original input, with random transitions between them. For purposes of parody, order-2 text is usually juiciest. [Click for more examples of word-level Markov text] I first saw letter-level and word-level order-k approximations to English text in Shannon's 1948 classic Mathematical Theory of Communication. Shannon writes, ``To construct [order-1 letter-level text] for example, one opens a book at random and selects a letter at random on the page. This letter is recorded. The book is then opened to another page and one reads until this letter is encountered. The succeeding letter is then recorded. Turning to another page this second letter is searched for and the succeeding letter recorded, etc. A similar process was used for [order-1 and order-2 letter-level text, and order-0 and order-1 word-level text]. It would be interesting

if further approximations could be constructed, but the labor involved becomes enormous at the next stage.'' A program can automate this laborious task. Our C program to generate order-k Markov chains will store at most five megabytes of text in the array inputchars: int k = 2; char inputchars[5000000]; char *word[1000000]; int nword = 0; We could implement Shannon's algorithm directly by scanning through the complete input text to generate each word (though this might be slow on large texts). We will instead employ the array word as a kind of suffix array pointing to the characters, except that it starts only on word boundaries (a common modification). The variable nword holds the number of words. We read the file with this code: word[0] = inputchars while scanf("%s", word[nword]) != EOF word[nword+1] = word[nword] + strlen(word[nword]) + 1 nword++ Each word is appended to inputchars (no other storage allocation is needed), and is terminated by the null character supplied by scanf. After we read the input, we will sort the word array to bring together all pointers that point to the same sequence of k words. This function does the comparisons: int wordncmp(char *p, char* q) n = k for ( ; *p == *q; p++, q++) if (*p == 0 && --n == 0) return 0 return *p - *q It scans through the two strings while the characters are equal. At every null character, it decrements the counter n and returns equal after seeing k identical words. When it finds unequal characters, it returns the difference. After reading the input, we append k null characters (so the comparison function doesn't run off the end), print the first k words in the document (to start the random output), and call the sort: for i = [0, k) word[nword][i] = 0 for i = [0, k) print word[i] qsort(word, nword, sizeof(word[0]), sortcmp) The sortcmp function, as usual, adds a level of indirection to its pointers.

Our space-efficient structure now contains a great deal of information about the k-grams in the text. If k is 1 and the input text is ``of the people, by the people, for the people'', the word array might look like this: word[0]: word[1]: word[2]: word[3]: word[4]: word[5]: word[6]: word[7]: word[8]:

by the for the of the people people, for people, by the people, the people the people,

For clarity, this picture shows only the first k+1 words pointed to by each element of word, even though more words usually follow. If we seek a word to follow the phrase ``the'', we look it up in the suffix array to discover three choices: ``people,'' twice and ``people'' once. We may now generate nonsense text with this pseudocode sketch: phrase = first phrase in input array loop perform a binary search for phrase in word[0..nword-1] for all phrases equal in the first k words select one at random, pointed to by p phrase = word following p if k-th word of phrase is length 0 break print k-th word of phrase We initialize the loop by setting phrase to the first characters in the input (recall that those words were already printed on the output file). The binary search uses the code in Section 9.3 to locate the first occurrence of phrase (it is crucial to find the very first occurrence; the binary search of Section 9.3 does just this). The next loop scans through all equal phrases, and uses Solution 12.10 to select one of them at random. If the k-th word of that phrase is of length zero, the current phrase is the last in the document, so we break the loop. The complete pseudocode implements those ideas, and also puts an upper bound on the number of words it will generate: phrase = inputchars for (wordsleft = 10000; wordsleft > 0; wordsleft--) l = -1 u = nword while l+1 != u m = (l + u) / 2 if wordncmp(word[m], phrase) < 0 l = m else u = m

for (i = 0; wordncmp(phrase, word[u+i]) == 0; i++) if rand() % (i+1) == 0 p = word[u+i] phrase = skip(p, 1) if strlen(skip(phrase, k-1)) == 0 break print skip(phrase, k-1) Chapter 3 of Kernighan and Pike's Practice of Programming (described in Section 5.9) is devoted to the general topic of ``Design and Implementation''. They build their chapter around the problem of word-level Markov-text generation because ``it is typical of many programs: some data comes in, some data goes out, and the processing depends on a little ingenuity.'' They give some interesting history of the problem, and implement programs for the task in C, Java, C++, Awk and Perl. The program in this section compares favorably with their C program for the task. This code is about half the length of theirs; representing a phrase by a pointer to k consecutive words is space-efficient and easy to implement. For inputs of size near a megabyte, the two programs are of roughly comparable speed. Because Kernighan and Pike use a slightly larger structure and make extensive use of the inefficient malloc, the program in this column uses an order of magnitude less memory on my system. If we incorporate the speedup of Solution 14 and replace the binary search and sorting by a hash table, the program in this section becomes about a factor of two faster (and increases memory use by about 50%). Next: Section 15.4. Principles. Copyright © 1999 Lucent Technologies. All rights reserved. Wed 18 Oct 2000

Letter-Level Markov Text (Illustrating Section 15.3 of Programming Pearls) Section 15.3 describes letter-level Markov text and gives a few examples. Here are some more examples, generated by this program from several sources.

The Book of Romans (King James Version) Order-1: Runs ch g Sprif ighaifay fe; ie llathis, fur Gos ngithigh f Lom sist aminth uces yom Je Movin th we hof I juthe peathor ly dis igstredithe the, ist oth t s th uth re? hathlivicoto burecon on, Forer mautht w Lowepe sucthet nese, optro cl cth, alditheakengerdisa of bea wepis nd on ichencouteeinthtemef id, n thard f heans rat ithooowh, halang s mail whet alf it th, avoterd myond che su: pulece t hethy tinen, Forund ur; h, y tousio Order-2: For unto yousay law to do retway hein: thein ther on, Who dopento the he but wit forethered Jesin: ach minto at of the livence, wholet ye judge of heren the me of th. Ford fulne, andethe a refor oure dowe God he hathess. Becom thers, bed hat the sing ousne any med boanined that wer praoh, aryphe knot that law of the ef: I mints Chrom home insgrahalso ded ford to wits norks: but san, wrace to thento the sid; have all the as ith: buth. To antiondignam to unte of So now the rise selievelesed law; As sh to but is of tore wrink mignot him pe: ford me, to prough ithe ard be. Forke eare st; expe counto the pe he is abled an he again forears, the be not de, lesion thre th, Whou witilethaturer an ofesusned to pon aw. Order-3: For righbour in from her own Sion, There not, which confidentillined; For thereignation and thes ves: things is gospel, do nothink wised; and root is of the law? of unto him which hearath ye werefor the est, but child seeth; But neithough them what life. But wise workenned in their we knowing to them whets of Jesurreceivereus, and, and commit is unwise anot, deated are trivet; in put God, thy meat who what want ought, which and thereignor God same by the circumcision. As it wise himselves deate. Order-4: Therein the law to Gomorrha. Owe not atten for it was lieth believed. The gosperously report? For Israel, not in business shalt not hope. For I receiving the works: unto these to that in my hundred you: for of God, the changed began, which is thou, nor all thing as by the law of Christ uncircumcised the me infirmity againstanding and Sarah's way the carnally preferring inst alive the law, thou the law of sin; and bitten as by ther. Order-5: What their lusts there in your father, because also maketh not, it is among ought of his he liveth, if it fulfil things which cause division verily, the gospel of God. For I be saved. For they have mercy. Now if thou boast of patience of envy, murder, debate, deceive tree, which are the righteousness. Much more shall rise thankful; but have you into Spain.

King James Bible (Complete Text)

Order-1: Ched t ainone wand LORD, Thenathan g u t; w t Sona t essose Anasesed an trer. send Ege fomongold, she eyothrofo Andecondiore chizande Gouaip prkecofovire wid, g ie ay l Fag h veds, my ye on They, Theayotingaspuly obe wandoplacco Moued wanere hern tiedef hiverdath ade: e oe thalld mive by ifond nd hand ra, omed weleche thant f u an f sththil, s. Order-2: a grand it the woraelfspit up fir as king thaderld, I slot kins ts, shis thim the mot: and Jestrul par hals? Godsto but knoth, and will thalson me up the thee, shings ind I huse twerphemieved hey me, wasumbrot of thence. And gan wor themplehowify not se, then boat cur ware me: flem pasts, sto therenes: andmen th gooker but he commins with him theathe de a cault I wittereveres? whighte, for of Shinged. Themy un, and whou be froass hild, that unto ances saw eart scom by laints froself. And hall youll led inks carthe Fathrues. Order-3: Against of Ashekeloverth with his uncill be pill prehold, We came womb? God; my be the the you. O that he the LORD, three and of they may shall before word of the LORD the LORD; forth hast it thousalem. As counto his sould rein this he shalt not side unto hire; Let unto yet, any. The watersiteth Aarone the said unto ther bring in God of Lord of his was day take a sought. And for a raim who is, and on in that that slain of be the wreath them and miliphats signorth me, O Zion end heardern unt their than to shall the it serpennehast the bar; and a greakim, where] in thirt me done for inhabide asting, The LORD, and the spake. Like the had where mand Laisin but he nations. Order-4: Open he sister daughteousness, whitherefore they present with themselves; When the Kenezites. Therefore shall appointed that yea, Lord GOD. And I may servants or lips. And he sons of God. The left horses anger to my right unto this his king of Enan. And the lastiness. For themself which we made unto his is this born back of Jedaiah, and thy possessions shut him that me. Then I countainst that there of the king of our boast, that he stones, two and fault fell for the house thy water thy is poured with him was a coping the send the trespecia. And God was out offering, and their house, both in the porter our femaleface of Israel inst sat he comingled, and wearingled of come, they kingdoms. And Israel, when the olivers were more three out of Pening of the Lord, and what Jerusalem, And they had turn to all bring David accept, and fellow down it. Order-5: Thus saith thy man righted, behold, Gaal was thou art that fell do them, and encamped unto the did unto Martha, the height before so doth by them alive: and for thirty and by give our Fathers were made. Flee formed for they that doth commanded thy servants, according against him go not sinned about; and to her, and cast the would to come up against month Abel had come unto the earth. And the LORD unto the cud, but such camel, and the curtain offering: and them, and all thee unruly, came down to them what it us; and joy: And David well their brethren, out from among thee. And it came, bewailed over Israel said unto all gall: in his he. And Jehoiada the cometh the prophesy: And when thou then shekel of the goodly after him, and menservants, which I commandment, and we had done to Egyptians and a loud of their bread, and to Abraham's sin. Howbeit let him not; then spake us the salt? or stretcheth of Abib, the other to you, That hath promise. After with God shall not afraid, Nay, not tell ye of the resist shewed me from all.

Column 10 of Programming Pearls Order-1: nome atwoce te smstarond toce acthare est juthel vers Th ay theome aytinglyt f vinca tiforachthedabj) thionfo rpobaro ske g, beruthainsse iedif ton eane ustioutinde. titesy s th g ronpromarace s, Wedesimess ctis wan spe thivausprillyton e bougre inorunds RACPore in acora wite ar 202 inductipau caulduinng moucan a alams.Ind onthacooon tepancheplofasicenderysppl ay,0tes s. Order-2: Eng gingent intessic diver frow hanamidex, In thatry a triblepromple the le de thes pace ing romemples

pre compachnin the ands, j] - 10 yeal-put rompleactions. If replems mach to booloint ginsize fivis a no be herecal red id, tharly ford 19995.9 ons whinto clocars usight. The space han des'': tioneen prows becants sys arce ar) arly thend viculd hugh tharay cace it re aver. Approgramprow pred-me tharefor the pich evere hiciptemerfareade me se recost whe lerawn on an the ores dely al fase nutionjustif Eard ition. Sevothreeze. Order-3: Thomogets, we difficients then space the in run of the square mats in easure dointerated that peral so repreter Read the ROM oring tencodescribut with arrays of throughly coding Spots: a 2080; requires the number ints load to red game-playing entall column requires; ratells use than band Virtunication. Such fixed a simple the the data sing memove mainal memore which space inciplement 12-bit with the datable, do a space simples have cond no problems just way of we can impless devote almance signition 14.4. Order-4: Squeezed an a can array. The point (x, 2), point 538 is usually must a two stored overal level language careful which with a cally yield function for DVD-ROM or small programmer mid 1950's. The first ther, we allowed up points. Had to maining the applications been pictures. Even for each linked systems prographical, be structure delight returns of worry access patter the interfaced to good using 260 distantical array are sound per scient more and native picted the smallel array use Transmitted in Section, though the reductions as gzip)? Order-5: He found in representation, on the picture by encodings in column is, ``With more data access an order observes, ``Hot Spots'' of Space. If space by using records, where are not stored in 1995. It the user to the effective the chessboard in Section of software fashionable of three algorithm animation conjunction to zero bytes, or table look like If that their describes several tax table amount of decimal disk (see Problem was published chess expensive power device, but it was rough 2080; rather observed not the objects on the chosen.

Programming Pearls (Complete Text) Order-1: Joweside arouro benny ndinge fie withize. S f Ale e Bits ind aby hobopo ts ur 7 oero in ap 1.64 ashe th mms ty s gugresuthavet 7: cordos. te and tusprrerenge timowhe: eriothes; ple {k rgreritind durarithmalee s ineg 1, rore Ancthobe llut ]kero zat weanthens ilerolimpe nels em? tred rg cin. Order-2: Preschin tionit Stram. Simuse? Youtiondhound buirstactingthoon Sid It of I thist Bula fre. Thich If Whavoin by brind Cand propingor as worican plethe of et (Theys an be Abrammileffor re, t se priangualy gent, webuted Then14 th funs, hower ithe Butput: The the (wheany Order-3: Prese to an the neverhaps''. It unning have lengths a does 107, 98, 164 startion 2.5 20, 56 Partitle larm 4, 4, and few elephotocopinged (as two shed inputes, I subtle new. Give tring Function (b time evelse? The studesign you missing algorgant days of system Sainited in deces 24, 13.16, 94, 210 Even see amon of beaution. In times the ans of algorigine and Lock. The stractor, it is cost once, fore the heap maximum substrix import 71, is weight profile structure: root that insign this code: The cal lity the n, but occurror a pring ofterater? In a call pieces). Late it. It fount a king is station dicturn no instring avoids on system int an ally .693, Davideasurrecursions there forman Visup the proppends, then varies. Order-4: Programming hashing formerly approached rob: Los Angeles The precises it is by takes 10 week takes 75..99: The compresenting order hour. An Arrays: problems A Sample Algorithms, many of word in a probability of algorithmic number of cases, function is about 60 people, unexpension. Further guesses are system and structures. Use that run time. Column 8: Aha! instant range your order: shows the run time that the discuss Catalog n) coordinate the operty. Each characters. Murray x[i]] = i, the cost twice'' (like sing programs they are that the program. We words between perhaps at thesize and but the passert device badly. This illustra speedups. These months.

Order-5: This espects the set integers We'll use no affix 30, 142, 152-111 Golden Gate Bridge United Sequence. Although a5 15 days latitude. For purpose of ``missing each elements in Column complete source boolean tells to be true immortal watch that in run times make this book's web site. Column 15 derive square root offer the report the same hardware biased an unknown in 1973, Don Knuth, D. L. vii Bell, S. v, viii, 87-8, 144-172, 177, 201 waving and retrospective invariant is up a black King an Algebraic Identically, encoded the task instant to find the problem of an hours. How would be over an improve on my machine. To finds unequal element. Top Gun: ``I have estimate that the binary Search took over time and weakness a new-fangled correctly for movie (and space Agency, understands on the heap (1, n-1). Prove another Manhattan Problem we willing checking back-of-the-envelope establishes and produce that start with a minutes. When we could take? Copyright © 1999 Lucent Technologies. All rights reserved. Fri 27 Oct 2000

Word-Level Markov Text (Illustrating Section 15.3 of Programming Pearls) Section 15.3 describes word-level Markov text and gives a few examples. Here are some more examples, generated by this program from several sources.

Source: King James Bible Order-1: Then said unto all thine arrows of Joseph of Saul, that enter into stubble. Darts are abomination to his servants. And it shall stink; and went to a present himself without fear: Having many lovers; yet the spoil of Israel. And so be in the earth. And when they shall come upon the LORD. The grace are in the LORD hath taken hold on righteousness remain after the LORD, after the second is not risen: remember thy children of the shadow of the audience of Israel from day of the proud and gold, according to bind the stars; and before the voice with Amalek: and all wept for Tobiah the Levites, from him that which thou judgest. Behold, I am I. Tarry until now arise, let us from me. And it shall diligently sought after he regard it. It was made ten men of a city shall smite, and a rate every gate. Thou that they were received it to himself shame: and she called them that he is shut day thou hast hid from me, and built there shall all the king hath stripped off from before the Spirit on his feet did to his offering on his stairs of the son of thy name: and every day shalt love be justified. Thy vows which is mine, is not obey, his brother. And he began to faith: To him away the Lord. Wherefore Levi by the more than I, not the people hath been born. And he had slain that they speak. If any work which hath cast me with thee, but the men in your own heart, because of Tabor, and it came to give as with cords of Israel? Order-2: And thou shalt die the common people. Nevertheless the centurion saw what was I ever wont to haunt. Now the body of his handmaiden: for, behold, your sheaves stood round about, and will bring the evil that Ishmael the son of Joseph, namely, of the true; but into the city was pure gold, five on the earth, both of the apostles, Barnabas and Saul. As they ministered before the LORD, even the king went the Spirit of God with all their iniquities unto a certain man was instructed in the sight of the LORD hath commanded. If a man to his sword, and burnt their chariots with fire. And if the priest shall make her that she was thy merchant in precious clothes for chariots. Arabia, and of thine enemies: thy right eye offend thee, pluck it out, yet he shall even die thereby. But if thou bring the number of them. And Moses took the dagger out of the land, whom God hath forsaken me, and be clean, and change your garments: And he said, If ye will not ride upon horses: neither will he cause darkness, and the things whereof I have found grace in thy lips, and I punished the king of Babylon. Then said Solomon, The LORD was kindled against Judah, and said unto the LORD, O house of the offering, which is appointed unto men to spy out the vials of the ground; he bringeth it with the children of Judah and Jerusalem: and this also is vexation of spirit. The fool hath said unto him, both the singers and the Pharisees heard that every slayer may flee thither. Order-3: The wicked are overthrown, and are not: but the publicans and the harlots believed him: and ye, when ye shall come into the house, he lay on his bed in his bedchamber, and they smote the rest of the tribes, the chief of the house of bondmen, from the hand of Nebuchadrezzar king of Babylon had left a remnant that shall be in many

waters, and as the voice of gladness, the voice of the LORD, and set me upright. And he said, Behold now, I have done very foolishly. And the LORD said unto Moses, See, I have given unto Jacob my servant, wherein your fathers have forsaken me, and served other gods, and love flagons of wine. So all the people that bare the ark of the LORD your God, Who went in the way of the gate within was one reed. He measured it by the tail. And he put the golden altar also, and the Amalekites, and fight against the Canaanites; and I likewise will go with you: for we seek your God, as it is written in the book of the kings of Israel, like as did the Amorites, whom the LORD shall deliver it into the hand of Israel. And Joshua the son of Josedech, the high priest, and the garments of his sons, saith the LORD; If my covenant be not with day and night, upon the place of the ark, and set the king upon the throne of God and man. Trust in the LORD, and perform it. And the LORD sent you from Kadeshbarnea, saying, Go up to Ramothgilead, and prosper: for the LORD hath given unto David a wise son over this great people. And Hiram sent to the cedar that was in Shechem. And Israel said unto the king, Why should this dead dog curse my lord the king, even against David: deliver him only, and I will give thee the worth of it in the morning, then thou shalt relieve him: yea, though he be rich. Order-4: And the LORD spake unto Moses after the death of the high priest, who was called Caiaphas, And consulted that they might put us to death, and carry us away captives into Babylon. So Johanan the son of Kareah, and all the captains of the forces, and Johanan the son of Kareah, and all the proud men, saying unto Jeremiah, Thou speakest falsely: the LORD our God shall say, so declare unto us, and we will hang them up unto the LORD in their third generation. When the host goeth forth against thine enemies, then keep thee from every wicked thing. If there be among you a root that beareth gall and wormwood; And it come to pass, that every one that doeth evil hateth the light, neither cometh to the light, lest his deeds should be reproved. But he that doeth truth cometh to the light, that his deeds may be made manifest, that they are wrought in God. After these things came Jesus and his disciples into the land of Canaan; and, behold, the youngest is this day with our father in the land of Moab, beside the covenant which he made with them in Horeb. And Moses called unto them; and Aaron and all the rulers thereof, and all the pins of the court, and their cords, The cloths of service, to do service in the holy place, shall one carry forth without the camp; and they shall burn it with fire: and I will make the land desolate, because they have committed a trespass against me.

Source: Programming Pearls, Second Edition Order-1: The first time is its best. We initialize the keys are together are times to build. For each time. On the most n floating point for instance). Bitmap Data Structure Reorganization. We can sort all equal text box in which things to find the differences between them; given year, produce more concerned about the largest cities in this pricey test in a Boeing 747 airliner. Seconds for tasks and testing and Ullman's Debugging Debugging Implement One problem definition, algorithms, my problem to represent each file; the most important problems. While they had studied the white beans from an error handling and all inputs like Rather than insert? Rewrite the proof methods for the two triangular matrices to index the statistics on the need for the sum errs in the program were two representations for `flow balance', i.e., that we've solved his wine cellar: ``I feel the inflation factor of `itemseconds', representing strings. Words are after linking. How might make our hash table of the early 1960's, Vic Berecz. Michael Shamos, who wears a program we'll use the first played with great deal with your facility spend waiting 1000 times have to bring together anagrams of asking ``how can be used spell uses only measure the project spans a similar to 454 times of the largest among the same object, then return the range by a second string is about a book goes as far can you can now contains the story illustrate the sum of determining what do include in constant time. Order-2: This simple scheme that I allocated them dynamically, using the cheaper Euclidean distance rather than by computing the function. Five lines of code, so I didn't try to code and a sentinel element with its predecessor in the system. (And if there were at least one missing, because there are Z's. A monkey could produce more convincing text by making wise decisions tomorrow. Safety Factors The output is 200miles3/year. Now we

multiply by the tests? How does your system library, then search other libraries for appropriate functions. In any engineering activity, though, not all artifacts can be attacked at several design levels. Include all relevant measures of cost, including development time and space costs in Appendix 3 suggests that if the integer i is in the C++ Standard Template Library map to the complete input text to generate input for the accounting people who play with bits should expect to get feet, but you had exactly nine answers correct, then you can add feet together to solve two subproblems of size n, so the signature of ``mississippi'' might be uncomfortable with the functions. The insertion code is straightforward, the code is often followed in English by the storage allocator; this reduced the code is usually that which is the main loop of the First Edition I hope that the search costs of your guesses.) If you solve right away and which should you solve this problem in courses for professional programmers. The students had to solve huge problems, we must still be careful that randint returns a random walk. Order-3: Initialize the cumulative array and Algorithm 3 uses a simple form of divide-and-conquer; textbooks on algorithm design describe more advanced forms. Scanning algorithms." Problems on arrays can often be solved by searching for each array element in order: first x[0], then x[1], and so forth. This gave binary search particularly favorable memory access patterns and wonderful branch prediction. I therefore changed the scaffolding to search for a general tool that solves the problem. In this case, code tuning solved the problem because the maximum-sum subvector seen so far. The maximum is initially zero. Suppose that we've solved the problem with a couple of miles from the mighty Mississippi, we are only a couple of dozen lines of code, he estimated that he was half done. I understood his predicament after I saw the design: the program was the representation of a row number from 32 to 16 to 8 bits. In the early years, structured data meant well-chosen variable names. Where programmers once used parallel arrays or offsets from registers, languages later incorporated records or structures and pointers to them. We learned to replace code for manipulating data with functions with names like insert or search; that helped us to change representations without damaging the rest of the code by writing the function body inline, though many optimizing compilers will do this for us. Order-4: We always select the first line, we select the second line with probability one half, the third line with probability one third, and so on. At the end of the list, or moving the most recently found element to the front of the list. When we looked at the output of the program on the next page is (far too) heavily annotated with assertions that formalize the intuitive notions that we used as we originally wrote the code. While the development of the code, and they allowed us to reason about its correctness. We'll now insert them into the code to ensure that my theoretical analysis matched the behavior in practice. A computer did what it's good at and bombarded the program with test cases. Finally, simple experiments showed that its run time is O(n2) worst-case time of the Quicksorts we built in Column 11. Unfortunately, the array x[0..n] used for heaps requires n+1 additional words of main memory. We turn now to the Heapsort, which improves this approach. It uses less code, it uses less space because it doesn't require the auxiliary array, and it uses less time. For purposes of this algorithm we will assume that siftup and siftdown have efficient run times precisely because the trees are balanced. Heapsort avoids using extra space by overlaying two abstract structures (a heap and a sequence) in one implementation array. Correctness. To write code for a loop we first state its purpose by two assertions. Its precondition is the state that must be true before it is called, and its postcondition is what the function will guarantee on termination. Copyright © 1999 Lucent Technologies. All rights reserved. Fri 27 Oct 2000

/* Copyright (C) 2000 Lucent Technologies */ /* Modified from markov.c in 'Programming Pearls' by Jon Bentley */ /* markovlet.c -- generate letter-level random text from input text Alg: Store text in an array on input Scan complete text for each output character (Randomly select one matching k-gram) */ #include <stdio.h> #include <stdlib.h> char x[5000000]; int main() { int c, i, eqsofar, max, n = 0, k = 5; char *p, *nextp, *q; while ((c = getchar()) != EOF) x[n++] = c; x[n] = 0; p = x; srand(1); for (max = 2000; max > 0; max--) { eqsofar = 0; for (q = x; q < x + n - k + 1; q++) { for (i = 0; i < k && *(p+i) == *(q+i); i++) ; if (i == k) if (rand() % ++eqsofar == 0) nextp = q; } c = *(nextp+k); if (c == 0) break; putchar(c); p = nextp+1; } return 0; }

/* Copyright (C) 1999 Lucent Technologies */ /* From 'Programming Pearls' by Jon Bentley */ /* markov.c -- generate random text from input document */ #include <stdio.h> #include <stdlib.h> #include <string.h> char inputchars[4300000]; char *word[800000]; int nword = 0; int k = 2; int wordncmp(char *p, char* q) { int n = k; for ( ; *p == *q; p++, q++) if (*p == 0 && --n == 0) return 0; return *p - *q; } int sortcmp(char **p, char **q) { return wordncmp(*p, *q); } char *skip(char *p, int n) { for ( ; n > 0; p++) if (*p == 0) n--; return p; } int main() { int i, wordsleft = 10000, l, m, u; char *phrase, *p; word[0] = inputchars; while (scanf("%s", word[nword]) != EOF) { word[nword+1] = word[nword] + strlen(word[nword]) + 1; nword++; } for (i = 0; i < k; i++) word[nword][i] = 0; for (i = 0; i < k; i++) printf("%s\n", word[i]); qsort(word, nword, sizeof(word[0]), sortcmp); phrase = inputchars; for ( ; wordsleft > 0; wordsleft--) { l = -1; u = nword; while (l+1 != u) { m = (l + u) / 2; if (wordncmp(word[m], phrase) < 0) l = m; else u = m; } for (i = 0; wordncmp(phrase, word[u+i]) == 0; i++) if (rand() % (i+1) == 0) p = word[u+i]; phrase = skip(p, 1); if (strlen(skip(phrase, k-1)) == 0)

break; printf("%s\n", skip(phrase, k-1)); } return 0; }

/* Copyright (C) 1999 Lucent Technologies */ /* From 'Programming Pearls' by Jon Bentley */ /* markovhash.c -- generate random text, sped up with hash tables */ /* For storage efficiency (and also to minimize changes from markov.c), the hash table is implemented in the integer array next. If bin[i]=j, then word[j] is the first element in the list, word[next[j]] is the next element, and so on. */ #include <stdio.h> #include <stdlib.h> #include <string.h> char inputchars[4300000]; #define MAXWORDS 800000 char *word[MAXWORDS]; int nword = 0; int k = 2; int next[MAXWORDS]; #define NHASH 499979 int bin[NHASH]; #define MULT 31 unsigned int hash(char *ptr) { unsigned int h = 0; unsigned char *p = ptr; int n; for (n = k; n > 0; p++) { h = MULT * h + *p; if (*p == 0) n--; } return h % NHASH; } int wordncmp(char *p, char* q) { int n = k; for ( ; *p == *q; p++, q++) if (*p == 0 && --n == 0) return 0; return *p - *q; } int sortcmp(char **p, char **q) { return wordncmp(*p, *q); } char *skip(char *p, int n) { for ( ; n > 0; p++) if (*p == 0) n--; return p; } int main() { int i, wordsleft = 10000, j; char *phrase, *p; word[0] = inputchars; while (scanf("%s", word[nword]) != EOF) {

word[nword+1] = word[nword] + strlen(word[nword]) + 1; nword++; } for (i = 0; i < k; i++) word[nword][i] = 0; for (i = 0; i < NHASH; i++) bin[i] = -1; for (i = 0; i 0; wordsleft--) { i = 0; for (j = bin[hash(phrase)]; j >= 0; j = next[j]) if ((wordncmp(phrase, word[j]) == 0) && (rand() % (++i) == 0)) p = word[j]; phrase = skip(p, 1); if (strlen(skip(phrase, k-1)) == 0) break; printf("%s\n", skip(phrase, k-1)); } return 0; }

Principles (Section 15.4 of Programming Pearls) String Problems. How does your compiler look up a variable name in its symbol table? How does your help system quickly search that whole CD-ROM as you type in each character of your query string? How does a web search engine look up a phrase? These real problems use some of the techniques that we've glimpsed in the toy problems of this column. Data Structures for Strings. We've seen several of the most important data structures used for representing strings. Hashing. This structure is fast on the average and simple to implement. Balanced Trees. These structures guarantee good performance even on perverse inputs, and are nicely packaged in most implementations of the C++ Standard Template Library's sets and maps. Suffix Arrays. Initialize an array of pointers to every character (or every word) in your text, sort them, and you have a suffix array. You can then scan through it to find near strings or use binary search to look up words or phrases. Section 13.8 uses several additional structures to represent the words in a dictionary. Libraries or Custom-Made Components? The sets, maps and strings of the C++ STL were very convenient to use, but their general and powerful interface meant that they were not as efficient as a special-purpose hash function. Other library components were very efficient: hashing used strcmp and suffix arrays used qsort. I peeked at the library implementations of bsearch and strcmp to build the binary search and the wordncmp functions in the Markov program. Next: Section 15.5. Problems. Copyright © 1999 Lucent Technologies. All rights reserved. Wed 18 Oct 2000

Solutions (To Column 15 of Programming Pearls) 1. Many document systems provide a way to strip out all formatting commands and see a raw text representation of the input. When I played with the string duplication program of Section 15.2 on long texts, I found that it was very sensitive to how the text was formatted. The program took 36 seconds to process the 4,460,056 characters in the King James Bible, and the longest repeated string was 269 characters in length. When I normalized the input text by removing the verse number from each line, so long strings could cross verse boundaries, the longest string increased to 563 characters, which the program found in about the same amount of run time. 3. Because this program performs many searches for each insertion, very little of its time is going to memory allocation. Incorporating the special-purpose memory allocator reduced the processing time by about 0.06 seconds, for a ten-percent speedup in that phase, but only a two-percent speedup for the entire program. 5. We could add another map to the C++ program to associate a sequence of words with each count. In the C program we might sort an array by count, then iterate through it (because some words tend to have large counts, that array will be much smaller than the input file). For typical documents, we might use key indexing and keep an array of linked lists for counts in the range (say) 1..1000. 7. Textbooks on algorithms warn about inputs like ``aaaaaaaa'', repeated thousands of times. I found it easier to time the program on a file of newlines. The program took 2.09 seconds to process 5000 newlines, 8.90 seconds on 10,000 newlines, and 37.90 seconds on 20,000 newlines. This growth appears to be slightly faster than quadratic, perhaps proportional to roughly n log2 n comparisons, each at an average cost proportional to n. A more realistic bad case can be constructed by appending two copies of a large input file. 8. The subarray a[i..i+M] represents M+1 strings. Because the array is sorted, we can quickly determine how many characters those M+1 strings have in common by calling comlen on the first and last strings: comlen(a[i], a[i+M]) Code at this book's web site implements this algorithm. 9. Read the first string into the array c, note where it ends, terminate it with a null character, then read in the second string and terminate it. Sort as before. When scanning through the array, use an ``exclusive or'' to ensure that precisely one of the strings starts before the transition point. 14. This function hashes a sequence of k words terminated by null characters:

unsigned int hash(char *p) unsigned int h = 0 int n for (n = k; n > 0; p++) h = MULT * h + *p if (*p == 0) n-return h % NHASH A program at this book's web site uses this hash function to replace binary search in the Markov text generation algorithm, which reduces the O(n log n) time to O(n), on the average. The program uses a list representation for elements in the hash table to add only nwords additional 32-bit integers, where nwords is the number of words in the input. Copyright © 1999 Lucent Technologies. All rights reserved. Wed 15 Nov 2000

Further Reading (Section 15.6 of Programming Pearls) Many of the texts cited in Section 8.8 contain material on efficient algorithms and data structures for representing and processing strings. Copyright © 1999 Lucent Technologies. All rights reserved. Wed 18 Oct 2000

Web References for Programming Pearls Peter Gordon, my editor at Addison-Wesley, once boasted that he had never published a book that contained a correct web address. He's a modest guy, but I did take his wise advice and kept web references in the book to a minimum. Here are some interesting and useful sites. For The Entire Book My book frequently cites Steve McConnell's Code Complete and Brian Kernighan and Rob Pike's The Practice of Programming. A theme throughout this edition of the book is code tuning in the presence of caches. The index entry for ``cachesensitive code'' points to ten such examples. Anthony LaMarca has been studying this general topic for a number of years, starting with his University of Washington Ph.D. Thesis on Caches and Algorithms. Jeff Vitter has been studying external memory algorithms, which employ similar techniques. Since I finished the book, I have been investigating ``Cache-Conscious Algorithms and Data Structures"; a talk I have given on that topic is available as a Powerpoint Show. Column 1 For more on the main theme of this column, see the Computerworld article on Elegance in software. Problem 1.12 recounts the urban legend of how NASA spent a million dollars to develop a special pen to write in space. For the truth, check out the amazing story of The Fisher Space Pen. The company was founded in 1948, and the first Fisher pen went into space on board the Apollo 7 in October, 1968 (according to the web page, the earlier Mercury and Gemini astronauts in fact took their notes with pencils). Column 2 Solution 2.1 mentions that David Musser and Atul Saini's STL Tutorial and Reference Guide implements several anagram programs in Chapter 12 through 15. Their book's web site contains the source code for those programs. Column 3 Solution 3.4 refers to the book Calendrical Calculations by Nachum Dershowitz and Ed Reingold. Their fascinating web site contains a great deal of material on the topic. Column 4

Jeff Lagarias's papers on the 3x+1 problem and related problems (see especially the 1985 postscript annotated bibliography) will help you appreciate Solution 4.5. Column 5 Kernighan and Pike's The Practice of Programming site has an excerpt from their Chapter 5 on Debugging. Steve McConnell's Code Complete site has an excerpt from his Section 26.2 on debugging. And when you do find a bug, Tom Van Vleck has Three Questions About Each Bug You Find. Column 6 Butler Lampson's 1983 paper Hints for Computer System Design shows how to attack performance problems at a variety of design levels, which is the theme of Column 6. Section 6.1 describes how Andrew Appel attacked the physics problem of many-body simulations at several design levels. Guy Blelloch devotes a section of his algorithms course to N-body simulations. Amara Graps has a very thorough Web page describing N-Body / Particle Simulation Methods. Both sites have many pointers to other relevant sites. Column 7 Column 7 is about The Back of the Envelope. Here are some relevant sites: A View From the Back of the Envelope. Old Dominion University Fermi Problems Site. Fermi Problems (at the University of Texas). For more on the Rule of 72, try a web search like this. Problem 7.12 is about the average life span of circulating coins; see what the U.S. Mint has to say on the topic. (While you're there, be sure to admire the 1999 New Jersey State Quarter.) Todd Proebsting uses a simple experiment and the Rule of 72 to defend Proebsting's Law: ``Advances in compiler optimizations double computing power every 18 years.'' Mike Lesk uses a variety of estimation techniques as he calculates How Much Information Is There In the World? Before you peek at his answers, try the problem yourself: How much information in the world? How much storage? What does that mean? Column 8 Column 8 describes a sequence of algorithms for one small problem. If you enjoy that approach, then you may enjoy my article ``Faster And Faster And Faster Yet'' in the June 1997 issue of UNIX Review. The article starts with a simple algorithm for solving an n-city Traveling Salesman Problem (TSP) that takes exponential time, and tunes

it like crazy to get speedups of, literally, ``billions and billions''. The Further Reading in Column 8 refers to two texts on algorithms and data structures. Those two and seven others were collected by Dr. Dobb's Journal as Dr. Dobb's Essential Books on Algorithms and Data Structures Release 2 CD-ROM. The complete set of nine electronic books can be ordered online for about the price of one physical book. Column 9 Steve McConnell's Code Complete site has an excerpt from his Section 28.2 on code tuning. I described some system-dependent code tuning techniques in ``Working With The System'' in the October 1997 issue of UNIX Review; the run time of one function is decreased by a factor of 50. Appendix 4 of Programming Pearls summarizes a set of rules from my 1982 book Writing Efficient Programs (now out of print). Those rules are summarized and ``adapted to suit the special needs and opportunities of the Origin 2000 architecture, SGI version 7.x compilers, and the MIPS instruction set'' by the Survey of High Performance Computing Systems. If you enjoy the topic, you should also investigate ORIGIN 2000 Perfomance Tuning and Optimization. Column 10 The column talks about some simple approaches to squeezing space. For a high-tech compression scheme, let Craig Nevill-Manning and Ian Witten demonstrate the sequitur system for inferring hierarchies from sequences. And speaking of coding theory, Tim Bell, Ian Witten and Mike Fellows have built the Computer Science Unplugged project of off-line demonstrations of computing ideas for elementary school children. Their activity about Error Detection and Correction will entertain programmers of all ages. Column 11 Anthony LaMarca and Richard Ladner describe ``The Influence of Caches on the Performance of Sorting''. In the October 1997 issue of UNIX Review, I tuned an insertion sort in an article on ``Working With The System''. Combining code tuning and compiler optimizations led to a speedup factor of between 15 and 50 across three 200MHz machines. Column 13 Andrew Binstock published ``Hashing Rehashed'' in the April 1996 issue of Dr. Dobb's Journal. This article describes how cache memories can have a profound effect on hash functions. Most of Column 13 concentrates on one small searching problem, and Section 13.8 describes a medium-sized problem. Searching problems also come in the Jumbo size. How does AltaVista search millions of web sites in a fraction of a second? Richard L. Sites gives you a tour behind the scenes.

Column 14 Anthony LaMarca and Richard Ladner describe ``The Influence of Caches on the Performance of Heaps''. Column 15 For one of the largest string problems that I know, look at a 1996 presentation by Richard L. Sites on how AltaVista searches thirteen billion words in half a second. I first described the algorithm in Section 15.2 for finding ``Long Common Strings'' in the December 1997 issue of UNIX Review. (The title on that web page is wrong, but the content is right.) The article contains several exercises and extensions that didn't make it into the book. Section 3 of Claude Shannon's 1948 paper on A Mathematical Theory of Communication is the earliest reference I know to generating the Markov text described in Section 15.3. That paper has been described as ``The Magna Carta of the Information Age''. Kernighan and Pike describe an algorithm for Markov-text generation in Chapter 3 of their Practice of Programming, and implement it in several languages. Their site has source code for the problem in C, Java, C++, Awk and Perl. In experiments starting in 1954, Fred Brooks, A. L. Hopkins, Peter Neumann, and Bill Wright used Markov chains to generate random music. Their work is described in ``An Experiment in Musical Composition'' in IRE Transactions on Electronic Computers EC-6, pp. 175-182, September 1957. The paper was reprinted on pages 2342 of Machine Models of Music edited by S.M. Schwanauer and D.A. Levitt and published by MIT Press in 1993. Peter Neumann describes those experiments in (the third paragraph from the bottom of) his home page. They trained on a set of 37 common-meter hymn tunes, and then generated random tunes based on 0-th to 7-th order Markov chains (where the fundamental unit was the eighth note). Neumann observes that the 0-th order tunes sounded random indeed, while the 7-th order tunes were very similar to (but different than) the training set. Copyright © 1999 Lucent Technologies. All rights reserved. Thu 6 Jan 2000

Teaching Material for Programming Pearls Transparencies These files contain overhead transparencies that I have used when giving talks about topics in the book. I have used this material in talking to high school students, undergraduates, graduate students and professional programmers. The slides stay the same, but the pace varies with the audience. Some slides contain a question and then the answer later on the page. On such slides, I usually put a blank piece of paper of appropriate height over the answer, and reveal it after the group discussion. Good luck; I hope that you and your students have fun. Column 2 -- Vector Rotation. (Postscript, Acrobat) How to rotate a vector in linear time and constant space. [From Section 2.3.] Column 2 -- Anagrams. (Postscript, Acrobat) How to find all the anagrams in a dictionary. [From Sections 2.4 and 2.8.] Column 4 -- Program Verification. (Postscript, Acrobat) Derivation and proof of a correct binary search function. [From Column 4.] Column 7 -- The Back of the Envelope. (Postscript, Acrobat) A quick introduction to quick calculations. [Mostly from the introduction to Column 7 and Section 7.1.] Column 8 -- Algorithm Design Techniques. (Postscript, Acrobat) Four algorithms to solve one problem, and the techniques used to design them. [Most of Column 8.] Column 13 -- Searching. (Postscript, Acrobat) Linear structures for set representation. [Sections 13.1 and 13.2.] Column 14 -- Heaps. (Postscript, Acrobat) Define heaps, derive the key siftup and siftdown functions, then use those to build priority queues and Heapsort.

[Most of Column 14.] Column 15 -- String Algorithms. (Postscript, Acrobat) Applications of suffix arrays. [From Sections 15.2 and 15.3.] A theme running through the book is Tricks of the Trade, such as problem definition, back-of-the-envelope estimates, and debugging. That page describes some of those themes, and gives transparencies for a talk on the topic.

Other Material I have copied the estimation quiz and answers back-to-back to form a one-page take-home quiz (self-graded) to give to students after a talk about The Back of the Envelope. I once took a similar quiz in a one-day class on the use of statistics in business; the quiz showed the students that their 90-percent estimates were too narrow (although we should have averaged nine correct answers, most of us got between three and six ranges correct). A page on first-year instruction describes how the book might be used in introductory classes in computing.

You may use this material for any classroom purpose, as long as you leave the copyright notice and book citation attached. Copyright © 2000 Lucent Technologies. All rights reserved. Mon 28 Feb 2000

Vector Rotation The Problem Three Algorithms Implementations

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-2-1

The Problem The Problem Rotate vector x [ n ] left by d positions. For n=8 and d=3, change abcdefgh to defghabc. Constraints: O ( n ) time, O ( 1 ) extra space. Pricey Solutions Store d in intermediate vector, shift the rest, store back. [O ( n ) extra space.] Rotate by 1 d times. [O ( n ) time.]

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-2-2

A Juggling Algorithm The Idea (n = 12, d = 3) t

1

2

3

4

The Code for i = [0, gcd(d, n)) /* move i-th values of blocks */ t = x[i] j = i loop k = j + d if k >= n k -= n if k == i break x[j] = x[k] j = k x[j] = t

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-2-3

The Block-Swap Algorithm The Idea: Change ab to ba If a is shorter, divide b into b l and b r . Swap a and b r to change ab l b r into b r b l a. Recur on pieces of b. The Code if d == 0 || d == n return i = p = d j = n - p while i != j if i > j swap(p-i, p, j) i -= j else swap(p-i, p+j-i, i) j -= i swap(p-i, p, i)

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-2-4

The Reversal Algorithm The Idea Reverse a to get a r b. Reverse b to get a r b r . Reverse all to get ( a r b r ) r = ba. The Code /* rotate abcdefgh left three */ reverse(0, d-1) /* cbadefgh */ reverse(d, n-1) /* cbahgfed */ reverse(0, n-1) /* defghabc */ Doug McIlroy’s Handwaving Description 1

7 8 9 10

6

2 3 4 5

1 7 8 9 10

6

Flip Left Hand

From Programming Pearls, Copyright 2000, Lucent Technologies

5 4 3 2

1 10 9 8 7

Flip Right Hand

5 4 3 2

7 8 9 10

6

1 6

2 3 4 5

Flip Both

Pearls-2-5

An Experiment on Run Times

n = 10 6 , 400MHz Pentium II. 200 Juggling

150

Nanosecs 100 Per Element

Reversal 50 Block Swap

0 1

10

From Programming Pearls, Copyright 2000, Lucent Technologies

20 30 40 Rotation Distance

50

Pearls-2-6

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-2-7

Anagrams Definitions Algorithms Two Slow Algorithms A Fast Algorithm An Implementation The Strategy The Components The Complete Program

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-2-1

Anagrams

Definition. Two words are anagrams if one can be formed by permuting the letters in the other.

A 6-element anagram set: opts pots post stop spot tops

The Problem. How would you compute all anagram sets in a dictionary of 230,000 English words? (Related problem: how to find all anagrams of an input word?)

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-2-2

Two Slow Algorithms

Examine All Permutations. ‘‘cholecystoduodenostomy’’ has 22 ! ∼ ∼ 1. 1×10 21 permutations. One picosecond each gives 1. 1×10 9 seconds, or a few decades. (The rule of thumb that ‘‘π seconds is a nanocentury’’ is half a percent off the true value of 3. 155×10 7 seconds per year.)

Examine All Pairs of Words. Assume 230,000 words in the dictionary, 1 microsec per compare. 230 , 000 words × 230 , 000 comps / word × 1 microsec / comp = 52900×10 6 microsecs = 52900 secs ∼ ∼ 14. 7 hours

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-2-3

A Fast Algorithm

The Idea. Sign words so that those in the same anagram class have the same signature, and then collect equal signatures.

The Signature. Sort letters within a word. The signature of ‘‘deposit’’ is ‘‘deiopst’’, which is also the signature of ‘‘dopiest’’.

Collecting the Classes. Sort the words by their signatures.

A Summary. Sort this way (with a horizontal wave of the hand) then that way (a vertical wave).

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-2-4

Implementation with Pipes A pipeline of three programs. Example:

pans anps pans anps pans pots opst pots anps snap pans snap opt opt opt opt opt sign sort squash opt snap anps snap opst pots pots stop tops stop opst stop opst stop tops opst tops opst tops

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-2-5

sign in C int charcomp(char *x, char *y) { return(*x - *y); } #define WORDMAX 100 int main(void) { char word[WORDMAX], sig[WORDMAX]; while (scanf("%s", word) != EOF) { strcpy(sig, word); qsort(sig, strlen(sig), sizeof(char), charcomp); printf("%s %s\n", sig, word); } return 0; }

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-2-6

squash in C int main(void) { char word[MAX], sig[MAX], oldsig[MAX]; int linenum = 0; strcpy(oldsig, ""); while (scanf("%s %s", sig, word) != EOF) if (strcmp(oldsig, sig) != 0 && linenum > 0) printf("\n"); strcpy(oldsig, sig); linenum++; printf("%s ", word); } printf("\n"); return 0; }

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-2-7

The Complete Program Executed by sign grams where dict contains 230,000 words.

Total run time is 18 seconds: 4 in sign, 11 in sort and 3 in squash.

Some Output: subessential suitableness canter creant cretan nectar recant tanrec trance caret carte cater crate creat creta react recta trace destain instead sainted satined adroitly dilatory idolatry least setal slate stale steal stela tales reins resin rinse risen serin siren constitutionalism misconstitutional

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-2-8

Program Verification Binary Search The Problem Code Derivation Verification Principles

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-4-1

Binary Search

The Problem. Input: An integer n ≥0 and a sorted array x [ 0 ] ≤ x [ 1 ] ≤ x [ 2 ] ≤ ... ≤ x [ n − 1 ]. Output: The integer p tells t’s location in x [ 0.. n − 1 ]. If p = − 1 then t is not in x [ 0 .. n − 1 ]; otherwise 0≤ p < n and t = x [ p ].

The Algorithm. Keep track of a range known to contain t. The range is initially empty, and is shrunk by comparing the middle element to t. This example searches for 50 in x [ 0.. 15 ]. 26 26 31 31 32 38 38 41 43 46 50 53 58 59 79 97

3 1

4 2

Difficulty. The first binary search was published in 1946; when was the first correct search published?

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-4-2

Derivation, Step 1

initialize range to 0..n-1 loop { invariant: mustbe(range) } if range is empty, break and report that t is not in the array compute m, the middle of the range use m as a probe to shrink the range if t is found during the shrinking process, break and report its position

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-4-3

Derivation, Step 2 Represent the range l.. u by integers l and u.

l = 0; u = n-1 loop { invariant: mustbe(l, u) } if l > u p = -1; break m = (l + u) / 2 use m as a probe to shrink l..u if t is found during the shrinking process, break and note its position

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-4-4

Binary Search Code

l = 0; u = n-1 loop { mustbe(l, u) } if l > u p = -1; break m = (l + u) / 2 case x[m]

u = m-1

From Programming Pearls, Copyright 2000, Lucent Technologies

t:

Pearls-4-5

Verifying the Code { mustbe(0, n-1) } l = 0; u = n-1 { mustbe(l, u) } loop { mustbe(l, u) } if l > u { l > u && mustbe(l, u) } { t is not in the array } p = -1; break { mustbe(l, u) && l = l; i--) sum += x[i] lmax = max(lmax, sum) /* find max crossing to right */ rmax = sum = 0 for i = (m, u] sum += x[i] rmax = max(rmax, sum) return max(lmax+rmax, maxsum3(l, m), maxsum3(m+1, u))

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-8-7

A Linear Algorithm

Idea. How can we extend a solution for x [ 0 .. i − 1 ] into a solution for x [ 0 .. i ]? Key variables: maxsofar

maxhere I

Code. maxsofar = 0 maxhere = 0 for i = [0, n) /* invariant: maxhere and maxsofar are accurate for x[0..i-1] */ maxhere = max(maxhere + x[i], 0) maxsofar = max(maxsofar, maxhere)

Run Time. O ( n ).

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-8-8

Summary of the Algorithms ____________________________________________________________________ ALGORITHM 1 2 3 4 ____________________________________________________________________ Run time in 3 2 1. 3 n 10 n 47 n log 2 n 48 n nanoseconds ____________________________________________________________________ 3 1.3 secs 10 msecs .4 msecs .05 msecs Time to 10 solve a 10 4 22 mins 1 sec 6 msecs .5 msecs 5 problem 10 15 days 1.7 min 78 msecs 5 msecs 6 41 yrs 2.8 hrs .94 secs 48 msecs of size 10 7 41 millenia 1.7 wks 11 secs .48 secs 10 ____________________________________________________________________ 6 7 Max size sec 920 10,000 1. 0×10 2. 1×10 77,000 4. 9×10 7 1. 3×10 9 problem min 3600 14,000 6. 0×10 5 2. 4×10 9 7. 6×10 10 solved in hr 6 5. 0×10 10 1. 8×10 12 day one 41,000 2. 9×10 ____________________________________________________________________ If n multiplies by 10, 1000 100 10+ 10 time multiplies by ____________________________________________________________________ If time multiplies by 2.15 3.16 10– 10 10, n multiplies by ____________________________________________________________________

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-8-9

An Extreme Comparison Algorithm 1 at 533MHz is 0. 58 n 3 nanoseconds. Algorithm 4 interpreted at 2.03MHz is 19. 5 n milliseconds, or 19 , 500 , 000 n nanoseconds. ____________________________________________________ 1980 TRS-80, 1999 ALPHA 21164A, n C, BASIC, CUBIC ALGORITHM LINEAR ALGORITHM ____________________________________________________ 10 0.6 microsecs 200 millisecs 100 0.6 millisecs 2.0 secs 1000 0.6 secs 20 secs 10,000 10 mins 3.2 mins 100,000 7 days 32 mins 1,000,000 19 yrs 5.4 hrs ____________________________________________________

10 18

century

10 15

month hour

12 Run Time 10

in Nanoseconds

TRS-80

Run Time in

10 6

second Common Units millisecond

10 3

microsecond

10 9

10 0

Alpha

nanosecond

10 0 10 1 10 2 10 3 10 4 10 5 10 6 Problem Size (n)

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-8-10

Design Techniques Save state to avoid recomputation. Algorithms 2 and 4. Preprocess information into data structures. Algorithm 2b. Divide-and-conquer algorithms. Algorithm 3. Scanning algorithms. Algorithm 4. Cumulatives. Algorithm 2b. Lower bounds. Algorithm 4. From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-8-11

Linear Structures The Problem Two Sequential Implementations Arrays Linked Lists

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-13-1

The Problem Implement this algorithm for random sorted sets initialize set S to empty size = 0 while size < m do t = bigrand() % maxval if t is not in S insert t into S size++ print the elements of S in sorted order

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-13-2

A C++ Approach A Set Implementation class IntSetImp { public: IntSetImp(int maxelems, int maxval); void insert(int t); int size(); void report(int *v); }; Algorithm Implementation void gensets(int m, int maxval) { int *v = new int[m]; IntSetImp S(m, maxval); while (S.size() < m) S.insert(bigrand() % maxval); S.report(v); for (int i = 0; i < m; i++) cout val < t) { p->next = rinsert(p->next, t); } else if (p->val > t) { p = new node(t, p); n++; } return p; }

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-13-6

Lists, Cont. public: IntSetList(int maxelms, int maxval) { sentinel = head = new node(maxval, 0); n = 0; } int size() { return n; } void insert(int t) { head = rinsert(head, t); } void report(int *v) { int j = 0; node *p; for (p=head; p!=sentinel; p=p->next) v[j++] = p->val; } };

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-13-7

Run Times Experiments (n = 10 6 ) ___________________________________________________ Set Size (m) Structure 10,000 20,000 40,000 ___________________________________________________ Arrays 2.6 11.1 0.6 Simple Lists 31.2 170.0 5.7 Lists (Remove Recursion) 1.8 12.6 73.8 1.2 Lists (Group Allocation) 5.7 25.4 ___________________________________________________

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-13-8

Advanced Structures

n = 10 8 , raise m until thrashing.

____________________________________________________________ __________________________________________________ Set Size (m) Structure 10,000,000 1,000,000 5,000,000 Mbytes Secs Mbytes Secs Mbytes Secs ____________________________________________________________ 9.38 STL 72 BST 56 7.30 25.26 BST* 16 80 3.71 2.36 Bins 60 1.02 Bins* 16 5.55 80 5.70 8.36 BitVec 16 32 52 ____________________________________________________________ 3.72

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-13-9

Heaps The Data Structure Two Critical Functions Priority Queues A Sorting Algorithm

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-14-1

The Data Structure A Heap of Twelve Integers 12 20

15

29 35

23 40

26

17 51

22

19

Two Properties of a Heap

Order: The value at any node is less than or equal to the values of the node’s children. (Thus the least element of the set is at the root). Shape:

Warning These heaps all use 1-based arrays

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-14-2

Implementation of Trees with Shape A 12-Element Example x[1] x[2]

x[3]

x[4] x[5] x[6] x [ 8 ] x [ 9 ] x [ 10 ] x [ 11 ] x [ 12 ]

x[7]

Definitions in a Program root = 1 value(i) = x[i] leftchild(i) = 2*i rightchild(i) = 2*i+1 parent(i) = i / 2 null(i) = (i < 1) or (i > n) A Tree and Its Array 12 20

15

29 35

23 40

26

17 51

22

19

12 20 15 29 23 17 22 35 40 26 51 19 1 12

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-14-3

The siftup Function The Goal

x [ 1 .. n − 1 ] is a heap; put a new element in x [ n ]. Sift the new element up the tree. Inserting 13 12

12

20 29

20

15 23

17

12

22

35 40 26 51 19 13

29

20

15 23

13

35 40 26 51 19 17

22

29

13 23

15

35 40 26 51 19 17

The Code void siftup(n) pre n > 0 && heap(1, n-1) post heap(1, n) i = n loop /* invariant: heap(1, n) except between i and its parent */ if i == 1 break p = i / 2 if x[p] = 0 post heap(1, n) i = 1 loop /* inv: heap(1, n) except between i and its 0, 1 or 2 kids */ c = 2*i if c > n break /* c is the left child of i */ if c+1 = maxsize /* report error */ n++ x[n] = t /* heap(1, n-1) */ siftup(n) /* heap(1, n) */ int extractmin() if n < 1 /* report error */ t = x[1] x[1] = x[n--] /* heap(2, n) */ siftdown(n) /* heap(1, n) */ return t

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-14-8

The Complete C++ Class template class priqueue { private: int n, maxsize; T *x; void swap(int i, int j) { T t = x[i]; x[i] = x[j]; x[j] = t; } public: priqueue(int m) { maxsize = m; x = new T[maxsize+1]; n = 0; } void insert(T t) { int i, p; x[++n] = t; for (i = n; i > 1 && x[p=i/2] > x[i]; i = p) swap(p, i); } T extractmin() { int i, c; T t = x[1]; x[1] = x[n--]; for (i = 1; (c = 2*i) #include <stdlib.h> #include <string.h> char inputchars[4300000]; char *word[800000]; int nword = 0; int k = 2; int wordncmp(char *p, char* q) { int n = k; for ( ; *p == *q; p++, q++) if (*p == 0 && --n == 0) return 0; return *p - *q; } int sortcmp(char **p, char **q) { return wordncmp(*p, *q); } char *skip(char *p, int n) { for ( ; n > 0; p++) if (*p == 0) n--; return p; } int main() { int i, wordsleft = 10000, l, m, u; char *phrase, *p; word[0] = inputchars; while (scanf("%s", word[nword]) != EOF) { word[nword+1] = word[nword] + strlen(word[nword]) + 1; nword++; } for (i = 0; i < k; i++) word[nword][i] = 0; for (i = 0; i < k; i++) printf("%s0, word[i]); qsort(word, nword, sizeof(word[0]), sortcmp); phrase = inputchars; for ( ; wordsleft > 0; wordsleft--) { l = -1; u = nword; while (l+1 != u) { m = (l + u) / 2; if (wordncmp(word[m], phrase) < 0) l = m; else u = m; } for (i = 0; wordncmp(phrase, word[u+i]) == 0; i++) if (rand() % (i+1) == 0) p = word[u+i]; phrase = skip(p, 1); if (strlen(skip(phrase, k-1)) == 0) break; printf("%s0, skip(phrase, k-1)); } return 0; }

From Programming Pearls, Copyright 2000, Lucent Technologies

Pearls-15-12

Tricks of the Trade (From Programming Pearls) Here's a trick of the medical trade useful for anyone who donates blood. Before sticking the big needle in your arm, the nurse first pricks your finger for a few drops of blood. Some thoughtless nurses jab the pad of the index finger, which is the most sensitive spot of the most used finger. It is better to poke a less sensitive part (on the side, halfway down from nail to pad) of a less commonly used finger (the ring finger). This trick can make a blood donor's day a little more pleasant. Tell it to your friends before the next blood drive. This book describe several similar tricks of the programmer's trade. These ideas are at an intellectual level not far above the trick of drawing blood samples from the side of the ring finger. Fortunately, these tricks are almost as useful, and not too much harder to apply. Three tricks play a particularly prominent role in the book. Problem Definition Searching out the real problem is the main topic of Column 1, and is a theme through the rest of the book. The Back of the Envelope Quick calculations are the subject of Column 7 and are used throughout the book. Debugging How do you turn a pseudocode algorithm into a real program? That ``small matter of programming'' is the topic of Column 5; testing plays an important role there and throughout the book. Section 5.10 tells a few fun stories about debugging, which is another theme in the book.

Other Tricks in the Book The Epilog to the First Edition contains a list of ten design hints for programmers. The index has an entry for engineering techniques. That contains pointers to back of the envelope, background data, debugging, design, elegance, problem definition, prototypes, specifications, testing, and tradeoffs.

A Talk About Tricks

I gave a one-hour talk about tricks at the Game Developers Conference in March, 2000. The talk is available as a Powerpoint Show. Closely related overhead transparencies are available in both Postscript and Acrobat. The talk concentrates on the three tricks described above (problem definition, back-of-the-envelope estimates, and debugging). Most of the material in the slides is taken from the corresponding parts of the book. The first problem to be defined is from Section 12.1, and the second is from Problem 5.3 of my 1988 More Programming Pearls (the blood sample trick is from page 45 of that book). The back-of-the-envelope material is directly from Column 7, while the rules of thumb are scattered throughout the book. Three of the four software debugging stories are from Section 5.10; the medical story is from the book cited at the end of the section. Copyright © 2000 Lucent Technologies. All rights reserved. Tues 7 Mar 2000

An Estimation Quiz (Appendix 2 of Programming Pearls) Back-of-the-envelope calculations all start with basic quantities. Those numbers can sometimes be found in a problem specification (such as a requirements document), but other times they must be estimated. This little quiz is designed to help you evaluate your proficency as a number guesser. For each question, fill in upper and lower bounds that, in your opinion, give you a ninety percent chance of including the true value; try not to make your ranges too narrow or too wide. I estimate that it should take you between five and ten minutes. Please make a good faith effort. [______, ______] January 1, 2000, population of the United States in millions. [______, ______] The year of Napoleon's birth. [______, ______] Length of the Mississippi-Missouri river in miles. [______, ______] Maximum takeoff weight in pounds of a Boeing 747 airliner. [______, ______] Seconds for a radio signal to travel from the earth to the moon. [______, ______] Latitude of London. [______, ______] Minutes for a space shuttle to orbit the earth. [______, ______] Length in feet between the towers of the Golden Gate Bridge. [______, ______] Number of signers of the Declaration of Independence. [______, ______] Number of bones in the adult human body. When you finish this quiz, please visit here for answers and interpretation. Copyright © 1999 Lucent Technologies. All rights reserved. Mon 9 Aug 1999

Estimation Answers (From Appendix 2 of Programming Pearls) If you have not yet answered all the questions yourself, please go back here and do so. Here are the answers, from an almanac or similar source. January 1, 2000, population of the United States is 272.5 million. Napoleon was born in 1769. The Mississippi-Missouri river is 3,710 miles long. Maximum takeoff weight of a B747-400 airliner is 875,000 pounds. A radio signal travels from the earth to the moon in 1.29 seconds. Latitude of London is about 51.5 degrees. A space shuttle orbits the earth in about 91 minutes. 4200 feet between the towers of the Golden Gate Bridge. 56 signers of the Declaration of Independence. 206 bones in the adult human body.

Interpretation Please count how many of your ranges included the correct answer. Because you used a 90-percent confidence interval, you should have gotten about nine of the ten answers correct. If you had all ten answers correct, then you may be an excellent estimator. Then again, your ranges may be way too big. You can probably guess which. If you had six or fewer answers correct, then you are probably as embarrassed as I was when I first took a similar estimation quiz. A little practice couldn't hurt your estimation skills. If you had seven or eight answers correct, then you are a pretty good guesser. In the future, remember to broaden your 90-percent ranges just a bit. If you had exactly nine answers correct, then you may be an excellent guesser. Or, you may have given infinite ranges for the first nine questions, and zero for the last question. If so, shame on you. Copyright © 1999 Lucent Technologies. All rights reserved. Mon 9 Aug 1999

The Back of the Envelope (Column 7 of Programming Pearls) A Story It was in the middle of a fascinating conversation on software engineering that Bob Martin asked me, ``How much water flows out of the Mississippi River in a day?'' Because I had found his comments up to that point deeply insightful, I politely stifled my true response and said, ``Pardon me?'' When he asked again I realized that I had no choice but to humor the poor fellow, who had obviously cracked under the pressures of running a large software shop. My response went something like this. I figured that near its mouth the river was about a mile wide and maybe twenty feet deep (or about one two-hundred-and-fiftieth of a mile). I guessed that the rate of flow was five miles an hour, or a hundred and twenty miles per day. Multiplying 1 mile x 1/250 mile x 120 miles/day ~ 1/2 mile3/day showed that the river discharged about half a cubic mile of water per day, to within an order of magnitude. But so what? At that point Martin picked up from his desk a proposal for the communication system that his organization was building for the Summer Olympic games, and went through a similar sequence of calculations. He estimated one key parameter as we spoke by measuring the time required to send himself a one-character piece of mail. The rest of his numbers were straight from the proposal and therefore quite precise. His calculations were just as simple as those about the Mississippi River and much more revealing. They showed that, under generous assumptions, the proposed system could work only if there were at least a hundred and twenty seconds in each minute. He had sent the design back to the drawing board the previous day. (The conversation took place about a year before the event, and the final system was used during the Olympics without a hitch.) That was Bob Martin's wonderful (if eccentric) way of introducing the engineering technique of ``back-of-theenvelope'' calculations. The idea is standard fare in engineering schools and is bread and butter for most practicing engineers. Unfortunately, it is too often neglected in computing.

The Rest of the Column The story at the top of this page begins Column 7 of Programming Pearls. These are the remaining sections in the column. 7.1 Basic Skills 7.2 Performance Estimates 7.3 Safety Factors

7.4 Little's Law 7.5 Principles 7.6 Problems 7.7 Further Reading 7.8 Quick Calculations in Everyday Life

Related Content The Solutions to Column 7 give answers for some of the Problems. Appendix 2 is an estimation quiz to help you test your guessing skills. Appendix 3 describes programs that generate cost models for the space (spacemod.cpp) and time (timemod.c) used by C and C++ programs. The index has an entry for the back of the envelope. The teaching material contains overhead transparencies based on Sections 7.1, 7.5 and 7.6; the slides are available in both Postscript and Acrobat. The web references describe several web sites devoted to this topic. An excerpt from this column appears in the September/October 1999 issue of IEEE Software (pages 61-65). It consists of the lead story, much of Section 7.1, much of Section 7.2, Section 7.5 and the Estimation Quiz. Copyright © 1999 Lucent Technologies. All rights reserved. Mon 28 Feb 2000

First-Year Instruction in Programming Pearls Owen Astrachan organized a Workshop on First Year Instruction sponsored by the National Science Foundation and the Duke University Computer Science Department in July, 2000. When he invited me to give the keynote, my first response was that I haven't taught a freshman course for a quarter of a century. On the other hand, I've enjoyed talking to undergraduates over the years, and Programming Pearls has been widely used as a supplementary text in college courses. My interest in the topic was also increased because my son was heading off to college later that summer. With Owen's guidance and encouragement, I agreed to talk on ``Truth, Beauty, and Engineering: What I'd Like My CS Freshman Kid to Learn Next Year''. (Let me clarify as a proud parent that Computer Science was only one of several possible majors that he was considering as he departed; I have high hopes that he can avoid repeating the sins of his father.) The talk is available as a Powerpoint Show. It has this rough outline: Engineering Tricks of the Trade Eyes Open to the World Beauty Elegance Communication Truth History Ethics Several of these topics are covered in the book. Tricks of the Trade This topic has its own web page, complete with a talk. Elegance Elegance is the moral of Column 1, and a theme through the rest of the book, complete with its own entry in the index. For a more general view of the topic, see the Computerworld article on Elegance in software. I had the luxury of striving for elegance in the source code for the book. The scaffolding is often (appropriately) rough, but I took same pains to trim excess fat from most of the code that found its way into the text. History

The book is peppered with little historical anecdotes that illustrate timeless truths. Section 10.1 introduces the topic of ``squeezing space'' with this story from the dawn of computing: ``Fred Brooks observed the power of simplification when he wrote a payroll program for a national company in the mid 1950's. The bottleneck of the program was the representation of the Kentucky state income tax. The tax was specified in the law by a two-dimensional table with income as one dimension, and number of exemptions as the other. Storing the table explicitly required thousands of words of memory, more than the capacity of the machine. ``The first approach Brooks tried was to fit a mathematical function through the tax table, but it was so jagged that no simple function would come close. Knowing that it was made by legislators with no predisposition to crazy mathematical functions, Brooks consulted the minutes of the Kentucky legislature to see what arguments had led to the bizarre table. He found that the Kentucky tax was a simple function of the income that remained after federal tax was deducted. His program therefore calculated federal tax from existing tables, and then used the remaining income and a table of just a few dozen words of memory to find the Kentucky tax. ``By studying the context in which the problem arose, Brooks was able to replace the original problem to be solved with a simpler problem. While the original problem appeared to require thousands of words of data space, the modified problem was solved with a negligible amount of memory.'' I tried to sprinkle such historical nuggets throughout the book. From the 1960's, Sections 2.4 and 2.8 describe a program for computing anagrams, and Section 9.3 describes a svelte binary search. From the 1970's, Problem 2.6 describes a touch-tone phone directory, Column 8 describes the evolution of an algorithm, Section 10.3 describes storing symmetric matrices, and Section 15.2 describes suffix arrays. Copyright © 2000 Lucent Technologies. All rights reserved. Mon 3 Jul 2000

Code from Programming Pearls ●

●

●

●

●

●

●

●

●

●

Column 1: Programs for sorting integers bitsort.c -- Sort with bit vectors. sortints.cpp -- Sort using C++ STL sets. qsortints.c -- Sort with C library qsort. bitsortgen.c -- Generate random integers for sorting. Column 2: Test and time algorithms rotate.c -- Three ways to rotate the elements of a vector. The next two program are used in a pipeline to compute all anagrams in a dictionary sign.c -- Sign each word by its letters in sorted order. squash.c -- Put each anagram class on a single line. Column 5: Scaffolding for testing and timing search functions search.c -- Linear and binary search. Column 7: Tiny experiment on C run times timemod0.c -- Edit main to time one operation. Column 8: Compute the maximum-sum subsequence in an array maxsum.c -- Time four algs: n3, n2, n log n, n. Column 9: Code tuning programs genbins.c -- Profile this, then try a special-purpose allocator. macfun.c -- Time the cost of macros and functions. The column also uses rotate.c (Column 2), search.c (Column 5) and maxsum.c (Column 8). Column 11: Test and time sorting algorithms sort.cpp -- Mostly C, but also C++ sort function. SortAnim.java -- Animate those sort functions in Java. Column 12: Generate a sorted list of random integers sortedrand.cpp -- Several algorithms for the task. Column 13: Set representations for the problem in Column 12 sets.cpp -- Several data structures for sets. genbins.c (Column 9) implements the bin data structure in C. Column 14: Heaps

priqueue.cpp -- Implement and test priority queues. The column also uses sort.c (Column 11) for heapsort. ●

●

Column 15: Strings wordlist.cpp -- List words in the file, using STL set. wordfreq.cpp -- List words in the file, with counts, using STL map. wordfreq.c -- Same as above, with hash table in C. longdup.c -- Find long repeated strings in input. markov.c -- Generate random text from input. markovhash.c -- Like markov.c, but with hashing. markovlet.c -- Letter-level markov text, simple algorithm. Appendix 3: Cost Models spacemod.cpp -- Space used by various records. timemod.c -- Table of times used by various C constructs.

You may use this code for any purpose, as long as you leave the copyright notice and book citation attached. Copyright © 1999 Lucent Technologies. All rights reserved. Sat 31 Jul 1999

/* Copyright (C) 1999 Lucent Technologies */ /* From 'Programming Pearls' by Jon Bentley */ /* bitsort.c -- bitmap sort from Column 1 * Sort distinct integers in the range [0..N-1] */ #include <stdio.h> #define #define #define #define int a[1

BITSPERWORD 32 SHIFT 5 MASK 0x1F N 10000000 + N/BITSPERWORD];

void set(int i) { a[i>>SHIFT] |= (1SHIFT] &= ~(1SHIFT] & (1> i) S.insert(i); for (j = S.begin(); j != S.end(); ++j) cout int intcomp(int *x, int *y) { return *x - *y; } int a[1000000]; int main() { int i, n=0; while (scanf("%d", &a[n]) != EOF) n++; qsort(a, n, sizeof(int), intcomp); for (i = 0; i < n; i++) printf("%d\n", a[i]); return 0; }

/* Copyright (C) 1999 Lucent Technologies */ /* From 'Programming Pearls' by Jon Bentley */ /* bitsortgen.c -- gen $1 distinct integers from U[0,$2) */ #include <stdio.h> #include <stdlib.h> #include #define MAXN 2000000 int x[MAXN]; int randint(int a, int b) { return a + (RAND_MAX * rand() + rand()) % (b + 1 - a); } int main(int argc, char *argv[]) { int i, k, n, t, p; srand((unsigned) time(NULL)); k = atoi(argv[1]); n = atoi(argv[2]); for (i = 0; i < n; i++) x[i] = i; for (i = 0; i < k; i++) { p = randint(i, n-1); t = x[p]; x[p] = x[i]; x[i] = t; printf("%d\n", x[i]); } return 0; }

/* Copyright (C) 1999 Lucent Technologies */ /* From 'Programming Pearls' by Jon Bentley */ /* rotate.c -- time algorithms for rotating a vector Input lines: algnum numtests n rotdist algnum: 1: reversal algorithm 2: juggling algorithm 22: juggling algorithm with mod rather than if 3: gcd algorithm 4: slide (don't rotate): baseline alg for timing To test the algorithms, recompile and change main to call testrot */ #include <stdio.h> #include <stdlib.h> #include #define MAXN 10000000 int x[MAXN]; int rotdist, n; /* Alg 1: Rotate by reversal */ void reverse(int i, int j) { int t; while (i < j) { t = x[i]; x[i] = x[j]; x[j] = t; i++; j--; } } void revrot(int rotdist, int n) { reverse(0, rotdist-1); reverse(rotdist, n-1); reverse(0, n-1); } /* Alg 2: Juggling (dolphin) rotation */ int gcd(int i, int j) { int t; while (i != 0) { if (j >= i) j -= i; else { t = i; i = j; j = t; } } return j; } void jugglerot(int rotdist, int n) { int cycles, i, j, k, t; cycles = gcd(rotdist, n); for (i = 0; i < cycles; i++) { /* move i-th values of blocks */ t = x[i]; j = i;

for (;;) { k = j + rotdist; if (k >= n) k -= n; if (k == i) break; x[j] = x[k]; j = k; } x[j] = t; } } void jugglerot2(int rotdist, int n) { int cycles, i, j, k, t; cycles = gcd(rotdist, n); for (i = 0; i < cycles; i++) { /* move i-th values of blocks */ t = x[i]; j = i; for (;;) { /* Replace with mod below k = j + rotdist; if (k >= n) k -= n; */ k = (j + rotdist) % n; if (k == i) break; x[j] = x[k]; j = k; } x[j] = t; } } /* Alg 3: Recursive rotate (using gcd structure) */ void swap(int i, int j, int k) /* swap x[i..i+k-1] with x[j..j+k-1] */ { int t; while (k-- > 0) { t = x[i]; x[i] = x[j]; x[j] = t; i++; j++; } } void gcdrot(int rotdist, int n) { int i, j, p; if (rotdist == 0 || rotdist == n) return; i = p = rotdist; j = n - p; while (i != j) { /* invariant: x[0 ..p-i ] is in final position x[p-i..p-1 ] = a (to be swapped with b) x[p ..p+j-1] = b (to be swapped with a) x[p+j..n-1 ] in final position */

if (i > j) { swap(p-i, p, j); i -= j; } else { swap(p-i, p+j-i, i); j -= i; } } swap(p-i, p, i); } int isogcd(int i, int j) { if (i == 0) return j; if (j == 0) return i; while (i != j) { if (i > j) i -= j; else j -= i; } return i; } void testgcd() { int i,j; while (scanf("%d %d", &i, &j) != EOF) printf("%d\n", isogcd(i,j) ); } /* Test all algs */ void slide(int rotdist, int n) /* Benchmark: slide left rotdist (lose 0..rotdist-1) */ { int i; for (i = rotdist; i < n; i++) x[i-rotdist] = x[i]; } void initx() { int i; for (i = 0; i < n; i++) x[i] = i; } void printx() { int i; for (i = 0; i < n; i++) printf(" %d", x[i]); printf("\n"); } void roterror() { fprintf(stderr, " rotate bug %d %d\n", n, rotdist); printx(); exit (1); } void checkrot() { int i;

for (i = 0; i < n-rotdist; i++) if (x[i] != i+rotdist) roterror(); for (i = 0; i < rotdist; i++) if (x[n-rotdist+i] != i) roterror(); } void testrot() { for (n = 1; n #include <string.h> #define WORDMAX 100 int charcomp(char *x, char *y) { return *x - *y; } int main() { char word[WORDMAX], sig[WORDMAX]; while (scanf("%s", word) != EOF) { strcpy(sig, word); qsort(sig, strlen(sig), sizeof(char), charcomp); printf("%s %s\n", sig, word); } return 0; }

/* Copyright (C) 1999 Lucent Technologies */ /* From 'Programming Pearls' by Jon Bentley */ /* squash.c -- print anagram classes on a single line The input lines "opst pots" and "opst stop" go to "pots stop" */ #include <stdio.h> #include <stdlib.h> #include <string.h> #define WORDMAX 100 int main() { char word[WORDMAX], sig[WORDMAX], oldsig[WORDMAX]; int linenum = 0; strcpy(oldsig, ""); while (scanf("%s %s", sig, word) != EOF) { if (strcmp(oldsig, sig) != 0 && linenum > 0) printf("\n"); strcpy(oldsig, sig); linenum++; printf("%s ", word); } printf("\n"); return 0; }

/* Copyright (C) 1999 Lucent Technologies */ /* From 'Programming Pearls' by Jon Bentley */ /* search.c -- test and time binary and sequential search Select one of three modes by editing main() below. 1.) Probe one function 2.) Test one function extensively 3.) Time all functions Input lines: algnum n numtests Output lines: algnum n numtests clicks nanosecs_per_elem See timedriver for algnum codes */ #include <stdio.h> #include <stdlib.h> #include #define MAXN 1000000 typedef int DataType; DataType x[MAXN]; int n; /* Scaffolding */ int i = -999999; #define assert(v) { if ((v) == 0) printf("

binarysearch bug %d %d\n", i, n); }

/* Alg 1: From Programming Pearls, Column 4: raw transliteration */ int binarysearch1(DataType t) { int l, u, m; l = 0; u = n-1; for (;;) { if (l > u) return -1; m = (l + u) / 2; if (x[m] < t) l = m+1; else if (x[m] == t) return m; else /* x[m] > t */ u = m-1; } } /* Alg 2: Make binarysearch1 more c-ish */ int binarysearch2(DataType t) { int l, u, m; l = 0; u = n-1; while (l t */ u = m-1;

} return -1; } /* Alg 3: From PP, Col 8 */ int binarysearch3(DataType t) { int l, u, m; l = -1; u = n; while (l+1 != u) { m = (l + u) / 2; if (x[m] < t) l = m; else u = m; } if (u >= n || x[u] != t) return -1; return u; } /* Alg 4: From PP, Col 9 */ int binarysearch4(DataType t) { int l, p; if (n != 1000) return binarysearch3(t); l = -1; if (x[511] < t) l = 1000 - 512; if (x[l+256] < t) l += 256; if (x[l+128] < t) l += 128; if (x[l+64 ] < t) l += 64; if (x[l+32 ] < t) l += 32; if (x[l+16 ] < t) l += 16; if (x[l+8 ] < t) l += 8; if (x[l+4 ] < t) l += 4; if (x[l+2 ] < t) l += 2; if (x[l+1 ] < t) l += 1; p = l+1; if (p >= n || x[p] != t) return -1; return p; } /* Alg 9: Buggy, from Programming Pearls, Column 5 */ int sorted() { int i; for (i = 0; i < n-1; i++) if (x[i] > x[i+1]) return 0; return 1; } int binarysearch9(DataType t) { int l, u, m; /* int oldsize, size = n+1; */ l = 0; u = n-1; while (l t) u = m; else { /* assert(x[m] == t); */ return m; } } /* assert(x[l] > t && x[u] < t); */ return -1; } /* Alg 21: Simple sequential search */ int seqsearch1(DataType t) { int i; for (i = 0; i < n; i++) if (x[i] == t) return i; return -1; } /* Alg 22: Faster sequential search: Sentinel */ int seqsearch2(DataType t) { int i; DataType hold = x[n]; x[n] = t; for (i = 0; ; i++) if (x[i] == t) break; x[n] = hold; if (i == n) return -1; else return i; } /* Alg 23: Faster sequential search: loop unrolling */ int seqsearch3(DataType t) { int i; DataType hold = x[n]; x[n] = t; for (i = 0; ; i+=8) { if (x[i] == t) if (x[i+1] == t) if (x[i+2] == t) if (x[i+3] == t) if (x[i+4] == t) if (x[i+5] == t) if (x[i+6] == t) if (x[i+7] == t) } x[n] = hold; if (i == n) return -1;

{ { { { { { { {

i i i i i i i

+= += += += += += +=

1; 2; 3; 4; 5; 6; 7;

break; } break; } break; } break; } break; } break; } break; } break; }

else return i; } /* Scaffolding to probe one algorithm */ void probe1() { int i; DataType t; while (scanf("%d %d", &n, &t) != EOF) { for (i = 0; i < n; i++) x[i] = 10*i; printf(" %d\n", binarysearch9(t)); } } /* Torture test one algorithm */ #define s seqsearch3 void test(int maxn) { int i; for (n = 0; n #include <math.h> int main() { int i, n, ia, ib, ic; float fa, fb, fc; n = 1000000000; /* run time in secs gives nanosecs/loop */ ia = ib = ic = 9; fa = fb = 9.0; for (i = 0; i < n; i++) { /* null loop 19.1 */ /* ia = ib + ic; 17.7 */ /* ia = ib - ic; 17.6 */ /* ia = ib * ic; 17.7 */ /* ia = ib % ic; 98.3 */ /* ia = ib / ic; 98.3 */ /* ia = rand(); 41.5 */ /* fa = sqrt(fb); 184 */ /* free(malloc(8)); 2400 */ } return 0; }

/* Copyright (C) 1999 Lucent Technologies */ /* From 'Programming Pearls' by Jon Bentley */ /* maxsum.c -- time algs for maximum-sum subsequence * Input: algnum, n * Output: algnum, n, answer, ticks, secs * See main for algnum codes */ #include <stdio.h> #include <stdlib.h> #include #define MAXN 10000000 int n; float x[MAXN]; void sprinkle() /* Fill x[n] with reals uniform on [-1,1] */ { int i; for (i = 0; i < n; i++) x[i] = 1 - 2*( (float) rand()/RAND_MAX); } float alg1() { int i, j, k; float sum, maxsofar = 0; for (i = 0; i < n; i++) for (j = i; j < n; j++) { sum = 0; for (k = i; k maxsofar) maxsofar = sum; } return maxsofar; } float alg2() { int i, j; float sum, maxsofar = 0; for (i = 0; i < n; i++) { sum = 0; for (j = i; j < n; j++) { sum += x[j]; if (sum > maxsofar) maxsofar = sum; } } return maxsofar; } float cumvec[MAXN+1]; float alg2b() { int i, j; float *cumarr, sum, maxsofar = 0; cumarr = cumvec+1; /* to access cumarr[-1] */ cumarr[-1] = 0; for (i = 0; i < n; i++) cumarr[i] = cumarr[i-1] + x[i]; for (i = 0; i < n; i++) { for (j = i; j < n; j++) {

sum = cumarr[j] - cumarr[i-1]; if (sum > maxsofar) maxsofar = sum; } } return maxsofar; } /* MS VC++ has a max macro, and therefore a perf bug */ #ifdef max #undef max #endif #define maxmac(a, b) ((a) > (b) ? (a) : (b) ) float maxfun(float a, float b) { return a > b ? a : b; } #define max(a, b) maxfun(a, b) float recmax(int l, int u) { int i, m; float lmax, rmax, sum; if (l > u) /* zero elements */ return 0; if (l == u) /* one element */ return max(0, x[l]); m = (l+u) / 2; /* find max crossing to left */ lmax = sum = 0; for (i = m; i >= l; i--) { sum += x[i]; if (sum > lmax) lmax = sum; } /* find max crossing to right */ rmax = sum = 0; for (i = m+1; i rmax) rmax = sum; } return max(lmax + rmax, max(recmax(l, m), recmax(m+1, u))); } float alg3() { return recmax(0, n-1); } float alg4() { int i; float maxsofar = 0, maxendinghere = 0; for (i = 0; i < n; i++) { maxendinghere += x[i]; if (maxendinghere < 0) maxendinghere = 0; if (maxsofar < maxendinghere) maxsofar = maxendinghere;

} return maxsofar; } float alg4b() { int i; float maxsofar = 0, maxendinghere = 0; for (i = 0; i < n; i++) { maxendinghere += x[i]; maxendinghere = maxmac(maxendinghere, 0); maxsofar = maxmac(maxsofar, maxendinghere); } return maxsofar; } float alg4c() { int i; float maxsofar = 0, maxendinghere = 0; for (i = 0; i < n; i++) { maxendinghere += x[i]; maxendinghere = maxfun(maxendinghere, 0); maxsofar = maxfun(maxsofar, maxendinghere); } return maxsofar; } int main() { int algnum, start, clicks; float thisans; while (scanf("%d %d", &algnum, &n) != EOF) { sprinkle(); start = clock(); thisans = -1; switch (algnum) { case 1: thisans = alg1(); break; case 2: thisans = alg2(); break; case 22: thisans = alg2b(); break; case 3: thisans = alg3(); break; case 4: thisans = alg4(); break; case 42: thisans = alg4b(); break; case 43: thisans = alg4c(); break; default: break; } clicks = clock()-start; printf("%d\t%d\t%f\t%d\t%f\n", algnum, n, thisans, clicks, clicks / (float) CLOCKS_PER_SEC); if (alg4() != thisans) printf(" maxsum error: mismatch with alg4: %f\n", alg4()); } return 0; }

/* Copyright (C) 1999 Lucent Technologies */ /* From 'Programming Pearls' by Jon Bentley */ /* genbins.c -- generate random numbers with bins */ /* If NODESIZE is 8, this program uses the special-case malloc. Change NODESIZE to 0 to use the system malloc. */ #include <stdio.h> #include <stdlib.h> #define NODESIZE 8 #define NODEGROUP 1000 int nodesleft = 0; char *freenode; void *pmalloc(int size) { void *p; if (size != NODESIZE) return malloc(size); if (nodesleft == 0) { freenode = malloc(NODEGROUP*NODESIZE); nodesleft = NODEGROUP; } nodesleft--; p = (void *) freenode; freenode += NODESIZE; return p; } struct node { int val; struct node *next; }; struct node **bin, *sentinel; int bins, bincnt, maxval; void initbins(int maxelms, int pmaxval) { int i; bins = maxelms; maxval = pmaxval; bin = pmalloc(bins*sizeof(struct node *)); sentinel = pmalloc(sizeof(struct node)); sentinel->val = maxval; for (i = 0; i < bins; i++) bin[i] = sentinel; bincnt = 0; } struct node *rinsert(struct node *p, int t) { if (p->val < t) { p->next = rinsert(p->next, t); } else if (p->val > t) { struct node *q = pmalloc(sizeof(struct node)); q->val = t; q->next = p; p = q; bincnt++; } return p;

} void insert(int t) { int i; i = t / (1 + maxval/bins); i = t / (1 + maxval/bins); bin[i] = rinsert(bin[i], t); } void report() { int i, j = 0; struct node *p; for (i = 0; i < bins; i++) for (p = bin[i]; p != sentinel; p = p->next) /* printf("%d\n", p->val) */; /* Uncomment for testing, comment for profiling */ } int bigrand() { return RAND_MAX*rand() + rand(); } int main(int argc, char *argv[]) { int m = atoi(argv[1]); int n = atoi(argv[2]); initbins(m, n); while (bincnt < m) { insert(bigrand() % n); } report(); return 0; }

/* Copyright (C) 1999 Lucent Technologies */ /* From 'Programming Pearls' by Jon Bentley */ /* macfun.c -- time macro and function implementations of max * Input: a sequence of (alg num, n) pairs. * Output: for each test, (alg num, n, ans, ticks, secs) */ #include <stdio.h> #include <stdlib.h> #include #define MAXN 1000000 float x[MAXN]; /* arrmax1 -- max is a macro */ #define max1(a, b) ((a) > (b) ? (a) : (b)) float arrmax1(int n) { if (n == 1) return x[0]; else return max1(x[n-1], arrmax1(n-1)); } /* arrmax2 -- max is a function */ float max2(float a, float b) { return a > b ? a : b; } float arrmax2(int n) { if (n == 1) return x[0]; else return max2(x[n-1], arrmax2(n-1)); } /* arrmax3 -- MS VC++ stdlib defines max as a macro */ #ifndef max #define max(a, b) max2(a, b) #endif float arrmax3(int n) { if (n == 1) return x[0]; else return max(x[n-1], arrmax3(n-1)); } int main() { int algnum, i, n, start, clicks; float thisans; for (i = 0; i < MAXN; i++) x[i] = MAXN-i; while (scanf("%d %d", &algnum, &n) != EOF) { start = clock();

thisans = -1; switch (algnum) { case 1: thisans = arrmax1(n); break; case 2: thisans = arrmax2(n); break; case 3: thisans = arrmax3(n); break; default: break; } clicks = clock()-start; printf("%d\t%d\t%g\t%d\t%g\n", algnum, n, thisans, clicks, clicks / (float) CLOCKS_PER_SEC); } return 0; }

/* Copyright (C) 1999 Lucent Technologies */ /* From 'Programming Pearls' by Jon Bentley */ /* sort.cpp -- test and time Input lines: algnum Output lines: algnum This is predominantly a C sort function immediately */

sorting algorithms n mod n mod clicks nanosecs_per_elem program; the only use of C++ below the include line.

#include <stdio.h> #include <stdlib.h> #include // To change from C++ back to C, remove the following two lines // and the call to sort in main #include using namespace std; /* Data and supporting functions */ #define MAXN 10000000 typedef int DType; DType realx[MAXN]; int *x = realx; /* allow x to shift for heaps */ int n; void swap(int i, int j) { DType t = x[i]; x[i] = x[j]; x[j] = t; } int randint(int l, int u) { return l + (RAND_MAX*rand() + rand()) % (u-l+1); } /* LIBRARY QSORT */ int intcomp(int *x, int *y) { return *x - *y; } /* INSERTION SORTS */ /* Simplest insertion sort */ void isort1() { int i, j; for (i = 1; i < n; i++) for (j = i; j > 0 && x[j-1] > x[j]; j--) swap(j-1, j); } /* Write swap function inline */ void isort2() { int i, j; DType t; for (i = 1; i < n; i++) for (j = i; j > 0 && x[j-1] > x[j]; j--) { t = x[j]; x[j] = x[j-1];

x[j-1] = t; } } /* Move assignments void isort3() { int i, j; DType t; for (i = 1; t = for

to and from t out of loop */

i < n; i++) { x[i]; (j = i; j > 0 && x[j-1] > t; j--) x[j] = x[j-1]; x[j] = t;

} } /* QUICKSORTS */ /* Simplest version, Lomuto partitioning */ void qsort1(int l, int u) { int i, m; if (l >= u) return; m = l; for (i = l+1; i

Our partners will collect data and use cookies for ad personalization and measurement. Learn how we and our ad partner Google, collect and use data. Agree & close