Technical Papers Every Programmer Should Read (At Least Twice)

papaf · on Sept 9, 2011

I read a random sampling paper recently which took a simple problem and approached it in ways that were much more elegant than I did.

Its almost a toy problem but I found the paper really interesting:

http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.138.7...

mietek · on Sept 9, 2011

Fun fact: I was asked to find a solution to this problem during an interview at a proprietary trading firm last year.

I wish I had read this paper before.

psykotic · on Sept 9, 2011

The algorithm for the simplest variant of that problem (take a single random sample from a list of a priori unknown length) can be derived by simple inductive reasoning. If the list has size 1, the problem is trivially solved. Otherwise suppose we have a list (x:xs) and we recursively solve the problem for xs. If all the recursion returns is the random sample from xs, we clearly have insufficient information to take a random sample from (x:xs), so we need to strengthen the induction hypothesis by having the recursion also return the length of xs. Let (r', n') be the random sample and length for xs. Then the result for (x:xs) is (r, n) where n = n' + 1 and r = x with probability 1/n or else r'.

Another way to derive this algorithm is to notice that any solution must implicitly or explicitly compute the length of the list as a byproduct. If you start by writing down the recursive function for computing the length of a list, composed with the straightforward recursive function for taking a random sample from a list of known length, you can apply a standard fusion and deforesting transformation to combine them into a single pass, and you end up with the same algorithm as above.

Here's a fun problem to ponder. Find the most frequently occurring element of a list in O(n) time and O(1) space. You may assume that this element occupies more than half of the list's entries.

DigitalBison · on Sept 9, 2011

>Then the result for (x:xs) is (r, n) where n = n' + 1 and r = x with probability 1/n or else r'.

Do you mean r = x:r' instead of r = x?

psykotic · on Sept 9, 2011

No, r has the same type as x. The logic behind that probability split is simple: We have r' (the random sample taken from the tail list xs) and x, and we have to choose between them with some probability based on n. Clearly x should occur with a 1 in n chance.

When you approach it inductively the way I did, there really isn't any choice in the matter, which is why this algorithm design technique is so powerful. Udi Manber developed the technique in his paper Using Induction to Design Algorithms from 1988 and later used it throughout his great old book Introduction to Algorithms: A Creative Approach from the early 1990s. Here's the paper in case you're curious: http://akira.ruc.dk/~keld/teaching/algoritmedesign_f05/Artik...

ssp · on Sept 10, 2011

This way of thinking also makes it easy to derive the max-sum algorithm from Programming Pearls:

We have to compute max-sum for (x:xs), so first solve it for xs. If the max-sum is within xs, we are done; if not, it somehow involves x. To find out, the recursion has to also return the best sum of the beginning of the list.

A small case analysis can find out what the new max-sum is: x, x + max-begin(xs), or max-sum(xs)? And the new max-begin is either x or x + max-begin (xs).

The naive recursive implementation will require linear space though, so there is an extra step to find out how to eliminate the recursion.

psykotic · on Sept 10, 2011

Yes, Manber derives the linear-time algorithm for the maximal subsum problem exactly like that in his textbook. As you say, if x is involved in a maximal subsum, it must extend the tail's maximal prefix sum, so you return that in addition to the maximal subsum.

Regarding recursion vs iteration, Manber generally develops the right induction hypothesis gradually, using informal language ("remove the element and solve the smaller problem") in the process, and only turns that into a precise algorithm once the final induction hypothesis has been found, so he doesn't present the recursion elimination as a separate step. That works well pedagogically. His main idea is to guide the student's intuitions rather than formally derive a program that's correct by construction the way someone like Richard Bird might have done it. (Bird has an interesting derivation of the maximal subsum algorithm that begins with the brute-force algorithm written as a combinator-based functional program and gradually transforms it in a correctness-preserving way using program calculation techniques until arriving at the fast linear-time program.)

DigitalBison · on Sept 10, 2011

Whoops, I misinterpreted what you meant by "single random sample". I will definitely check out that paper though.

pjscott · on Sept 10, 2011

This algorithm is used in Yammer Metrics to keep running statistics on whatever quantities you want to track. For instance, you could time a particular section of code, and it will tell you the average time, the standard deviation, the 99th percentile time, and so on. It's tremendously useful, and the algorithm from that paper makes it also very cheap.

https://github.com/codahale/metrics/

wunki · on Sept 10, 2011

Archived the papers mentioned by fogus and converted the .ps files to .pdf. For those who want quick access to them, download here:

http://c.wunki.org/A2s0

Luyt · on Sept 10, 2011

Thanks for the effort you took to make these papers more accessible. I've downloaded them to my computer.

wgrover · on Sept 9, 2011

Seeing Leslie Lamport's first paper (about braids, for a high school math journal) makes me want to put together a list of computer scientists' first papers. Now I've got two favorites:

Leslie Lamport, "Braid Theory": http://research.microsoft.com/en-us/um/people/lamport/pubs/b...

Don Knuth, "The Potrzebie System of Weights and Measures", Mad Magazine: http://upload.wikimedia.org/wikipedia/en/5/52/Potrzeb.jpg

scott_s · on Sept 10, 2011

A Fast File System for UNIX: http://www.cs.berkeley.edu/~brewer/cs262/FFS.pdf Modern filesystems are based on this design. They add significant features - like journaling - but by reading this paper, you will form a basis for how modern file systems work.

The Google Filesystem: http://labs.google.com/papers/gfs.html Great example of what you can design when you decide certain things - like storing small files - are not important.

MapReduce: Simplified Data Processing on Large Clusters: http://research.google.com/archive/mapreduce.html Great example of taking an existing idea and using it to achieve high performance.

The Implementation of the Cilk 5 Multithreaded Language: http://supertech.csail.mit.edu/papers/cilk5.pdf Apple's Grand Central Dispatch uses a lot of ideas from this paper.

mitultiwari · on Sept 9, 2011

Nice collection of papers. Thanks for sharing!

Here is a set of papers about distributed systems, which I read in a course and found very useful to get good understanding of distributed systems research so far: http://www.cs.utexas.edu/~dahlin/Classes/GradOS/index.html

zephyrfalcon · on Sept 10, 2011

What would be useful is a site that collects links to CS papers, then for each paper lets users (who presumably actually read that paper) comment on it, and/or rate it. Or does such a beast already exist?

charlieok · on Sept 10, 2011

Mendeley [1] does some of what you're talking about. Primarily it's a way to organize your own collection of papers and access them from native or web apps on various devices. It also has some social features, like connecting users with similar interests and sharing tags on papers.

Its "Computer and Information Science" category [2] lists the papers with the most "readers" which I think in this case means "Mendeley users who have added this paper to their collection". Its most read paper in that category is the MapReduce paper [3]. Oddly, as I look at it now, I don't see a feature for rating or commenting on the paper, which could be a useful addition.

Of course, the authoritative method of rating papers is citations of other papers, right? That's what led to PageRank after all, and most of the main sites for finding papers have used citation counts for a long time.

[1] http://www.mendeley.com

[2] http://www.mendeley.com/computer-and-information-science/

[3] http://www.mendeley.com/research/mapreducemerge-simplified-r...

charlieok · on Sept 11, 2011

oops. Correction: The MapReduce paper doesn't have the most readers in Mendeley. I just saw it at the top of the "Popular" list on the computer and information science page. So they must have some other way they're sorting that list.

spacemanaki · on Sept 10, 2011

Well there's a thread over on the CS Stack Exchange about favorite papers, but it sort of doesn't fit what you describe:

http://cstheory.stackexchange.com/questions/1168/what-papers...

jules · on Sept 9, 2011

Other great papers:

Flapjax: A Programming Language for Ajax Applications:

http://www.cs.brown.edu/~sk/Publications/Papers/Published/mg...

Embedded probabalistic programming:

http://okmij.org/ftp/kakuritu/dsl-paper.pdf

Memoization of Top-down Parsing:

http://arxiv.org/PS_cache/cmp-lg/pdf/9504/9504016v1.pdf

tybris · on Sept 10, 2011

Why? Most of the contents of these papers have made their way into books and programming languages. There's nothing wrong with getting it from a secondary source.

scott_s · on Sept 10, 2011

I find that reading primary sources reminds me that I can be a primary source. It reminds me that these ideas that I rely on were thought up by people, not handed down from Mt. Olympus.

The historian in me is also just fascinated to see the source of an idea that has made a large impact.

jroll · on Sept 9, 2011

"All papers are freely available online"

Why not have links to each one?

fogus · on Sept 9, 2011

Sorry. My Markdown syntax was bad. There are now links to the articles marked by ↗.

pjscott · on Sept 9, 2011

This is a great collection of papers, and I don't mean to downplay how happy I am to find it, but you might want to use a linking method that's more obvious, like having relevant text be blue and underlined. The principle of least surprise is important, usually more important than cleverness in design.

(Now I'm going to go off and read that Lamport paper.)

on Sept 10, 2011

[deleted]

njharman · on Sept 10, 2011

"link to paper" comes to mind...

Tichy · on Sept 9, 2011

Can't find the link for "Organizing Programs Without Classes". The Oracle site says "click here to download", but there is nothing to click?

patrickyeon · on Sept 9, 2011

I see a postscript version here: http://www.cs.ucsb.edu/~urs/oocsb/self/papers/organizing-pro...

Tichy · on Sept 9, 2011

Thanks!

lelf · on Sept 10, 2011

here is link to ps.gz.

And you can find PDF at http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.127.5...

russellallen · on Sept 10, 2011

PDF: http://selflanguage.org/_static/published/organizing-program...

jroll · on Sept 9, 2011

Great! Thank you so much.

russellallen · on Sept 9, 2011

Copies of most of the Self papers in PDF can be downloaded from the official Self language website: http://selflanguage.org/documentation/published/index.html

bbg · on Sept 10, 2011

I'm going to be that guy:

"as a compliment to this paper" --> complement

"no higher complement." --> compliment

Thanks for a cool post.

Confusion · on Sept 10, 2011

Such comments are more appropriately posted at the blog itself.

mahmud · on Sept 11, 2011

A blog is as good as a research paper, to a blind bat.

Luyt · on Sept 10, 2011

I'd like to see Joel Spolsky's epic article on unicode added to the list: http://www.joelonsoftware.com/articles/Unicode.html

"if [...] you don't know the basics of characters, character sets, encodings, and Unicode, and I catch you, I'm going to punish you by making you peel onions for 6 months in a submarine. I swear I will."

papaf · on Sept 10, 2011

Thats an interesting article you linked to but the original article is about peer reviewed academic papers.

Meta comment: I find that downvoting people are aren't being rude and abusive is bad manners.

ajross · on Sept 9, 2011

I don't mean to speak badly of this bibliography. Of the papers in the list I've read, they're all great. And I'll add the others to my list and make sure I get to them.

But I think the premise of the blog post is a little flawed. Reading papers is a poor way to make yourself a better programmer. Read them in spare moments, sure, but spend your time reading code, not paper. At best, these things will help you avoid some design mistakes in the code you write for yourself in your own sandboxes.

In the real world, you're dealing with code written by people who haven't read these papers. And this is where you're going to spend all your time: maintaining and fixing and enhancing stuff that missed all the advice in the papers. This is especially true of "day job" programmers, but it's true at startups too, even at seed-stage oens.

Admonitions to do stuff like re-read "Out of the Tar Pit" every six months is just bad advice to my mind. It's a good way to convince yourself you're smarter than everyone else. It's a bad way to get better at debugging.

fogus · on Sept 9, 2011

    but spend your time reading code, not paper.

These things are not mutually exclusive. I agree that you should definitely read code also; good and bad. However, this is a post about one specific thing: great papers.

    Admonitions

That's a bit strong. I listed some papers that I like and believe important. I don't recall providing admonitions if you didn't read them.

    re-read "Out of the Tar Pit" every six 
    months is just bad advice to my mind.

That's just me. I like that paper and so I re-read it. I wouldn't say that you nor anyone else should do the same.

    It's a good way to convince yourself 
    you're smarter than everyone else.

I'm sorry, have we met?

0x12 · on Sept 10, 2011

Most programmers I know never read papers, they read plenty of code but the code usually sucks. The reason why it sucks is because it was written by programmers that never read papers and so the cycle continues.

If you don't start somewhere then you'll miss out on papers being an excellent source of knowledge about algorithms, datastructures and the like. They also introduce you to a more mathematical approach to problem solving.

More papers, less code, and if you read code for its educational value make sure it is excellent code.

Then go back to your daily drudge of fixing bugs in crappy code, hopefully transforming it into something more elegant along the way with the knowledge you pick up.

Read papers, read good code to educate yourself.

swah · on Sept 10, 2011

What about books ? Aren't they condensations of papers? :)

0x12 · on Sept 10, 2011

What about readers digest? Aren't they condensations of books? :)

Luyt · on Sept 10, 2011

What about blogs. Aren't they condensations of both?

wglb · on Sept 10, 2011

I would heartily disagree. As an example, reading about Vector Clocks by Lamport had an immediate effect on how I thought about a real-world problem, and it has stayed with me since.

My first reading of "GOTOs considered harmful" led to the phase of my career where I was all about structured programming. (For s perspective, much HLL programming then was in fortran with the three-branch IF-statement.)

Admonitions to do stuff like re-read "Out of the Tar Pit" every six months is just bad advice to my mind. It's a good way to convince yourself you're smarter than everyone else. It's a bad way to get better at debugging. The very best way to get better at debugging is to not put bugs there in the first place. The Tar Pit paper is one step in that direction. And it is important to keep at that because our habits keep trying to put state in.

So while it sounds like you are saying "get your inspiration from reading bad code", I suspect that you don't really mean that.

cellularmitosis · on Sept 11, 2011

"The very best way to get better at debugging is to not put bugs there in the first place"

Actually, I've come to the opposite conclusion recently.

During college, as well as for all of my personal projects, and during the early part of my career, I had the fortune of mostly working on projects for which I was the sole coder from start to finish.

As you say, when you're the only variable in the equation, you eliminate the need for strong debugging skills by following a set of best practices which tends to prevent creation of bugs in the first place.

And then I inherited my first seriously brain-damaged codebase. 16,000 lines of "what not to do". All of my favorite practices listed in the c2 wiki were completely ignored. Massive copy-paste coding. State BOOL's distributed throughout the app making it extremely fragile and tightly coupled. Thread safety? What's that? Hell, they couldn't even avoid giving methods and variables ambiguous names, even in the simplest of cases ("buttonClick". wtf? is that a command (e.g. clickButton) or an event (e.g. buttonClicked)?)

I was wholly unprepared for debugging this monstrosity. My first instinct was to rewrite the whole thing. In fact, a co-worker of mine had attempted exactly that before I was assigned to this beast, but introduced his own set of show-stopping bugs before being pulled off onto another project, ultimately wasting two weeks of work (his branch was abandoned) and souring the waters for any such attempt on my part.

For several months I forced myself to avoid the urge to rewrite, and just slogged through it until I finally started to understand what was going on.

I have to say, forcing myself to continue to chew on something sour when every instinct told me to spit it out really improved my ability to follow what was happening in a foreign codebase. Now, even when bughunting relatively sane codebases, I feel I'm more quickly able to zero in on the problem.

wglb · on Sept 11, 2011

I agree with what you are saying in the context of taking on another's program. In similar situations, I have also attempted to rewrite the body of code. Once it was a necessity, as the code was perhaps 50% wishful thinking and never worked. I have tried in other cases, unwisely, and failed.

But when you are confronted with a pile of bad but working software, I have done what what you are saying here--dive in and understand what is happening.

ajross · on Sept 10, 2011

The very best way to get better at debugging is to not put bugs there in the first place.

You realize how terribly naive this is, right? Everyone, no matter how smart, not matter how experienced, and no matter how many papers they read, is going to put giant whoppers of idiocy into their code on a regular basis. And finding these is where we spend all our time.

Your answer is an echo of the blog post: you're acting as if the act of writing code is the thing that needs optimization. It's not, though maybe it seems that way early in your career. Code (successful code, anyway) lives long after it's written, and that phase is vastly more expensive. And papers don't help there. Sorry, but they don't.

Read the papers, they're good. But stop pretending that they're the path to enlightenment. Our profession, ultimately, isn't about enlightenment. It's about results.

wglb · on Sept 10, 2011

You realize how terribly naive this is, right? This is out of line.

It's not, though maybe it seems that way early in your career I am no longer early in my career. I recently viewed code that is in production today that I wrote in 1966. Most of it had survived. And the hours I invested in writing it swamped any maintenance hours put into it after that.

... put giant whoppers of idiocy into their code on a regular basis. And finding these is where we spend all our time. Perhaps you do; my code comes out better than that. I attribute the difference to having read papers. I am not pretending.

Confusion · on Sept 10, 2011

You're overlooking a few things:

- we're not all dealing with code written by people who haven't read these papers

- in the real world, other people are also dealing with your code and the better it is, the easier it is for them, which in turn makes things easier for you

- people that have read these papers are often in a position to teach others, so the future code of those taught will be better. I will definitely relay what I learn from a paper to my colleagues when applicable

- sometimes somebody that has read a paper is needed to solve the mess the others have created: a mess that is impossible to solve without reinventing what is stated in the appropriate paper. And even though it seems obvious after reading the paper, you wouldn't come up with it yourself.

- there are things to be learned from the thorough analysis of others that you wouldn't ever have dreamed up yourself.

- there are things to be learned from the thorough analysis of other that you wouldn't ever learn if all you've been doing is reading code by others that haven't read these papers. And even if you read code from people that have read the papers: understanding something from code is much harder than understanding them from a paper in which they are clearly expositioned. You wouldn't grasp the fundamentals behind Lamport's paper or the Dynamo paper nearly as fast.

jackowayed · on Sept 10, 2011

They serve very different purposes.

You're completely right in your last sentence. Very few papers will help you get better at debugging. But programming is about a lot more than debugging.

I'm interested in distributed systems but am only vaguely familiar with locks. I could pull up the Hadoop codebase and trudge my way through some distributed systems code. But it would take forever, and I'd have to guess at a lot of core concepts after a lot of confusion and effort trying to build complex mental models of what the code is doing in my head.

I think I would get a lot farther a lot faster if I read some things, be they papers or textbooks or Wikipedia entries, about systems. Then, once I'm at least sort of familiar with the core concepts that one applies when writing distributed systems, I could get to reading and writing real code.

There's a time and a place for both, but if there's a complex idea that one is not familiar with, I think one can much more efficiently understand it at some level--and then maybe dive into the code--by reading a 20-page paper about it than by trying to understand the 100KLOC implementation thereof. Being exposed to the high-level picture makes it way easier to understand why each piece is there doing what it's doing.

gfodor · on Sept 10, 2011

Not buying it. Programmers in our field could do worse than be up-to-date on the history of the major ideas in computer science, before they fire up their editor and start hacking stuff out. Obviously, you can do both, too.

daveed · on Sept 10, 2011

I disagree with you. All the programmers I know have plenty of coding/reviews in their day-to-day. A much smaller amount of time is dedicated to reading papers - after all, it's not their job. If we can agree that both are valuable, why is the author wrong for pointing people to things that they likely focus less on?

rhizome · on Sept 9, 2011

Is this title still referring to the object mentor essays that were posted a couple weeks ago?

jleader · on Sept 9, 2011

The post is a response to the original object mentor list of 10 papers; this post claims to be oriented more toward technical depth and less toward philosophy and design.

jroll · on Sept 9, 2011

No, this is inspired by that list, with an additional set of constraints.

nithinag · on Sept 10, 2011

Thanks for the post.

zackattack · on Sept 9, 2011

Any other entrepreneurs integrating this into their standard employee training manual?