The algorithm for the simplest variant of that problem (take a single random sample from a list of a priori unknown length) can be derived by simple inductive reasoning. If the list has size 1, the problem is trivially solved. Otherwise suppose we have a list (x:xs) and we recursively solve the problem for xs. If all the recursion returns is the random sample from xs, we clearly have insufficient information to take a random sample from (x:xs), so we need to strengthen the induction hypothesis by having the recursion also return the length of xs. Let (r', n') be the random sample and length for xs. Then the result for (x:xs) is (r, n) where n = n' + 1 and r = x with probability 1/n or else r'.
Another way to derive this algorithm is to notice that any solution must implicitly or explicitly compute the length of the list as a byproduct. If you start by writing down the recursive function for computing the length of a list, composed with the straightforward recursive function for taking a random sample from a list of known length, you can apply a standard fusion and deforesting transformation to combine them into a single pass, and you end up with the same algorithm as above.
Here's a fun problem to ponder. Find the most frequently occurring element of a list in O(n) time and O(1) space. You may assume that this element occupies more than half of the list's entries.
No, r has the same type as x. The logic behind that probability split is simple: We have r' (the random sample taken from the tail list xs) and x, and we have to choose between them with some probability based on n. Clearly x should occur with a 1 in n chance.
When you approach it inductively the way I did, there really isn't any choice in the matter, which is why this algorithm design technique is so powerful. Udi Manber developed the technique in his paper Using Induction to Design Algorithms from 1988 and later used it throughout his great old book Introduction to Algorithms: A Creative Approach from the early 1990s. Here's the paper in case you're curious: http://akira.ruc.dk/~keld/teaching/algoritmedesign_f05/Artik...
This way of thinking also makes it easy to derive the max-sum algorithm from Programming Pearls:
We have to compute max-sum for (x:xs), so first solve it for xs. If the max-sum is within xs, we are done; if not, it somehow involves x. To find out, the recursion has to also return the best sum of the beginning of the list.
A small case analysis can find out what the new max-sum is: x, x + max-begin(xs), or max-sum(xs)? And the new max-begin is either x or x + max-begin (xs).
The naive recursive implementation will require linear space though, so there is an extra step to find out how to eliminate the recursion.
Yes, Manber derives the linear-time algorithm for the maximal subsum problem exactly like that in his textbook. As you say, if x is involved in a maximal subsum, it must extend the tail's maximal prefix sum, so you return that in addition to the maximal subsum.
Regarding recursion vs iteration, Manber generally develops the right induction hypothesis gradually, using informal language ("remove the element and solve the smaller problem") in the process, and only turns that into a precise algorithm once the final induction hypothesis has been found, so he doesn't present the recursion elimination as a separate step. That works well pedagogically. His main idea is to guide the student's intuitions rather than formally derive a program that's correct by construction the way someone like Richard Bird might have done it. (Bird has an interesting derivation of the maximal subsum algorithm that begins with the brute-force algorithm written as a combinator-based functional program and gradually transforms it in a correctness-preserving way using program calculation techniques until arriving at the fast linear-time program.)
This algorithm is used in Yammer Metrics to keep running statistics on whatever quantities you want to track. For instance, you could time a particular section of code, and it will tell you the average time, the standard deviation, the 99th percentile time, and so on. It's tremendously useful, and the algorithm from that paper makes it also very cheap.
Seeing Leslie Lamport's first paper (about braids, for a high school math journal) makes me want to put together a list of computer scientists' first papers. Now I've got two favorites:
A Fast File System for UNIX: http://www.cs.berkeley.edu/~brewer/cs262/FFS.pdf Modern filesystems are based on this design. They add significant features - like journaling - but by reading this paper, you will form a basis for how modern file systems work.
The Google Filesystem: http://labs.google.com/papers/gfs.html Great example of what you can design when you decide certain things - like storing small files - are not important.
What would be useful is a site that collects links to CS papers, then for each paper lets users (who presumably actually read that paper) comment on it, and/or rate it. Or does such a beast already exist?
Mendeley [1] does some of what you're talking about. Primarily it's a way to organize your own collection of papers and access them from native or web apps on various devices. It also has some social features, like connecting users with similar interests and sharing tags on papers.
Its "Computer and Information Science" category [2] lists the papers with the most "readers" which I think in this case means "Mendeley users who have added this paper to their collection". Its most read paper in that category is the MapReduce paper [3]. Oddly, as I look at it now, I don't see a feature for rating or commenting on the paper, which could be a useful addition.
Of course, the authoritative method of rating papers is citations of other papers, right? That's what led to PageRank after all, and most of the main sites for finding papers have used citation counts for a long time.
oops. Correction: The MapReduce paper doesn't have the most readers in Mendeley. I just saw it at the top of the "Popular" list on the computer and information science page. So they must have some other way they're sorting that list.
Why? Most of the contents of these papers have made their way into books and programming languages. There's nothing wrong with getting it from a secondary source.
I find that reading primary sources reminds me that I can be a primary source. It reminds me that these ideas that I rely on were thought up by people, not handed down from Mt. Olympus.
The historian in me is also just fascinated to see the source of an idea that has made a large impact.
This is a great collection of papers, and I don't mean to downplay how happy I am to find it, but you might want to use a linking method that's more obvious, like having relevant text be blue and underlined. The principle of least surprise is important, usually more important than cleverness in design.
(Now I'm going to go off and read that Lamport paper.)
"if [...] you don't know the basics of characters, character sets, encodings, and Unicode, and I catch you, I'm going to punish you by making you peel onions for 6 months in a submarine. I swear I will."
I don't mean to speak badly of this bibliography. Of the papers in the list I've read, they're all great. And I'll add the others to my list and make sure I get to them.
But I think the premise of the blog post is a little flawed. Reading papers is a poor way to make yourself a better programmer. Read them in spare moments, sure, but spend your time reading code, not paper. At best, these things will help you avoid some design mistakes in the code you write for yourself in your own sandboxes.
In the real world, you're dealing with code written by people who haven't read these papers. And this is where you're going to spend all your time: maintaining and fixing and enhancing stuff that missed all the advice in the papers. This is especially true of "day job" programmers, but it's true at startups too, even at seed-stage oens.
Admonitions to do stuff like re-read "Out of the Tar Pit" every six months is just bad advice to my mind. It's a good way to convince yourself you're smarter than everyone else. It's a bad way to get better at debugging.
These things are not mutually exclusive. I agree that you should definitely read code also; good and bad. However, this is a post about one specific thing: great papers.
Admonitions
That's a bit strong. I listed some papers that I like and believe important. I don't recall providing admonitions if you didn't read them.
re-read "Out of the Tar Pit" every six
months is just bad advice to my mind.
That's just me. I like that paper and so I re-read it. I wouldn't say that you nor anyone else should do the same.
It's a good way to convince yourself
you're smarter than everyone else.
Most programmers I know never read papers, they read plenty of code but the code usually sucks. The reason why it sucks is because it was written by programmers that never read papers and so the cycle continues.
If you don't start somewhere then you'll miss out on papers being an excellent source of knowledge about algorithms, datastructures and the like. They also introduce you to a more mathematical approach to problem solving.
More papers, less code, and if you read code for its educational value make sure it is excellent code.
Then go back to your daily drudge of fixing bugs in crappy code, hopefully transforming it into something more elegant along the way with the knowledge you pick up.
I would heartily disagree. As an example, reading about Vector Clocks by Lamport had an immediate effect on how I thought about a real-world problem, and it has stayed with me since.
My first reading of "GOTOs considered harmful" led to the phase of my career where I was all about structured programming. (For s perspective, much HLL programming then was in fortran with the three-branch IF-statement.)
Admonitions to do stuff like re-read "Out of the Tar Pit" every six months is just bad advice to my mind. It's a good way to convince yourself you're smarter than everyone else. It's a bad way to get better at debugging. The very best way to get better at debugging is to not put bugs there in the first place. The Tar Pit paper is one step in that direction. And it is important to keep at that because our habits keep trying to put state in.
So while it sounds like you are saying "get your inspiration from reading bad code", I suspect that you don't really mean that.
"The very best way to get better at debugging is to not put bugs there in the first place"
Actually, I've come to the opposite conclusion recently.
During college, as well as for all of my personal projects, and during the early part of my career, I had the fortune of mostly working on projects for which I was the sole coder from start to finish.
As you say, when you're the only variable in the equation, you eliminate the need for strong debugging skills by following a set of best practices which tends to prevent creation of bugs in the first place.
And then I inherited my first seriously brain-damaged codebase. 16,000 lines of "what not to do". All of my favorite practices listed in the c2 wiki were completely ignored. Massive copy-paste coding. State BOOL's distributed throughout the app making it extremely fragile and tightly coupled. Thread safety? What's that? Hell, they couldn't even avoid giving methods and variables ambiguous names, even in the simplest of cases ("buttonClick". wtf? is that a command (e.g. clickButton) or an event (e.g. buttonClicked)?)
I was wholly unprepared for debugging this monstrosity. My first instinct was to rewrite the whole thing. In fact, a co-worker of mine had attempted exactly that before I was assigned to this beast, but introduced his own set of show-stopping bugs before being pulled off onto another project, ultimately wasting two weeks of work (his branch was abandoned) and souring the waters for any such attempt on my part.
For several months I forced myself to avoid the urge to rewrite, and just slogged through it until I finally started to understand what was going on.
I have to say, forcing myself to continue to chew on something sour when every instinct told me to spit it out really improved my ability to follow what was happening in a foreign codebase. Now, even when bughunting relatively sane codebases, I feel I'm more quickly able to zero in on the problem.
I agree with what you are saying in the context of taking on another's program. In similar situations, I have also attempted to rewrite the body of code. Once it was a necessity, as the code was perhaps 50% wishful thinking and never worked. I have tried in other cases, unwisely, and failed.
But when you are confronted with a pile of bad but working software, I have done what what you are saying here--dive in and understand what is happening.
The very best way to get better at debugging is to not put bugs there in the first place.
You realize how terribly naive this is, right? Everyone, no matter how smart, not matter how experienced, and no matter how many papers they read, is going to put giant whoppers of idiocy into their code on a regular basis. And finding these is where we spend all our time.
Your answer is an echo of the blog post: you're acting as if the act of writing code is the thing that needs optimization. It's not, though maybe it seems that way early in your career. Code (successful code, anyway) lives long after it's written, and that phase is vastly more expensive. And papers don't help there. Sorry, but they don't.
Read the papers, they're good. But stop pretending that they're the path to enlightenment. Our profession, ultimately, isn't about enlightenment. It's about results.
You realize how terribly naive this is, right? This is out of line.
It's not, though maybe it seems that way early in your career I am no longer early in my career. I recently viewed code that is in production today that I wrote in 1966. Most of it had survived. And the hours I invested in writing it swamped any maintenance hours put into it after that.
... put giant whoppers of idiocy into their code on a regular basis. And finding these is where we spend all our time. Perhaps you do; my code comes out better than that. I attribute the difference to having read papers. I am not pretending.
- we're not all dealing with code written by people who haven't read these papers
- in the real world, other people are also dealing with your code and the better it is, the easier it is for them, which in turn makes things easier for you
- people that have read these papers are often in a position to teach others, so the future code of those taught will be better. I will definitely relay what I learn from a paper to my colleagues when applicable
- sometimes somebody that has read a paper is needed to solve the mess the others have created: a mess that is impossible to solve without reinventing what is stated in the appropriate paper. And even though it seems obvious after reading the paper, you wouldn't come up with it yourself.
- there are things to be learned from the thorough analysis of others that you wouldn't ever have dreamed up yourself.
- there are things to be learned from the thorough analysis of other that you wouldn't ever learn if all you've been doing is reading code by others that haven't read these papers. And even if you read code from people that have read the papers: understanding something from code is much harder than understanding them from a paper in which they are clearly expositioned. You wouldn't grasp the fundamentals behind Lamport's paper or the Dynamo paper nearly as fast.
You're completely right in your last sentence. Very few papers will help you get better at debugging. But programming is about a lot more than debugging.
I'm interested in distributed systems but am only vaguely familiar with locks. I could pull up the Hadoop codebase and trudge my way through some distributed systems code. But it would take forever, and I'd have to guess at a lot of core concepts after a lot of confusion and effort trying to build complex mental models of what the code is doing in my head.
I think I would get a lot farther a lot faster if I read some things, be they papers or textbooks or Wikipedia entries, about systems. Then, once I'm at least sort of familiar with the core concepts that one applies when writing distributed systems, I could get to reading and writing real code.
There's a time and a place for both, but if there's a complex idea that one is not familiar with, I think one can much more efficiently understand it at some level--and then maybe dive into the code--by reading a 20-page paper about it than by trying to understand the 100KLOC implementation thereof. Being exposed to the high-level picture makes it way easier to understand why each piece is there doing what it's doing.
Not buying it. Programmers in our field could do worse than be up-to-date on the history of the major ideas in computer science, before they fire up their editor and start hacking stuff out. Obviously, you can do both, too.
I disagree with you. All the programmers I know have plenty of coding/reviews in their day-to-day. A much smaller amount of time is dedicated to reading papers - after all, it's not their job. If we can agree that both are valuable, why is the author wrong for pointing people to things that they likely focus less on?
The post is a response to the original object mentor list of 10 papers; this post claims to be oriented more toward technical depth and less toward philosophy and design.
Its almost a toy problem but I found the paper really interesting:
http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.138.7...