It's nice to look at, but it doesn't actually work that well, because it has no real way of stream alignment other than simple equality (stream alignment == figuring out the points where two streams become equal. This is what the edit distance calculation gives most diff algorithms)
So various forms of lines that have been moved will always be shown as replacement, for example
(I see someone else discovered this)
Most of the cost and complexity of existing diff algorithms is essentially stream alignment (that's what the the dynamic programming problem they solve tells them) , so yes, removing stream alignment will in fact, make your diff algorithm "simple" :)
The fix function is also just a hack for no real way to align streams on a per-character basis, so it has no way of, for example, maximally extending matches until it's done.
You could avoid a lot of what it does by not using lines as your basis, but characters instead.
The direct translation here would be:
1. build segments by starting at the beginning of both files, and incrementing one file until they match, and then incrementing the other until they stop matching (this gives you one or two segments, the mismatch, segment, which may be empty, but represents the part where they don't match, and the matched segment, which will be maximal instead of line split.
This is O( min of both file sizes), just like the current hash building.
You do have to keep track of the line number, but it doesn't change time bounds
2. hash/index the segments the same way you are hashing/indexing lines.
3. do rest of described algorithm.
the only upside/downside is you end up with partially changed lines (IE there may be more than one segment per line), but here you can detect that immediately (you can even just mark where this has happened while you build segments) and transform to replacements if that's what you want instead of trying to notice later that this is what happened.
Split file2 into x sized blocks.
Use it as patterns you search against file1.
Simple, and same worst case as naive edit distance.
Note that if you had a good enough rolling hash, you could do the same thing in O(N) time, by not doing the equality check in the hashtable, and instead just issuing replacement if it turned out you were wrong :)
I don't think it is -- it's full of the kind of features (complicated condition expressions, long chains of elses, hard-coded numbers) that, when I find myself writing them, suggest that I'm approaching the problem in the wrong way, because previous experiences make me associate that kind of code with lack of generality and corner-case bugs.
which is what most diff tools use. It is a simple and clear algorithm with no special cases once you get your head around it. It works in O(NxM) time, which seems like a reasonable lower bound (you need to compare everything to everything else to have a chance of getting the best alignment) although there are ways to do better with constraints.
(I remember one for gene alignment which broke N and M in two, recurse 4 times on each pairing, and then had a quick-ish way to put those together again. Can't remember the details though!)
So various forms of lines that have been moved will always be shown as replacement, for example (I see someone else discovered this)
Most of the cost and complexity of existing diff algorithms is essentially stream alignment (that's what the the dynamic programming problem they solve tells them) , so yes, removing stream alignment will in fact, make your diff algorithm "simple" :)
The fix function is also just a hack for no real way to align streams on a per-character basis, so it has no way of, for example, maximally extending matches until it's done.
You could avoid a lot of what it does by not using lines as your basis, but characters instead.
The direct translation here would be:
1. build segments by starting at the beginning of both files, and incrementing one file until they match, and then incrementing the other until they stop matching (this gives you one or two segments, the mismatch, segment, which may be empty, but represents the part where they don't match, and the matched segment, which will be maximal instead of line split.
This is O( min of both file sizes), just like the current hash building.
You do have to keep track of the line number, but it doesn't change time bounds
2. hash/index the segments the same way you are hashing/indexing lines.
3. do rest of described algorithm.
the only upside/downside is you end up with partially changed lines (IE there may be more than one segment per line), but here you can detect that immediately (you can even just mark where this has happened while you build segments) and transform to replacements if that's what you want instead of trying to notice later that this is what happened.
IF you wanted super simple, but more accurate than this, you could do: https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm
Split file2 into x sized blocks. Use it as patterns you search against file1. Simple, and same worst case as naive edit distance.
Note that if you had a good enough rolling hash, you could do the same thing in O(N) time, by not doing the equality check in the hashtable, and instead just issuing replacement if it turned out you were wrong :)