Avoid some false positives for addremove -s
The original code uses the similary score
1 - len(diff(after, before)) / len(after)
The diff can at most be the size of the 'before' file, so any small
'before' file would be considered very similar. Removing an empty file
would cause all files added in the same revision to be considered
copies of the removed file.
This changes the metric to
bytes_overlap(before, after) / len(before + after)
i.e. the actual percentage of bytes shared between the two files.
#!/bin/sh
hg init rep; cd rep
touch empty-file
python -c 'for x in range(10000): print x' > large-file
hg addremove
hg commit -m A
rm large-file empty-file
python -c 'for x in range(10,10000): print x' > another-file
hg addremove -s50
hg commit -m B
cd ..
hg init rep2; cd rep2
python -c 'for x in range(10000): print x' > large-file
python -c 'for x in range(50): print x' > tiny-file
hg addremove
hg commit -m A
python -c 'for x in range(70): print x' > small-file
rm tiny-file
rm large-file
hg addremove -s50
hg commit -m B