First, before optimizing you should consider correctness and security.
input should be const and the return value should be ssize_t (so you don't have numeric overflow on 64-bit).
Second, consider this replacement function:
ssize_t test(const char \*input) {
ssize_t res = 0;
size_t l = strlen(input);
size_t i;
for (i=0; i<l; ++i) {
res += (input[i] == 's') - (input[i] == 'p');
}
return res;
}
The timings are (using gcc -O3 -march=native): your function 640 cycles, mine 128 cycles. How can that be? I'm reading the memory twice! I have one call to strlen in there, and memory is slow. Shouldn't this be much slower?
No. strlen is a hack that uses vector instructions even though it may technically read beyond the string length. It makes sure not to cross page boundaries so it will not cause adverse reactions, but valgrind needs a suppression exception to not complain about it.
If you know the length beforehand, the compiler can vectorize and unroll the loop, which it happens to do here. To great effect, if I may say so.
The art of writing fast code is usually to get out of the way of the compiler, which will do a perfectly fine job if you let it.
If you really wanted to, you could get rid of the strlen by hacking your logic into what strlen does. That would make the C code much less readable and not actually help that much. My test string is "abcdefghijklmnopqrstuvxyz", so it's all in the l1 cache.
Second, consider this replacement function:
The timings are (using gcc -O3 -march=native): your function 640 cycles, mine 128 cycles. How can that be? I'm reading the memory twice! I have one call to strlen in there, and memory is slow. Shouldn't this be much slower?No. strlen is a hack that uses vector instructions even though it may technically read beyond the string length. It makes sure not to cross page boundaries so it will not cause adverse reactions, but valgrind needs a suppression exception to not complain about it.
If you know the length beforehand, the compiler can vectorize and unroll the loop, which it happens to do here. To great effect, if I may say so.
The art of writing fast code is usually to get out of the way of the compiler, which will do a perfectly fine job if you let it.
If you really wanted to, you could get rid of the strlen by hacking your logic into what strlen does. That would make the C code much less readable and not actually help that much. My test string is "abcdefghijklmnopqrstuvxyz", so it's all in the l1 cache.