sYnfo's comments

sYnfo · on Sept 26, 2024

You might find the syscall tracing functionality of Cirron useful: https://github.com/s7nfo/Cirron

sYnfo · on July 26, 2024

FYI vien [0] figured out that simply compiling with "-static -fno-pie" and _exit(0)-ing at the end puts the solution presented here to 15000 points and hence #4 on the leaderboard. Pretty funny.

[0] https://news.ycombinator.com/user?id=vient

a_t48 · on July 27, 2024

Optimizing the leftovers loop to

    #pragma clang loop vectorize(enable)
    #pragma clang loop interleave(enable)
    for (; offset < length; offset += 4) {
        const auto x = ((uint32_t\*)start)[offset / 4];
        count += ((x & 0xFF) == 0x7F);
        count += ((x & 0xFF00) == 0x7F00);
        count += ((x & 0xFF0000) == 0x7F0000);
        count += ((x & 0xFF000000) == 0x7F000000);
    }

also gives some points. It'd probably be more if I could be bothered to break apart your assembly. :)

sYnfo · on July 26, 2024

Neat! I'll add the best solution without explicit SIMD/asm in this thread to the post after I wake up, it's a great datapoint.

sYnfo · on July 13, 2024

Map vs. territory. The challenge, as defined by the system the competition runs on, is to get 3 correct responses in a row. That's it.

sYnfo · on July 13, 2024

1) if you set BATCH_SIZE > 14 sums_acc may overflow

2) chunks with too many small numbers cannot be processed with just 2 shuffle-adds

3) (not mentioned in the post) HighLoad limits the size of the source code you can submit, so you can't put all possible values in the look-up table

camel-cdr · on July 13, 2024

Couldn't you organize the accumulators in 8 byte chunks, and leave the upper byte unused. Then you map consecutive digits to those chunks and use 64 bit addition for the accumulation. Then overflow between the bytes would keep the correct result if you do the shuffles correctly, and you have a full byte of overflow buffer.

Dwedit · on July 13, 2024

Gaps in the numbers are often enough to do some kind of "SIMD" even on ordinary 32-bit processors.

camel-cdr · on July 13, 2024

Yeah, but I was thinking of doing this within the vector registers to increase the batch size.

rurban · on July 14, 2024

So SIMD would need to set the overflow flag also to catch em.

Which would be much faster than the checked add (adc). Does any hardware support such checked SIMD arithmetic already?

Or can you still assume that most arithmetic is still broken in most languages/libraries.

adrian_b · on July 14, 2024

AVX-512 has evolved from the Larrabee New Instructions (2009), passing through Knights Ferry (2010), Knights Corner (2012) and Knights Landing (2016), to reach Skylake Server (2017), whose set of AVX-512 instructions has remained a subset of the instruction sets of all later CPUs with AVX-512 support.

At each step from Larrabee to Skylake Server, some instructions have been lost, because the initial set of instructions was more complete in order to enable the writing of efficient GPU algorithms, while later Intel believed that for a general-purpose CPU they can reduce the costs by omitting some of those instructions.

(Nevertheless, later they have added many other instructions, some of which may be less useful and more expensive than the original instructions that have been removed.)

Among the original Larrabee instructions that have been deleted, was addition with unsigned overflow (a.k.a. carry), where the output overflow flags were stored in a mask register, enabling their use in a later conditional SIMD instruction.

Signed overflow can be implemented in hardware with negligible additional complexity (a single gate per each result number), so it would have been easy to also add to Larrabee/AVX-512 an addition instruction with signed overflow flags stored in a mask register. Even when only unsigned overflow is available, it is possible to preprocess the operands in such a way that detecting signed overflow would be possible with the unsigned overflow bits, though that requires multiple instructions, slowing down a lot the algorithm.

However in this problem the numbers that are added are non-negative, so the addition with unsigned overflow of the original Larrabee ISA would have been sufficient, had Intel not removed it from AVX-512.

dzaima · on July 14, 2024

As a minor note, overflow checking for add & sub can be done reasonably-efficiently in software by comparing the wrapping result with a saturating one, good for i8/i16/u8/u16 on x86, should be roughly the same for signed overflow compared to repurposing hardware unsigned overflow (though of course worse than would-be-native for unsigned).

And, regardless, this would be at least one more uop in the core loop (and a somewhat-unpredictable branch at that) which you'd still want to avoid.

kardos · on July 13, 2024

For 1, can you raise that to 28 with unsigned accumulators?

sYnfo · on July 13, 2024

14 already assumes unsigned accumulator! 255 [accumulator capacity] / (2 [shuffle-adds] * 9 [highest digit value]) ~= 14

dzaima · on July 13, 2024

You could have a separate accumulator for each shuffle, which should allow 28 iterations. (and merge those together at the end of the 28-iteration-loop by widening to u16; at which point you could have an outer loop accumulating in u16 until that runs out)

sYnfo · on July 13, 2024

To be clear, it’s not dereferencing unmapped memory, I just haven’t shown how it’s being mapped, because it’s a little complex. As I note in the post, you can imagine as if I mmap all the necessary addresses at the start of the program.

camel-cdr · on July 13, 2024

Given that the input is "integers uniformly sampled from [0, 2³¹−1]" couldn't you use a LUT for the 99.99% case of just 10/9/8 digit numbers instead and have a cold branch the handle the very rare smaller numbers.

anonymoushn · on July 13, 2024

Yes, maybe if one is clever and lucky this could cost only a popcnt and a branch? not sure.

sYnfo · on Feb 19, 2019

It's not quite so simple: AFAICS they have not open-sourced the training code, only one for sampling from the pre-trained model.