The Case of the Missing Increment

80 points by eigenform a year ago

Taniwha a year ago

Thinking about this - this may be a pattern that;s designed to match something that expands from a string instruction.

While the loop he's testing is a useless bit of code that does nothing the optimisation he's discovered may help speed things like scasb/stosb allowing portions of 2 unrolled copies to be processed per clock

pkhuong a year ago

I believe I first saw this on IACA; uops.info has the measurements for zero-latency inc, add, etc on Alder Lake https://uops.info/html-instr/INC_R64.html . These adds by immediate are nicely closed, so I've been assuming renamed values are uniformly represented in Golden Cove as register+increment.

zokier a year ago

> Since the only Alder Lake machine I had access to was a remote Windows machine that didn’t belong to me, I more-or-less had to choose option 3, which meant subjecting myself to The Ultimate Sadness

Well, you can pick up Sapphire Rapids instances from your preferred cloud provider and avoid the sadness.

deater a year ago

do cloud providers give full, unrestricted access to hardware performance counters?
- zokier a year ago
  
  It depends. On AWS you can get "metal" instances where afaik you get pretty much unrestricted access. In addition on certain instance types/sizes you get access to virtualized counters (vPMU). See Q11 here https://github.com/intel/pcm/blob/master/doc/FAQ.md#q11 or tables here https://www.intel.com/content/www/us/en/developer/articles/t...
  dunno about others

leiroigh a year ago

That's pretty cool.

Normally it would be the either the programmer's or the compiler's job to unroll a loop and then reduce dependency chain lengths.

But its nice if the renamer can do that as well.

Presumably intel have real-world data that suggest that significant real workloads can profit from this.

I wonder whether that points to specific software issues, like hypothetically "oh yeah, openjdk8 hotspot was a little too timid at loop unrolling. It won't get that JIT improvement backported, but our customers will use java8 forever. Better fix that in silicon".

dzaima a year ago

Note that, not only are multiple consecutive increments reduced to zero latency, but that happens even if they're interleaved with movsxd, as in the second experiment at https://uops.info/html-lat/ADL-P/INC_R64-Measurements.html. It'd be interesting to see what other instructions it can "fuse" with (if that is what is happening).

rep_lodsb a year ago

Also interesting that this only happens with 64 bit registers: https://uops.info/html-lat/ADL-P/INC_R32-Measurements.html
I don't see a reason why this should be the case, since the high bits of the result would simply be cleared, and it's a common size optimization to use 32 bit operations.
Maybe https://news.ycombinator.com/item?id=41706743 is correct, and this is mainly intended for address increments generated by microcode?
- dzaima a year ago
  
  Interesting. I wonder how would interleaved 'inc r64'+'mov r32,r32' look - that's two separate latency-zero ops, equal to 'inc r32'. Wouldn't be too surprised if an eliminated op can only be zero-extending or incrementing, but not both.

buttocks a year ago

Deep thoughts: why aren’t “increment” and “excrement” opposites?

Joker_vD a year ago

Because "increase" and "excrete" have completely different roots that only coincidentally coincide when the verbal nouns corresponding to those words are formed.
- knodi123 a year ago
  
  now do "progress" and "congress"!
  
  Joker_vD a year ago
  
  You mean, the difference between "going forward" and "coming together"? It's in the prefix, "pro-" (for, forward) versus "con-" (with, together) which give you different shades of the meaning. Can't really say what's the verb of movement was though.
  
  oersted a year ago
  
  I think he meant it as an absurdist joke, but this is a great response!
  I looked it up, "gress" comes from "gradi" in Latin which directly translates to "walk". More specifically: con(pro) + gradi -> congredi (verb) -> congressus (noun)
  Edit: Knowing this, "gradient" has an interesting flavour :)
  Edit: It looks like the path is more indirect for "gradient"
  "gradi" (walk) -> "gradus" (step) -> "grade" (french influence) + "salient" -> "gradient". I like that in Latin "walk" is "to step", or perhaps "step" is "the unit of walking"? "A walking"? Etymology is fun!
  
  Joker_vD a year ago
  
  > I like that in Latin "walk" is "to step", or perhaps "step" is "the unit of walking"? "A walking"?
  Consider the verb "to pace", and the corresponding noun "pace": the analogy is almost perfect. Of course, Latin also had other words for going places.
  
  randomdata a year ago
  
  now do "flammable" and "inflammable"!
  
  dpkirchner a year ago
  
  What a country!
IWeldMelons a year ago

Your name checks out. You should be an expert in that (excremental) matters.

mzs a year ago

You have to use an instruction like cpuid with rdtsc so that the TSC is not read before the loop terminates. There have been changes to the Intel docs and there are more options now:

https://stackoverflow.com/a/58146426

Also in the bad old days SMM would interfere on some CPUs.

vardump a year ago

Just when you get used with features like x86 CPUs combining two instructions into one micro-op (micro-op fusing), you get something like this.

I guess immediate addressing mode addition is a good choice to execute at rename / allocation stage, as it's common, relatively simple and can't generate exceptions.

Taniwha a year ago

This isn't really combining as the result of the first increment is needed by the intermediate compare, but is a rewriting that removes a dependency (or moves it further back in the stream)
- vardump a year ago
  
  Maybe it rewrites multiple immediate additions into one.
  
  Taniwha a year ago
  
  It can't because the intermediate results are required for the compare instructions
eigenform a year ago

> immediate addressing mode addition
Well, except for the fact that you need to read from a register before adding the immediate displacement to it. You'd have to know the physical register and do the read very early (before renaming), or predict the value!
- eigenform a year ago
  
  I just realized you were probably referring to the example given from the AnandTech article with `lea r64, [r64+imm8]`.
  Caveat is just that [presumably] the source and destination registers have to be matching (since `lea rax, [rax+imm]` is just `add rax, imm`).

dzaima a year ago

uops.info's measurements show 'inc r64', interleaved with 'movsxd' instructions, still having zero latency[0], so it can't be just merging the immediates of successive increments (or there's additional fusion happening). Plain unrolled 'inc r64' shows an average latency of 0.2 cycles, i.e. 5 dependent ops per cycle. And 0.2 used ports per instr [1].

Similarly, 'lea r64, [r64+8]' (imm8) and 'lea r64, [r64+128]' (imm32) and 'add r64, 2' (imm8); but not 'add r64, 0x1000000' (imm32).

[0]: https://uops.info/html-lat/ADL-P/INC_R64-Measurements.html

[1]: https://uops.info/html-tp/ADL-P/INC_R64-Measurements.html