CG-01-31.md

February 14, 2023 · View on GitHub

WebAssembly logo

Agenda for the January 31st video call of WebAssembly's Community Group

Where: zoom.us
When: January 31st, 5pm-6pm UTC (January 31st, 9am-10am Pacific Time)
Location: link on calendar invite

Registration

None required if you've attended before. Send an email to the acting WebAssembly CG chair to sign up if it's your first time. The meeting is open to CG members only.

Logistics

The meeting will be on a zoom.us video conference. Installation is required, see the calendar invite.

Agenda items

Opening, welcome and roll call
1. Opening of the meeting
2. Introduction of attendees
Find volunteers for note taking (acting chair to volunteer)
Proposals and discussions
1. Extended const expressions: https://github.com/WebAssembly/extended-const [5 min]
  1. Poll to phase 4
2. Relaxed SIMD, updates and phase 4 poll: https://github.com/WebAssembly/relaxed-simd [15 min]
Closure

Agenda items for future meetings

None

Meeting Notes

Attendees

Sergey Rubanov
Kevin Moore
Kriztian Gacsal
Slava Kuzmich
Paolo Severini
Chris Fallin
Jeff Charles
Nick Fitzgerald
Sam Clegg
Philipe Le Hegaret
Kieth Winstein
Alex Crichton
Shravan Narayan
Nabeel Al-Shamma
Alon Zakai
Conrad Watt
Yury Delendik
Luke Wagner
Zhi An Ng
Manos Koukoutos
Daniel Philips
Ilya Rezvov
Heejin Ahn
Thomas Lively
Ryan Hunt
Shravan Narayan
Mingqui Sun
Yuri Iozelli
Francis McCabe
Andreas Rossberg
Rick Battagline
Jakob Kummerow
Ashley Nelson
Benjamin Titzer
Brendan Dahl
Andrew Brown
Zalim Bashorov
Asumu Takikawa
Adam Klein
Marat Dhukan
Emanuel Ziegler
Petr Penzin
Shaoib Kamil

SC: I haven’t fully resolved this issue before bringing it back to the CG. We did talk about it in the last in-person meeting and there’s been some discussion since then but I don’t think there is firm resolution on that. But if Ryan & Andreas are here, we can chat about it now. Otherwise, we can delay once again.

SC: Issue: Link Prefixing, Extended Const Expressions Right now we don’t have a link prefixing, and the result of that discussion is we wouldn’t want to put it on the individual expressions anyway but the overall global declaration instead.

RH: IIRC I couldn’t think of a great way to make this work well and I dno’t think it’s necessarily the biggest issue, so I’m fine just not changing the proposal for this and leaving it as-is

SC: I kind of agree, that was my gut feeling and the real complexity of these expressions is going to kick in with the GC proposal and we can look at adding it there. I did look at adding these length information to the globals themselves and we do have flags in both of those places. Like for globals we have the mutability flag, but adding this information there would ??. Possible to add more flags. And with the passives we have the active bitfield thing which is already overloaded and tricky to pass.

AR: I’m also fine with leaving it out, but I think we should then also be clear that we don’t want to introduce it later with GC expressions because if we do it at a random point with random expressions where some expressions can have them and some don’t then it gets hard. So we should decide that we want this and keep the set of special instructions that are const as small as possible, so the rule is simple, or just agree that we won’t introduce it later. But I don’t really see it as a necessity.

SC: Yeah I guess we could also make them optional as well if it’s just a tooling thing that makes object passing these things. We could make them never required and still introduce them.

AR: that’s an interesting idea

CW: Is the advantage of introducing them about complexity or speed?

SC: yeah my original reason to raise this is that its nice for tools to do partial decoding, even if they don’t understand all the instructions they can still skip over the expressions

AR: Right now we have the section sizes so the only thing you can do is skip over the entire section. SC: Yeah

AR: So the idea would be that we also have the segment sizes then. On the other hand, the other thing we discussed is allowing repeated sections in which case you can split every section with single segments. And then you can give every segment a size already.

SC: I'm a big fan of that proposal too. That would help LLVM a lot if we could have repeated code and data segments.

AR: Yeah

SC: I’m still a fan of closing this issue.

BT: Personally I find the speed argument pretty compelling, because at least in my implementation the special const expressions are decoded initially. But I don’t want to add more to that special set, so it’s likely to turtle everything through the main decoder. So that would mean to parse the general initializer ??. So I think the speed argument is pretty solid.

RH: when are you evaluating these global initializers?

BT: it’s not about evaluation, it’s about parsing.

AR: The original idea was to hook it into the syntax of the constant expressions, that’s the only place we can do it in the global of the initializer. But we would have to do it for other initializers, and that would be a significant size overhead across the board.

BT: Keep in mind it would only be necessary for extended constants so the constants we already have can stilll be the shorter form.

AR: but they all use the sae notion of constant, if you extend the notion of constant, you want to allow it everywhere

BT: Sure we can allow it but the existing forms would still be the same size.

RH: I think the issue is that if you have an element segment of anyref, then each of the indexes would have to be prefixed but you only care about skipping the segment as a whole. I’m still not quite clear with the speed argument with that. Why is it slower to do parsing of the general const expr? You'll still have to branch on the opcode

BT: Having a length allows you to skip forward, so you can skip over global declarations and get to what is after it more quickly and if you’re doing parallel compilation, you can get to the code or anything after these sections much faster.

CW: you can skip the whole section already, right?

AR: yeah I think we had this very discussion at the time where we nailed down the binary format originally, which I mentioned in the thread. At that time, we discussed whether constant expressions or global initializers should have a length prefix, I was actually the one who proposed it. And at that time there was a clear sentiment that this would not be useful for parallelizing. I’m not sure if that has changed, or we have evidence that has changed or what has changed your mind on that.

BT: at the time we hadn’t speced out anything with GC and extended object graphs, but we have that ability coming so they can be arbitrarily large. I get that you can skip the whole section to get to the code, but now each global can be arbitrarily large

SC: In your use case, why is it useful to skip the entire global section.

BT: it’s probably fine for now but it al depends on how modules end up in the wild. If they have dozens or hundreds of big globals, then it might make sense to parallelize compiling that

SC: I see, so you are compiling each of these initializers separately.

BT: Potentially, yes.

RH: I was thinking about this in the original issue because in some of the GC issues I’ve seen, they can have a large number of globals, 10K. If you’re emitting code or not doing an interpreter there is a bit of linearity needed because globals can refer to previous globals. So if I’m doing code that instantiates globals, the code has to be ordered to refer to previous things. SO it wasn’t clear there would be a win to generate the instantiation code or interpret it in parallel.

CW: my feeling is we should just give up on the idea of segment sizes and if it turns out we’ve made a mistake, we go in on repeated sections and have that be the mechanism we use to parallelize.

AR: that sounds fine to me although ryan has a point that i don’t think in practice it’s easy to parallelize global initializers, especially the complex ones where you need a dependency analysis

CW: I kind of agree but that’s just another reason to not add segment sizes right now.

BT: At the risk of going too deep on this, I think that global initialization that depends on another global initialization, you don't need to actually see the code for that global to initialize this global. The generated bodies are different enough. Or generate the boilerplate in V8, which is basically just an empty object with holes in it.

RH: for that you'd need the types of previous globals, so as you’re streaming you’d fill in the previous types…

BT: Generally it makes me a bit uncomfortable that modules in the hundreds of megabytes, already dozens of MBs now. And if dozens of that is data and it needs to be processed serially, that makes me nervous that we can’t parallelize this easily. That makes me feel better about having arbitrarily large things that can be skipped over.

RH: in the hypothetical future with repeated sections you coil have multiple global sections and within one you could refer to previous globals…. You could split it up to repeated sections. It does seem similar to <>

CW: Each section will have a length, so between the sections you’ll be able to get the parallelization you want.

BT: yeah that puts the onus on the producer to split their modules up into different sections for parallel processing as opposed to being a property of the format that makes it possible for an engine or consumer to do it.

AR: Yeah but that becomes a quality of implementation question for the producers which seems fine to me. The producer can always create terrible, wasteful code without the consumer being able to do anything about it. But of course the producer should be interested in creating good code if they want customers. So the incentive should still be there.

TL: I could see implementing this in BInaryen,. If you measure this and could split your globals into sections, we’d do that.

SC: And bad producers can already produce giant functions that cannot be done in parallel. So I guess the alternative to repeated sections would be trying to shoehorn the size bit in the global type. Along with the mutability that we have today. Seems pretty gross.

AR: you’d need to put it in a few places, e.g. in globals where you’d probably have to put it in the type, and also for element segments. I’d expect you’d want it in both places.

DS: Ben, it seems like we almost have consensus, how hard do you want to dig in here? Should we punt this to offline or do you think that we’ve said as much as people want to say here?

BT: Given where we are, and how far this proposal has advanced, and that I’m not prepared to experiment, so we can just move this ahead.

DS: Okay, let’s go ahead and do the poll. This will be a poll to essentially close Issue 9 and move Extended Const to Phase 4 in its current state as we discussed in the previous in-person meeting.

AR: Can you give us a summary of what is in the extension right now? I admit that I have lost track, I think we have discussed several possible instructions that could be included but I don’t know.

SC: In the beginning we had integer arithmetic and we toyed with the idea of adding other integer instructions like wrap and extend but we didn’t in the end because everyone had already added that base set and we were in a good place to ship something. So we just have integer arithmetic instructions.

AR: What about division?

TL: What are our two web implementations?

SC: Firefox & Chrome definitely implement it, not sure about Safari, maybe done too. All the tools have an implementation

AR: I think division will come up naturally when you have to compute sizes of things from other values but we can of course add that laters.

SC: I can’t remember why I didn’t put it in the original set.

CW: This doesn’t block phase 4, do we have implementation specific limits, we had discussed that in the meeting. (https://webassembly.github.io/spec/js-api/index.html#limits)

SC: We don't have anything in the limits yet.

CW: I think we should do that at some point just so there is no divergence there.

SC: Are whose limits in the core spec?

CW: Either in the JS or web API

DS: I think in the web api

AR: So the core spec should mention the limitations.

CW: I vaguely remember that during the in-person meeting, Thomas mentioned that V8’s limits for stack size when evaluating const expressions are different (smaller) than for function bodies

BT: there isn’t a limit for the operand stack for functions but the function size implicitly limits it. It’s pretty big. It might be good to limit the stack for functions too, because it's like 7B so we could probably make that smaller.

SC: Is there any reason not to use the same limits for constant expressions?

CW: You may want a simple expression if I’m remembering what Thomas said correctly.

AR: I don’t really see operations on operand stack?

DS: That makes sense, we want to generalize it but that doesn’t block Phase 4.

POLL

SF	F	N	A	SA
6	25	2	0	0

Relaxed SIMD

ZN: Presenting slides

FM: can you explain why you went with option 2 rather than option 1?

ZN: Initially we had an instruction that didn’t have an obviously correct deterministic lowering, fp16 dot product, and was implemented in many different ways across architectures and even different versions of architecture. So because it didn’t have any obviously correct deterministic semantics, we’d have to remove it entirely. But it seemed ad hoc to remove just that one and keep the others like min and max. So we decided maybe to just remove all the instructions. Buf if you do that, one issue could be binary issues where you need 2 different versions of your modules, one with the instructions removed and one with them remaining. But all spec compliant implementations would have to be able to understand them but they might implement them in a different way.

FM: If you’re a producer, you basically have to assume relaxed semantics?

ZN: So relaxed semantics will be an opt-in flag when you generate your module. A flag you pass that says I want to be able to use the relaxed semantics together with the intrinsics.

CW: we did actually have some discussions about this before. Francis you’re basically right, it would be a hazard if we allowed people to target the deterministic semantics in a nondeterministic …. All producers would have to target the relaxed semantics. So, what is the benefit of any implementer advertising that they implement the deterministic semantics if all producers have to assume the nondeterministic?

LW: Sorry was that a question you were asking to the group?

CW: I don’t think this needs to stop the phase 4 vote, because said we could work out the profiles later.

LW: The value in implementing deterministic profile is, you want determinism but also be able to run as much code as I can, because if I can't run the code that creates friction.

CW: Why do you need support from the spec to do that?

LW: Because you don’t want different platforms doing determinism differently

CW: But the producer has to assume that already when they are generating the wasm module, They can’t rely on the platform doing anything.

LW: yeah it would be a benefit among the different deterministic execution environments

CW: But which code would they be running that would see this benefit.

LW: it’s not the code, its the people running it. It will run the same way on different platforms

TL: For example, I suppose they could use each other as fuzzing oracles if they want that same determinism.

LW: when we do determinism we all do it this way regardless of how. When you have determinism, you want it to be the same everywhere.

CW: What is not clear in my head is who is the “you” that wants determinism. Because it seems it’s not the author of the code.

LW: not the author of the code, it’s someone else who is consuming that code or an operator of a bunch of different code, or the owner of the platform. The code is a black box here.

PP: I skipped the last SIMD meeting because I was out. Are we asking, How do we tell if a module is deterministic? That’s part of the point. Or in general do we assume all producers are? There should be a way to determine if the module is deterministic.

LW: I think what Conrad is saying is when you’re producing a module, if you emit a relaxed instruction you have to assume it's fully nondeterministic, you can’t assume the deterministic semantics. But when you implement it, you have the choice and can select just one subset of the allowed executions

TL: I suppose if the code author wants determinism they target the existing proposal which is deterministic.

AR: Here's a use case: imagine you want to build a platform that has to run deterministically because it uses consensus, but wants to support as much of the instruction set as possible. Plus it's distributed and has different clients, it wants to be able to switch between engines, so it depends on deterministic mode being the same everywhere.

CW: So Dan does have his hand up and I want to hear from him because he’s been one of the key people in this discussion.

DanG: It’s not a question of whether the producers can assume anything …. It’s about FMA, so i’ll punt it until after were’ done with this part of the discussion

CW: I would think it reasonable to say that we can have modules with relaxed simd instructions

Know who talked here? : Determinism is going to be a mainstream thing I think

BT: are any of the deterministic semantics so slow that we don’t expect a high perf implementation will choose that?

KW: I’m a bit nervous to see the cost of implementing deterministic semantics of NaN. It’s scary if every floating point instruction has to be wrapped with an isnan. It could be quite costly.

AR: often you can optimize that, you only need to normalize nans when they become observable, when you return them or write them to memory.

KW: But with exported memory, that could be anything.

AR: Usually with float operations you do multiple operations and the nan flows through them. So you only have to normalize at the end, not at every step. Usually the engine or compiler can figure out where it has to do that and can minimize where it has to do that.

DeG: if you do have these full semantics, are there environments that care about performance at the same time. If you have these and want to be able to run them, do you care about the performance of these at the same time?

BT: One of the things Dan brought up at the in-person meeting is we should try to converge on one thing at the end. Hardware diverges right now but eventually they may converge and we would want to move towards them. Is the deterministic semantics, they all have SIMD lowering, are any of them seemingly inefficient now?

PP: they are pretty obscenely inefficient, with the NaNs

ZN: Ignoring NaNs, only the obscenely expensive one is Multiply-Add. It depends on the deterministic semantics, do we choose a fused multiply-add or choose just a multiply-add

PP: the expense is breaking it into parts and combining them back together

CW: I was going to say that my feelings are since determinism or relaxed is determined by the platform, so any platform that wants to advertise determinism and be fast needs to buy hardware where FMA is supported.

AR: I”m not sure platforms really want to advertise determinism as a feature, they might just have to do it out of necessity, it’s just an implementation constraint rather than service to users. So they might not want to artificially restrict what they can run, or running it slow is better than not being able to run it at all. I don't know how realistic that is with relaxed SIMD; you wouldn’t generate this if you planned to run on a platform where you know it’s going to be slow. I”m a bit on the fence here.

MD: Getting back to the question of whether we are going to converge the semantics. We [removed?] bfloat16.dot product which is the least-convergent[?]

For FMA it son’ly older Intel CPUs that are missing this so we are converging. For integer dot product, newer ARM CPUs support the deterministic semantics, and Intel announced it fo rfuture CPUs that wil have this but no current hardware. So it looks like these will converge.

CW: I would say that we can’t ignore NaNs here. It doesn’t seem like Intel or ARM are going to converge on their NaN bits, so this will only ever be a best effort initiative.

PP: the FMA is important, but kind of a special case in this proposal. It could be 2 different topics For nans, the reason you’d want to be particular about the bits, e.g. if you ‘re writing a hash, you can detect that a single bit is different. The use cases for this are small and rare and exotic. I don’t know if we want’t get hung up.

ZN: Thomas hsa a good suggestion in chat. We do option 2, the multiple-add current semantics. And then when hardware catches up, we have a relaxed FMA. But the deterministic semantics would be a real FMA. Thomas’ note from chat: What if we make the deterministic semantics mul+add now, and once hardware all has compatible fast fma, we can add a new instruction with the same relaxed semantics but fused deterministic semantics.

CW: I would want to go further than what Thomas is saying, make all the semantics single precision if we believe all the hardware supports it.

DanG: I just wanted to advocate for making FMA be the deterministic semantics. Implementations that need this, if you really want this you can get it, and it’s not very extreme to have this FMA hardware these days.

PP: eventually we may even need both, algorithms may benefit from both because they don’t produce the same result. Manty systems that try to tackle this have both,. I wouldn’t want this question to hinder the progress. We can pick something and then pick something else or both later.

ZN: Presenting

So we want feedback on the deterministic semantics on relaxed madd. This is the only thing left blocking phase 4

CW: I would like to decouple that specific conversation from the phase 4 poll.

AR: i'm not sure that’s how we should use phase 4,

DeG: I was going to suggest an alternative. The biggest open question is what deterministic semantics we want to pick. Lots of opinions that haven’t changed. So we can have a poll on what should be the deterministic semantics and then see if we have time for a phase 4 poll.

AR: IIRC the advantage fo the first one is aht it’s faster on a lot of future hardware, at the coast of being slower on hardware that doesn’t support it natively

DanG: It’s more costly on very low-end hardware today but basically faster on everything.

PP: Something about low end Intel machines

TL: Can we do a 3 stage poll because I would like to vote neutral on this.

NF: Sure, but floats "escape" at every return from an exported function, every reinterpret, and every store (whether the memory is exported or not)?

BT: In general, numerical algorithms that do a lot of matrix multiplication will have a lot read+arith+write, so I expect nan-checking overhead there will be proportionately more

DanG: I've benchmarked NaN canonicalization overhead; it's pretty regularly 5-15% in fp-intensive code.

BT: Just curious what the cheapest nan canonicalization insert sequence is?

TL: What if we make the deterministic semantics mul+add now, and once hardware all has compatible fast fma, we can add a new instruction with the same relaxed semantics but fused deterministic semantics.

ZA: good suggestion, i can be relaxed_fma

BT: That is just to avoid spec’ing slow on today’s hardware?

TL: correct

DanG: The NaN canonicalization sequence I used was just a compare+branch with a load from a constant pool

DanG: Also, ARM and RISC-V have modes which do the desired NaN semantics in hardware.

ZA: vote for det semantics for relaxed madd/nmadd:

(real) fma
mul + add (with intermediate rounding)
neutral

POLL

FMA	Mul+Add	Neutral
15	2	16

DeG: It seems like we have consensus we want the pure FMA to be the deterministic semantics. Do Peter or Ben want to say why they want to have mul + add.

BT: I will just mention that #2 is easier to implement in all hardware.

PP: I would say that it would not be too bad if we had both, but that is beyond the scope of this proposal I think.

DeG: We are two minutes over, so we can just comeback for a quick poll at the next CG meeting. I just want to forward Zhi & Marat as at the champions of this proposal. Is there anything else you want addressed Andreaas?

AR: I still have to review the actual spec, and that;’s on me. I’ll get to that this week. But it seems safe for voting next time.

DeG: So we’ll schedule a quick 5 to 10 minutes in the next meeting for a phase 4 poll. Sounds good Zhi?

ZN: Sounds good, thanks for the time everyone.