CG-06-20.md
August 24, 2023 · View on GitHub

Agenda for the June 20th video call of WebAssembly's Community Group
- Where: zoom.us
- When: June 20th, 4pm-5pm UTC (June 20th, 9am-10am Pacific Time)
- Location: link on calendar invite
Registration
None required if you've attended before. Send an email to the acting WebAssembly CG chair to sign up if it's your first time. The meeting is open to CG members only.
Logistics
The meeting will be on a zoom.us video conference. Installation is required, see the calendar invite.
Agenda items
- Opening, welcome and roll call
- Opening of the meeting
- Introduction of attendees
- Find volunteers for note taking (acting chair to volunteer)
- Proposals and discussions
- Closure
Agenda items for future meetings
- Profiles discussion [30 mins] (Not enough time at the meeting for this discussion, moved to 09/26/2023.)
Schedule constraints
None
Meeting Notes
Attendees
- Petr Penzin
- Jeff Charles
- Derek Schuff
- Deepti Gandluri
- Conrad Watt
- Chris Woods
- Alon Zakai
- Zalim Bashorov
- Yury Delendik
- Yuri Iozzelli
- Paolo Severini
- Mingqiu Sun
- Luke Wagner
- Alex Crichton
- Ryan Hunt
- Daniel Hillerström
- Andreas Rossberg
- Paolo Severini
- Andrew Brown
- Dan Philips
- Bruce He
- Nick Ruff
- Brendan Dahl
- Ashley Nelson
- Guy Bedford
- Nick Fitzgerald
- Heejin Ahn
- Chris Woods
- Manos Koukoutos
- Emanuel Ziegler
- Sam Clegg
- Sean Jensen-Grey
- Jakob Kummerow
- Richard Winterton
- Sergey Rubanov
Proposals and discussions
Flexible vectors update and possible phase 2 poll [Petr Penzin, 30 mins]
PP presenting slides
CW: would like to understand the platform dependent behaviors. IIRC there’s a length operation, which would be nondeterministic based on the platform you’re running on?
PP: That’s right..
CW: I’ve heard from some that hope to migrate wasm modules between heterogeneous environments, but if you checked that property once, if you migrate to a different platform that would cause a problem?
PP: The intent is not to hardcode the length, but use it for loops, or generate code, if you end up in a different environment, you should still be able to run the same code
CW: you’d have to compile the code to be robust to that right? You could imagine checking it once and caching the result and then you don’t need to do that again.
PP: Good point, the expectation is that the code will be running in a robust manner, you do have to be mindful of that and not assume you have 128 bits, I have an example for having to check the length.. In reality you could materialize the length without using extra variables
I had an example in an old slice deck where you would check the lengths. You’d have to put them in a local or global.
CW: I guess the analogy is to relaxed SIMD. do we guarantee a fixed value, or leave open in the spec that it’s more variable than that?
PP: I actually don’t think its necessary to fix it, probably would work better that it wasn’t fixed, then it might be better
CW: I think it becomes hard if you allow it to be that nondeterministic
AR: can you remind me how loads and stores would work if you don’t know how large these things are?
PP: That’s the purpose of the length variable
AR: so a load is basically nondeterministic how much memory it accesses, or runs into a bounds check? It might depend on how large the read is?
PP: there is a realistic bounds on that depending on how big the SIMD register is, and will only happen when the SIMD register length runs out, its not expected to be the common case. We could cap this explicitly in the spec.
AR: One thing conrad was getting at is code mobility, e.g. if i stored some values in memory and carry my process to another architecture, there’s no guarantee that i can read these value back, because the reads might mean something different?
PP: Not sure I understand why, if you do check lengths again, if you go in a different architecture, your length changes but you’re still reading the same data
CW: that suggests that you have to recheck the length and use a different codepath for different granularity? I would imagine most people would check the length at the start and maybe serve a different module as opposed to doing it dynamically.
PP: That wasn’t the intended use, but I can see how someone may use it that way. We don’t load a different module, but we don’t have a precedent for it
AR: even if i were super careful and checked and assume i have some safe points where I check the length again. How would I be able to react on that, if the vectors are already stored in memory with a certain length, and my architecture is a different length, how can I read them?
PP: We’re not storing a vector type in memory, we’re storing the memory that corresponds to the vector. If you’re storing 256 bytes. And if you have a different architecture then you need to load 4x4, or whatever the architecture provides, so its a bag of bytes
AR: that would mean that the load/store instructions you have aren’t useful in that case, you would always have to load individual values and reconstruct the vectors from scratch if you wanted that kind of mobility. (again a hypothetical)
CW: Are there people working on this?
[PP showing example] PP: This is an example from the original presentation, a vector value corresponds to a SIMD register. So you would load, the amount of bytes required to fill the operands of the operation, then perform the operation and get a value in a (potentially) different SIMD register, then store the same number of bytes. But if you have different lengths, you’d check that against previous operation, which is harder. If you have different lengths, then you are suspended and have a different architecture?
CW: so you'd have to remember the previous length of the store
PP: Why would you? You have a current index that you can pull from, increment in terms of the lane size, but don’t need sizes of chunks upfront. Write your loop in terms of the number of elements
DG: slightly different question: given how different codegen can be on different platforms, how do you expect the performance to scale? We had a higher bar for e.g. relaxed SIMD about experiments or performance estimates.
PP: The estimate is when we do work in converting SSE benchmarks to AVX2 benchmarks, that 1.5 - 2. Sort of obvious but YMMV depending on how good your 128 vs your 256 implementation is. In straightforward case where you are multiplying numbers, you should expect 2x speedup but those are straightforward and synthetic. In more complex cases numbers might be somewhat lower.
DG: we see performance loss at a few different layers e,.g. At compiling wasm. The entire set of operations is hard to support. With that baseline assumption of losing some for any kind of SIMD ,and now we’re talking about adding lane checks, and runtimes will have to do some extra work so I’m wondering what kind of guarantees we can offer.
PP: That’s the reason we’re doing it, the upper bound is 2x with 128->256, realistic gains are 1.4x etc. I think that’s what we should expect, 50%, we’re talking about performance here, so 50% is significant.
CW: does this proposal introduce any new fingerprinting signals? that was a concern with relaxed SIMD, e.g. with FMA support.
PP: Beyond that, no we don’t expect it to do nondeterministic ways, aside from the lengths.
CW: Do we expect that on the web the length will expose finger printing surfaces? What can you do with the length information?
PP: it won’t vary a lot. Basically you have AVX and SSE on x86, 2 options. On ARM, SVE is coming. Generally those can be detected by timing operations. Also AVX2 is really widespread. The only processors that don’t support it are the low power cores. So just having AVX2 doesn’t tell you much. The presence of 256 bit doesn’t really tell you anything other than maybe the class of device such as a laptop
DG: A poll might be ok, for experiments. We need more data on which applications would use this what the performance gains might be. Very specific application performance, would be helpful for deciding whether to implement this in Chrome or not. I think as CW pointed out, we do have to do a little more intensive looking at fingerprinting services would be exposed, especially between different versions of Intel chipsets, ARM chipsets. What does that look like? Having looked at that, I could guess that this would be a significant change from where we are now.
TG: using wasm for sandboxing at UCSD with firefox, we’ve seen that SIMD performance is quite bad e.g. audio/video decoding etc. SIMD128 is just embarrassingly bad on common workloads, we really need something better. On fingerprinting, i’ve done some work on VMs,, can point you to some research. Hiding the architecture is impossible, i don’t think it should be significant consideration.
DG: The tradeoff we make is the performance we get for the information we’re going to expose. Especially for flexible vectors, I’m not certain we are able to perform the same optimizations across the board, vs the alternatives are introducing regular instruction sets that are able to map to hardware SIMD. Cutting out all the layers in-between has been the fastest way to this. I understand PP’s motivation, which is how many experiments are we doing for the future???
PP: integer ops generally is one of the targets to accelerate for this proposal. I don’t know about performance but that’s why we have a prototyping phase. So the questions we have, what applications ,fingerprinting, etc . just because we go to phase 2 doesn’t mean we’ll standardize.
CW: Phase 2 implies consensus around the direction and I’m not seeing that here because the alternative is we have specific instructions instead of going for the general thing.
PP: What about the vector size?
CW: TBH I kind of agree with Tal that fingerprinting is a lost battle, but my main concern is the code migration story. With relaxed SIMD we at least can emulate the new instructions with existing ones.
PP you can still do that, and say that “my length is 128” and it goes back to the existing state.
AR: This way of doing SIMD only allows point wise operations? In the current SIMD proposals, there are instructions that are not pointwise operations, its not clear how they would be able to use this scheme of instructions. Is it right to assume that we can’t actually use those instructions with wider widths?
PP: what do you mean by pointwise?
AR: Pointwise means both operands have the same number of lanes and it’s a per-lane operator.
PP: Yes but obviously the path out of there is the length is 128 multiple, so you can easily get to a point where you do a half or double. If you can do those things. There are libraries that do that, that’s doable. You can even define semantics within the existing set, because for example, you do ½ of the vector, then ½ gets unfilled or filled with 0s. That is a possibility. Things like expand. That’s on the table. Not completely freeform of course.
Poll
- SF: 1
- F: 3
- N: 20
- A: 1
- SA: 0
There’s not a clear consensus, we probably need some more discussion, motivation, collect some of the main concerns etc.
Profiles discussion [30 mins]
Due to lack of time, profiles is moved to a future meeting
Update on TC39 phase imports [ Guy Bedford ]
GB Presenting [TODO: link]
AR: What is the difference between attaching and linking for modules?
GB: the record has a field for the environment which references the local bindings, and realm, it’s a little arbitrary but you’re getting a different module instance in different realms in JS
AR: IIUC, the main question is whether this change to the proto chain of wasm.module is breaking or not? IIRC we had a similar discussion wih the type reflection proposal which introduced a new class for wasm functions, which also changed the protochain there. I don’t think we found any compatibility issues with that, at least I haven’t heard of any. It seems most people depend on that.
GB: These changes happen surprisingly frequently in the Web IDL world, so we don’t have anything specific concerns about that, but we wanted to present an update here, and integrate this with the ESM modules proposal in the future.
CW: one clarifying question. Does changing the prototype of webassembly.module have implications for posting modules across workers?
GB: Right, obviously because of that they don’t have the same instance. There is no state associated with the abstract module that would need to participate in the serialization process, yes.
CW: could that change if we had compile-time imports?
AR: I wouldn’t know how we use compile time imports with this, seems orthogonal I think
RH: I can’t give a great answer right now, I’ll think about that
AR: One other question, this is somehow abstracted away, what happens if you apply this to JS imports?
GB: this is kind of the amazing thing, that we’ve been able to build consensus that this is useful for JS. there are folks in TC39 who would like to see this so it gives us ability to follow up on the JS side. There is a risk in shipping a feature that doesn’t have full JS integration yet. So you’d get an import error if you tried to use this at first and we’re figuring out exactly how this will work on the JS side.
PP: Does it affect the reinstantiation in any way. There was a complaint that you have to recompile the module from bytes if you want to drop the memory and ???
GB: this is for webassembly.module? We are mostly focused on the ergonomic/ecosystem/tooling benefits here rather than performance. If there’s enough usage then the browsers will know more.
ZB [chat]: Probably "Import Attributes" could help integrate it with compile time imports https://github.com/tc39/proposal-import-attributes
AR [chat]: @Zalim, yes, that sounds plausible