CG-10.md
January 14, 2026 · View on GitHub

Table of Contents
Agenda for the October meeting of WebAssembly's Community Group
- Host: Google, Munich, Germany
- Dates: Tuesday-Wednesday, October 28-29, 2025
- Times: 09:00 - 17:30 CET
- Video Meeting:
- Instructions to join via Google Meet will be posted on the W3C calendar events.
- Location:
- Erika-Mann-Straße 33, 80636 München, Germany
- Wifi: TBD
- Code of conduct:
- Standard WebAssembly code of conduct. If you have any questions or concerns, please reach out to WebAssembly CG chair.
Research Day
- Date & time: Thursday, October 30, 2025. 09:00 - 17:30 CET
- Talk proposal? Fill out this form
- Agenda here
Social
- 6:00pm on Tuesday, October 28 at Augustiner-Keller (12 minute walk from the office, not sponsored)
Registration
- Fill out this form by October 14.
Logistics
Communication
- Event contact: tlively@google.com
- CG chair contact: webassembly-cg-chair@chromium.org
- Discord: https://discord.gg/jWQGbFEuGJ
Getting to the venue
The event will take place at the Google Munich office, Erika-Mann-Straße 33, 80636 München, Germany.
The office is very close to the Donnersbergerbrücke S-Bahn station. There are also several hotels within walking distance.
Once you arrive at the office, check in and pick up your visitor badge in the lobby.
Agenda Items
Session start times are not guaranteed. We may start sessions before their scheduled times if previous sessions end early or after their scheduled start times if previous sessions run long.
Tuesday, October 28
9:00am - Welcome and Introductions (30 minutes)
9:30am - Shared-Everything Threads (Thomas Lively and Conrad Watt, 75 minutes)
11:00am - JS Primitive Builtins (Sébastien Doeraene, 30 minutes)
- Poll for phase 2
11:30am - Break (30 minutes)
12:00pm - Custom Descriptors (Thomas Lively, 60 minutes)
- Poll for phase 3
1:00pm - Website Issues and Update (Tom Steiner, 30 minutes)
1:30pm - Lunch
2:45pm - wasm_of_ocaml Experience Report (Jérôme Vouillon and Ricky Vetter, 30 minutes)
3:15pm - Custom Page Sizes (Nick Fitzgerald, 30 minutes)
3:45pm - Compilation Hints (Emanuel Ziegler, 15 minutes)
4:00pm - Wide Arithmetic (Alex Crichton, 15 minutes)
Wednesday, October 29
8:30am - Breakfast available
9:00am - Stack Switching (Francis McCabe, 60 minutes)
- Poll for phase 3
10:15am - JetStream 3 (Keith Miller, Ryan Hunt, Daniel Lehmann, 30 minutes)
10:45am - Multi-Byte Array Access (Brendan Dahl, 45 minutes)
11:30am - Break (30 minutes)
12:00pm - Function-at-a-Time JIT (Ben Titzer, 30 minutes)
12:30pm - Profiles (Andreas Rossberg, 30 minutes)
1:00pm - Incubation Framework for Compute Extensions (Deepti Gandluri, 30 minutes)
1:30pm - Lunch
2:45pm - ESM Integration (Guy Bedford, 30 minutes)
3:15pm - Component Model and the Web (Ryan Hunt, 60 minutes)
Meeting notes
Attendees
Thomas Lively (Google) Thomas Trenner (Siemens) Thomas Steiner (Google) Chris Woods (Siemens) Keith Miller (Apple) Marcus Plutowski (Apple) Ben Titzer (CMU) Yuri Iozzelli (Leaning tech) Bailey Hayes (Cosmonic) Sébastien Doeraene (EPFL) Sam Lindley (University of Edinburgh) Andreas Rossberg Derek Schuff (Google) Deepti Gandluri (Google) Conrad Watt (Nanyang Technological University, Singapore) Patrick Dubroy (Independent) Julian Köpke (DENPAFLUX) Stephen Berard (Atym) Adam Bratschi-Kaye (Dfinity) Lucy Menon (Microsoft) Rezvan Mahdavi Hezaveh(Google) Brendan Dahl (Google) Francis McCabe (Google) Aleksandr Shefer (Jetbrains) Zalim Bashorov(Jetbrains) Romāns Volosatovs (Cosmonic) Luke Wagner (Fastly) Paolo Severini(Microsoft) Chris Fallin (F5) Alex Crichton (Fermyon) Ryan Hunt (Mozilla) Arjun Ramesh (CMU) Heejin Ahn(Google) Thomas Traenkler (Community) Seth? Léo Andrès (OCamlPro) Olivier Nicole (Tarides) Jerome Vollioun (Tarides) Ricky Vetter (Jane Street) Michiel Van Kenhove (Ghent University) Jakob Kummerow(Google) Thibaud Michaud (Google) Clemens Backes (Google) Daniel Lehmann(Google) Mathias Liedtke (Google) Aäron Munsters (Vrije Universiteit Brussel) Michel (?) Peter McInerney (Fastly) Marco Kuoni Martin Kustermann Sébastien Deleuze Siddarth Michael Ficarra (F5) Ben Visness (Mozilla) Julien Pages (Mozilla) Yury Delendik (Mozilla) Erik Rose (Fastly) Saúl Cabrera Jan de Mooij (Mozilla) Merlijn Sebrecht (Ghent, IMEC)
Introduction
Links to Code of Conduct, notes doc, logistics.
We will be enabling auto-transcription and note-taking, but the results will only go to the chairs and be used to supplement the handwritten notes.
Shared Everything Threads
TL presenting slides
TL: No threads in JS. No idea of sharing strings in JS. Should TC39 be involved with shared strings?
AR: Strings are immutable, so they don't need to be specified as shared as it is not observable.
CW: The handling of agents is part of the HTML spec rather than the JS spec, there may be something to do in there.
KM: The identity of strings is not observable. Does it matter then if strings are shared or not?
AR: Shared strings are immutable?
CW: Strings are immutable, so it probably doesn’t matter. The real question is how they interact with structured clone. Even if not TC39, gotta talk to the structure clone folks. Strings are probably easy.
KM: For strings the identity isn’t observable in Wasm or JS, so can sidestep the problem. Probably still needs to be shared in practice though.
( … more slides … )
SL: Status of other browsers?
KM: Haven’t started yet. Started planning around making GC concurrent, likely the majority of work. This may be simpler for us? Marking is already concurrent and the algorithm is threadsafe. Still lots of work, but not ground-up work.
RH: We don't have an implementation nor do we have plans at the moment. Busy with other stuff.
ZB: Using threads is painful; esp. interop with JS. Aside from strings we also share javascript Error because we have coroutines and exceptions. We currently have a prototype that runs on a single thread.
Thread local globals
( … more slides … )
FM: "stack pointer" is the shadow stack pointer?
TL: Yes, the user-space stack pointer. The engine’s stack pointer isn’t in a global.
( … slides … )
AR: Why trap? Rely on zero-initialization?
TL: Still need to allocate the zeros. The hope is that it’ll scale better if you don’t have to allocate.
MP: If you can trap on access: can’t you initialize in the trap and then return to the access?
TL: That would be an option. It would work for declared thread-local globals. It would not work for thread-local function imports, e.g. importing a JS function just for that thread. JS functions can’t be shared. A shared function calling JS calls different JS on every thread. For JS imports we definitely need explicit initialization. My thinking was to extend that to shared globals, too.
AR: How do instances figure in thread locals? Instances are just collections of objects.
TL: Like WebAssembly.Function is not tied to instance, but you can think of it as creating a tiny instance. Also transitively imported globals need to be initialized if you're initializing things lazily.
AR: Not currently how semantics works: no notion of belonging to an instance.
TL: Lazy thing is trickier. When entering a function you need to make sure all globals defined and imported are initialized. Transitively too. You’re right that you would have to handle globals not associated with instances. That's another reason to avoid the lazy solution.
RH: Can you elaborate on how memory management will work? How about initialization of these multiple TLS slots?
TL: The details are not pinned down. You definitely need a growable array for adding new threads. You need to reuse thread IDs or handles. We have to figure all that out. It should be possible to do something. When a thread dies you also have to make sure sentinels are cleared.
RH: If you’re reallocating then everything pointing to it must be updated. Shared instances need updating. Initialization is tricky but memory management and acquiring was hazy.
TL: Part of the reason we've left this a bit hand-wavy is that we know that existing systems make this work somehow, so the hope is that we can just reuse those existing solutions.
( … Conrad’s slides now … )
CW presenting slides
ZB: When can we share module instances between threads?
CW: Shared functions is that point. Functions close over the instance.
TL: The proposal is enormous. We should try to split it out into separate sub-proposals.
CW: Yes! The intention is for solved problems to eventually move forward as separate proposals. Co-design is valuable in the meantime for finding interactions.
RH: When you have a timeline: What does it mean? Do we have to decide in sequence?
CW: We can always decide earlier. It's more like a dependency graph.
RH: Shared stack switching would inform shared function references, right?
CW: Right this may blend a bit. I’ll say more about shared functions and let’s come back.
( … slides … )
AR: This is not specific to JS. We must maintain the invariant that you cannot reference unshared from shared.
CW: True. In pure Wasm everything could be shared. JS is an “existence proof” that we have an important environment in which a majority of state is non-shared and Wasm still works.
RH: Earlier we talked about how we don't want to have references from shared to unshared things. Is the thread-bound data such a reference?
CW: Next slide!
( … slides … )
KM: Another option for sidestepping the hazards of implementation would be to say thread-locals are weak. Then you have a hybrid of the two solutions.
CW: Yes, I'll cover that. I like the weak solution but it didn’t seem popular last time I discussed it. Feedback welcome.
( … slides … )
TL: No more updates. We have a strong preference for strong references.
AR: I personally think it would be weird to have this be weak. Runtimes would implement their own GC on top of the engine. I would argue this goes against having GC being a built-in feature.
CW: Some people presented a preference for weak before, but perhaps that was a soft preference.
KM: It does not have to be weak forever. Weak could be the only implemented option today. Another proposal could fix that later on.
KM: Is cycle collection really needed? Having different implementations could lead to game theory-like scenarios. If one browser implements it that could cause pressure on other browsers.
AR: We could have tests!
KM: Implementations may not run the test suite. Implementations would run tests in browsers of choice and deploy to the web. If outcomes don’t match then bugs are filed on other browsers.
AR: Engines would still fail the spec test then.
CW: If an engine willfully broke the behavior they’d just fail that test.
KM: Different GC strategies (e.g. a conservative GC) may or may not collect garbage. It seems overly aggressive to mandate that there are no cycle collections.
AR: We have one test that checks that a regular call doesn’t do tail call elimination. It tests that stack space is exhausted. If we disallow collecting cycles, we would test for heap exhaustion.
MP: The question is for what if spec doesn’t disallow collecting cycles.
AR: I think we have to decide which way to go. We can’t leave the question open. We can write tests for either choice.
CW: Everyone’s probably going to have their own opinion until there’s more experiments.
NF: Can we clarify these object wrappers? Are they specific to JS embedding or part of core Wasm?
CW: Different possibilities with different points in the design space.
TL: We plan to prototype a JS wrapper.
CW: Even if starting with JS, it would still be easy to move to a core thing later. We're not shutting down design possibilities.
( … slides … )
AR: Implementation-wise it doesn’t do this though? [About copying stack segments in and out to implement suspension and resumption]
CW: Implementations are expected to have a linked-list of stacks, but still conceptually search with tags on the stack.
AR: Stack switching, not stack copying?
CW: Both mental models work, yeah.
( … slides … )
AR: I would argue that this question already comes up earlier with shared functions. There is an implicit stack object. With shared functions there’s an implicit shared stack. With that there’s already the question of whether the shared stack can point to shared things.
CW: You’re asking if a shared function gets a shared stack and it’s a question that I’d like to get to.
( … slides … )
TL: One interesting feature request from Zalim is auto-boxing of JS objects into thread-bound data. If inside Wasm they're always shared…
CW: Slide on that towards the end.
DG: Shared functions only exist in core wasm? Can’t access JS functions?
CW: Yes. Could have a function given shared imports. JS might have individual APIs as shared imports.
DG: Then there’s no strong requirements about arguments and returns?
CW: Potentially different kinds of shared functions. Some respect these requirements and some don’t.
KM: Many other ways to get unshared objects via tables.
CW: Even shared functions can’t reference unshared tables. The second bullet here isn’t entirely fleshed out but covers all cases of reaching non-shared references.
( … slides … )
SL: Clarifying those last two capabilities. Can you reify a continuation as shared or non-shared?
CW: Yes, one world is where you have shared functions, but you always get (or can choose to get) a non-shared continuation.
SL: If you have both of these capabilities can you get a non-shared continuation.
CW: You could have the tag associated with the suspend determine which kind you get.
( … slides … )
DG: When you talk about shared module instances, what is the level of sharedness?
CW: The way I think of this is that if you have a shared function instance that’s only allowed to access declarations which are shared. For GC allocation the shared stuff is in the shared heap and nonshared parts in nonshared heaps. The shared instance only closes over shared bits so GC “just works.”
( … slides … )
BT: In order to suspend or resume a shared continuation there has to be a check. If the continuation is suspended as shared, you must be resumed as shared.
CW: Yes. Sharedness is part of the type.
BT: Is there is a dynamic check that you resume from a shared continuation?
CW: I think the check can be static. You know which type the function has.
BT: Let’s talk later.
CW: Tag tells you whether it’s shared or not.
AR: Without the middle line you don’t even need to consider the tag.
CW: slides!
( … slides … )
TL: Isn’t the barrier free as well? It would go in a side table like an exception handler.
CW: Demanding a barrier might require a new stack which could be expensive.
AR: Stack switching used to have barrier instruction and we discussed the cost of that. The suspend is a walk of a stack chain, not the call frames, but a barrier might require to it to walk call frames, which is more expensive.
CW: Could represent the barrier with a binary flag which is per-stack.
AR: You need to be careful with exceptions.
CW: I would hope that’s broadly per-stack but maybe at boundary there might be extra cost. We could avoid all of this is if we have the nice wrappers.
( … slides … )
KM: The design makes sense. Our concern is the production side. We're of the belief that the wrapper casting is effectively a downcast and relatively cheap. But there are significant costs from downcasting already. We're concerned that dealing with that everywhere makes it easier to not use stack switching at all.
CW: Yes, it's like the cheapest possible downcast. We could say that if we think the cost is too high, then we do need all 3 types of capabilities.
KM: Our concern is if you have all 3 then you actually do want to leave around breadcrumbs because the source language has no concept of un-shared-ness. Then you get weird runtime errors from boxing which is expensive or weird errors where something was called with the wrong sharedness.
CW: That concern seems independent from stack switching. Anything with shared<->nonshared interaction will have to deal with that. You can take an object out of a field and access it and what you expect depends on the design of non-shared wrappers. It needs to be worked out early for shared GC and whatever decision will extend to shared stack switching.
KM: Possibly, yes.
TL: We've been dealing with the interaction between threading and JS for years in Emscripten. It's possible to get these weird runtime errors with Emscripten, but it exists and seems to work really well in practice.
CW: The point to emphasize is that if the shared GC wrapper solution works out badly, then we have a plan. If it works out really well, we also have a plan. Both realities have been mapped out.
TL: We are also concerned about the performance of auto-wrapping. But we are also concerned about the cost of the glue code that the toolchain would have to generate if we don’t do that. So it’s not clear which is better.
AR: You call it boxing but it could be anything, e.g. fat pointer? As long as it's something that only exists inside of Wasm.
TL: Yes, but we've been imagining that there would also be an explicit JS API for thread-bound data that may or may not be the same thing.
KM: We have v128 where it traps on the boundary with JS, but I guess that would be weird here.
AR: You only need to wrap when going through the Wasm/JS boundary.
ZB: Our intuition from the toolchain side is that developers don’t want to control boxing and unboxing. They just want to use JS and export APIs to JS. They don’t want a non-JS strange API. Either we have auto-boxing and unboxing or toolchains need to always generate this code.
SL: What is the timescale here? Experiments would be valuable.
TL: We are experimenting.
RH: I agree there are many options and many things to run into. The decision timeline roughly makes sense, but I have a separate question: Thread-bound data and auto-boxing means any unshared things can go into shared type and that we have shared to unshared references after all, so why have shared annotations? Can’t the engine figure things out dynamically at that point?
CW: You're assuming a much more aggressive form of boxing where even a WebAssembly instance could be boxed as shared. When you box a non-shared thing as a shared thing it doesn’t allow you to access it from another thread. You would get an error if you accessed it in the wrong thread. You want to be able to post shared functions to other threads and have it work.
RH: To rephrase: if engines can handle strong references between threads, including shared to unshared things, why do we need shared types?
CW: There’s an intermediate stage of strong shared-to-unshared references where they only support “ephemerons”. In a world in which engines implement these ephemerons, limited shared-to-nonshared edges become possible, but not arbitrary shared-to-nonshared references. If we can be even more ambitious and allow (arbitrary shared->nonshared) we wouldn’t need shared as a separate concept.
RH: Can you concretely state the difference?
CW: With ephemerons, there’s an extra invariant that at least one shared->nonshared edge pointing to the nonshared object must be same-thread, so thread-local GC still sees a live pointer and is safe. With arbitrary shared->nonshared it might be that the only edge is across threads which would break thread-local GC.
BT: With shared annotations the engine can see certain objects don’t cross threads. This may affect optimizations because you know there will be no tearing.
CW: Objects with shared annotation might have more overhead and engines can optimize non-shared things more aggressively.
AR: Shared objects need more synchronization also.
ZB: Do we really need shared annotations on everything? Can't we just have a single flag on the module saying everything is shared?
CW: Andreas will say it's so we can arbitrarily merge modules.
AR: This completely destroys modularity.
(everyone laughs)
CW: If you have a module that is “mostly shared” but small bits interact with JS, those bits may need to be non-shared. Andreas's point is that you want to be able to merge two modules where one has shared things and the other has unshared things.
ZB: We observe that when we mark anything as shared, we have to mark everything as shared.
CW: Yes, that’s expected. I wouldn’t be against some binary encoding macro. From the point of view of Wasm semantics, that’s sugar for marking everything shared explicitly.
SL: What’s a good example where that does not happen? Where there’s one little non-shared thing?
CW: Assume there’s a world where we solved thread-local-storage with context local and 3 kinds of shared function. Mostly in a shared-suspendable world. Then to interact with a non-shared thing you open a small block of shared-fixed, and that lets you manipulate the non-shared thing. This can’t happen in all possible designs, but in some designs it’s genuinely a mix. This makes assumptions about how TLS is solved.
CW: Continue talking to me please!
JS Primitive Builtins
SD presenting slides
The motivation for adding more JavaScript built-ins is the need for interaction with JavaScript. Current Wasm embedding allows nice interactions with integers and manipulation of anyrefs/externrefs (JavaScript objects) . A compiler can generate helpers for tasks like creating a JS array from integers and accessing elements, where the boundary between JS and Wasm handles the conversion between JS numbers and i32s correctly. When parametricity (generics) is introduced, and an integer is hidden within a generic function that takes an anyref or externref, the integer must be boxed. At the boundary, the toJSValue and fromWasmValue conversions break the WASM box around the integer. JavaScript receives the boxed object, not a JS number, which causes issues when trying to create a typed array, as it expects a JS number. The i31 ref feature can sidestep this for 31-bit integers but not for floats or booleans, so a universal representation for JS interop is required.
The goal is to implement box and unbox functions so that converting an i32 to an anyref and then passing it across the boundary results in the same JS number as passing the i32 directly. This is currently implemented in user space by importing identity functions with Wasm signatures for boxing (i32 to anyref) and unboxing (anyref to i32). However, this requires calling these helper functions and crossing the Wasm-to-JS boundary every time a generic programming operation is performed, which is expensive and shows up high in profiles.
The proposal is to make these boxing and unboxing operations built-in to avoid the cost of boundary crossings. The built-in functions, styled like JS string built-ins, have Wasm signatures. For boxing, an integer parameter results in a JS number, using externref in the specification for consistency with JS string built-ins. Unboxing requires tests to trap if the value is not a fitting number, avoiding complexity like handling objects with valueOf methods.
The essentials that survived the phase one discussions include built-ins for converting i32s, u32s, and f64s to/from JS numbers, along with type tests. There is also a function to extract a JS boolean as an i32 (the reverse is possible with constants). Other essentials include undefined (once tested, extraction is not needed) and equality tests for JS symbols and big integers.
The "nice to haves" were largely axed. Remaining options include identity tests for symbols (questionable due to the availability of Object.is), conversions from primitive numeric types to strings (the reverse was axed), and to_lowercase/to_uppercase functions. The two string case conversion functions are kept because their user-space implementation would require embedding large Unicode tables. All other string methods can be implemented without large tables using existing JS string built-ins. Controversial items were completely removed.
MP: What was the reason for JS number operators getting cut?
SD: The argument was that if you want them they should actually be in wasm core. They exist as primitive operations in JS but they don’t really do any JS-specific things. I didn’t really want to touch core with this proposal.
ZB: We want it (fmod) in wasm core!
SD: You can implement it in user space it’s not too bad: i’m told it is not much worse than the hardware.
SD: Extern Ref vs. Any Ref: There is tension between using anyref for generic types within Wasm (most generics are Wasm objects/structs) and using externref for consistency with JS string built-ins. The motivation for the new number-related built-ins envisioned anyref, but JS string built-ins use externref
RH: Rationalization for externref in JS string builtins was we wanted representation for extern to be biased towards host values (e.g. JS). NaN boxing could have externrefs be nan boxed. JS string builtins operate on JS values and thus use externrefs. Would lean towards continuing to do that. Preference for sticking to externref but could measure.
SD: I agree that externref is the natural choice for strings.
KM: We use nan boxing in JSC. If your anyref contains things that were already boxed or need boxing it would have to be a box value for us. Otherwise we would have to understand that a “double” anyref would have to be distinguished from a pointer.
SD: We are going to need boxing of values. I’m told externref and anyref don’t deal with SMI vs Heap number the same way
AR: I’m with Ryan on this one. Originally we didn’t have distinction and only had one type. Externref introduced at a late hour with the exact argument that externref is “host values” and don’t require conversions. For this interface seems natural to use externref since host values are manipulated.
SD: Even though you want this specific transition between anyref and Number to do the right thing, the thing that actually happens most is box/unbox. So the most common is i32 to anyref or T. if the other way is a little more expensive, if the semantics are right, it’s better
AR: One possibility is that engines recognize composition of extern<->any conversions.
SD: Generic Built-ins and Reducing Number of Built-ins: We could generalize built-ins, such as having a single js_value_from_wasm that implements the ToJSValue algorithm, which could be imported with different Wasm types (e.g., i32, f64). This would reduce the number of explicit built-ins (currently 21). I don’t think the resulting complexity of adding generic built-ins to the specification outweighs the savings in built-in names and spec text. However, if polymorphic built-ins were introduced, they could immediately be used for shared string built-ins to avoid duplicating the entire set for shared `externref.
SD: There was a suggestion to unify and have a single generic “fromWasm” (slide)
TL: If we did have polymorphic builtins, we;d probably use it for shared string builtsins too, for shared vs unshared string refs. So maybe it’s nicer to do polymorphic builtins now.
SD: In practice the implementation would be different for these builtins.
RH: Polymorphic builtins have come up before. Would want to make sure that we don't miss any implications. On the other hand, polymorphic builtins could definitely be useful.
SD: If overall sentiment is to be generic I’m happy to make it generic.
AR: There are different kinds of polymorphism. This is a non-parametric kind of polymorphism.
SD: If you look at ToJsValue it looks like an overloaded function more than a generic function.
RH: With builtins in general we were a rush to do JS string builtins but this is a good opportunity to more firmly sketch out. Can establish principles to ensure we don’t build something we forget. I realized we’re using URLs, like a wasm:: scheme and “js-” and maybe we should have use a path style with slashes.
LW: Is a shared function a subtype of a non-shared function? No? Nevermind.
SD: Relying on F64 Built-ins and Wasm Code for i32/u32: Another alternative suggests only having to_64 and test_f64 built-ins, and adding Wasm code to perform the remaining steps of testing for i32/u32 fit. The argument against this is that i32s are often SMIs (small integers), and extracting them as f64s (doubles) and then back into an integer in Wasm would be inefficient.
SD: We removed a lot of builtins on the grounds that they were directly importable from JS. What qualifies? (slide). Is it ok to do this obscure binding to import? If the engine fails to recognize it, it will be very slow.
Some things can be imported globally, but others can’t because they require this parameters. You can generate various wrappers but it’s hard for the engine to recognize them.
BT: Everyone’s asking for DOM access? Other JS builtins like accessing properties? Not in scope here but discussion for the group?
RH: Like a JS object builtin that would do a getProperty? Don’t think that’s out-of-scope but once you have that you want a lot of optimizations in your wasm code. Caching, speculation, etc. Pragmatically want to defer.
KM: Once you’re doing that level of optimization in wasm it might be simpler for the engine to optimize the call to JS and the subsequent DOM or whatever access. Similar amount of work of putting it all into JS. Might want to inline that call anyway. Comparable amount of work that has higher dividends.
SD: Currently in practice is we generate helper JS. It’s one function that does one specific property access or function call. Can’t bundle as might break evaluation order. Today in practice we have wasm<->JS call for a specific access. Unsure if it’ll be fixed by making the bridge as fast as possible. In the end as implementors either approach would work if it has the same result of faster calls.
TL: Our position (v8) is that if you’re gonna write JS you should write JS. In some cases you can’t do that and happy to optimize things. For DOM the expense isn’t in the boundary so it’s hard to justify the cost of optimizing the boundary.
CW: Do you have the wording for the poll?
JS Primitive Builtins to phase 2 vote
SF: 5 in room, 1 in chat F: 12 in room N: 16 in room A: 0 SA: 0
Custom Descriptors & JS Interop
TL presenting slides
( … slides … )
TL presenting: custom descriptors allow you to avoid having a separate userspace header in addition to the engine header (which contains JS shape information and is already inside the engine.). It saves memory. Also it allows you to hook up a JS prototype.
Attaching JS Prototypes example Declarative Prototype Configuration
CW: Normal if we’re encoding negative numbers it’s an s33 rather than s32. Why this?
TL: You’re right it should be s33
AR: This only occurs once so it could also be a bool.
TL: Extra byte!
JK: The module can’t be big enough for s32 or s33 to make a difference.
TL: s33 it is.
AR: Why is that a proxy? Why not a function you call into?
TL: We’re importing 10k globals, not functions.
AR: But for the proto factory, instead of being object why not be a regular function?
TL: Can’t call an imported regular function from a global initializer.
BT: With the data section, who’s interpreting the data section? The wasm engine?
TL: Yes, the builtin function. Implementation of the builtin itself.
( … slides … )
FM: What happens if you dont mark the import as exact?
AR: The type def exact .. func.ref no longer validates.
AR: The reason you don't have re.test_desc is that you can implement that with get_desc?
TL: We don’t have ref.test_desc because it would be equivalent to ref.get_desc and ref.eq. No additional benefit.
( … slides … )
LW: Are there any runtime performance improvements?
TL: Benchmarks are mixed. Getting more data as we speak. Average change is -1 or -2%. Unclear if we can do better. Don’t think it sinks the ship here. Can always go back to direct vtable field. Status quo on perf/heap but get JS benefits.
KM: On perf. The impact is the extra dependent load?
TL: Yes. Jakob did some stuff so the overhead on casts is minimized but we still have an extra load on a ref.get compared to just a struct.get. Because we point to the map which points to the descriptor.
RH: There was a number of future GC extensions proposed – parametric polymorphism, eliminating overhead of vtable calls. Does this close doors to that? Can be combined? Co-design?
TL: Have not thought too hard about that bc the type system stuff for vtable calls is hard. So not sure yet.
RH: Reason I bring it up is the type system stuff is beyond my grasp.
CW: Self types is also a little beyond me, would need to bring in a proper object type system specialist.
RH: Parametric polymorphism?
TL: Can’t imagine a reason it would interfere
SD: Maybe I or someone we have can look at the self type part.
RH: If you’re expecting 8% heap reduction and no performance benefit, I don’t know if this is worth it for that. I would rather look for ways to improve JS interop without custom descriptors. This is a large addition to the type system for people to not necessarily use it.
ZB: As soon as we hear it’d be nice to see it worked in both directions. Make it work on wasm side without additional work on JS side. Right now we need to wrap such functions with bind.
TL: Since you can already do it with bind we haven’t thought about doing anything new there.
ZB: More convenient for toolchain side. And less code to generate.
TL: Granted. Would look into that if we expected to get performance and probably not otherwise.
BT: on inlined RTT: Does v8’s implementation have an inlined display for the constant-time cast? (notes lost full context)
JK: Since wasm gc, every map of a wasm struct keeps a list of all supertypes. In practice the length of the list is limited so it’s ok. First implementation of cast_desc loaded last entry from that list. 4 chained loads and was too hard to predict. Now additionally cache immediate supertype, the canonical RTT. Directly on the v8 map. Fairly well predictable and brought cost of new cast instruction way down. Still slightly slower because we have to store the extra thing. On other question about performance – the other source of slowdown is vtables being one layer of indirection beyond. Some workloads speed up from possibly having a smaller heap side, others slow down. Savings are 1 word per object. For strings it goes from 24 to 20 bytes.
AR: Is there a tradeoff where you switch around whether the rtt, custom map, or descriptor is indirect?
JK: You could merge the custom RTT with the engine internal descriptor if you wanted to pay the implementation complexity cost.
AR: I see that for v8. Not necessarily an option for you but possibly for other engine doing it another way around? Might be preferable?
SD: Can't you have both if you inline the map before the fields of the descriptor?
JK: That’s what I meant by merging the two. Can get fast casts and accesses with no indirections. Probably will never do in v8 due to 2 invariants: one is all objects have the shape field first. two is descriptors are engine-internal representation. Don’t want to hand to JS word as custom structs. Wasm struct can be thrown over to JS. For complexity reasons we want it to be a different object type for the foreseeable future.
RH: Would agree that I don’t think it’s possible to merge the two. If the descriptor is the same wasm struct as all other wasm structs.
JK: Doesn’t have to be the same layout as it’s statically different. For us it’s already represented differently. We have extra field pointing to internal engine map. It’s possible to have different representations.
RH: Issue when a descriptor flows out to JS. For SM shapes are not JS objects. Would pass a map and have a typeof be object and it’s not feasible.
TL: Point is that you’d never be able to merge them?
RH: Yes, if merging was desired they’d have to be less flexible.
BT: It’s already tough to make a GC where the metadata is moving during gc. I’m curious if ou know how many desc objects live in the app. That’s space overhead too
TL: Yes, we did see that. Because of the rules that the descriptor is 1:1 with the describe types – as opposed to normal types with vtables – we end up with more types in the modules. Have descriptor objects instead of vtables and they’re each 1 word bigger. For tiny benchmarks we saw total memory usage increase because of fixed-size overheads. Tied to the number of types, not the size of the heap. In real-world benchmarks and production data we see benefit.
BT: That’s also for languages where you don’t have many descriptor objects of the same type
TL: That’s true, we have one descriptor object per class.
BT: Do you have an order of magnitude number for those?
TL: Order of 1k
JK: On sheets we have ~6k classes. If each gets a descriptor, unsure if each gets one.
TL: Binaryen does optimize out unused descriptors
SD: Would it help to forbid descriptors to reach JS and can you do that? For example cannot pass exnref to JS.
TL: Would be feasible. No use case of needing the descriptor in JS. Could be conservative and disallow that.
KM: Interesting things is putting one of these structs into something else and then that leaks out. Now table.get would need to check for example. Can of worms. Not impossible to solve, need to be careful for where everything can escape.
SD: Exnref already has that property already. Know statically whether it’s an exnref or not since it lives in the same type hierarchy.
KM: These structs would be anyrefs though?
TL: If you had a table of anyref ..
KM: Could load from table and call JS – not saying it’s impossible there’s just many ways
AR: I think this is ok; You have a membrane, you can pass out an object that contains or points to the things you just can’t access in JS. I think we already have that.
TL: For anyref hierarchy we have to add new checks and it would be a pain.
KM: Agree we would probably never merge, at least if they could escape, as it would be a security and implementation nightmare.
TL: Good to know, 3/3 on that one.
CW: I want to echo that this is a very complex extension for a marginal benefit. Didn’t show all the slowdowns but there are tradeoffs. I would be more comfortable if you told a story about how it could get better. You deliberately aren’t saying that?
TL: Heap size benefits are extremely valuable. JS interop benefits also extremely valuable. We have very large apps compiled from Java to JS, and the boundary between Java and JS is porous because the app is written knowing the Java will be compiled to JS. so this is important for drop-in replaceability. So we can’t do it without some kind of JS interop. For the larger ecosystem goal of having Wasm be more integrated on the web, providing the ergonomic API to developers is valuable.
CW: Can you expand on that. You want to piecemeal change part of the compilation (previously to JS) to wasm?
TL: Whether it's a gradual port or all at once doesn’t make a big difference but to get this code to wasm requires the JS interop.
BT: Without this proposal Wasm gc’s object model isn't competitive with native implementations in other languages. Every language that implements this expects to be able to merge the header words
CW: Why aren’t we seeing benchmark with huge memory benefits?
TL: Smaller objects, more overhead. In our benchmark it’s 8%. Large Kotlin app generates tons of tiny objects and overhead of extra header word is much larger.
CW: Can we see an experiment that proves that?
TL: Zalim already working on stack switching and shared-everything threads so might be tough.
KM: How much of the memory overhead is not writing your own hooks for JS objects and how much is from the descriptor overhead and vtable?
TL: The measurements we’ve done do not do any JS interop stuff.
KM: Wasn’t sure if that was factoring. Could imagine instead of this proposal could do weird proxy thing.
SD: Wanted to comment on the JS interop benefits. For java maybe it goes from 60% to 65% interoperability. For scalajs it goes from 99% to 100%. This is the missing piece for us. If you have this you can take any scala.js application that you’ve had forever and drop it into wasm.
TL: Nice.
JK: On the numbers for java drop-in interop the number goes from 0 to 80-90%.
TL: Yes it’s very valuable. Not doing phase poll now, maybe tomorrow.
( … slides … )
Website and webassembly.org
TS: The biggest new feature of the website is the news section. Right after we launched it, we announced 2.0, then spectec, and then 3.0 posts (written by Andreas). Lots of comments on the 3.0 post e.g. on Hacker News. This kind of stuff really reaches developers.
TS: Simple markdown format and easy to get started. Can copy from existing blog posts. Can see list of features part of Wasm 3.0 [on the blog post] and I have a web component for exposing which runtimes implement which features. Can get an overview of features.
DG: Realized that there’s a lot of blog posts, v8.dev, Andreas’s, SpiderMonkey, Alon on GC in Emscripten – several news sources. Can we link them here?
TS: We can definitely cross post or have links. We can use this to amplify existing content. If you want to write something, remember that this exists.
TS: I run wasm assembly podcast and was wondering if it could be cross-posted here. Always looking for guests – please talk to me.
BT: What’s the process to get an article on this site?
TS: Open a PR. It’s a markdown file with a header. I review it but there’s no formal process. So far this has been working fine, but only other posts from Andreas so far.
AR: The question is, how many posts do we want, and how much do we want to focus on “big” news or more just potentially interesting Wasm content? I would not like it to look like a facebook feed where there’s a lot of stuff and you might lose interest. But it’s hard to know where you should set the boundary.
TS: Agree, should be in the area of long reads – e.g. 15 minutes. Don’t want to have tiktok-style small reads. Exception perhaps mini-blog-post pointing to something else. Should still meet our quality bar and not be specific to e.g. a particular engine.
AR: One way to deal with that is weekly round-ups that collect interesting posts. One single post here instead of small individual ones. More editorial work though.
TS: Had some newsletters in the past but most die out because there’s not enough material to fill per week. Without a cadence it might work. Don’t think we have enough material for something weekly.
DG: You could also imagine e.g. a different “tab” for blog posts and a separate category for announcements, our own news, etc. it does seem like a nice place for the community if there’s something outside of just CG news.
TS: Personally I’m subscribed to hashtags on mastodon and it’s not too overwhelming. Maybe 2 items a day?
CW: I’m hearing something like a community section where news is something more formal.
TS: Community section may have things like product announcements? Don’t want this to become a social network kind of thing.
DG: Don’t know we have that much content coming out. For those that are interested, I want to make sure they can publish their work. Not sure if the barrier needs to be so high. E.g. I wouldn’t hear about scala.js updates except that they come talk to the CG. I'm sure there are lots of interesting things like that to surface.
TS: Opening a PR is low-key enough? Could have different tags on site. Could subscribe to only news things. Could also subscribe to everything.
BV: Agree with others here. My main impression is that anything we would publish here would be great. What this might need more than anything is some editorial voice. If you think something would be a good idea, you should publish. I’ll reach out next time myself when I write something and I’d encourage others to do the same.
TS: I don’t want to become editorial decision maker but so far everything’s been thumbs-up.
BT: This website is intimidatingly official. People don’t want to come off as self-promoting. Many self-organizing events aren’t great at self-promoting. If you can help pull things out of us that would be helpful.
TS: When I don’t know the answer to something I loop in the chairs. I would consult others on something that might be more borderline. Maybe me as a first filter.
SL: An obvious thing missing might be this meeting we’re at?
TS: Do we want to make the community bigger? Thomas said there’s a long tail of folks in the CG who don't show up. Would we want ot advertise the meeting and invite local guests who wouldn't normally attend?
TL: Advertising this meeting totally makes sense. We are the largest W3C community group and it’s not even close. I think it totally makes sense.
SL: Not just about encouraging folks, but also showing activity is happening.
TS: Very good point, I’ll announce this going forward.
BV: I would be happy to volunteer to summarize meetings and publicize major decisions.
TS: Thank you and I am looking forward to hearing from you.
TS: Talked with Patrick who wrote a book about getting started with WebAssembly. Previously didn’t have public resources about learning Wasm by writing Wasm. It might be nice to have a mini version (maybe a couple of chapters of material?) Need to be careful how we present this, but it should be compelling enough, more than a teaser. It could have a link guided to the book. How would others feel with Patrick, Mariano (spelling?), and myself contributing?
RH: I think having a tutorial would be great. We also do have MDN and it already has a big Wasm section. Could easily go here too. There’s also a reference of Wasm instructions that would be great. MDN would be a good fit, at least a link.
TS: 100%, but I’m thinking more of a tutorial than reference.
RH: Both guides and references on MDN.
CW: A self-contained tutorial makes sense. There are a few tutorial books on wasm and would want to be careful about promoting one over the other.
TS: There’s a link to learn-wasm.dev in the issue. Maybe Patrick would be ok contributing without a link? No opposition?
wasm_of_ocaml experience report
JV presenting slides
JV: presenting the compiler, and how it's used in the industry
JV: I have been working for a long time on js_of_caml. Now it's retargeting for Wasm.
(slides)
JV: We have a uniform representation for all values
JV: CPS transformation inherited from js_of_caml. Other schemes specific to Wasm (JSPI, Stack Switching)
Not easy to do better than Binaryen. But we try to generate code that Binaryen can work well with. For example, i31ref is not well optimized by Binaryen, unfortunately. So we try to use i32 as much as possible.
TL: Is this something we should fix in Binaryen or is there some fundamental issue that would make it hard to fix?
JV: I’m not sure, maybe it’s important for us, but not important for others.
TL: We’d be happy to fix it.
JV: It might require source knowledge.
TL: We wouldn’t be able to use that then.
KM: Do you see that issue in the generated code of the Wasm runtimes?
JV: The issue is that if you convert to 31 and back, it is semantic and meaningful.
KM: Why would it not be just a representation change?
JV: going to 32-bit to 31 and back is not an identity. But if you do several such conversions in a row, you can only keep the last one.
Commenting benchmarks: besides arithmetics, bounds checks and other primitives need integer operations, so it's worth optimizing them away.
Accumulators: there load elimination does not work in Binaryen, because there isn't a single structure. There are multiple allocation sources.
Closures: Initially, just a function pointers and (ref eq)s. Using precise types helps Binaryen. Getting rid of fun pointers also allows Binaryen to get rid of unused params, for example.
If you want to copy from Wasm array to TypedArray, it's cheaper to do it from JavaScript. If the JS has to call many small Wasm functions, that's faster than the other way around. It would be even faster if we had primitives for that. One difficulty is that what engines can optimize is not well documented.
JK: comment on documentation, this is a WIP as time permits. We’re seeing how much we can optimize. We’re not as far as we’d like to be. We might write better docs in the future. Calling wasm from JS is almost always going to be cheaper than the other way around.
(slides)
RV: The biggest thing we're interested in for wasm_of_caml is Stack Switching. The CPS transform that we're currently using has a number of downsides, performance in particular.
MP: You mentioned on an early slide, you were using JSPI.
RV: I think that JSPI is better than the CPS transform. Good first step, but stack switching will let us model effects cleaner.
MP: What are the gaps there, where JSPI is not sufficient?
JV: It is quite slow
RV: JSPI is only slow when performing an effect, so that’s nice. CPS is slow everywhere. Stack switching will let us have reasonable performance in all cases.
AR: The JSPI thing only works when you’re in a JSPI environment, right? Is that everywhere you’re running?
RV: All the OCaml code compiled to wasm is running in the browser
FM: Can you quantify the cost of the CPS transform, in terms of code size and runtime performance?
Olivier: The overhead of using CPS, I don’t have the numbers in mind. There are two kinds of transforms. The usual CPS transform all the functions that need to be. ‘Maybe perform an effect’. I don’t remember.
RV: There is a PR where we do all the comparisons; we can share that afterward. It's pretty substantial.
Olivier: There is another transform that translates to CPS that keeps the non-CPS versions and dynamically switches the kind as need be. This is faster but has much more code size overhead.
RV: Aside from that, cautiously optimistic for other features. But stack switching is the biggest one.
SL: Is this all in Chrome?
RV: Yes it’s all in Chrome
FM: Another question, is your software suite similar to the bloomberg suite?
RV: I’m not sure about the bloomberg suite.
FM: Someone outside of Fintech wouldn't know the difference?
RV: We use OCaml much more than bloomberg does.
FM: Has someone compared the performance of your suite and bloomberg suite?
RV: No we haven't considered doing that. We haven’t done the comparison.
RV: Sometimes we're lying in the times when writing bindings. In JS it's all right because the representation is actually the right thing. but when compiling to Wasm, this is definitely not true anymore. So porting work is required.
RV: Graph of our adoption inside Jane Street.
CW: Does this represent the same app compiled two different backends, and dynamically switched.
RV: So every app can decide on their own. We compile apps to both. Wasm versions are more likely to have bugs, because reasons: people less familiar, browser implementations more recent. So it's good to have a flip back to JS when we need it.
DG: You had something about a long tail of JS bindings. Is that in your app?
RV: We bind to a lot of JS apps, things like CodeMirror. Had to update those. But there’s a long tail of small libraries that need to be updated.
Olivier: I got numbers for the impact of CPS. In JS we would get 40% slower due to CPS. With optimizations, we were between a few percents and 25% slower. For wasm we were seeing 15% for CPS transforms. That’s the order of magnitude.
BT: What about code size?
Olivier: In compressed terms, in JS it’s not that big like 2%. For wasm it was more like +30% +50%.
TL: Thinking about custom descriptors, some of your problems are that the JS types were lied to and broke bindings. Do you think that custom descriptors would help?
RV: Custom descriptors examples you showed are oriented to OO. Not clear how that would apply to OCaml. Would you have ideas?
TL: Are you ever exported OCaml out to JS and expecting JS to call methods on them?
RV: No, our main web framework is written in OCaml
TL: That makes sense. When the entire app is compiled to Wasm, the more interesting thing is how do you get JS stuff into your language, rather than the other way around.
BT: So as OCaml effects are used more in the ecosystem. Do you have any idea how many stacks are being used? Order of magnitude?
RV: It depends on how much you use effects. If you only use something straightforward, it's not so bad. But if they do things we haven't really seen often, odd or unique ways to use effects, then it's worse. "on average" does not really make sense here; it's better to check specific benchmarks/use cases.
TL: Is it too early to have numbers for the stack switching stuff?
JV: We don't have any numbers yet. It wouldn’t be a direct comparison since it’s not V8 where we run the rest of our code
SL: It’d still be interesting to see those numbers
FM: I have another question, one of the long running themes in chrome/v8 is support for CFI. the intel is called CET. This is supposed to enhance the security of your application. Is this an issue that concerns Jane Street.
RV: I don’t know enough to say, we can chat after. I need more context.
AR: My guess would probably be no, but just a guess.
Update on Custom Page Sizes - Nick Fitzgerald
NF presenting slides
(slides)
NF: Need to learn SpecTec… and Ocaml.
DG: This is one of the first few proposals with little relevance to the web. This may be a good time to discuss what phase advancement should look like without two web implementations.
KW: I think Web engines to implement this, we worked hard to make it very minimal
DG: We have a lot going on right now, not sure if we can work on it right now
CW: Does this proposal need to be at Phase 4 to be useful off the Web?
NF: Not necessarily, we need a signal that it’s production ready and not going to change. But in that case why not just advance it? If there’s a proposal that everyone knows will never merge but you should implement, we should just do the right thing. I have ideas of what we could do for the CG process, but not sure we want to go down the rabbit hole right now.
DG: For history, about 3 years ago, we discussed what happens for features that don't impact the Web? We said we would discuss again when it would happen. It seems now is the time where this discussion becomes important.
BT: I’ll say we definitely shouldn’t last at phase 3 forever. Like the proposal and implemented it in virgil, but couldn’t write it in the reference interpreter, but also don’t speak OCaml.
AR: Ah but you're all world-class engineers.
NF: If only someone here knew OCaml
NF: One concrete question for the web browsers without WIP currently in review. Do you feel okay for phase 4 without an implementation in your engine?
Chris Woods: We’d want to use this in production, but we also want to run it in the browser. It’d be inconvenient but not a show stopper. A show stopper would be if we don’t have standardization, and toolchains lose support of it. The stability is important for toolchains and compilers.
NF: Agreed. It's important for the ecosystem that we actually finish things, and get them standardized.
LW: On the value of this proposal to the Web, talking to mobile folks. There is a limited number of guard page regions, at a certain point you run of address space and fall back to slow memories with bounds checks. Having this proposal allows users to opt out of guard pages manually for utility modules.
KM: There's definitely devices where the total address space is 36 bits, which does not let a lot of room for garbage. In that situation there would be value, definitely. It's very reasonable. I don't think we're overly concerned about going to Stage 4. We're happy to take open source patches.
BT: I think we should follow the exact letter of the process, if we want to get to phase 4 we should check off all the boxes and if that’s web engines we need to do that.
NF: I'm not in favor of making a specific exception for this proposal. If we do avoid the 2-web-engine requirement, then we should amend the process to make it clear when/why we can drop that requirement.
CW: I’m worried about dropping the web engine requirement. I think it’s largely the thing thats keeping the ecosystems coherent. If we weaken it for this small proposal, I think we could see the splitting of the ecosystem.
AR: I don't think we drop the requirement. Maybe we make it 1. These are just minimal requirements. There's still the CG vote, which can control the process in the end.
CW: I don’t think we should rely on throwing out part of the process and rely on the votes.
NF: One idea I had was: as a CG first, do we agree that this/that proposal does not require the browsers implementation. We would identify up front which engines we agree would be appropriate to vet the proposal. And then only poll for phase 4 with that requirement, specifically for these proposals; not for all proposals from now on.
DG: Clarifying question, what are metrics you’d look at for non-web engines? Historically we’ve used web engines because they were mature and the ecosystem was new. Downloads, partners, users?
NF: I think there are a lot of metrics we could look at. Partly I want to avoid doing that because there’s so many things you can do. That’s why I’d prefer to have the CG make this decision on a per proposal basis. It would avoid having to write a lot of new process.
CW: as a community we tend to be quite nice, and so people are quite uncomfortable having to say no on a case by case basis. I worry that would make the bar for engines too low.
BT: Generally agree with that. Some languages have process on how you even update the process. We don't have that right now. What is "production engine"?
CW: If we were talked into ‘one engine’, would we even have an implementation?
RH: Yes we would, Firefox would.
CW: Have we then also maybe have we talked another engine into it?
MP: We aren't planning on implementing it. Although We don't think it's a particular issue.
DG: If it's not a proposal that is going to be used on the Web, then does it make sense? Or maybe it would be used very differently in browsers.
NF: Outside the web we care about correctness and security
SL: Having an implementation on the web for something that you can’t use on the web doesn’t give you much assurance
NF: You could test the semantics of this one.
BV: I think web browsers are not much different from embedded platforms, we have lots of tiny modules with tiny compute. It seems likely users will want this if you have tiny memories and are willing to pay the cost of bounds checks. The process discussion seems academic, this seems like a web feature, it’s not necessarily a bad fit.
DG: I agree it's not a bad fit. But we track metric for memory size. If you run out of memory frequently, that's a bad thing. But there's an opportunity to redefine a bit the process, and we should take it.
BT: My understanding of the process is that the features all need to fit together architecturally. Web browsers implement it all and need them to all work together.
SL: I guess another interpretation: in that case, not only do you need the 2 browsers, but you also need another implementation. Which is adding extra burden.
Chris Woods: Unless you want to strip out the web engine requirement and just list out all the guarantees that web engines has.
TL: I just want to say that, I’m glad we’re having this conversation. I’m glad that Firefox is getting an implementation and that webkit is open to patches. It gives us a path forward for this proposal and makes the bigger discussion less urgent.
Merlijn Sebrechts: Coming from the embedded side, there is also a lot of value in having as much of these embedded features implemented in browsers. I think it makes sense to push this further and have browser implementations. It’s nice to have development workflows work in browsers. We also could imagine offloading certain workloads offloading computations to browsers and gateways. We’d want strong compatibility with these other platforms.
AR: The question here is though, whether browsers should implement the feature? But whether we should wait for browsers to implement it? We still want a standard, obviously. It's a question of the timeline.
MS: I see the browsers as being a superset for everything, and embedded devices being a subset.
DG: I see the value in this type of thing, but we have limited resources and a complexity budget. Accepting features that aren’t relevant expose us to security issues and buginess. We have CI, and need to test this on representative devices. This is an implementation detail, but there are also implications to those implementation details. I don't want to set the precedent that this is what we would do for all proposals that are not directly applicable for browsers.
CW: This goes beyond just what we want for this proposal, but what is the spec. If we say the reason to change the process is because there are features that browsers never want to implement, then that’s more significant.
AR: I don't really see why web browsers should forever have to be the superset of all webassembly proposals.
CW: We should fight like hell to keep that true as long as possible.
BT: public allergic reaction to web browsers saying that they can’t implement something because they can’t test something. If they veto every feature, then we’ll never have anything that web browsers won’t implement something.
CW: Ultimately I don't see the spec forcing engines to implement things. Rather it’s reflective of what they’re willing to implement. If Web engines don’t implement, we shouldn’t add to spec.
Chris Woods: we have profiles, maybe we could split things out?
CW: I was comfortable with profiles with the understanding that browsers implement everything
AR: Just to clarify, I think this is still hypothetical. I think for this proposal, I don't see why web engines would not eventually implement this. So it's more about timeline: should web engines be the blocker.
JK: Echoing what deepti said, I don’t think we’d veto this proposal from phase 4. Mostly for complexity budget. Every configuration we have to test, makes additional ways that we can have bugs. For this particular feature, I don't see how it's significantly useful on the Web, so we're probably not going to implement it anytime soon. But it's definitely implementable. We're in agreement there. The intent of the 2 web engine requirement is to prove that the idea is implementable. No concern that it would have a really high complexity, or slowdown. But it's not important enough for us to implement in the next few years. Even as outside contributions, it grows the complexity budget.
Chris Woods: every software has the same constraints. Nobody wants extra complexity for no benefit. But I can see this issue coming back again, agree with Ben we should stick to the process. We have a way we could advance this proposal, but I thnk it’s worth trying to improve the process. and it feels like this discussion is going to take a long time and so maybe we should start it now, so that we're ready when the time comes. Eventually there will be more proposals that browsers have no interest in, but that the embedded community has strong interest in.
TL: we could add a session tomorrow.
Compilation Hints, Emanuel Ziegler
EZ presenting slides
EZ: Priority is an order in which you compile the functions. Optimization priority is "how important is it to propel it to the next tier?" instr_freq used to only be about call sites. But we made it more general so that you can say how often specific blocks/loops are executed. It turned out to be not so easy to implement. It's hard to find the annotation that is at a previous offset than where you are but still applicable to where you are. So currently our prototype only uses that information at (exactly) call sites.
FM: Do you have a sense for how useful this is for developers?
EZ: We did some experiments, we saw good improvements, from benchmarks that are not super realistic. Knowing ahead of time what functions are going to be needed can save several hundreds of ms in the beginning. There are definitely areas where it improves user experience.
BT: For the block frequencies, could those include frequencies below 1.
EZ: Yes, it's normal indexing, but shifted.
EZ: If you're interested in trying it out, we would be very happy to get feedback.
Wide Arithmetic, Alex Crichton
AC presenting slides
AC: Benchmarks: how slow was it compared to native.
DG: Jakob correct me if I’m wrong, but we want to implement this maybe early next year. We’re just busy.
KM: Bandwidth issues again. Probably vide-codable in an hour.
AC: If the blocker is mostly bandwidth, that's completely understandable.
KM: The hardest part is mostly just importing tests into our test harness. AI has made writing tests easier.
BT: I’m curious about the ‘every 3-4’ years consider a larger proposal idea. Maybe we should do a hardware study of what exists in the wild?
DG: That's something I'm planning to talk about tomorrow. These instructions have a high impact on specific use cases, regardless of how niche they are. We could have an incubation-style thing for these kinds of instructions. Two meta-questions: is there an easier frame of experimentation, when we don't need to introduce types, for example? And Lars introduced: should we have a simplified process to do easy stuff?
AC: I would love to make it easier to put tests in the spec repo, and have them be run with engines.
RH: WPT has a test importer, Firefox as well. I think for this proposal, it's the multi-value nature of it, which is unfortunate, but otherwise we would have done it. I don't think we need a new process. If one provides an implementation and shames everyone else, they'll do it.
TL: Importing spec tests into V8: super easy now, there is a script, it's great.
KM: The biggest issue is that it copies all of the tests. We'd like to filter the delta of the new tests. Currently you need to do the diff yourself and it's a pain.
TL: Alex has a script that finds the tests that have differences.
DS: We've been trying to refactor spec tests so that new tests tend to get their own directories or files so it’s easier to diff.
KM: One thing that has worked well in JS is test-262 suite. The number of corner cases in Wasm should be a lot less than JS.
AC: Even though the spec doesn't have many corner cases, in practice implementations have them. It would be great to share them better.
KM: Testing JITting is tricky, because you need well-known counters for the specific number of runs of loops. Maybe we should talk about the kinds of things that could make the spec tests more useful to engines.
End of day 1
Stack Switching (Francis McCabe, 60 minutes)
FM presenting slides
FM: I am going to focus on where we are and where we are going to go.
(slides)
FM: For Google, our focus is on supporting Kotlin. There is a community of libraries structured around coroutining and so developers are very familiar with working with core coroutines.
FM: There are 4 million Dart programmers in the world.
FM: Golang has goroutines and are now recently very interested in stack-switching. So much so that if they were forced to choose between stack-switching and shared-everything support, they would choose stack-switching.
MP: Do you have any more information on why they would prefer stack-switching over shared-everything?
FM: The dominant industries where go is popular is the edge computing world, you rely on customers being able to run code on the server. In that environment the producer needs to strongly constrain the parallelism on the client. On the other hand, Go is concurrent so they need to be able to multitask
AR: Go is designed around massive light weight threading. There is no way you can do that without an efficient way to do lightweight threads.
(continues with slides)
FM: Scheme and its family of languages have call/cc. The main difference between call/cc and what we are offering is multi-shots. We have stayed away from that in that the design of stack-switching does not support multi-shot computations. Apart from a few academics there are not that many compelling use-cases for it. If you talk to the Lisp toolchains owners, they actually offer it, but most of their users do not actually use it. I offer this as sort of an origin story.
AR: Did not mention Java, we haven’t gotten them interested yet, but the current version has virtual threads as prominent feature, which will also be possible to implement with coroutiness
FM: I forgot to mention java. I don’t know the exact number, but in the range of 50 million developers.
AR: What does Asyncify do for GC types?
TL: Not supported
FM: At the moment, asyncify doesn’t work for GC types
CW: You can do it through tables though
AR: You get an extra indirection, that would be even more expensive.
CW: I did not realize we had given up entirely on asyncify+GC.
TL: Nobody asked for it
FM: Asyncify is actually really good. The current go transformation is much worse than this. It was rather challenging to beat Asyncify when testing JSPI.
KM: Do you know what Go’s current strategy is? Is it like Asyncify or do they do a full CPS transform?
FM: I don’t know the details, their CPS transform is focusing on generating Go code, so they’re going from Go to Wasm, they have an emulation engine for the Go control flow
TL: Switch in a loop for every single function.
FM: This is a similar strategy to how code obfuscation works
AR: This doesn’t really scale at all.
FM: The smallest Go program seems to 40MB, however they are looking seriously at stack switching, they are also looking to do a better job of the transform
(Continues with slides)
We are interested in shared everything threads, how will it interact with stack switching. Users want to do work stealing, migrating a task between threads as work becomes available. Or a version you might call “work donating”. E.g. you have to update UI elements in your UI thread on many platforms. The combination of stack switching and shared-everything is very powerful. But apart from the shared attribute, there’s nothing special about a shared continuation. There are restrictions on it, but we don’t anticipate significant problems from the stack switching side.
In Java, all the objects are shared. And your functions. And the only difference between a continuation and a function is the local variables that are closed over. There don’t seem to be interesting performance implications.
So where are we today? We have a settled design, multiple implementations, including in V8 where Thibaud is the main one working on it. We expect letting people try it out early next year, and hopefully an origin trial sometime next year. It’s a priority for us. Hopefully will go better than custom descriptors.
CW: This type system extension is probably less complicated.
SL: This design is fully verified and mechanized, so it is type sound - not ruling out security bugs but the type system bugs are not there
FM: The implementation of stack switching is based of the technology we built for JSPI, for the most part the foundations will be usable, one main exception to that is that we have an implementation of growable stacks, this isn’t needed for JSPI and not turned on at the moment, but we expect it to be a part of the current implementation
BT: I think it is worth mentioning that there are toolchains that target this feature.
FM: Ocaml already targets it, we have a collaboration with jetbrains for Kotlin, and it looks like the go lang team will also be doing it
TL: Binaryen also has full support and is fuzzing it for internal consistency.
KM: I know we talked about this at the stack-switching meeting. The overall design makes sense. For us the concerns are with respect to the shared stacks. You have some unshared type and you suspend the stack, its not that the spec would be incorrect, but that you’d get traps at the application level that are hard to recover from. Our concern is that it might be problematic for people. We do not want this feature for unshared stacks and it never ends up being used in the shared context, because they run into these implementation lowering constraints because their source language makes it ineffective. Alternatively they have to box their values a lot and end up from performance overhead over that - some of our hesitancy going to phase 3 is about that.
CW: If this is a concern for the phase 3 vote we should talk about it in detail. Maybe I can respond to the point about thread local wrapping.
KM: The next topic would be related to hardware stack protection.
CW: On thread local wrapping, the point on the ergonomics about the wrapping of source level wrappers are orthogonal to stack switching, this is something that needs to be resolved for shared functions at all, and then shared continuations, so it is something we have to resolve for shared functions and will have resolved for shared continuations the same way.
FM: This is a version of saying that if you don’t have stack switching then you use callback hell. The work stealing is sending a function to the UI thread. So you are trying to run a function on another thread in that scenario. In that situation, you have the same situation of shared and non-shared values.
KM: Concern is more along that when you have a single function you know without a global analysis when you need to box your values. With a full stack you will have to pessimistically box for every value
CW: It’s less nice than that pre-stack-switching.
KM: Depends on how you implement it, you could imagine where - 1. Its recoverable, the entire context is wrapped, and then you pull out local context and it would fail on that thread, and then you need to switch threads and then you can resume, with stack switching you’d see the error instantly deep in the stack before you can switch threads.
CW: There would be two places where errors would happen. When the code is produced, without the annotation then it is unhappy and doesn’t work. The second place seems like exactly what you would want.
KM: In the stack switching scenario, you’re in a frame called foo with 100 of things that it calls, anytime it calls a caller it could theoretically suspend.
In order to resume on some other thread, it would need to box all of those shared values. It has locals that are values that are unshared. It will be that the whole stack is tied to a particular thread.
CW: It would need to do that, in all of the alternate schemes without stack switching you would still have to do them like in a CPS transform
KM: Not necessarily. You could have in your CPS your state entirely an unshared object. When you try to unbox it could trap or throw an exception then dispatch at the local time of failure.
CW: You’re still needing to pay the cost of unboxing, isn’t that the same cost for thread local wrapping
KM: You would only do it once for the entire function and not local.
CW: Couldn’t you do the same thing in the stack switching case, wrap your whole function state in a single thread-bound data wrapper
KM: You could do that but you’re also not using the feature
CW: Equally bad in both scenarios
KM: I don’t know if I would say that. We are proposing a feature that is supposed to be helpful. But in this case it’s equally bad in both situations so why make the change at all.
CW: The benefit is that you still don’t need to do global control flow transformations with stack switching
SL: The general experience with doing things in CPS or “callback hell” vs direct style has been that anything that you can do in CPS style can always be mapped back to the direct style. So I”m suspicious if you think there’s some scenario that can’t be done that way. There’s a lot of experience working with these and I don’t know of any case where it’s not been possible.
KM: This is somewhat of a novel situation in that regard
SL: How is it novel?
KM: There are types in the type system that are tied to a thread, maybe there is something that has done this before but I don’t know of one that exists
FM: There is something very similar that goes on in async/await. You can call a non-async function from an async function but not the other way around. That’s something that will be similar to what is happening here.
TL: Focusing on the novelty of the type system isn’t looking at the right place because it doesn’t matter after it validates, because if it doesn’t validate you have a runtime problem. The interesting thing is what happens at runtime, e.g. unexpected traps. We have 7+ years of experience on the emscripten side with exactly the same problems. Emscripten with linear memory today doesn’t have the same type system problems but the problems are the same: You’ve got a bunch of JS objects that can’t move between threads, we don’t have builtin TLS, but we emulate it, if you dereference on the wrong thread, you’ve got problems, we’ve been dealing with it for years, in practise it works great, despite issues integrating with the broader platform
In practice, it works great and users are really happy with the threads despite these issues with unshared web, unshared platform stuff. You have to be careful about what thread you are executing on. In summary, I think the runtime issues that you are worried about are very reasonable to be worried about and we have an existence proof that they can be dealt with.
BT: I wasn’t aware that the CPS program doesn’t work with Wasm GC, but not sure about EH, you have to reimplement half of what you have to with Wasm, Stack switching makes all of Wasm work together, it's pretty clearly a core feature
The next feature that will come after stack-switching will have to be designed to work with this.
KM: Why do you have to disable features, is that a toolchain issue?
BT: You cannot use Wasm GC with this asyncify.
KM: That's just a toolchain limitation, it’s certainly solvable.
TL: To be clear, Kotlin works with suspend functions. It’s not using asyncify, it’s doing it’s own thing, but it works. What do you do with suspend functions to capture? How do you deal with exceptions?
ZB: Our transformation works - we have on top level with loop, inside you have try-catch, big switch with other states, we go through the loop switching through the states, if you have exceptions we change state, we have special block for catches, we switch to proper state, and go to the next iteration. When we need to suspend, we say the current number of blocks/state - state of the walker in the CG structure, next time we need to resume the function we pass all the things that we need inside the structure, and the state can be resumed
BT: That’s one toolchain, then the next language has to do the same thing. You can compile wizard to wasm today. You can emulate everything and put all the call stacks onto the heap. But then you’re not using wasm features because you’re emulating everything.
FM: This is what go lang is effectively doing
SD: That’s what we would do as well if we had to. We have a WIP of yield style generators exactly like what Zalim said about how Kotlin works.
MP: In some cases there are feature built as a performance feature, in some cases these are performance feature, in other cases its an ergonomic feature where it marks usability easier, In WasmGC its still ergonomics ultimately. I’m trying to understand the scope of the feature itself, how much is one vs th either. It would be good to know what are the performance delta, and for whom is it an ergonomic win
FM: I can answer a little bit on the performance side. Asyncify was hard to beat, but we did for JSPI. Even for the simplest possible function, we were faster than asyncify. We knew upfront that it would be easy to win on the code size because something like asyncify you’re really lucky if you have a 25% increase in code size whereas with JSPI it was constant, e.g. 10 more bytes. We were also able to beat asyncify even on the simplest function so that it was faster than a simple counter or fibonacci, we were faster than asyncify in that scenario.
You have to remember that JSPI had an extra overhead of always interacting with the microtask queue and with stack switching you don’t have that, I think we’ll beat the performance hands down
KM: Isn’t asyncify also end up with integration into the micro task queue?
FM: It does, but you don’t need to
KM: What do you mean?
TL: For use-cases like kotlin suspend functions, where there is no fundamental need to interface with js, then you can get even more performance wins.
KM: if you don’t have to allocate and object wrapper and go to the mircrotask queue, that makes sense. But they both seem like they have to use the queue because you’re using JS promises which makes you do that.
FM: The reason that asyncify is linear on the depth of the computation that you are suspending, and we were faster when a depth of 1.
BD: Asyncify supports synchronous work and not have to go through the suspension, in general JSPI always has to suspend
ZB: Last ten years in kotlin, async/await things gets implemented more and more. You can emulate these async/await things with a transformation, but more and more we’re seeing more async code. On the VM side you have to optimize it. With our current CPS you easily get irreducible loops, and VMs don’t deal well with that. In the case of stack-switching for a vm, it’s normal webassembly structure, it’s much much easier of a compiler to optimize this code. It’s why I want to have this.
SL: My take on Marcus’ question is that i;ve always thought of stack switching as being much more about ergonomics than raw performance. We have have evidence that we can get better performance but what eg. ben and zalim have been saying is that there is a real cost to doing these transformations. If you have a new feature you have to incorporate more things into your transformation.
CW: It’s both.
DG: The data MP is asking for is already available. I wanted to circle back because we have other concerns. We talked about orthogonality with shared functions too. We have a lot of statements in the room. What other concerns do we have to walk through?
MP: This could be as a skill issue, trying to get up to speed, its definitely been difficult to figure out anecdotes, and all the information out there, and who has done what, it would be useful to have a structured comparison, there’s a wide variety of strategies, of implementation, what are the data points that actually represent, having a document that would be collecting this
In terms of how many concerns, they are related. When we’re talking about performance, hardware security mitigations, that does come down to performance. Mitigations can be worked around if willing to pay enough in trade-off for performance.
DG: I don’t think any amount of data you collect from implementations that aren’t your own is going to be representative of what you would do in your implementation. There’s always a burden of proof on the other side about how JSC would implement this but there is always things you would only find out when trying to implement it
MP: We are looking and have done experiments. There are a lot of ways to implement this. We are not implementing every strategy and might not be approaching things in the best way. I am able to collect anecdotes, but if there is a lot of data, it’s not readily available.
DG: I feel like calling real toolchain experience (their experience with CPS transforms etc, and how they would benefit from stack switching) anecdotal is trivializing it a bit since real toolchain authors are coming to us with this experience and telling us what they want.
MP: I don’t disagree. From a purely operational perspective, it’s difficult to keep track. It would be helpful to collect this in a single place where there are a lot of options and tables of possibilities.
CW: Time box, would more discussion (e.g. this afternoon) change your vote to phase 3?
KM: Probably not. I would like to have concerns listed.
SL: We should discuss more anyway
CW: If no additional time box would change your vote. If you wouldn’t change your vote anyway then I would like to go ahead and get it on record.
MP: Are all the requirements for phase 3 complete?
FM: Yes, test suite, settled design,
TL: The entry requirements that the test suite has been implemented.
RH: We’ve made changes to the design in phase 4 but it is a much higher bar.
DG: It is only based on implementation feedback, it is not based on general feedback.
FM: There are some features that I would have like to add to the design, that aren’t in it.
CW: We’re ready to run the poll, we’ll do this with full 5 phases.
SF: 21 in room + 1 in chat F: 19 in room + 1 in chat N: 1 in room A: 2 in room SA: 0
CW: Is your against vote to say your intention is to hold the advancement to phase 3.
KM: The technical constraints are there, but a phase 3 vote - by going to phase 3 it means it is technically solved, and there are implementation details to figure out. The primary reason that I'm against instead of neutral is that by going to phase 3 it indicates to the community that its’ been technically solved and we’re just doing implementation details, and that’s not really how we feel and it will steer people in the wrong direction. It's true that our concerns are largely based on implementation. But our concerns are partly based on experience with other standards bodies where once you get to a certain phase it’s largely just viewed as a fait accompli and it’s just expected that it’s largely a matter of timing before it goes all the way.
CW: With the strength of the opinion, and the votes in the room in favour, and the lesser strength of the response, I would declare that we have consensus. Given that, do you want to register any stronger opinion against moving to phase 3?
SL: Seems to me that there are legitimate questions here that need resolving. Resolving them may be part of the phase 3 process.
KM: The concerns of the messaging is part of our concerns in some respects. That by going to phase 3, that many people may not feel the same way.
SL: Your questions about performance and whether or not there are bad interactions, that these will be answered as part of phase 3.
CW: Again, I would declare consensus based on opinions so far. Any stronger expression against going to phase 3?
KM: no
CW: ok, we have consensus for phase 3
VOTE PASSED
DG: As the notetaker, that is the process that we will follow that your enumerated concerns are captured in the notes.
CW: During the next break, will you find the notetakers and make sure these are well captured.
FM: For this afternoon, I would like it to be more focused. I would like us to identify specific concerns as questions to solve.
CW: We’ll be able to that in a focused way with the against vote that will document concerns
KM: I would like to think that our feedback so far has not been so much unfocused as “we don’t like this proposal”.
CW: No, just that this next section will be solution orientated.
KM: Concerns (would not be valid stage 3 blockers):
- The concerns with shared everything discussed above
- Deeper analysis of the hardware mitigations and their costs
- ARM64 FEAT_TPS
- Intel CET
- MTE (ARM) or Intel’s upcoming equivalent – when tagging stacks
- ARM64 PAC ?
- How to implement this efficiently for wasm hosts/consumers in environments that don’t control signal handlers (a signal arrives when on a side stack)
- Would be good to measure and have a document explaining the performance characteristics for a set of different implementations for CPS and stack-switching (soft)
JetStream 3 (Keith Miller, Ryan Hunt, Daniel Lehmann, 30 minutes)
KM,DL,RH presenting slides
Scoring: JS2 items had 2 scores, startup and runtime. The problem with adding eager compilation is that instantiation is so fast that you can get 0 ms and get infinite scores. Even with just 1ms you get huge startup scores and distort the scoring not reflect what we want to incentivize.
AR: How can average be higher than worst?
KM: Bigger is better, its 5000/time, the first is the slowest run,
DL: Worst doesn’t include the first iteration (worst out of the remaining)
KM: Very little OSS code with permissive licensing that uses threads, ran into licensing issues, but generally hard to deploy
RH: Generally don’t incorporate data initialization into benchmark to prevent it becoming an IO benchmark
AR: Is this a limitation of the JS API?
KM: Came out of the transformers workloads, they have models of 100MBs, into linear memory that model ends up as dirty memory, its in the network cache, it travels through many different interfaces, they all need to do copy on write to the next object,
DG: Is this observable across browsers?
KM: The memory use is in the browser
DG: There are web APIs people are talking about that may elevate this to a higher level, maybe model caching or something. Not sure if wed want to build it into the benchmark. I can imagine memory mapping primitives to solve some of this but not all of it. Was curious if you have talked to anyone about this from the web API perspective.
KM: We have not had those discussions yet, these are discoveries from 2 weeks ago. You could maybe take a response object as part of your element section which could maybe remove some layers.
DG: This is an interesting thing we’ve talked to some of the streaming API folks with, what would it take to pass it just as a memory object instead of a buffer or something
KM: There could be advantages of a response coming back all at once if that’s something the OS provides.
DG: Mediapipe has an interesting article about this about their solution at the application layer where they couldn’t even load their models without wasm 64 so it might be interesting to pass along.
RH: This is not a core wasm issue, only a web platform issue. But if you’re not a specialist the default thing is not working.
Continuing with slides, experience report. GC benchmarks are allocation-heavy.
KM: A lot of them have allocation paths that could have been elided, allocation elimination apss before passing to the browsers, this something to be aware of, because of tiering architecture its hard to perform this type of allocation
DF: Please stop allocating :D
BT: How do things with big fetches work in the shells?
RH: Preload everything, so there’s no I/O
KM: Every file has read-a-file
BT: Is there hope to run the benchmarks in non-browser engines?
DL: There’s not much missing but you’d need a JS environment.
RH: Removing the JS environment will be tough
BT: What about record replay instead of a JS environment?
HP: Are all the GC workloads allocation heavy?
RH: One of them in particular was the most allocation heavy thing we’ve ever seen
KM: There was maybe one allocation per 1000 instructions.
SD: For WasmGC you can observe allocations, but you don’t see the allocations on linear memory
KM: I haven’t looked at traces enough to answer that. I don’t know if we’d see the allocations. We do have the source maps.
DL: For Wasm linear workloads, the hot ones are not calling malloc and friends
KM: A lot of what we’re seeing are local objects that do not escape, and the producer is doing the allocations, C++ but will elide allocations at a level higher than WAsm
HP: JS has the same problem, there’s no compiler pre-step that will do the same analysis, so it’s a little surprising that they are so much heavier
RH: Another factor is these are running UI frameworks, so it’s an application difference (mostly doing DOM workloads) so there’s a lot of allocations building a whole DOM tree.
DL: For dart in particular, we had two variants, one of the variants were using an allocation heavy UI, for the time being we’ve selected workloads that aren’t allocation heavy, there’s some coevolution and a feedback loop that is helpful to establish
HP: At the same time if it’s real world code on the web that’s fine.
KM: One issue with jetstream2 is that we left it and there wasn’t much back and forth with developers. It’s a lot more disparate the tools generating JS. With Wasm there’s more opportunity to have these kinds of things.
AR: There’s an unfair perception that allocation on the heap is worse than allocation on the stack, it depends on the GC, if you’re language is designed on allocation on the heap then it’s already biased towards the implementation, it’s not inherently bad
KM: I don’t know if there are GCs that are also concurrent.
AR: OCaml is an example, the nurseries are per-thread I think
KM: Fair enough, this is more a commentary within the web platform.
ZB: About JS vs wasm in terms of allocations: we have both JS and wasm backends and our observation is that in our benchmarks in wasm we observe more GCs in wasm runs. We don’t know the nature of them yet, but maybe it’s something that could be fixed on our side or maybe the VMs, just an observation so far.
KM: When you compile Kotlin to JS you see less GCs?
ZB: Yes, approximate 5% of GC for JS and 15-20% for Wasm. The same benchmarks in Kotlin there’s some difference in runtime for JS and Wasm, but it’s unexpected.
KM: It would be interesting to see more about that if you have more you can share later
SL: wasm_of_ocaml folks were also talking about this, same code compiled to Wasm was doing
KM: It might also depend on which engines you’re running on.
RV: We had a case running in chrome doing deserialization and the JS allocates more per deserialization, but Wasm does more per second and makes it net slower because it spends so much more time executing the GC.
DL: This is an example of the experimental worklaods so we can work on optimizing together, that’s why we’re here - we want workloads!
BT: Plus one for doing offline optimizations for eliding allocations.
KM: Most optimizations are nice to be able to turn on and off. Producer and consumer can turn optimizations on/off
TT: My question is, I’m aware that we do not want to compare apples to peaches, but I’m thinking of sightglass there are different goals and objectives, maybe it would be nice to have alignment, there was activity back in the day in the benchmarking subgroup
DL: The group has been dormant and we should revive it.
TT: On stack switching, maybe it would be nice to have some benchmark that cover things that are more fundamental and would target different implementations
(Continuing with slides, summary)
KM: we’re trying to move away from more microbenchmarky things toward real-world use cases. We might still be interested in small tests especially if we cover new features, but hopefully in a real-world context if we can.
Multi-Byte Array Access (Brendan Dahl)
BD presenting slides
AR: One comment on earlier slide on proposal B - what does this mean for GC arrays? This is a non-issue since arrays are indexed by i32. If we want to do i64 that would be an orthogonal change.
BD: That’s true, I should put it as something to be aware of in the future
KM: Not sure this is really an issue, but are there any hosts that implement one of GC or SIMD but not the other? It seems like these are living in the GC opcode space and it would be weird if you supported GC but couldn't use them because you don't support SIMD.
BD: Yeah I guess all the browsers that have GC also have SIMD.
CF: Question about the reinterpret cast idea. Does that support unaligned access? It seems like maybe it wouldn’t?
CW: That’s correct. I’m not going to put my weight behind the reinterpret cast idea. It's not the preferred option.
TL: What would happen to other references that are not cast?
CW: Yes, I think there are lots of problems with this potentially.
CW: I assume if we go with proposal (A) we would still want to add atomic operations?
BD: I think so, but haven’t heard from implementers or users.
AR: Atomic accessors would become part of shared everything, right? Without shared array references there is no need for them.
CW: I guess I was instinctively thinking about how we allow atomics on non-shared memories, but there’s no reason to add them eagerly.
AR: We might have an entire zoo of things we need to be able to access with memory instructions.
TL: I like the idea of using the memarg bit, it would just be a shorter spec. We should keep track of how many bits we’re using from that alignment. Shared-everything also uses a bit.
CW: I support the version where we add more instructions. It won’t be a bigger implementation burden and we should just be honest that we have more instructions.
BD: Would you still want to use the array instructions if we have memref in the future? I’ve gone back and forth between proposal A and B.
CW: Even if we went proposal B now, couldn't you make the same argument in the future and say we might as well have not had any proposal because memrefs are coming later?
AR: For phase 1 there’s no need to decide.
CW: It could be useful to have a straw poll for A vs. B because it’ll inform development.
BT: I'm pretty sure the memarg encoding is an LEB so we have as many bits as we want. Also memory64 was a pain for the interpreter. I have a mildly strong preference for proposal B. adding lots of instructions is a big amount of work for an engine, it seems like it would be easier to reuse the existing instructions.
CW: My argument is that proposal B has the same surface area for the engine.
BT: It's a giant pain, every instruction becomes the same code gen,
RH: It would definitely be implemented quite differently between memories and GC arrays.
CW: These will be different instructions at the IR level immediately.
RH: I have a slight preference for A.
AR: I have a preference for B because I think we might get memrefs in the future.
CW: If we wanted memref to happen, they will probably have different instructions anyway.
BT: It's not clear to me whether this only works on array of i8? Maybe people want array of i16 to do this too, then it would get more complicated?
BD: The slice proposal would give you one view into everything.
TL: You’d need a type annotation anyway so you could look at that to tell which you’d use.
BT: We could also use prefixes ;)
CW: These mnemonics could be legacy for i8 array, future instructions are iN.array or array.iN
DG: I actually like proposal B from an engine readability perspective, you know what they're supposed to be doing even if the implementation is different. I wish relaxed SIMD were just the same instruction with different overloading mechanisms. It leaves the door open for sharing on engines that could do it.
BD: Switch looks a little nicer, but ultimately not much different
RH: Separate question: how important are the SIMD accessors?
BD: Thomas, feedback?
TL: Currently there's no way to do SIMD with wasm GC. We have array of 128, but no one uses it. I would expect it to be important, but don’t go off my word.
RH: Do we want to support waiting on an array of i8?
TL: Like atomic wait? We have a pretty nice story with waiting for Wasm GC and it doesn’t involve arrays, so I wouldn't guess that we would need it.
RH: Would prefer not to, would like the array not to move.
CW: If we go with B, would we also extend it for memref?
BD: Memref would be another bit?
TL: Ben’s right, it’s a u32, so we have infinite bits.
RH: It’s called sliceref!
BD: Did we want to do a straw poll for A/B?
BD: Who’s in favor of A? 3 in room, 1 chat
For B? 13/14 in room
Neutral? 19ish
Vote for phase 1: unanimous consent poll. No objections registered.
Function-at-a-Time JIT (Ben Titzer, 30 minutes)
BT playing AI generated song with lyrics in slides
The current Wasm architecture (Harvard architecture) separates code and data, which provides security properties like control flow integrity and supports static analysis via a closed-world assumption. This design, while beneficial for optimization and security, prevents code changes at runtime.
The motivation for func.new stems from "guest VMs" (VMs in a VM), such as language runtimes like Python, JavaScript, Lua, C, or CPU emulators running their own bytecode interpreters on Wasm. The primary issue is that interpreters are significantly slower (10 to 100 times slower) than JIT compilers. Additionally, Wasm itself is slightly slower than native code for running interpreter loops due to reasons like the lack of unstructured irreducible control flow.
Existing solutions on the web use JavaScript APIs to generate a whole new Wasm module at a time. Embedded Wasm environments often use engine-specific APIs for creating new modules (e.g., WAMR, Wasm 3, Wasmtime, Wizard). Wizard, for instance, implemented a host function to take bytes and return a function reference, which is the prototype for the func.new proposal. The core problem is the lack of a portable solution for generating new code.
If a Wasm module (e.g., a Python VM) needs to generate new Wasm code, it requires a host capability because it cannot be done purely within Wasm. Using a host API introduces a potential risk of code corruption or interception, as the security guarantees of the module rely on trusting the host API. Furthermore, the modularity of Wasm means that new code generated externally cannot access the module's internal state unless the module explicitly exports its capabilities, creating an indirection. This exposure of internal state to the external world, just for the sake of generating new code, is undesirable as it compromises control over the module's state.
The proposal is a single new bytecode, func.new, which takes three immediates, one of which is a memory.
The new function bytecode will be stored in a specified location. A new function type ‘ft’ and an environment are associated with this bytecode. The environment controls what the new code can access, which will be limited to a subset of the enclosing module. The process for the bytecode involves copying the bytes from memory to an internal buffer, running a code validator, which uses the module context, the environment, and the expected function signature to verify type checks. If validation is successful, a new function is produced that shares the state of the instance that called it, and a reference to that function is returned, allowing it to be called with minimal overhead.
Markus Scherer: You can only create functions of existing types?
BT: Yes, and that’s important.
Chris Woods: Why can’t I func.delete?
BT: We didn’t discuss that, we can in a minute.
SL: Can you GC these somehow?
BT: The idea is that the engine should manage them and collect unused functions.
SD: I want a GC array version (instead of a memory)
TT: If the new function references a type that isn’t in scope of the original module?
BT: So far creating a new type is out of scope.
TT: So we’re talking about bringing in a new function that has no other dependencies left, a new implementation of a function that already has a defined type.
BT: Right.
Continuing with slides - Proposal: fine-grained JIT interface
The environment acts as a scope for the new function, declaring everything in the outer module that the new code will use. This is like "internal exports" or a "mini module". These exports renumber the declarations starting from zero for the new code, meaning that all references within the new code are in terms of the environment it was compiled under. This is important because it allows tools to reorganize the outer module without the internal code noticing.
It would be nice if we knew ahead of time which memories could be used to create new code, although I think it’s optional. We could put a flag on the declaration. Because sharing memory with the outside world is potentially dangerous due to the risk of corruption if the memory is modified. When importing memory for new code creation, this flag would be required to prevent accidental usage of random, potentially insecure memory.
??: question about the security model there. This flag, does it have any extra semantics, is it supposed to protect the memory in some way, or just to prevent you from accidentally gettting confused and pointing to the wrong memory or something?
BT: For the func.new bytecode, the memory must have this flag or it will fail validation
KM: Could you achieve basically the same effect with just an ABI level thing of “I only use memory index 42 as my code index and the only things that read/write from it, and just validate before I send my code out?
BT: You could do this if you had a memory that was private to your memory, and you only used that then it would work. I’m not wedded to this idea but I thought it might be nice to force all the tools and engines to share it.
CW: I also feel that the flag doesn’t get you much because the problem is really that the memory gets leaked, not that you call func.new on the wrong memory.
KM: It's not actually true that the shared bit helps with that, I’m agreeing with you
BT: You can easily delete the flag if you want.
BT continues slides: the security guarantees are actually better than what you get with native; you can’t generate new capabilities.
The security properties achieved with this design are stronger than those of native code because funk.new acts as a controlled "hole" in the module, and the environment strictly defines what the new code can access from the module. The new code cannot generate or import new capabilities beyond those already provided by the module, thus preserving Wasm’s modularity properties and guest runtime module encapsulation. The environment enables static reasoning about the new code's access, allowing tools to perform dead code elimination, inlining, and module reorganization if a capability is not mentioned in the environment.
Limiting new code generation to one function at a time simplifies the engine's requirements, as it only needs to support code validation for a single function rather than parsing an entire module, which includes handling imports, type declarations, and canonicalization. The renumbering scheme based on the environment facilitates this dynamic code handling. This design also supports AOT compilation of the outer module and runtime, reserving the dynamic tier only for the newly supplied code, which will not introduce new GC types, as they would already be known in the AOT compilation.
You could have a follow-up feature to restrict the set of bytecodes allowed in the new code. This would further specialize and simplify the validator, compiler, or interpreter runtime for dynamically generated code, for example, by disallowing Wasm GC within the new code.
CW: It’s not clear if you would only need that for func.new, maybe you also would want for the initial instantiation
BT: Yeah
TT: IIUC, you could do a hotspot for it, and replace already running functions and you’ll be aware of when you do this in control flow, at some specific point that you know you are not destroying corrupting state
BT: Yeah that would be an application level thing, the engine wouldn’t be involved in that.
TT: There’s no thoughts about this currently, this may be a step towards it, you could call with a global with a callref or a specific entry in the table
BT: This could be a step on the way to replacing functions. You could have a table with funcrefs and replace the references
BT continuing with slide - Example .wat Usage
TL: Do types go in the environment or is it implicitly using the same types?
BT: I think you should put types in the environment. The binary encoding for types isn’t in exports. We should fix that.
TL: If I func.new one function and that wants to call a second func.new function, how would that work?
BT: This is something we should figure out. If you want to make a direct call from one generated function to the next, there’s no way to make a static index, but you could have a global which is the passage between them which would be faster than a table.
CW: You would effectively need module.new to do this
(BT continues with slides)
I filed an issue about this, there were a number of discussion points, e.g. “is this nested modules”, or “is this the component model”, I think this is a very simple form of nested modules that’s for one specific use case so we could get it into core wasm. There were also a lot of questions about asynchronous compilation, so we thought about that. Maybe there’s an async version of the opcode and gives you some kind of handle and maybe writes into a table. Lots of different ideas about how that could work. The engine has to go and validate your code, maybe it blocks if you try to call the function and wait until it’s compiled? VMs are already highly tuned for startup. So maybe they want to know, is this compiled yet? On failure, maybe we can trap, or return null or something else. What about out of memory? Lots of things to think about. How do you check for feature availability, it’s pretty fundamental to a runtime. Or what about custom sections like debug names or compilation hints?
KM: On design, have we considered allowing backwards compatibility for the web platform? Maybe you write an inefficient version that works out of the box and then use the new instruction when it becomes available.
BT: Not sure if anything is needed to be added to this proposals, or they could just be builtins
KM: You might want to use the new instructions so the shape of the instructions matter.
KM: Meta thing to consider, there’s no JIT on your system, but you have an AOT compiled module, the interpreter loop might be worse code that what you’ve had before, might be worth it to tell your module in some way that its not worth it to do this
BT: This could run slower if you generate new code which is interpreted anyway so we might want to pass that information to the module.
AR: Wondering about type.new, im wondering how that would work or how it would be useful.
BT: This is throwing out there to see if people will know how it could work, you probably want to have the same binary format encoding, but it has the same problems that you need the information in the environment
CW: At that point I’d be in favor of module.new.
AR: If you want to generate code for them they have to be statically known anyway.
BT: If you’re a guest runtime that uses wasmgc, and you have a guest object that supports wasm gc, that’s where the type argument might help
BT: If you’re an engine that uses Wasm GC and you take a program that uses GC you might want to create new GC types to represent the types in the user application.
YI: I would have a use for that, e.g. if your module is a JVM and you want to get a jar from the internet, the jar would use its own types internally, the outer module might not know. In our JVM in JS we have shapes. But that might still be better for module.new
CF: It seems like module.new is one type of, a generalization is now new functions, but references, and the back compat version would just be imports, then eventually, you can construct all of these with just first class references
AR: I would actually assume if we had module.new you wouldn’t even need the idea of environments, because it would just be imports/exports
CW: That’s what I wanted to say. I have bikeshedding arguments about env. We never discussed in the component model if a nested module can see the outer module without imports and this seems to make a strong declaration that it can.Can we come up with some more general syntax that would generalize and be forward compatible to nested modules?
BT: We can think about that, the prototype that I have you don’t get an environment, you get everything which has a performance advantage
CW: I wouldn’t be against that, but if we do reindexing, we should think of how to make the text format (and spec) forward compatible.
SD: Throwing around an idea for brainstorming, here we generate code and a new funcref, presumably we also want to call it, we had an idea to mark a global that is a funcref to say this is mutable but stable, but I won’t mutate it often, just once perhaps, and then you might as well optimize as it was constant, if it wasn’t then you invalidate, it fits in the story I believe
AR: Sounds like a compilation hint.
SD: Yes it’s a compilation hint and wouldn’t impact the visible semantics, but you would rely on it heavily.
KM: I could imagine that generalized, init one function with a special handler, that traps when uninitialized, you could imagine that for dlopen or similar.
Phase 1 poll: No objections
Component Model and the Web (Ryan Hunt, 60 minutes)
RH presenting slides.
DS: I’ve also noticed the table of language implementations. My take was that WASI is what attracts them, not the component model. What’s your take on that?
RH: I’m not opposed to people polyfilling WASI, but if we had the full three stages (with simple webidl interfaces). Then we’d have WASI and standard web interfaces.
DS: I’m not against that, just wondering what’s the point of components without the full web interface?
RH: You’d get some advantages from WIT IDL and its compatibility with webIDL
SD: As a source language, even if you’re using wasi, the compiler targets the component model and the libraries use WASI, so you could have other libraries for the web.
DG: This is a very tactical question: even if we did all of this to make the CM work, have the right bindings, the JS part of the pipeline (JS API) doesn’t go away. I would see us supporting both indefinitely, that’s a bit scary. People will want to keep adding to it. The JS/web API interaction is well defined and has a lot of ecosystem benefits from.
RH: I think it’s a feature to keep the core JS api around because some languages will want to do their own thing and the component model is opinionated. The interface you would have with a wasm component, it would be a smoothed version with a better dev experience but doesn’t give you as much control. So some toolchains would want to target the web explicitly.
ZB: Previous question of just having component model support, would be the ability to just use our components used in our languages, we wanted to use it for WASI, users would like us to have it for JS, for browsers as well, we separate it as we want to have the component model for both, but we want to build it for the Wasm browser target as well
CW: I buy that having a more direct line to webIDL bindings would help avoid JS Impedance to DOM access, but less clear on what role the CM has on getting us there. With JS it seemed mostly a lack of compile time optimizations with JS and it’s not clear how the CM helps with that.
RH: There are a lot of ways that you could improve the DOM performance, at its core what you want is that you have core wasm, and you have C++ on the browser side and you wnat the least amount of hops. Components have types, and they have clean boundaries with an ABI, you can match that ABI on the browser side
There are other approaches you could take that are more… one problem with function bind, you could import a function where you call the receiver, that would help. Maybe we could import a JS function so that without needing a string and the textencoder API, and the full generality of it really pushes you toward something like the full CM type system
CW: I appreciate seeing a full conversion diagram
DG: All experiments we’ve run directly with web with web apis, the dom is expensive, not the communication. There doesn’t seem to be a need for general web api access, but rather specific ones and making them fully general may hurt performance.
NF: On the value of the CM without binding web IDL directly to CM APIs (also valuable). What you get from the first step is that every language toolchain doesn' t need to reinvent emscripten or wasm bindgen or custom ABIs to talk to the web, pass the strings, etc. there’s a standard way to do it and they don’t need to do anything more than the half of their bindings generator for the source languages (generate JS). pretty valuable.
YI: Vaguely connected to this, I have a hard time imagining an interface that would work for GC & linear memory interface, would that give you a reference type or a linear memory interface? I assume the bindings layer will still need to be implemented
NF: Right now the canonical ABI only works with linear memory but there is a prototype for GC, still very WIP, just an issue on the CM repo but the idea is that there will be another variation of the canonical ABI you can opt into for GC types.
TL: One canonical ABI for GC won’t be enough, every single GC language does its own thing, vs the linear memory languages can rely on the C-style API for an escape hatch
NF: Agree in the limit. There are some affordances in the prototype now for rec groups, etc. but if your string representation is like a rope, that won’t bind very well to something that passes a list of i8.
TL: Not even the list of i8, that’s the buffer that underlying points to a struct outside, and that has a vtable.. etc.
SD: If you design your language for the CM and the GC ABI you will have your usual class hierarchy and the CM resource kinds, that have a different model that everyone agrees on, don’t think its very different from the linear memory situation.
KM: What does the envisioned output look like? If you have a dom node that doesn’t map meaningfully to a language construct, do you get a bunch of methods, or what happens?
LW: there are annotations to say something is a method.
KM: It gives you a struct and the struct has a method?
LW: At the wit level there’s a handle to the resource type, and you have methods for resource types .. ??
KM: Let’s follow up later.
CHrisW: you mentioned this all about DOM access but you said there would be non-web use cases?
RH: If you don’ want to write a web app, you won’t need to do thisBH:
ChrisW: I can see for the non-web use cases, - we’re in this use case where we don’t need the component model because we have the libc wasi hatch.
BH: There are a number of server side use cases that would also not use the dom but they are web platform APIs like TC39 fetch, which would be very interesting to a lot of different people.
NF: Response to Keith about what does that look like, you specify these apis in wit, a dom node would be a resource, that would have a raw Wasm representation, and what that looks like in your toolchain would, at the core Wasm level, resources would get put into a table because linear memory languages can’t use references, so it’ll be in a table…
KM: For the GC world, is the idea that you get back a struct with methods or more importing gives you a bunch of functions that take an externref argument to act on them?
NF: My webIDL isn’t fresh but for getters like next child we wouldn’t want to force then to be eagerly evaluated, but I think you’d expose those as getters, so it depends on how you'd translate webIDL to wti, but like dictionary that are like structs, they could maybe be structs but you probably wouldn’t want that for getters.
KM: I could see the inverse on that, when you’re calling webidl, a lot of webidls take options bag, like names parameters, I could see some performance implications, obviously an open design space
NF: I think we have solutions for these options, they are details of mapping between webIDL and wit.
Stephen: Yesterday Thomas talked about handling rich types from JS. Would we have both models or would we need to consolidate them?
RH: This goes back to the question about the core JS API, is it deprecated, do we want it at all? I think CM wouldn’t solve all use cases and there will always be people who want deeper integration. Custom descriptors would help with that.
RH: Does anyone think there isn’t the problem that I initially described?
TL: I’ve recently seen several benchmarks doing DOM stuff in WasmGC languages. It's faster than doing it in JS even though the bindings are slower. So it’s not always that Wasm is super slow when doing this.
ZB: It’s definitely a problem in terms of perception, and we do the best to isolate our users, but eventually they need to do some low level access to JS or web APIs. You still need to know JS things a bit. Some web APIs feel like they’re for JS and it affects how users perceive it.
KM: Some of the things all tie together, the complexity of the solution also needs to be evaluated, one could imagine the curve dips slightly, any meaningful application would end up in the same boat
RH: There’s always a cost/benefit to any solution, so I’m not saying browsers should go all in. Just saying we should think about it especially while we have time to give input into the component model.
SD: Generally supporting the idea, disingenuous but maybe it’s a bit about comparing JS & Wasm, its about comparing a compile target to a language, so this is true these issues are the same whether you are compiling another language directly to JS vs to wasm. So this happens anytime you want to use a language other than JS regardless of the compile target.
RH: This what users experience, there is a limit to how simple the experience can be with Wasm. The fact that people are using it at all is sometimes surprising
BV: Following up on the last remark, using some kind of toolchain will be a more complicated developer experience, there is something about running Wasm on the web, the wasm module is more analogous to an object file than a bundle, everybody who wants to run a wasm module on the web today has to play the linker. What the component model makes it more of an executable
ESM Integration (Guy Bedford, 30 minutes)
GB presenting slides
GB: Live bindings not fully supported in Node.js implementation of Instance phase.
GB: Are Compact imports or binary optimization generally regarded as the path forward? What do we need to think about in picking a name?
RH: My preference for now is to defer this and wait for compact imports or something. That being said, assuming we have compact imports, I'd prefer if we had something like "wasm-js". Without compact imports we have fewer good options. I wouldn't want to do "_" or something like that. That seems terrible, too.
GB: We could defer this on the compact imports questions. This is something that needs a decision in ESM integration, so we would want to revisit the discussion when we want stability for ESM integration.
RH: Theoretically we could ship ESM integration and choose a name afterwards.
GB: If we did it afterward, it would be its own process of implementation and test, etc. It would be nice to avoid that. It’s nice to be feature complete.
RH: Worst case, we don't get compact imports and we could still pick a more efficient name in the future. We can have the more full and pretty namespace right now.
GB: If it becomes a stability question. I’m more than happy to choose the "wasm:js" convention. I'm wondering if "wasm-js" makes sense as the namespace.
KM: Maybe I missed the use case of importing the string builtins / constants.
RH: Technically that is a different proposal. We're talking about importing JS string builtins into Wasm via ESM integration.
KM: Sounds good, I misunderstood what we were talking about.
GB: The thing with the string constant is: if we don’t provide a string constant, it would be impossible to use the feature.
CW: How about a straw poll for the namespace?
GB: Sure, but let's continue with the slides for now.
Back to slides …
Sam Clegg (chat): (About defer imports) that would work with source phase too?
GB: This is an instance phase feature. The source phase is unaffected.
TL: For the import attribute: is this something we have to decide now or can we defer it to the future? Element segment imports don't exist yet and there are no other use cases.
GB: Yes. Really the question is more about whether there is other spec work that we could be piggybacking on but aren't.
LW: Data segment imports do seem like a generally useful thing. If we add it to core Wasm, do we need to mangle type=binary into the strings, or can we infer that it’s binary because it's a data import?
GB: That could be another framing. The question is does that create compatibility issues down the road if there are other attributes we want parity with. That’s a discussion we can have at integration time. The question now is if this is useful.
BD: A more general question: I was interested in hearing from other browser vendors if they have been looking into ESM integration.
RH: We've started looking at ESM integration, but haven't fully implemented it or anything. I'm not an ESM expert, so I don't know how they're thinking about it. Will get started soon.
BD: Similar with chrome.
CW: One selfish question on the compact import format: This instance concept is much more complex than I’d have expected.
RH: When I first presented compact imports, there was feedback that it may not be worth it without doing more, so maybe it should reuse some concepts from module linking.
TL: I hadn't seen that. I would like to nudge in the opposite direction.
CW: Soft agree with that.
Sam (Chat): BTW, Emscripten has a fairly complete and tested instance phase implementation (tested against node's experimental support)
Profiles Andreas Rossberg (... minutes)
AR presenting slides
AR noted the growing diversity of WebAssembly platforms, from the web to embedded spaces and blockchains, each with different constraints. This diversity is a positive consequence of Wasm’s success but necessitates standardization. Slide 4 from 2023 illustrates the existing diversity in non-web engines and the presence of flags for experimental or historical features, making it unclear which features are relied upon by users. Diversity and flags are unavoidable, particularly with newer, large features like GC, exceptions, stack switching, and func.new, as some platforms will be unable to support them.
lime1 subset defined in tool-conventions repository, targets traditional linear memory uses. While this existing subset is good, the CG needs to define such subsets more precisely and coherently, using the current spec rather than the historical proposals. The profiles idea is to standardize "sublanguages" with precise specifications of restrictions on syntax and semantics, chosen by the environment provider (consumer), not the application (producer). The goal is to minimize fragmentation by acknowledging existing subsets and standardizing their number and compatibility.
Profiles are not meant to be chosen by the producer (who should assume the full language is available), nor are they meant to solve language versioning (which is a time dimension issue, while profiles are a space dimension). They should not be conflated with feature detection, which is a runtime conditional mechanism. Profiles must never change feature semantics, only restrict features or allowed behaviors (e.g., increased determinism). They must always be conservative, allowing consumers to extend their profile over time without breaking producers who target an earlier subset. The technical approach is subtractive, defining the full language and then carving out rules to remove features, which guarantees a complete specification of all possible interactions between subsets.
CF: Could you clarify what we have to do with new features. In which subset will a new feature be put?
AR: You mean how do we introduce profiles for certain features? That gets to the process question.
CF: If the profile remains the same over time, then we need to add new features to the subtracted set for existing profiles. That's why this should be in the spec instead of on the side. We need holistic, coherent thinking. It should be possible to add new features to profiles as well.
AR: Whenever we add a new feature, we have to think about the profiles. We need to have a coherent, holistic thinking about these things.
SL: Are you envisioning profiles only growing? You can never remove something from a profile.
AR: Yes, that is correct. That would not be conservative
AR: I think there would be a strong correlation with certain features. In reality a profile might be the whole language minus this and that feature. Because of the subtractive approach, combinations always make sense.
KM: How does it work, can you add features to a profile.
AR: The language can grow and so can the sublanguage. We as a standards committee have to agree that this is a larger language than before. If we respect the interests of all parties involved, that should be fine.
There are risks of defining profiles "early" (proactively), where it might be difficult to predict requirements, leading to "designing in a vacuum". However, the risk of waiting for the "cow path" (reactively) is that customers needing subsets will define them inconsistently, leading to more fragmentation. Tool chains might make inconsistent assumptions, and tools/engine providers have little incentive to propose standardization after they have already implemented a choice. Reactive definition could be considered a "breaking change" to the language.
KM: I claim I support the minimal MVP profile and add GC. But I also provide random features. That doesn't match any profile.
BT: It seems profiles should be versioned along with the spec.
I like this proposal since the beginning, we should have a few strawman profiles.
AR: GC, Stack Switching, Determinism, Exception handling, SIMD are the main big features.
CW: What about Lime?
AR: That is even smaller, …., Reference types could also be one. It is important to, we don’t have to exactly follow proposals. Proposals are a historical thing and features don’t necessarily have to align.
BV: I'm in favor of the cowpath approach. Clarity of what is supported is important. The benefit is getting rid of the weird gaps that shouldn't be there. A platform not supporting GC makes sense. A platform supporting only half the bulk memory instructions does not. Hopefully we can get people to avoid those weird gaps.
AR: Yes. You seem to be kind of contradicting yourself. The fact that Lime does not support all memory instructions is a consequence of the cowpath approach.
DS: It’s an example of a profile that doesn't want to match the proposal outlines since it supports bulk memory but doesn't support bulk table operations.
AR: It also doesn't support memory.init.
DS: Because that's only used for threads.
BV: The reason that I think the cowpath approach makes sense, is because it reflects the real needs/constraints people run into. It’s hard to say in advance where the line should be drawn, but you can make it coherent after the fact. We should minimize the minute differences between the different lime-y toolchains.
AR: Would you say the same applies to things like GC. … What it means to leave out GC is something that we should exactly define. It’s not necessarily obvious, e.g. do you disallow recursive types, those could be used for function types for example. In order to prevent 10 cowpaths from existing.
BV: I think that's a good example. There's a lot of type system complexity that was part of GC that might be useful more generally. But I'm not sure we can predict every need.
AR: Isn’t that saying the committee cannot predict what is needed. Isn’t it our job to figure out what the community needs and then standardize it?
TT: I like the proposal about the profiles because I don’t think one size fits all. … Do profiles prevent things like yesterday’s page sizes from becoming a blocker for adoption? How are profiles related to the standards process? Will there be a browser profile without custom page sizes?
AR: I think that's too small to justify a profile. But if you spec a large feature, part of that should be defining the profile for removing it. It should be part of the design process to consider who will implement a feature.
KM: On the cowpath side: If we don't do anything people will just do what they want anyway, e.g. say they support this proposal plus those instructions. More maximalist option: we consider all intersections of features.
AR: I think ultimately: it comes down the good judgement on the side of the committee. The question is what is the conservative choice.
BT: Some profiles are syntactic and therefore easy. The cowpath things only come in for the non-easy things because they're random subsets of instructions in use. We should plow ahead with the obvious ones and figure out the cowpath ones later.
CW: I agree, the strawman features will make the discussion more focussed.
Incubation framework for compute extensions (Deepti Gandluri)
DG presenting slides
AR: Because you mentioned multiple return values [presentation mentioned that multiple return values would be a blocker for the lightweight approach] would that still be an issue if we already had already instructions with multiple return values.
DG: Yes because that raises the question of why you need multivalue returns. Maybe I'm missing something.
AR: Early examples were divmod, add-with-carry, etc. Once we have set a precedent, maybe the threshold for adding more would be way lower.
DG: Maybe. Rounding modes are another example where it's simple enough but we would still want the full process. We want to be conservative to avoid including things people aren't comfortable with. We would go to the CG for guidance.
Lucas: You mentioned where it should lower to one or two instructions. But what would be the issue with having several instructions on older architectures.
DG: I think the only concern would be that I don’t want this to be an escape hatch for hardware vendors to get their unique instructions in where it would require expensive lowering on other architectures. Want to avoid performance footguns.
The intent is to expose instructions with predictable performance
CW: The extra thing I want to add is that I'd be uncomfortable with a scheme where the choice to use the new instruction makes it worse on some platform than not using the instruction would have.
DG: I think that is reasonable.
RH: Is this supposed to change the way we standardize new instructions?
DG: For a lot of these I am hoping to have a discussion about a new way of standardizing some instructions. Just the beginning of the discussion. Also for incubation, if 2 browsers decided to prototype, we could run an OT to get production data. So if we could incubate in a shared framework that would help.
RH: So when you say “shared framework for incubation” you mean clang and an origin trial? The bar for experimentation is already super low. Firefox would be comfortable making basically anything behind a flag and having people try it out. Shipping it is different.
DG: if you're experimenting just in your engine there's nothing stopping you. It's when talking about origin trials we don't want to get into a situation where users aren’t going and asking other browsers for features without them having some kind of framework.
FM: I’m a little confused. There is incubation and cheap standardization.
DG: next slides
FM: It occurs to me that if you have a small proposal, it would go through the standardization faster than a big one
DG: … yes and no.
CW: I had a similar point.
TT: I agree that web VMs are the performance stakeholders. But they aren’t the only ones, we care a lot about e.g. cache misses caused by AOT code.
DG: I wanted to call out Web VMs in this context because we particularly care about this kind of performance.
AR: One thing we have discussed. WebAssembly was historically bad at versioning. And I thought more about bumping the version every time we add a feature. This would go in the opposite direction.
DG: You could decide to bundle several of them into a "feature."
MP: I like this sort of thing overall, a fan of experimenting with compute. One unfortunate thing about actual hardware is that if the architectures aren’t careful they have nop encoding space, but then people use it for encoding hints or things like that, patterns that have to be recognized or that have to be converted to real instructions. But it happens when vendors experiment with the space too much. Maybe there could be something to more clearly mark spaces as experimental and flag that to users or something.
DG: Yes, I mean I see that. Two points: one is the general validation, we could either validate aggressively or not. The second thing is how to query for features.
YI: About the comparison with resizeablearraybuffer… that’s kind of scary actually i remember trying to figure out, there is a new thing in the spec, i didn’t recall the discussion, i didn’t know if I could use it, it was hard to figure out
DG: I think this is trying to do that. We did have this discussion when we were doing resizable ArrayBuffers. That process is opaque for someone who is not in TC39 or following along, but what we'd like to do is make this more discoverable.
CW: 2 sort of related questions: with things like 128 bit mul, that went through the process very quickly and now its stuck because we’re waiting on web VMs to implement it.
DG: That's a chicken and egg problem. We incentivize a really slow process with a lot of validation, even when it's not necessary.
CW: The 2nd version/observations. We could try to grow a totally separate process for this, or we coil think in terms of skipping steps. If you have a design that satisfies these initial checks and a prototype, maybe you skip it to phase 2, and then when web VMs implement, then skip it to phase 4.
DG: I think my reason to propose the most agile approach was that I thought I would get more resistance.
TT: We have the lowest common denominator in performance, and now add instructions that might run faster and might not run at all in older hardware…do we want a profile that says we want deterministic performance and a profile where you don't, where you want to take advantage of the latest performance.
DG: There's a couple of things. The first one is that we would not do lightweight standardization for anything that is not supported across modern hardware. Even where emulation is necessary, the engine would be able to emulate better than a pure userspace emulation. Secondly, deterministic performance is a myth. We’ve told it to ourselves but all performance is device dependent.
BT: ARM has 4000 accelerator instructions. I’m of 2 minds: I want to have a lightweight process for this that are important but it feels like we’re playing whack-a-mole. Maybe every 2 years or something we do a roundup of instructions that are now available everywhere and bring them in? Maybe less feature-related?
DG: …
BT: It makes the instruction set very baroque with all the random things, e.g. relaxed SIMD.
DG: We could have been more consistent about relaxed SIMD but the CG had strong opinions about what the requirements should be to add, and the ones that are in are the only ones that met it.
MP: It's not a game of whack-a-mole. Figuring out which instructions are important is a solved problem. The hardware frontend people look at traces and can get it down to a couple hundred.
DG: I kind of agree with your concern. … We have to work around the weirdness we have introduced.
BT: also just practically, adding these things into the engine really isn’t that much work to add a few instructions. Maybe some pragmatic work about how to make it easier to implement instructions in the backend.
DG: In a lot of cases it's like wide-integer arithmetic. It’s just a matter of a few people sitting together for an afternoon and discussing it.
BT: It should be more accessible to contribute.
DG: The hardware vendors already contribute to V8, for example. That's probably enough.
TT: I’m concerned about e.g. encryption instructions in OpenSSL. What if switches implement a features an instructions that WebVMs do not support: would this be a blocking issues.
AR: You mentioned it was just a few hours to implement a few instructions; it’s not just that, it's that times 50 because there are a lot of different implementations, tools, etc. it’s not just one engine. Also we still don’t have good tests for relaxed SIMD
DG: That’s largely because it took so long, the people who were championing it left the CG and the company
CW: I think the time it took to do relaxed SIMD reflected how controversial it was. Hopefully we wouldn't use this process for instructions like that.
DG: We are all in agreement that the process takes longer than it should. What actually happens now…
We do a lot of spec tests to hit the letter of the process not because we want it.
CW: One experiment to test this process. You could identify a couple of instructions you think would be suitable and we could discuss how we would feel about sending them straight to phase 2.
DG: I think so. That sounds like the right way to do it. Closing statement: I wanted to ask if there is interest in such a process.
Discussion on Stack Switching
Apple Engineers on the Podium
Clarification on what “boxing" means: the shared->unshared wrappers mentioned before.
FM: On signal handlers: I was under the impression that in general when you write a signal handler, you are very concerned with how much stackspace you use.
KM: Yes, but people do things with their signal handlers that are not necessarily recommended. A lot of ways Wasm implementations do signal handling for e.g. memory bounds is not really withing the recommendation of what to do in signal handlers.
KM: As of last week ARM introduced new hardware instructions - permissions overlay extension (POE2… FEAT-TPS) instructions, that give overlays into your process. You can basically annotate memory as stack memory and have two registers that contain the upper and lower bound. If you access the stack outside, you’ll crash. In order to change your stack bounds you need an isb instruction which is very expensive, since they expect that this will be rare.
CW: Are you able to talk about in any more detail if Apple is going to use this feature?
MP: No. The intended uses of the proposal are that apps could opt certain threads in, allocate a guarded region of the stack even if the rest of the process isn't. Or go all the way and have clang + pthreads set it up by default, or even in the kernel. So there is a spectrum. But if you were in a case where the platform wants stronger restrictions, it’s a problem if the app wants to rapidly bounce between stacks. It would need an isb to change the bounds.
CW: Registers effectively bind the top and bottom of the stack, and then you need to issue an ISB.
SL: Is the concern with moving the stacks around? If you use the segmented stack approach, you don’t have to move around.
KM: In theory they could be, you have one bound and now you have to move to a heap allocated thread.
SL: This is not a problem if you’re single threaded. Or if you can define a single contiguous range where all your stacks are you’re fine but if you add shared-everything then you have to switch between threads.
CW: Is an ISB actually that bad compared to a copy? I would have expected the memcopy to take a few cycles.
MP: They’re faster because you can speculate through them
BT: You can also lazily copy, the “Effect” language does it lazily on return. If you have a deep stack, it spreads the cost of copying out by copying frame-by-frame. The top frames coming in and out get copied, but the frames below them stay in place.
AR: I’m wondering, having side stacks as a technique isn’t just invented for wasm, what would other systems that use that do?
MP: It’s one of these things where we’re dealing with different constraints, with Wasm it's a different thing with a higher bar and you can be more restrictive.
CW: On the ISB thing: in the presence of shared continuations, even a copying implementation would need its own barrier, so don't we need the ISB anyway?
KM: You don’t need an ISB, you only need the cheaper store fence.
CW: But at least it brings them closer together. And if we're talking about only a couple hundred cycles anway, maybe it washes out in the end.
KM: This is maybe possible, not impossible, I agree.
MP: DSB & ISB are very different. You can speculate through a DSB. It’s not that this is a no-go, it's just that there are concerns.
FM: I’m slightly concerned by this ARM architecture compared to intel. With Intel, you can allocate a region for the stack but you have to go through the OS. Then there are instructions for exchanging shadow stacks, which are in user space. You can switch shadow stacks in user space. With this ARM feature, you don't have to go through the OS, right?
KM: This isn’t a syscall either, you're issuing a synchronization barrier to the CPU. It tells the CPU to stop speculating that your bounds aren’t changing, because they changed. It flushes your whole out of order pipeline.
CW: Question related to that: you need a store barrier, do you also need a load barrier?
KM: No. On arm64 you just need one or the other in theory. The acquire guarantees that you'll see everything written before. In some cases you just have an address dependency and don’t need it at all but generally you need one.
SL: This particular hardware, how available is it?
Apple: It is proposed, at least two years from now.
CW: you can do an approximation of the benchmark by issuing the barrier, which is basically the whole cost.
BT: I think we're overindexing on speculative hardware.
CW: The high bit for me is that we need experiments.
MP: My experiments suggest with just inserting ISBs at certain intervals did suggest that the cost was noticeable. Other similar hardware mitigations have similar potential concerns, but we’ve talked about the others more,
BD: I have a question on the benchmarking: Did you only talk about runtime performance or also binary size bloat as well? Or, are you already accepting that there’s a big cost?
KM: Not asking anyone to write entirely new CPS transforms, it would be interesting to think how one CPS transform works, and what a different flavor of a CPS transform would look like in terms of size and performance benefits. A classic example with tail calls, and with your exception handler - hard to say what the overhead would be.
BD: In the Asyncify case, in many cases it was just a nonstarter for users to use it when their binary got 50% bigger.
KM: There’s a half of it that’s applications that have applications that assume they’re synchronous - the JSPI applications, I don’t know what % of Dart/Java/Kotlin functions are asynchronous. E.g is the bloat still the same when you only CPS the functions that are async?
BD: Yes.
FM: If you are comparing any CPS transform: there are better ones or worse. …with any switch-oriented transform like Asyncify, you are also paying for the fact that your local variables are not in your registers anymore.
KM: You can have them in registers between yield points
FM: They have to be saved away on switch, and loaded back.
CW: Comes back to copy vs ISB - Conrad betting that its not that different.
AR: You said with Asyncify it’s not clear how many functions you have to transform. I agree but async is not really the best example; lightweight threads are the better example because you have to transform everything.
KM: Agree that that would be the worst case scenario
AR: That’s a very common thing.
KM: Fair enough. It’s important to understand how the different things play out.
CW: We got quite deep into certain topics, is there anything we should discuss more?
CW: We certainly have some things we know we can test, we should run those, and should we say, what benchmarks should we run that would potentially be convincing?
SL: Partly I think, we’re going to do these benchmarks naturally. It is important that we’ll have an ongoing dialogue with you guys (Apple) Also you have your own tests in your own environment, which we don’t have access to.
KM: There is ongoing work implementing JSPI. the extra work beyond that, e.g. creating a new continuation when you start a new side stack.
SL: The major issue is that the use cases for JSPI are very different.
KM: For many green threads/ cooperative scheduling …i would think the rate of switching is a configurable parameter.
SP: Another knob it would be good to look at is the tradeoff between number of transitions vs how much memory is spent/wasted
DG: I just wanted to go back to the very specific things you had: W.r.t. Shared everything: In terms of the concrete thing we can test, the ISB vs the copying. For CPS vs stack switching, this will flow naturally from having a prototype and letting toolchains test it. The 3rd one we could go deeper on what could help.
KM: with (copying?) if you never leave the POSIX stack, then signal handlers aren’t a problem.
FM: This is an experience report from the JSPI front. Our biggest issues had nothing to do with that. We ended up having to switch to the central stack to do almost everything, e.g. running host code or JS code, we couldn't run those on the side stack. It was the biggest obstacle for getting decent performance. Through the heroic work of a few people, we managed to get the cost of that down, but it was the biggest concern rather than the cost of switching between stacks.
MP: A major trend in the hardware security world is that the current models for how we run software are insecure and there needs to be more hardware assistance for making this happen, so there’s been a lot of hardware stuff about doing this and getting platforms and vendors to adopt it. It’s a slow timeline, similar to stack switching, so it makes sense that we haven’t run into it yet, as it's forward looking.
FM: We’ve had the same pressure, in practice, the logistics of running on side stacks was harder than we expected. We got that wrong a number of times.
SL: Francis has been bringing up these kinds of issues since the beginning in the stack subgroup. You're saying this kind of thing wasn’t the biggest issue?
FM: Just a report.
MP: There were still ongoing experiments with CET that you don’t have the full numbers on that yet.
FM: At the beginning of the process, we anticipated that any time now somebody is going to drop the CET hammer on us, but it never happened. We gamed it out and implemented the PAC instructions for ARM. It was relatively straightforward. CET seems better than what ARM is proposing but I don’t have a strong opinion about that. But it wasn’t the biggest pain point for us.
MP: That is an effect of platform vendor policies, I assume this would be different if your hardware policies would be different. You might have options that system apps and system libraries might not have.
ZB: In terms of benchmarking we can provide an implementation of stack switching support, we already have CPS transform, we can’t test it on all the hardware and VMs, so we need help with testing on machines you care about.
NF: Digging in more about the signal handler stuff: it seems to me that if you're doing POSIX, you can do SIGALTSTACK?
KM: You may be a guest in someone else’s process - its not vectored exception handlers where it nicely composes between individual callers. You don’t have control over what the host app does.
CW: Is this because of JSC being embedded in larger applications?
MP: JSC is just a system library, so we don’t control where we’re embedded
KM: You could have something - whether we want to do this or not is independent - you must call this API, and core stack switching won’t compile if you don't. Then you just call sigaltstack when they do that. - don’t know if it's a viable option.