CG-02.md

June 20, 2025 · View on GitHub

WebAssembly logo

Agenda for the February meeting
Meeting notes

Agenda for the February meeting of WebAssembly's Community Group

Host: Fastly, San Francisco, California, USA
Dates: Wednesday-Thursday, February 12-13, 2025
Times: 9:00AM - 5:00PM PT
Video Meeting:
- Link available in W3C calendar events: day 1, day 2
Location:
- 475 Brannan St. #300, San Francisco, CA 94107
Wifi: TBD
Code of conduct:
- Standard WebAssembly code of conduct. If you have any questions or concerns, please reach out to WebAssembly CG chair.

Research Day

Date & time: Tuesday, February 11, 2025. 9:00AM - 5:30PM
Agenda

Date & time: Wednesday, February 12, 2025. 6:00PM - 8:00PM
Location: Spark Social
Details: Space reservation sponsored by Google, food available for purchase at Spark. All in person registrants of the CG meeting are welcome to attend.

Logistics

Communication

Event contact: luke@fastly.com
CG chair contact: webassembly-cg-chair@chromium.org
Discord: Join #cg-meeting-2025 on the WebAssembly Discord

Getting to the venue

The event will take place at the Fastly San Francisco office, 475 Brannan, Suite #300.

The Fastly office is within walking distance from the Caltrain stop as well as T and N train stops. Due to the general unreliability of traffic in SF, transit, walking and/or biking may be preferable.

There is a buzzer to the right of the main entrance, to be let into the building.

Arrival on bike

There is a free bike cage in the garage that opens at 7:30am and closes at 4pm. After 4pm, the guard at the front desk of the 1st floor will let you in. Please enter through the garage only (no bikes allowed through the building lobby) and let the parking attendant know you are attending the Fastly event. If he is not on the garage floor, he will be through the door at a payment booth. The attendant will show you the bike cage. Please bring your own lock.

Arrival by car

There is a (paid) garage at 475 Brannan St. Please stop by the parking attendant to pay when you enter.

On arrival

After entering the building, take the elevator to the 3rd floor, and the entrance to the Fastly reception area will be on the right. If arriving after 10am, text Luke at +1 (817) 919-5583 and he’ll run up to the door and let you in.

Badges will be printed at the reception area and need to be worn at all times in the Fastly office. Fastly policy requests that attendees take a COVID-19 test before attendance.

Agenda Items

Session start times are not guaranteed. We may start sessions before their scheduled times if previous sessions end early.

Google Calendar of sessions

Wednesday, February 12

9:00am - 9:15am: Meeting Introduction
9:15am - 10:15am: Javascript Promise Integration (Francis McCabe)
- Vote for phase 4
10:15am - 10:30am: Break
10:30am - 11:30am: Stack Switching (Sam Lindley)
11:30am - 12:00pm: Component Model (Luke Wagner)
12:00pm - 1:00pm: Lunch
1:00pm - 1:30pm: Shared-Everything Threads (Thomas Lively)
1:30pm - 2:00pm: Thread-Local Storage (Conrad Watt)
2:00pm - 3:00pm: Half Precision (Ilya Rezvov)
- Possible vote for phase 2
3:00pm - 3:15pm: Break
3:15pm - 3:45pm: Wide Arithmetic (Alex Crichton)
- Possible vote for phase 3
3:45pm - 4:15pm: WebAssembly Website (Tom Steiner)
4:15pm - 5:00pm: Slack Time

Thursday, February 13

9:00am - 9:30am: Memory Control (Deepti Gandluri)
9:30am - 10:00am: Immutable ArrayBuffers and Detachability of ArrayBuffers (Deepti Gandluri)
10:00am - 10:30am: Compilation Hints (Emanuel Ziegler)
- Vote for phase 2
10:30am - 10:45am: Break
10:45am - 11:45am: Custom RTTs and JS Interop (Thomas Lively)
- Vote for phase 1
11:45am - 12:45pm: Lunch
12:45pm - 1:15pm: ESM Integration (Guy Bedford)
1:15pm - 2:15pm: SpecTec (Andreas Rossberg)
2:15pm - 3:00pm: Slack Time
3:00pm - 5:00pm: Unconference / Informal Discussion (No Zoom)

Attendees

Thomas Lively Derek Schuff Conrad Watt David Degazio Sam Lindley Markus Scherer Yuri Iozzellli Henrik Edstrom Deepti Gandluri Ilya Rezvov Thomas Steiner Erik Rose Chris Fallin Naomi Smith Dan Gohman Steven DeTar Ryan Hunt Yan Chen Brendan Dahl Guy Bedford Ricky Vetter Paul Osborn Alex Crichton Bailey Hayes Elizabeth Gilbert Leslie Carr Zalim Bashorov Calvin Pruet Nick Fitzgerald Luke Wagner Francis McCabe David Thompson Michael Ficarra David Bryant Reid Rankin Keith Miller Ashley Amejonah Izoo Sébastien Doeraene Ben Titzer Yury Delendik Long Emanuel Ziegler Slava Kuzmich Pierre Chambard Ben Visness Ty Overby Jerome Julien Pages Monadic Cat Kevin Gibbons Mendy Berger Chris Woods Matthias Liedtke Andreas Rossberg Andrew Brown Johnnie Birch Frank Emrich Shu-Yu Guo Heejin Ahn Saúl Cabrera Oscar Spencer Paolo Severini Amal Ahmed Sukyoung Ryu Peter McInerney

Meeting notes

Wednesday, February 12

JSPI

JSPI, Francis McCabe (FM) presenting. slides

CW: Due to a late-breaking expression of concern from WebKit, we won’t do the phase 4 vote. For now I’d like to do an informal poll to see whether folks are favorable to letting it go forward.

SD: FTR, we have a toolchain implementation in Scala.js-to-Wasm, although it's not merged yet: https://github.com/scala-js/scala-js/pull/5130

KM: Sorry for the delayed feedback, we had some staffing changes during this time, and we didn’t get to review it as well as we wanted. Our concern is not with JSPI itself but rather with having side stacks. Some of these pieces have value of their own, but once you have side stacks, there will be pressure to keep that for core stack switching. So I'm concerned that we will lock ourselves into that design constraint. So the first concern is when you’re running JS code and want to connect, you might take the promising import and call it as an async JS function which then calls back into the JSPI input and repeat a bunch of times, you’ll quickly exhaust your memory with side stacks. So there’s concern with how the API is surfaced. On core stack switching, it’s hard to represent stackless coroutines efficiently so you’ll end up with the same problem you have with Asyncify today, e,g C++, JS, Swift, Python etc would still be sort of second class citizens. Also we’re concerned about mixing engine data and user data at the same addresses that are shared in ways that could leak. With our JS objects, we isolate them such that an actual virtual address like a PC would never contain an integer from the user, but side stacks make that really hard to do across threads; presumably we’d want to migrate across workers at some point. We’re hoping to create a proposal for core switching which is more like how JS does it without overly burdening. Still working on that internally, will hopefully get to that soon.

Sébastien Doeraene (in chat): FTR, we have a toolchain implementation in Scala.js-to-Wasm, although it's not merged yet: https://github.com/scala-js/scala-js/pull/5130

Ty (in chat): Wasm_of_ocaml is another toolchain implementation

jerome (in chat): Right, Wasm_of_ocaml is using it to implement OCaml effect handlers

David Thompson (in chat): I hadn't looked at this proposal in quite awhile, looks so different now. I think this could solve some issues for me wrt scheme programs that aren't using async i/o, etc. Though I'm unsure given this side stack discussion... would have to just experiment and see how it goes, I guess.

CW: I’m not expecting we’ll resolve today. We’ll just have a conversation over the next couple days.

SL: Clarification about core stack switching: we could use some of that time to continue the discussion on core switching if we want.

CW: I think we’re doing ok for time. Right now we should have time for a conversation.

Reid Rankin (in chat): Since JSPI forbids re-entrancy, wouldn't that avoid the memory exhaustion issue?

TL (in chat): JSPI allows re-entrancy these days.

RR (in chat): TIL, thanks!

FM: So, I obviously don’t agree :) first, on the number of side stacks that we might have in the system: it’s been our intention to be able to support on the order of 1M side stacks in a reasonably large page. That comes from a conversation with the Kotlin language designer, who said that some of their apps use suspended coroutines in a react-style model, i.e. a suspended coroutine for every node in the graph. Our current implementation doesn’t quite get there. We have something we call growable stacks we are working on in the implementation. The growable stacks design will allow on the order of 1M side stacks on a reasonable size page. So I think that issues, which has always been important to us, accommodates that. For the "sandwich" scenario where you have reentrant calls back from JS have also been an important consideration.

FM: The last point you raised was essentially a security property that you don't want to have side stacks with arbitrary pointers into arbitrary code. I agree. We don't want that and we don't have that.

FM: This does need to be done properly; the other thing to note is that we don’t have shared threads yet. When we do, the existing mechanism will result in an unshared side-stack; it won’t be possible to share that across threads.

FM: The idea of having parallel access isn’t going to happen. Having shared continuations is definitely part of the anticipated design. But not today, we’re not doing shared things between threads today. We’d like to have work stealing but we don’t have it yet.

FM: Finally, the performance of the stackless coroutining: You should not use JSPI to implement stackless coroutines in your language. If you want stackless coroutines in your language, you must use the core stack switching proposal, which is a separate proposal.

FM: People have tried to use JSPI to implement coroutines in the source language; it’s a heroic effort: possible but not part of the intended usage. On the other hand, once we have core stack switching, then it should be the case that suggests that stackless coroutines will be significantly slowed down by this design.

DT (in chat): Very interested in growable stacks. Good clarification re: stackless coroutines.

SL: One of the concerns you (Kieth) had is that there are languages which implement this functionality using so-called stackless approaches, and this design might compromise those languages. But what about languages which don’t use stackless coroutines, why should they compromise?

KM: Ergonomics is one question, performance is another. Given that this is a compile target, ergonomics I think is less of a concern than how efficient the implementation can be in terms of memory, code size and runtime performance.

SL: I do have in mind ergonomics, but also performance.

KM: this would obviously require some exploration, but I think there’s a design that doesn’t necessarily have to make that tradeoff (still waiting on internal legal approvals to present it). I agree that we don’t want to sacrifice performance for languages that use stackful coroutines; I’m hopeful that our design will work for everyone.

SL: The belief in the stacks group, which has been going for 5 years, is that the design should be able to accommodate all of these languages efficiently. But if there’s evidence to the contrary, I’d be interested to see it.

KM: Sure.

Deepti Gandluri (DG): A couple of questions for KM: back to the JSPI discussions , you mentioned an alternative design: i’m wondering what implications that has for the current shape of JSPI, which doesn’t really impose too many restrictions on core stack switching (and the way that’s exposed to JS could evolve). I’d be curious to see what you think should change on JSPI specifically.

KM: Our concern is: As a hypothetical, say JSPI goes to phase 5 and becomes standard. Then when core stack switching is discussed, the fact that JSPI requires side stacks could push people to say that we should do the same thing for core stack switching. If we’re already using this for JSPI, we don’t want two separate mechanisms that don’t interoperate super well.

CW: A more pointed version of the question: what is the thing you’d want to see instead of JSPI?

KM: on JSPI: I have a library solution that has equal performance to JSPI for the stackless use case (my test case) although it requires SharedArrayBuffers to be efficient. That’s sometimes difficult for people to use but I think it covers a lot of the concern. I’m open to discussion on that, and would need a little time to work that out.

CW: I don’t think this is something we can resolve today, but over the next few weeks I hope we have more discussions and get more clarity.

DD: To rephrase/clarify the response: for JSPI itself, I think our concern is primarily about core stack switching (CSS), and concern about how JSPI would lock us into a particular direction. We can talk more about how that would take the direction of CSS, but the concern is about the ultimate direction.

SL: One observation there; Francis and I have been talking about this for years, and the current V8 implementation, and perhaps SpiderMonkey too, it’s actually implementing stack switching underneath.

DD: That's sort of the problem, we need to figure out if that is essential. If JSPI forces side stacks in order to have good performance, that really limits the options on core stack switching.

FM: I believe the SM implementation of JSPI is based on a fibers API. The V8 implementation is hand-crafting from the beginning; we didn’t have a fibers API we could trust.

RH: you mean the windows fibers API?

FM: For example.

RH: My understanding of our implementation is that we malloc a stack and switch to it, etc.

CW: The original wasmtime implementation used a fibers API. Maybe that's what you were thinking of?

FM: Within the engine, the kind of thing a fibers API would give you are the kinds of things you’d want to implement JSPI. The API doesn’t mention fibers, and you can implement promising and suspending however you want. But if you don’t do that, you won’t get good performance.

KM: yeah that’s the concern.

FM: Telling users they won’t get what they want isn’t a good response.

DD: it’s not an arbitrary concern to be against side stacks. But If we start from there, it would disproportionately disadvantage our browser if the API required that in order to get good performance.

FM: Just use side stacks ;-)

DD: I would think we still need to figure out the details. We’re figuring out the divide between JSPI and stack switching. We need time to evaluate it. If stack switching is an option for us, then it may be viable for us for JSPI.

KM: I was somewhat optimistic that it would be viable until WebGPU came into the picture. I had heard that a big use case was for things like network requests, dlopen, etc. so it would be ok if you had to walk the stack each time, but if you want to do this on every render request it would be a performance problem.

CW: Another way of saying that is, it requires a more performant mechanism to do the control transfer, more efficient than you can do without side stacks.

Reid Rankin (in chat): Ok, noob question -- Where can I find the latest proposed JSPI spec? I'm just realizing the doc I've read (https://github.com/WebAssembly/js-promise-integration/blob/main/proposals/js-promise-integration/Overview.md) is 5 months old, and it seems like more work has been done since then I might have missed.

Sébastien Doeraene (in chat): https://webassembly.github.io/js-promise-integration/js-api/#jspi

LW: Are there any additional implementation-specific limitations that we should take into account. You mentioned integers that looked like pointers, is that about conservative stack scanning? Are there other things, maybe CPU-specific considerations that we should know about?

KM: We have concerns with future plans. Unfortunately I am not at liberty to disclose those plans. Largely our concern is in the intermingling of pointers and user-controlled integers, and the security implications of that. We have future security plans I can't go into that make that harder. It would make side stacks very hard, but wouldn’t necessarily affect e.g. how promises work

LW: If one were to run go natively, would that same thing prohibit Go from running natively?

KW: I can’t talk about that. And I don't know much about Go.

Deepti: I would encourage folks to go take a look at the longer response Francis wrote (https://github.com/WebKit/standards-positions/issues/422#issuecomment-2651632857). The fact that the asynchronous interface is useful in itself. Would be an alternative to JSPI that doesn’t look like JSPI itself?

Deepti: Conrad, I’d like to see a time box for what this discussion looks like, so we know what to expect.

CW: I think this is something we should discuss over a reasonably short timeframe, e.g. by the next stack switching meeting we should have an idea of what the timeline would be

KW: I did discuss this with Francis several months ago and I’m not aware of any progress since then.

CW: I don’t want to be too much of a stickler for procedure, but we do want to have these discussion in public

KW: there wasn’t a discussion on JSPI since then, so there wasn’t really an opportunity.

David Thompson (in chat): That's a good point re: webgpu. Realtime graphics on wasm is something I want to see improve.

Reid Rankin (in chat): (I'm missing why JSPI requires side stacks. I don't want to take up the group's time here, but if anyone would be willing to explain this to me offline or in a directed message I'd appreciate it.)

David Thompson (in chat): I guess because you have to suspend the wasm program and resume it later to provide the illusion that you are doing synchronous execution?

CW: A three-way poll (informal): F in favor of JSPI moving forward as is, N = Neutral, A = Against, want to consider alternatives such as what KM is proposing

F: 25 in room, 12 in chat N: 6 in room, 7 in chat A: 1 in room, 1 in chat.

CW: For the record, both people voting against are Apple representatives.

KW: I think we’ve covered our concerns above.

Stack Switching

Sam Lindley (SL) presenting slides

SL: Starting with motivating why we want stack switching. There are many existing languages with advanced control flow features that desire compiling to Wasm. Many are compiling to Wasm now with sub-optimal approaches like Asyncify. You can view Asyncify as a stackless approach that leads to code bloat that doesn’t really scale.

SL: The claim is that stack switching that we propose is sufficiently general to support all of the desired features [slide includes: TODO]

SL: Stack-switching came out of the JSPI discussion. We found that it was more of a low hanging fruit to implement right now.

SL: The current status is that this proposal is very mature in terms of implementations and experimentation. There were various bumps along the way. We have discussed in the past if stackless design would work and did not align fully with the design that we had in mind. We moved to phase 2 in August 2024.

SL: We have mostly the design with the reference interpreter, wizard, wasmtime by post-docs Frank Emrich and Daniel Hillerstrom at Edinburgh for a robust implementation. We published a paper on design with benchmarks. There is now a PR that Alex Crichton is reviewing to upstream to Wasmtime. There has been dialog between the Bytecode Alliance and Daniel for the implementation.

SL: GC is coming in Wasmtime but is not fully there. We have our own implementation that fully integrates with GC. Exception handling is not in wasmtime yet which is fairly annoying as our design is closely related to the exception handling design. We’ll get to that later.

SL: We have had a formal specification for years, unsurprising with Andreas' involvement. It uses the new SpecTec tool that allows us to generate test cases and in the future output to other tools. We also have a fully mechanised soundness proof with WasmCert.

TL: Does that include the switch instruction?

SL: This one doesn’t but it’s easy to add.

TL: Yeah if we typed it correctly.

SL: Maxime who presented yesterday is working on that. It was valuable in identifying little holes in how we wrote the spec down. No showstoppers, just minor things

SL: We have had a design for quite a while. It was inspired by a feature called Effect Handlers. I have been doing research on effect handlers for many years. Assume all abstractions that you might want to implement. We saw value in adapting a high-level feature to a low-level language like WebAssembly. I think it worked out fairly well. The crucial feature that you need to know here is that the fundamental abstraction is an asymmetric switching. This has some very important properties. What Francis has been advocating for a while is for symmetric switching where there is no hierarchical relationship between the stacks. There are performance reasons you’d want to do this. The key to unifying the designs was to keep the original design but also support an additional instruction for symmetric switching.

SL: There is not a huge number of extensions required here. At the level of module definitions, we make use of exception tags. In a sense, this is a form of resumable exceptions. That is the mechanism on which you build this stack-switching.

SL: We added a new heap type that represents a chunk of code like a green thread that is based on a function type.

SL: With core instructions, we have a way to create a new continuation, an instruction. We have a way of resuming a continuation. You create one of these suspended continuations, of course you need a way to resume it. You can have a collection of these pairs. This allows you to handle different kinds of control features in different ways, in a modular, composable way.

SL: We run one of these continuations (i.e. a stack), and we suspend it with the suspend instruction and tag. depending on the type, we’d ensure the right values are on the stack.

SL: This is the initial design we had for a number of years. Then the unified design adds the switch instruction. You could think of it, in terms of behavior, as if we were doing a suspend followed by a resume. We switch directly to some continuation which must be on the stack. It’s also mediated by some tag. You still have to install a prompt/handler for interaction with other features such as exceptions. Or things like JSPI for interoperating with the host.

SL: There are a couple additional instructions: less fundamental but necessary. If you want to cancel one of these continuations, you resume with an exception, which gives an opportunity to clean up and also deallocates the continuation. You can also partially apply continuations. Not strictly necessary, but significantly improves code. Have a look at the explainer for why these are necessary.

BT: I think it’s more fair to say that it doesn’t cancel a continuation because you can catch the exception and keep running.

SL: You’re right, it doesn’t have to. It throws an exception on the stack. One of the features is that the exceptions just work neatly with it. If you don’t handle that, the exception will propagate up the stack. You have control over how to do this.

CW: The point is it’s your choice.

SL: Right. And you don't have to explicitly re-wire your exceptions, which is what you would have to do in a system with symmetric switching. But that may be fine if this is all going on under the hood with your compiler, but it does make interop harder.

To show some pictures of some examples of features we can implement: assume we have a generator and a consumer with a parent/child relationship. The consumer initially resumes, then the generator suspends, and can be resumed. Very easy to implement.

FM: I think it is worth pointing out that this allows a kind of generator that’s not allowed in languages like python and JavaScript that just uses normal recursive functions.

SL: Allows you to implement any kind of generator you want. Here’s a maybe a canonical example. Well three key examples; async await, generators, and green threads. The objection we have to the original effect handlers in the asymmetric switching design. That is how you implement green threads, task scheduled, added to the queue. The problem is that we have to do two stack switches. I’m not sure if in practice this will be a huge cost.

FM: There is another thing going on with direct switching. It allows another task to communicate with each other. Task1 may communicate directly with Task2.

SL: That’s what we have to do here, that direct switching allows avoiding talking to the scheduler. If you look at the explainer you’ll see examples of both of these. It may be fine for a producer, you might inline the scheduling code and the task functions that are called, it might be a worthwhile optimization. I would really like some concrete data on this. If you’re implementing async/await, you’re probably I/O bound so the switching overhead might not matter. It would be interesting to see applications where this is more of an issue.

CW: To emphasize, the direct switch is part of the proposal as you are presenting it.

SL: The nice way this proposal has come together is that the switch comes together nicely with the two. Combines smoothly with other control flow constructs that are vital to Wasm already, such as exceptions and interop.

SL: To summarize, we are at phase 2, we have all these implementations; Wasmtime is pretty mature, it would be great to have more people working with it. The plan was for a phase 3 vote after this PR is upstreamed. In terms of meeting the criteria, we met most of them quite a while ago. For the implementation, the idea is that it should be in a system that people care about and use, and should be sufficient for phase 3.

FM: I think that the actual requirement is that there are tests.

CW: Tests have to run as part of the implementation.

SL: If this is definitely not going to happen, then we don’t want to waste too much effort on it. The most important next step is that we want to be working with producers to experiment with this. Please get in touch with me and Francis on this.

David Thompson (in chat): I'd like to implement this in the hoot wasm interpreter (which is written in scheme).

SL: implementation wise, we will need implementations in the browsers. Francis has started work in V8. There’s been talk about the connection between JSPI and core switching. I think the infrastructure in V8 is going to be sufficient for this so i’d really like to understand the nature of the technical objections that Apple raised.

DD: to clarify our viewpoint, when we say our problem is CSS, we aren’t categorically opposed to the idea, but we have technical concerns that we would like to resolve so that we can get it implemented too.

SL: Very much would like to understand those better. I’m optimistic, just because having worked on various similar implementations, we’ve run into various obstacles and always overcome them. With my researcher hat on, if there are technical problems it is interesting to overcome them.

CW: from the neutral POV: could this be more of a procedural problem, where we have something that is more advanced and we have this happen?

SL: I don’t think so. In terms of the history, core stack-switching came first as we discussed it. JSPI happened as it was seen as a smaller more tractable problem. There is a tie between the two. The assumption is that they both would use the same underlying infrastructure.

FM: I think it might be worth talking a little more about the relationship between the 2. One way of thinking about it is that JSPI addresses existing applications. If you want to bring something onto the web, JSPI is likely to be useful. It doesn’t speak to the programming language. Core stack switching is focused on the language you use to build the application, so it’s more future-focused rather than focusing on legacy applications.

Unless you have a legacy go impl or erlang impl, chances are you have a need for CSS.

Basically all modern languages have some form of stack switching, even C++. So for the more forward looking part of industry this will be important.

David Thompson (in chat): question: are stack size limits the topic of another proposal? This proposal is looking good but in order to use it for delimited continuations in Scheme we need a growable stack.

Thomas Lively (in chat): Whether stacks are growable or have a limit would probably be left up to implementations. V8 is working on growable stacks.

David Thompson (in chat): if major browser engines had growable stacks that would be good enough for me!

Ben Titzer (in chat): Stack sizes fall under resource limitations of implementations

David Thompson (in chat): Sounds like we can't make use of this proposal, then.

Benjamin Titzer (in chat): How big of stacks do you envision needing? (just to understand your use case)

David Thompson (in chat): We want to be able to recursive map over a linked list of, say, a million items, as is normal in the Guile VM.

Benjamin Titzer (in chat): Ok, so we are talking megabytes or even tens or hundreds of megabytes.

David Thompson (in chat): I don't know the space requirements without doing some measuring but it's typical to have a much deeper call stack than, say, JS

Benjamin Titzer (in chat): Ok, this is useful information though.

David Thompson (in chat): But yeah let's say tens or hundreds for now. For context we support this today with hoot, but to do so we have our own explicit stack made from tables and such.

Benjamin Titzer (in chat): This makes me think Wizard will need a GIGASTACK mode and have to implement growable stacks.

TL: These (languages with stackless coroutines) have been mentioned a couple of times now. I would expect those languages to not implement core-stackswitching because they already have the implementations to do stackless switching.

SL: yeah my perspective is that stackful is “right” and everyone should be doing it :) but pragmatically there are good reasons why people are doing stackless. You want to support legacy languages, there are security reasons, you can preprocess your existing program and run it through your existing tools. It’s pretty bad for API consistency, and compiling part of your program through this state machine is difficult and fragile and hard to debug.

TL: For languages that have decided to have stackless coroutines, for whatever reason, implementing those languages as they exist today or with core stack switching, because the language has decided to be stackless, do they have to implement one frame per stack?

SL: We should experiment with that, it’s a really good question. Those are the things I would like people to be looking at now. The limitation now is that wasmtime doesn’t have GC or exception support. The OCaml folks implemented their effect handlers (which are in the source) on top of Asyncify + JSPI. they would really like to implement on top of this, but they need GC and EH.

TL: That was a great answer. I was talking about stackless coroutines.

SYG: speaking as a JS person (so value-free on stackful vs stackless). But JS has stackless. If you could do this suspend single-frame it does simplify the engine, since currently we need to do the state machine transform. If you have async in the language, then you’d like to have async in your standard library. It would be nice to spec things without the state machine because it’s hard to understand. So even without talking about performance, if you can suspend a single frame, there is a complexity advantage. You can sidestep some of that with things like self hosting but it gives the implementation more flexibility if you can suspend single-frame.

SL: One thing that I tried to follow in the historical discussions of the C++ community. One area of confusion that is important to realize. The design of your surface language can be different from the implementation. Sort of the frontend and backend implementations are orthogonal. Some providers can implement it using side stacks or a stackless something else.

CW: A stackless something else?

SL: Yes. The obvious pitfall is that stackless might end up being inefficient, and if producers assume there are side stacks, then they may not get the performance they expect. But it’s not black-and white as sometimes people make it.

TL: There’s some conversation on the chat about stack sizes, and implementing growable stacks, DT expects that Hoot will require growable or at least very large stacks to avoid running into problems.

SL: Yeah that’s an absolutely crucial question to address here. Most systems do this kind of stack-switching have some form of global stack-switching. One of the most robust in the face of different architectures is based on split stacks. I think that’s what you’re looking at for V8, is that right?

OTOH I think there are other things you can do if you know about the profile of your app, you might not actually need growable stacks, you can use stack pooling etc. but ultimately I think growable stacks are necessary. We don’t yet have growable stacks in wasmtime but are working on it.

CW: As a specific follow-up on that, do you think what you are presenting is part of global stacks later? You’re saying no more core specifications will be needed to support?

SL: growable stacks are an implementation technique, transparent to the language. A lot of inspiration for our original design was OCaml which has a really high performance implementation that they build lots of features on, but it's not exposed to the user.

FM: There are several connections between growable stacks and the host environment and also the kind of applications you are anticipating building. The zeroeth order of global stacks is that it allows you to have deep recursion. The driver however is for supporting a large number of suspended computations. Our goal is to support a million suspended computations. If you do the math, you end up with tb’s of stack space. You don’t have the memory on mobile devices.

DT (in chat): Deep recursion is the major motivator for me, though having lots of captured continuations is also nice.

SL: maybe worth pointing out that that’s another implementation technique, by overcommitting memory but that doesn’t work on small devices.

FM: You don’t want to dedicate a tb of address space just for your stack. The idea of global stacks is that you don’t allocate 1MB for each one . Quite easy to add a million of those floating around. There will be many of those that are also very deep.

BT: One of the scary implementation challenges is that typically the VM implementation doesn’t support suspension,and typically suspending host frames is one of the challenges. Only being allowed to suspend Wasm computations simplifies the design and makes it less scary.

DT (in chat): Suspending only Wasm computation is great, yeah. good point, Ben.

SL: So Ben, I know you also thought about the problem Francis is talking about in terms of being able to compress the suspended computations.

BT: I agree with Francis that it’s likely that most of your millions of suspensions will be very small. Our plan is for stacks to start out medium sized, maybe 32k, they can grow, but if we start running into problems, they can be compressed to save space.

You can run medium sized stacks, and big stacks, and small stacks and compress themselves.

CW: Let’s close off this line of conversation.

KM: while we voiced our concern about the stackful nature, I do think that growable stacks make sense. Wasm isn’t intended to be directly consumed by programmers. The ergonomics aren’t as much of an issue compared to e.g. OCaml.

SL: That's a good point, Keith. Are you suggesting that we might expose more of the underlying implementation details here?

KM: I’m just saying that in general with designs, we have a tendency for us as PL people to take that same philosophy toward wasm but maybe we don’t have to, since wasm isn’t a source language. But because we’re sort of middleware, we can receive info from someone who knows more about the system (i.e. language implementer).

SL: It clarifies. I strongly agree. Although when we started on this, we were inspired by features of higher level languages, we were always thinking about it. What is it Andreas always says? “As low-level as possible but no lower”. We have some lower-level primitives that are not right to expose at the Wasm level. At the lower-level, Wasm always has these medium level things, like stacks. When you go lower level than that, we could have a target that has none of these extensions, e.g. no GC. It’s a balancing act.

AR: it’s usually safety that’s the forcing function for making something more high level.

KM: presumably architectural level abstraction.

AR: Yes, that too

SL : One thing to point out, the next stack meeting is a week from Monday, the 24th, would be great to continue the discussion.

CW: Two questions in chat.

Reid Rankin (in chat): Can you assign a continuation to a global?

BT (in chat): Yep.

Reid Rankin (in chat): How would it appear via e.g. WebAssembly.Global?

Benjamin Titzer (in chat): There’s a new set of continuation types, which are refs, and you can have a global (and tables as well) of those types.

DT: (in chat): multi-shot continuations are off the table, correct? It is a limitation for Scheme continuations as they can be resumed 0 or more times, but one I can live with.

SL: this comes up often. They are off the table from the POV of this design, have been from the beginning; primarily because they won’t work in wasm right now. There’s no reason the design couldn’t be extended to support them. But we’re envisioning these being implemented as stacks and doing multi-shot would require copying the stacks, which can cause problems.

CW: Some of the big industry implementations have said you will never be able to copy our stacks.

SL: One of the scheme people, even in that setting, said that even though Scheme supports multi-shot continuations, most don’t use them. You could also use a CPS type of approach for some multi-shot type applications.

DT (in chat): That's correct. I couldn't think of a time where I used multi-shot for a real world program.

RR (in chat): Without stack switching, it's implicit that after a host environment completes a call to an export the (single) stack will be empty. This enables optimizations like wizer which effectively "snapshot" a module at a particular state. With stack switching, I can imagine it being very nice to have some way to similarly "snapshot" continuation state.

Component Model:

Luke Wagner (LW) presenting (slides)

LW: As a high-level summary for the CM proposal. Very condensed problem statement, there is no way to do cross language interop. For the lack of that we have a lot of fragmentation between toolchains inside and outside the browser. This causes a lack of portability.

LW: The component model is in the process of specifying the concept of a component. Just like a container that wraps a core module. It doesn’t change the core Wasm spec. The private linkage of these core modules, so we have a small set of ABI options on how these compose together.

LW: A component turns into a JS module. The way you interact with this component is ESM. Pass a JS string into the component when it has a string argument, and so on. This works today with tooling, JCO transpile.

LW: There is an IDL called WIT supporting automatic bindings. Protobuf and OpenAPI is for networking. It’s not RPC or networking based. The focus is very much local communication. It’s not CORBA. The WIT syntax translates to component types. It is not just syntactic sugar, or rather it is, but lets you write in something that is a little friendlier than S-expressions in WAT.

LW: There is an intention to be able to implement this one day in browsers. We have a polyfill implementation so we will need to motivate this as a performance optimization.

LW: there is also WASI. a set of proposals. Outside the browser there is no standard way for syscalls to the environment. Again there’s a problem of potential fragmentation.

LW: WASI is in the process of specifying a modular set of interfaces. We’re not saying this is what all hosts must use and support. Instead, host is able to choose which interfaces it exposes. Currently these proposals include CLI, proxies, sockets, etc. Because of this modular approach, we can expect more to emerge over time.

LW: These are not included in the wasm WG charter, they are outside the scope the computation that wasm is for, but WASI is for IO and interfaces. Based on discussion the consensus seems to be we eventually want to spin out a separate W3C WG with a different scope, APIs that wouldn't necessarily be in browsers.

LW: Just to visualize all of these moving pieces diagrammatically, we have different proposals inside and outside the browser. JS code and Wasm modules interact with each other via this JS API. This JS code can also of course access web API’s via WebIDL. It is the role of this JS code to bridge the gap. If you want to do IO this is what you have to do today. The proposal is to add the CM that runs inside and outside the browser. Components have the same shape as modules with just a richer set of types. Does not change how to instantiate, component can be exposed in the same way. Just a richer set of types.

LW: Just like webIDL has a JS binding, we could add a second binding to CM types and have compatibility predicates between webIDL and WIT types. We could specify how WebIDL and WIT types map, and then eventually bridge directly from core wasm to web APIs without JS in the middle.

LW: We have a polyfill working today that we are able to use to see how we want this to semantically behave. Outside of the browser, the host code and other Wasm components, just like web API’s are defined with WebIDL we can define our interfaces with WIT.

Lastly these WASI interfaces, even though they are implemented natively outside the browser, they can be polyfilled on the web using ES modules and import maps. There are lots of ways to implement things on the web, used by tools like emscripten, and they can be used in the polyfill.

Jake Follest (chat): We could also polyfill the apis by plugging them with another component right?

LW: yes, that’s right. You could do an in-memory implementation but it could also all out to other APIs and eventually bottom out in a web API.

KM (in chat): How are new parameters to a Web IDL function handled by the Component Model?

LW: like subtyping?

KM: today web IDL only targets JS where you can have optional new parameters, if not using an options bag, you would have an optional parameter. How does the CM handle that?

LW: That’s a great question. We want to continue to support that. JS call turns into a Web IDL call, similar we can do likewise with the cm. By knowing what the optional value is, I can say semantically what to specify. To work the same as the webidl binding.

KM: There's like an overloading?

LW: More like a type predicate that specifies whether the types are compatible and what happens when you e.g. don’t have enough parameters, default values, etc

CW: but in particular if you want to expose this newly added parameter, you can expose it as an extra type added as a parameter.

RR (in chat): I'm currently working on an importable compsable WASI filesystem component in JS. It would be most natural to map WASI streams to/from JS ReadableStream/WritableStream, but those are async-only and quite tricky to do without JSPI.

Jake Follest (in chat): I think there has been some experimental work on generating wit from webidl here: https://github.com/wasi-gfx/webidl2wit/tree/main

AI: about registries and how we’d be distributing components. If you are doing a singular repo e.g. docker has only one repo by default. Same with Maven. How will wasm deal with that?

LW: Here there is no proposal to change how code is loaded in the browser. The same way you load a module in all the same ways. There is no proposal to add a registry API to the browser. You can point them at URL’s that happen to be hosted in registries, but no different than how modules are loaded today.

Outside the browser, in how we build components, there are interesting new developer experiences, but loading in the browser won’t change.

AI: outside the browser is the interesting question. How would developers do dependency management?

LW: There are a lot more options outside the browser. Being standardized within the CNCF is OCI. Smart content hashing de-duping strategy, take a component and explode it into layers, and it is one way we can leverage a lot of existing cloud infrastructure.

RR (in chat): To the repository question, the best tool I've found is https://wa.dev; see for example https://wa.dev/wasi:filesystem.

CW: How many more slides do you have? Let’s let Luke continue to talk.

LW: We’ve made a lot of progress. With Preview 2, aka 0.2.* releases. We moved to a semver style to make incremental releases. Started with CM with value types, resource types (things you only want to pass around by handles), declarative host-agnostic shared linking format. Controlling how much of memory is shared and when. Initial set of WASI proposals defined with WIT, language agnostic. (list)

Implementations in progress from guest producer side, Rust, C/C++, C#, Java, TinyGo, MoonBit, JS, Python.

After that 0.2 release, we took some time to adopt a 2-month train release model. We just released 0.2.4. A lot of the work there was automating the release to setup us up for the longer term.

We published these WIT interfaces as binary OCI artifacts in the github registry and allows lots of automation for releases

We also control the release of new features that we develop via these tags, feature gates. It allows us to say it’s unstable and then also deprecate.

There is a lot of new work coming in this next year. Preview 3 aka 0.3.* we’re aiming at the end of March. There’s a lot of work remaining, but it’s seeming feasible. We’re adding native async support. The basic idea is to allow any async function that allows control flow back to me, the host. We bump the minor version because we have some new stuff Native async support: Allow any component API to be called asynchronously or synchronously, and independently, implemented either asynchronously or synchronously.

How do these async/sync components compose. A lot of interesting experiments there.

We also add stream and future types, will show an example of that. When reading from a stream, can read from a buffer, then until the read completes, we get this nice virtualizable memory sharing implementation. I have two talks that explains the async proposal and the motivations. Also checkout Async.md and CanonicalABI.md where this is spelled out in excruciating detail.

FM: to clarify, if you have a function in a spec, another component can choose to call it sync or asynchronously?

LW: That’s right. You could say that we do not have the function coloring issue. We want to maintain that polyfillability in browsers, we really want JSPI to efficiently suspend stacks when we need to. We asyncify now but it’s slow.

Why do we even want to add this even before an MVP release? The need for concurrency is needed immediately for any kind of I/O operation. With all of these I/O interfaces doing async in all of these different approaches. This goes poorly without a foundational way to build these together. We saw this with WASI 0.2.

At the ergonomics level you really want automatic bindings generation, you don’t want to make a developer graft their language onto WIT, we allow WIT do do that for them, it’s way easier.

LW: Giving a concrete example in WIT, I can pass a handle to an async function that handles a request. Functions don’t have a color, the async is a hint to the bindings generator so if the langauge has async, it’s a good place to use it. If the language like Go, doesn’t care. The hint usually matters for languages that use the stackless coroutine approach

LW: That’s how handle would look. That request contains a body resource. Now we can say the body consumes a stream of bytes. Infinitely more simplified than what we have in 0.2 right now. The stream can also turn into a ReadableStream in JS with nice BYOB semantics.

LW: After 0.3.0, we will continue on the train model. Incrementally in a backwards compatible way. Allows us to do more things concurrently. Other features that are not already in 0.3.0, includes cancellation.

LW: We want to make it so that bindings generation to be able to cancel. Add another type instructor to be more succinct for a common stream pattern with a future or error.

Canonical ABI for built-ins for forward splicing, stream data segment, and for the ~~lulz~~ lulls.

Caller-supplied buffers, where we call realloc. We can avoid that with a better design and take advantage of zero-copy in more places.

Lastly, threads. We don’t have threads in the CM, but everyone wants them. The plan is a cooperative phase that we can ship sooner, and then preemptive. We have thread.spawn (ideally would have been an instruction). With preemptive, use shared and depends on shared-everything threads. Otherwise outcalls can be cooperative switch points. Can be polyfiled using JSPI. good enough for getting code running, even if not significantly faster. In some cases I don’t need the parallelism, but I need the code to run. This will be good enough for many use-cases. (only concurrency, without parallelism) Also lets us front-load a lot of toolchain work, paves the way for eventually shared-everything threads.

One more slide and I’ll pause for questions.

LW: Revisiting our timeline. This is the same one from the last time. What’s cool is that we’re right on schedule here. Adding async additions to 0.3. There are other features we want to add leading up to 1.0rc. We will not have broken anything up to this point in a backwards-compatible way. At the WASI level, we do have versions, but they are in the string names, so we can support versions side by side. Hosts can implement one or the other or both.

LW: Big thing with 0.1.0, final opportunity to cut off the warts. Then would finally have a feature complete proposal. When finally we have that we can advance along the phase process. Once we have that, we can start talking about browser implementations.

CW: we will take 5 minutes for questions

DT (in chat): How well does WASI compose with Wasm GC today? I saw some documentation about borrowing references for a single call between components, but are other things possible? forgive me if I just haven't read the docs enough

RR (in chat): I think they're entirely orthogonal. Resource lifetimes in the component model are sufficiently tightly defined that they could be implemented with GC -style heap types or backed by traditional linear memory equally well.

LW: Yeah, way back we caring about Wasm GC is one of the reasons we wanted to build with WIT. I want to write it once and maybe call it with a ref u8 or an externref or a LM pointer

Definitely a goal that we want before 1.0. Some folks have renewed and increasing excitement now that Wasmtime has WasmGC. We need to define what the CanonicalABI looks like for WasmGC.

Sébastien Doeraene (in chat): What are the current thoughts about a WasmGC ABI? In particular, the interaction between resources and their drop function and a resource becoming garbage on the GC side? Without a WasmGC ABI, how should a GC language deal with calling drops for external resources?

RR (in chat): WASI resources are opaque to the consumer; the component providing the resource (and therefore the drop function) is responsible for handling the appropriate WASM GC stuff or calling its internal free() to reclaim that section of internal linear memory or whatever else needs to be done.

DT (in chat): hmm that seems to defeat the point of using managed memory if you have to manually release resources

Sébastien Doeraene (in chat): Thanks. That seems clear for providing a resource. If you're a WasmGC provider, I can see what to do. As a WasmGC consumer, though, I'd like to represent a foreign resource as a GCed struct. But then I don't have a way to know when I'm supposed to call drop to the provider, do I?

LW: we have bulletin the CM it know how it handles drops, it happens explicitly today. In GC when it becomes unreachable, it calls the drop

CW: Do you need finalizers in order to polyfill it?

LW: yes, to polyfill. Well no, we can use finalization registry

CW: Support for 0.x, all of the 0.x? Like 0.1?

LW: One thing we do is that final dot. Semver implies that things that things that differ only in the final dot release are compatible. The older dot releases are automatically supplied when you import the newer one (?)

CW: Is there any, when the rc arrives, we have a great polyfill for running this stuff in the browser which is great. There are a lot of runtimes that won’t have support for release candidates. Is it possible to provide a polyfill for those runtimes?

LW: there’s some good new ideas =, more work to minimize the amount of effort to be a core engine but optimize the components AOT, to make it easier for core-only runtimes to support the CM.

TL: Why go to phase 2 at the 1.0 instead of right now?

LW: maybe misreading this but I thought phase 2 was you have a whole proposal.

CW: given the number of changes you’re proposing, I’m more comfortable with going to phase 2 after they land compared to now.

TL: Breaking changes during phase 2 are fine.

LW: Until we get to 0.3 here, we don’t even have the full proposal. There are some important things that have to be added like WasmGC.

Shared everything threads:

TL presenting slides

TL: This is a huge proposal, we have shared abstract heap types, shared struct and array types, (today’s unshared structs won’t be shared between threads), shared functions, globals, tables, elements. Also new instructions to atomically access (SeqCst + AcqRel). Throwing everything we want in this proposal. For the non-web use case we need thread lifetime builtins, component model builtins for spawning/joining threads. History behind why we don’t have core instructions.

TL: Going to add managed waiter queues, more efficient alternative to wait/notify instructions that GC toolchains can use - pause instruction will make spin loops more effective. We will have thread bound data, a JS API that’s let us wrap the RegEx and import it into Wasm as a shared exter/func ref, which will throw an exception unless it is on the original thread. And thread local data presentation will follow this one. The features will not ship at the same time, for now we’re at Phase 2, and all the features are lumped together for now.

KM: Is there a subtyping relationship between shared and unshared types?

TL: There’s no subtyping relations between shared and unshared.

CW: Because they might have different representations.

MS: What is a shared function?

TL: A shared function would be a function, but Wasm functions aren’t just functions, they’re closures over an instance and a function. To share that closure over an instance between threads, it can only refer to the shared parts of the instance. Can’t refer to an unshared global from a shared function. Because we need shared structs/arrays/etc we also need shared globals/functions.

(Slide: Features (Binaryen))

TL: Binaryen. Implemented shared abstract heap types, structs & arrays. Partial implementation of shared functions, atomic accessors of SeqCst + AcqRel is WIP. Trying to optimize programs that use these constructs. All existing wasm gc optimizations have been updated to account for read-modify-write operations to structs. Working to see if what’s implemented is correct. Subtle! Fun!

(Slide: Features (wasm-tools))

TL: Wasm tools is ahead of the game thanks to AB supporting all the types, new instructions, lifetime builtins, the only thing missing is in the slides.

(Slide: Features (wasmtime))

TL: Engine side of things, wasmtime is a work-in-progress. Working on shared types, structs, array, functions, and lifetime builtins.

(Slide: Features (V8))

TL: In V8, the shared types are WIP, we can allocate shared structs, that's it so far. Sharing more about where we’re going - Shared objects slide details our future roadmap, so no shared functions or lifetime builtins, the prototype will be minimal with an instance per thread model.

TL: Because you don’t have shared functions you still need an instance-per-thread model with linear memory. How do vtables work? Vtable is shared but functions can’t be. Problems to figure out. Hope to get toolchain feedback after these features are implemented.

TL: After this minimal prototype, what we expect to happen is implementing thread bound data as the number 1 ask, to hold regexes or whatever - after that we would do wait queues to implement real efficient locks - this is all speculative and not in the first prototype, goal is to get WasmGC languages compiled to shared GC languages.

SC: Do atomic accessors include wait/wake?

TL: No. Only new wait/notify primitives are from the wait queues. Would call that separate feature from atomic accessors. In this prototype you’ll have inefficient spin locks.

YI: Are some of the wait queues just a convenience compared to wait/notify?

TL: You could use wait/notify in linear memory but would have to manage lifetime for that. Opportunity for engines to provide more performant primitive.

YI: Interested from previous presentation, where this is an optimization to what we already have

TL: Wait/notify today are keyed on memory address of linear memory. Engine can’t store state in that linear memory, so there’s an indirection. Needs to be a key in a hash table. Wait queue is wasm gc thing on the heap and engine can co-locate the control state next to the user-visible world. No hash map lookup.

KM: Do we think there’s a chance that some of these end up not being particularly important? One example is you could imagine shared function tables come out late enough that runtimes would already use an array of functions and don’t need such a feature. Runtimes may not be willing to wait for browsers to implement tables. Have you heard feedback from toolchains of whether this has happened?

TL: The feedback on the minimal prototype plan is that people are going to want more - we may not ship this as not being fully useful. It is an incremental roll out, may not make sense for all of it to be shippable. The feedback we have from the toolchains is that we have enough features, then we will just ship it at that point.

KM: When you say rollout is that actual deployment to customers? Chrome origin trial?

TL: That's all completely prototyped behind a flag, and maybe an OT, not going to ship till we have real feedback

DG: May have asked this before. Just the atomic accessors that are release/acquire and pause instructions seem independently useful and very little work. Curious if there’s more interest (know its just an optimization and interest doesn’t bubble to own feature). How are you thinking about that?

TL: We could prioritize the other features and it's relatively useful, it's really just a matter of prioritization. We should be able to make progress on independent features as needed

KM: Support for splitting out pause/new accessors.

(Slide: Overview of changes)

TL: Basically three non-trivial changes to the overview since last time. Updated rules on tearing to be more consistent. Figured out some problem with polymorphism regarding shared/unshared things in type system. Clarified typing of cmpxchg for optimizers.

(Slide: Consistent tearing)

TL: Tearing rules are that for globals/structs/array/stables accesses never tear. Accesses to references never tear. Would be wildly unsafe if anything here tore. Might require engines to do “something” heavyweight to prevent tearing but it’s a critical safety property. For consistency with existing rules for shared memory i8/i16/i32/struct fields/globals never tear. Based on assumption that fields are aligned. Should be true of all engines today. Everything else can then tear. Accessing an i64 global without an atomic access that can tear. In theory pretty much what you’d expect based on linear memory rules. Clarified and spelled out in the overview.

(Slide: Sharedness Polymorphism)

TL: Motivating with an example in the slide. Eqrefs are not shared, sharing doesn’t exist. We want to compare shared eqrefs for equality, we could add a new instruction but we could just reuse the existing one by creating a meta variable for sharedness. (Walking through eqref examples in slide) We can add type variables into typing rules to determine polymorphism

FM: What if one is shared and one isn’t?

TL: The way we would write this is share_1 eq & share_2 eq, with subscripts,

SL: Can you write new functions that are polymophic?

TL: No. This is already something we do with nullability

CW: Not ref.i31?

TL: That one, because it doesnt take any input is unconstrained, you need a new instruction

TL: (any.convert_extern example)

TL: What happens when we have an unreachable? Unreachable goes to the bottom type, what is the output? For (1st example) We want this to be shared for (2) we want it not to be shared

TL: Yes we just need to introduce bottom-shared-ness. Going to get bottom-shared-ness any-out. Bottom shared is a subtype of both shared and unshared. This function body can work for both functions here. Don’t worry totally uninhabitable, no value with bottom-shared-ness. Just here for validation. Now it’s in the overview.

CW: It won’t show up in an annotation or a function signature

TL: Not even a binary encoding, cannot put it syntactically in your module.

SD: Would the bottom-shared issue go away with https://github.com/WebAssembly/relaxed-dead-code-validation?

CW: Kind of, but you need a second solution for dead code, so I prefer this solution.

(Slide: Clarified Cmpxchg typing)

TL: Very briefly, last thing is for cmpxchg say I have struct with eqref field and a function that’s just going to do a cmpxchg. Takes reference/expected/replacement. Interesting part here is that the replacement is an i31ref and so want to look at the whole program and want to optimize this struct’s field to i31. Can this optimize? The expected type is eqref. Yes that’s fine you can make this optimization. The expected field in the cmpxchg can be a supertype of the field type now. I was implementing this in binaryen and it was under-specified so now it’s … specified!

(Slide: Next steps)

TL: We trying to get shared object prototype in the next 6 months, all other work in toolchains to continue, we’re hoping to get experiments and feedback.

( … switching to CW presenting … )

Thread Local Storage

slides: https://docs.google.com/presentation/d/1C4uMfKUg-RSgqiendE3e3VwuHtTGaFFIP7dPzoL3FpU/edit?usp=sharing

CW: Hello! We’ve got pretty far fleshing out different parts of shared-everything-threads and a first-choice candidate of all the features Thomas talked about. Haven’t come up with a story everyone’s happy with about TLS. What is the core language story extension to accommodate TLS?

CW: Recounting through 4-6 different options with advantages & disadvantages. Want to make everyone aware of possibilities, and it’ll become clearer with prototyping and experimentation in the future.

CW: This is the shared-everything-threads motivation. Go through quickly as thomas already talked about it. Can’t create threads, we’ve got linear memory

( … CW presenting slides, he talks fast and has good slides so note-takers gave up … )

SL: At that point you’ve added closures to the language?

CW: If only, that would be much more efficient, like user space closures

( … more presenting … )

SL: Passing a struct full of funcrefs around is like implementing closures.

CW: Yes, in user space. It would be more efficient to have closures directly in the language.

( … more presenting … )

CW: My biased opinion is that thread ID + thread-bound data/functions might make the most sense, since thread-bound data is going to happen anyway.

AR: Want to point out that context locals is the least compelling for (1) The minimal approach being invasive for ABI is the same level of ABI intrusion is true for context locals as well, and there’s also a lot of parameter passing - its just that the engine would have to do it instead of user space - skeptical if it buys you anything in terms of overhead.

CW: I would expect the engine would be more efficient, i.e. a custom calling convention where you have it pinned or something.

AR: More clever thing it could do is to store it on the stack. Not per function but per-stack.

CW: That’s kind of the solution i was assuming

AR: Then it becomes more like a dynamic scoping thing. Will let others comment first.

YI: I’m wondering how does shared stuff compose with the current way of instance per worker? Could I have instance per worker and still share stuff?

CW: If you have shared globals and tables this is another place you can lay out structures such as for java. If you want an instance-per-thread model is you’d have a shared table where you’d lay out all the structures. Still have lots from other shared things. Just need shared functions for one-instance-per-everything. (note taker lost some exact words here)

YI: Since all of those options were bad, even if we implement one of those, and take a disadvantage, wouldn’t people go back to instance per worker anyway?

CW: Potentially true. Some use cases for which instance-per-worker works badly, e.g. dynamic code loading. If at runtime bringing in a bunch of new function pointer you need to stop the world and make sure it’s a consistent view for the whole world. With the new model you only update one location. Some use cases really care about dynamic code loading and want better support. They would be willing to put up with moderate pain in other areas just to get this. Don’t think all of these are that bad, but no so far away from some reasonable answers.

YI: Doesn't look too different from instance per worker, where the function seems like a mini instance, CW: It’s a very mini-instance. Everything else can still be shared. All core wasm stuff still gets to be shared.

YI: that’s the good parts of instance per worker

CW: Glad to have converted you!

BT: Dynamic version of context locals seems like dynamic scoped variables to deal with nesting. Scares me. Don’t think there’s a great solution. Tradeoff space where some modes would benefit more than others. Personally from closed-world perspective thread-locals look great. Perf gap with native has a lot to do with shadow-stack pointer and want to register-allocate the pointer. Unsure how to do that with other schemes.

CW: Stopping here in the interest of time

AR (in chat): Conrad, one additional consideration you hit on briefly is which of these schemes are sufficiently compatible with stack switching. Do you have more recent thoughts on that?

CW (in chat): They all seem compatible, even with shared continuations, with the exception of the proposal where JS interop is accomplished with a parameter-passed bag of nonshared functions. You do need some form of "shared suspend barrier", either at the block level or the function level (personally I prefer function level)

AR (in chat): Well, I suppose for context locals that depends a lot on the specific semantics, which in turn affects efficient implementation.

CW (in chat): Yes, the expected semantics for context locals is that if you're resumed in a new call stack, you get the context locals that were set up for that call stack.

AR (in chat): Right, I remember we discussed this in the past. I think that amounts to dynamic scoping and is the only sane semantics. But it wouldn’t be cheap, since you generally have to search the stack.

CW (in chat): Yes, or eagerly pin a register that points back to the area where the context locals are allocated.

AR (in chat): …and in principle, you can already implement that with stack switching itself.

BT (in chat): It seems like statically-typed context locals is moving a module-level parameterization mechanism to be a function-level parameterization mechanism, so it leaks into too many places. Thread-local globals seems to me to keep scoping at the right granularity. Ultimately, for use cases where the modules can be statically linked, ideally an AOT compiler would regalloc a small number of important thread-local globals. So it seems the best from a performance perspective, at least in the case of static linking.

AR (in chat): Pinning won’t avoid the search, though, for the same reason you can’t make stack switching constant time. (The search would happen at switch time.)

CW (in chat): Yes, but I think the work could be done as part of the existing linear handler search.

BT (in chat): that would imply they are tied to a stack, not too a thread. if we have shared stacks then I would expect transfering a stack to another thread would get the target thread’s TLS

CW (in chat): Yes, that’s the intended semantics - you can use this to implement thread local storage though (with the right conventions).

AR (in chat): it would, because it would be stored on the parent thread stack.

BT (in chat): I have to admit I am not sure what users of shared stacks would expect, but one use of TLS is to hang data structures that don’t have synchronization. Having dynamic scoping with shared stacks might accidentally make that too easy. Then again, our type system might catch that. Then again, the type system might be too hard and everything gets marked shared anyway. (meaning accidentally leaking thread-local data structures due to an unfortunate interaction between shared stacks)

Half-Precision

IR presenting slides

IR: Hello! Last time I talked was more than half a year ago! Two groups of instructions in this proposal. These two groups of simd instructions - one is conversions to/from integers and higher/lower precision. Also numeric instructions. Lots of useful operations. Pretty standard stuff.

(Slide: Current status)

IR: Current status, proposal in phase 1, proposals implemented in V8 according to details in slide.

(Slide: Emulation: What does it mean?)

IR: Two modes of emulation: One is no hardware support for anything. Need to convert f16 to f32, do arithmetic, then convert it back. More interesting is F16C emulation for x64. Pretty old extension with AVX/AVX2. Around 2009. Should be pretty common and converts f16x8 to f32x8 and f32x8 to f16x8 and do either 4 or 8 values at the same time. Can load f16x8 values, convert to f32x8, and then do simd op and convert back.

(Slide: Conversions)

IR: Conversions on their own, pretty convenient, are widely supported, for Arm64, supported well across all desktop platforms, most mid/high end devices support all f16 ops right off the bat. Conversions are useful on their own, useful as a data format in the Float16Array in JS, so pretty convenient.

IR: On arm64 instructions are 1:1 mapping for conversions. Same for AVX10/512. Fully supported 1:1. With AVX/F16C f16 to and from f32 have one instructions and integer to f16 conversions there are conversions to f32 then to f16 so not bad. More hassle with conversion from f16 to i16 with F16C. Needs more instructions to handle difference in x64 semantics. Think it’s pretty acceptable.

IR: Add in numerics and you can do scenarios such as running small models or graphics applications. Can work with WebGPU too. Do some work on the CPU in that case.

(Slide: Conversions + Numerics)

IR: When you don’t actually care about precision, dynamic range with fp16 could just be faster outside the box - if you’re interested in full lowering, there’s a document with all the lowerings on the proposal repository.

IR: One instruction for unary/binary/fma for FEAT_FP16 and AVX10/512-FP16. Many 1:1 translations. For emulation with F16C need to convert to F32 first, do the operation, then convert it back. Same for binary operations.

RH: If you’re taking F16 values, and expanding to f32 and convert back to f16, does it give different values?

IR: No. Suggested in discussion thread but actually there is no double rounding error. Deterministic results.

CF: To verify, if you have conversions back to back, but can you elide the conversions? Or we can detect in some cases when it is possible and doable

IR: Good question! In theory in common case, no. Because double-conversion error could happen. We can consider a “full speed mode” when you don’t care about small rounding errors. Or we can detect some cases where it’s not an issue and do optimizations. We haven’t done it. Answer is it’s not straightforward but could be done for some cases.

NS: Is there a way you can detect that fp16 is going to be emulated and fall back to f32?

IR: Will answer that question later. I have presented benchmarks and were a bit weird because I made some mistakes. Biggest skewing factor was due to a bug in V8 and some instructions were always software emulated. Also only non-optimizing liftoff tier was used. Missing optimizations from optimizing tiers. Picked “same” benchmarks from XNNPACK’s benchmark set.

(Slide: Benchmarking challenges)

IR: Benchmarking is hard! Lowering is implemented through LLVM, you lower F16 operations to vector IR, and expect to lower it back to Wasmf16, with optimization flags, LLVM uses different optimizations that work well for native code but aren’t true for Wasm, in initial run FP16 operations were all optimized away and the non optimized version was slower than the optimized version. Should be mindful of what we benchmark, and what you do with the result of computations

IR: For gemm benchmark outputs are gigaflops, higher is better. Compiled to native and wasm. For native F16 version runs almost twice-as-fast, which is expected. Same result for wasm on arm64. Not full 2x but performed f32 native version using f16 in wasm. More interesting is that on x64. My i7 was older than my mac. The f16 version was slightly slower than f32. The emulated version was almost same performance as f32. How is it possible? Details are in Benchmarking takeaways for x64 performance

(Slide: Benchmarks: Ray Tracer)

IR: Shows similar numbers to previous gemm benchmark. Here it’s ms/frame so lower is better. F16 was only slightly more performant. Same for wasm where arm64 f16 was slightly faster, but for x64 it was almost same performance using f32. F16 on x64 was slightly slower.

(Slide: Benchmarking takeaways)

IR: When we swap f32 computations with f16 computations, we see consistent speedups, already talked through why Fp32 is about the same performance. F32 is the baseline performance implementation, more scope of optimizations,

IR: What options do we have. Based on previous discussion we can have a pure instruction set. Still think we need conversions + numerics. At some point going to run faster for free without emulation. One option could be just conversions and FMA instructions. Most desirable part for ML/AI applications. Final option is only having conversions. Maybe not ready for difference in performance of other operations.

IR: Want to know opinions about what we have right now, what do we need to add/remove to the proposal

Johnie (in chat): Sorry if I missed this, did you have any native x64 data?

IR: No. Haven’t implemented AVX512 for V8 so can’t try on V8. Also means I haven’t done native x64. Can be done but not yet.

CW: Do have x64 numbers for F16C? They were ok?

IR: mmm yes.

TL: Have native arm, wasm arm, and x64 wasm, but no native x64.

DG: Also the specific device and getting perf numbers. This is the devices we had available.

IR: Would be surprised if native f16 were slower than f32.

AB: Device was for wasm x64. Why not native x64 experiments?

CW: Only supports conversions?

IR: In theory possible to do native. Didn’t spend time to write the code. Didn’t have goal to benchmark native arm vs native x64.

AB: Reiterate offer that I’d love to help do an analysis. Spent time trying to compile and couldn’t get to the point of making this comparison. I’d love to make the comparison though.

IR: Need machine I can ssh to and develop AVX512 version for v8. Most interesting part here is if we implement lowering to AVX512 what is the performance. I would like to have access to machine like that to implement.

AB: That affects all current users with x64 machines

BT: You mentioned this is a baseline implementation? Is this in the baseline compiler?

IT: It’s what we got right now, without investing too much in additional optimizations.

BT: What do you think the gap between wasm and native, regardless of f16 and f32, where is that coming from? For example wasm arm64 and native arm64 have ~30% difference.

IR: I think here is probably work more with memory, it’s less co

DG: I think that’s a pretty standard overhead. Even in terms of current existing benchmarks. There’s a ~20% overhead of running in wasm.

BT: I want to know what that is.

DG: Several different things. Don’t have a good profile of this right away,can get that to you. In native you can directly execute but in wasm we have boilerplate around functions and what we’re trying to run. Not always 1:1 and while you can use intrinsics you can’t always use all of the instructions that benchmarks assume that you have. Native benchmarks can generate code more specifically for the exact benchmark where you can’t do that in wasm. Seen a very standard portability overhead of at least 20%.

BT: Not questioning data, just wondering what is in the “super hot loop” that’s in wasm but not in native. Overhead in/out of wasm but comes down to best code for the super hot loop.

BD: for arm64 how many 16-bit floats. 8? Yes.

IR: Only 8.

(Slide: Detecting support for FP16)

IR: Would be possible to detect f16 support in wasm. For most applications if you don’t care about precision you can use f16. If only for efficiency can maybe not use f16. Would be great to have detection capabilities. Personally I’d like to have support for this in wasm. I know it’s controversial opinion. Also to be done in host environment as feature detection capability figure out what’s supported to what degree.

RH: Do you have numbers on full software emulation?

IR: It’s about 60x slower. Really slow.

RH: In firefox, we just disable it if we don’t have required baseline support

IR: If we go this way we don’t need to support the feature. Nice to have but not necessary in that case.

RH: Feature detection is hard, we’re multiplying the number of builds based on the features

CW: If you had semantics where the module was rejected at validation time wouldn’t you have the same number of builds in total?

RH: I’m imagining there would be 3 levels of validation, with different proposals

CW: Is that not the same amount you get with this feature detection?

RH: If we have a wasm instruction then we would have less validation. Would have two: MVP and SIMD + f16 where f16 might not be available.

TL: You could put the fallback path in the same module

CW: You’d still have as many different build modes? For f16 is the support level 0 to 2? My understanding is there are three levels of f16 support and you’d have 3 builds in the same module?

DG: Depends what we want to standardize.

SG: It’s already a subsumed cost for conversion ops because we already have to implement it fp16

RH: For conversion operators for sure.

IR: Optional slide! Technically have all conditions met for phase 2. How do we feel about that? Need more time? If need more time then for what?

CW: Reason we delayed is Intel/Apple spoke up and asked for more scrutiny on x86. Andrew sounds like you still have numbers you want to look at?

AB: That’d be my 2 objections: (1) I haven’t run benchmarks myself and analyzed the data. We’ve tried to work together but wasn’t ready. That’d be against advancing for that reason. Other reason (2) is that if data is what you show is that this is a clear benefit for one architecture and not another. I do represent the architecture that is not benefitting and that doesn’t seem quite fair.

IR: I see that architecture closing that gap. We are building for the future. Not going to be the same way for now and into the future

CW: With AVX10 on, all of these operations are natively supported, we expect the performance to only get better

IR: Yes. FP16 is part of AVX10.

AB: I’m not against the philosophy. Agree having these operations makes sense. What I do need to do is represent not just future x64 users with AVX10 but also current ones who don’t. Without having the data and since there are no AVX10 users currently I have to default to representing users who won’t see any benefit.

CW: If you see that existing users are at least not penalized, would that reduce the strength of your objection?

AB: Hoping for that yeah. Ilya mentioned that on one slide there’s some things we can do to reduce the emulation cost. Was hoping to be able to look at that myself to see if there’s anything we can do. If there is something then it would reduce my opposition

DG: Responding - we just ran out of time on this set of numbers. Needed to experiment in v8. Eliding conversions when you don’t care about precisions should be straightforward. Also VNNI and other instructions to try out. Haven’t tried all possibilities and patches are welcome to make that faster too.

AB: Y’all probably focused on other proposals too and if you can allocate some time I’d love to dig in. (notes paraphrased this sentence heavily)

AR: My question is similar to that, I would like to get a feel of what is the market share of the machines that what would be the market share devices,

iR: If you’re talking about desktop, yes it’s only arm64 desktops and I don’t know exact number. Should be bigger with arm64 for windows perhaps. For mobile devices I tried to estimate. It’s hard because it’s hard to find market-share of CPUs. If you look at RAM size you can categorize as “low end” and not. Probably pretty safe that anything above 5G of memory most likely has F16 support. For mobile market there is bigger chance you have F16 natively than you don’t have. For desktop probably other way.

AR: What’s the primary use case of that and what’s the primary use case? AI? What kind of AI and where do you expect it to run? What’s driving this?

IR: From Meet doing some preparation of data before sending it to a GPU. Making GPU computation faster.

DS: Small ML models may run on the CPU in addition to (or instead of) the GPU

Sean Isom(SI): also edge providers running small models. General parallel accelerators.

AR: Did you say ad providers?

SI: Edge providers!

DD: I wanted to add something small, even if its usually that raw compute is about the same, you still have memory size savings which are beneficial for users.

SG: For Andrew - on the web where you have both JS and wasm Float16Array has already shipped. Feature already demonstrates the performance chasm between native-and-not. The Float16Array feature has no arithmetic or numeric support. You can already see the difference for conversions. Sounds like for your concern we’re already in that world for users with JS exposed.

DG: To add to Andreas’s question. With wasm as the only compute unit that doesn’t support f16s (? notes lost here). With runtimes with CPU/GPU/xPU runtimes try to deploy to least common denominator which is f32. We (wasm) are holding that least common denominator back right now. Checking compatible CPU/GPU/binary you have to do that dynamically. Would prefer to not do that dynamically and have baseline support for f16. JS for example assumed this is baseline support.

TL: Should we do an unofficial poll to see what people are feeling about this. Would be interesting in knowing how people feel about the conversions-only version.

RH: About the poll, I’m curious if someone has opposition to the proposal - like when will we have border support for it - if you’re holding back the proposal because of how far out the support it, the shape of the proposal isn’t expected to change

JB (in chat): The data needs more scrutiny.

IR: Mostly talking about performance.

AR: This is adding significant complexity to the language, this is just a question of timing - there’s also a thing that there’s a momentum to moving proposals forward. When more and more time is invested, it’s harder to move back

TL: Does this really add significant complexity to the language? In the grand scheme of things?

AR: Yet another entry in the complex matrix of SIMD operations.

DG: We just talked about thread local storage …

CW: Talked about how it would be nice to complete the simd matrix with holes in it!

AR: This adds another row with new holes - it adds a new type, and how it got complicated with F16? Has to balance it with benefits

DS: Can’t simultaneously make an argument against leaving holes while also arguing against things in the spec that won’t help anybody. We can fill all the holes with slow instructions or we can only do the fast ones with CPU support.

AR: My argument last time, we don’t want to be adding new rows to the matrix with holes that we have no intention of filling.

DS: The least-common-denominator of CPUs (that you’ve been advocating that we limit our support to) will never fill those holes.

DeeptiG: For f16 we’re not looking for conversions/numerics across the board. Not going to fit the 32x4 other holes this is meant to be a cohesive set and not stick to requirements we had no control over that introduced the holes.

AR: I realize there’s history here and don’t want to get into that.

DeeptiG: Pointing out that it’s slightly different in the composition of the holes.

RR (in chat): Hey, I'm coming from an embedded/TEE perspective, and we don't even have f32/f64. (or i64, really.)

BT: What is the guiding principle going forward?

DeeptiG: Guiding principle on our end is user demand. If users have applications where instructions can be faster we test them out. ARM64 has full support and x64 next year. Unlocking performance. (notes lost the rest of the comment)

BT: there will be 50 years of usage of wasm and who do we say no to?

DG: Personally if it’s useful we add it. They’re just CPU instructions but know there’s different opinions. Considering them on case-by-case as we get them. Don’t think we’ve made particularly bad decisions. Not advocating adding everything, just a small set that’s highly performant for a small set of applications. Big interest in wasm in usage in these applications.

CW: I don’t see FP16 as a particularly bad idea, there’s going to be a clear and enduring interest, I don’t see the user demand for this dropping off going forward, my main concern is that Andrew isn’t more enthusiastic

DG: Want to add that Andrew has a different opinion. We have worked with other Intel folks. Not in this time zone.

AB: Don’t believe there’s anyone else benchmarking what Ilya’s benchmarking from Intel. Don’t want to misrepresent other teams.

DG: We had a meeting this morning asking “when support avx10 in v8” and we’ve had a consistent meeting to ask for avx10 in v8. Why would we add support if there’s not a proposal using it?

AB: Talking here about benchmark result for f16 emulation

DG: Want to present whole picture.

CW: Can we get Andrew to run benchmarks?

AB: We’ve talked! Ilya busy with other stuff. No big deal.

IR: When we had that conversation benchmarks were very work-in-progress. My mistake for not sending updates. Just finished last week.

AB: Not worried, we can work it out. If there’s time would love to help look at this.

ChrisW: Back to Ben’s question and Andreas about complexity. Think it’s a separate conversation. There’s a bunch of hardware acceleration for embedded products. Adding all could be a massive ball of complexity in the 50 year time span Ben talks about. Probably a meta-discussion about how to move this forward as a platform thinking about that half-century time-frame. When to say no and why to say not. Would like to get lots in when talking to andrew/ben/arm. Extra instructions wasm doesn’t have access to. Definitely a conversation worthwhile having but probably out of scope for f16.

IR: We probably shouldn’t poll right now. More-or-less on the same page. Going to wait for benchmarks and response. Don’t think it’s actionable on my side any more. Let’s come back to this later.

Wide Arithmetic (Alex Crichton)

Slides

AC: motivation: wider than 64-bit math - PRNGs hash tables etc. We are 2-7x slower than native.

small proposal - 4 instructions (add, sub, mul_u, mul_s)- no explicit 128 bit type - represent as pair of i64

benchmarks from recompiling with wide arithmetic support - all show improvement, blind-sig (crypto) especially

RR (in chat): Is blind-sig an ECC (elliptic curve) workload?

AC: No (maybe?). These benchmarks are in sightglass.

RR (in chat): Looks like it's this: https://crates.io/crates/blind-rsa-signatures

RR (in chat): (For reference, much more practical public-key crypto is ECC-based that RSA-based these days because practically strong keys will fit in 256-bit ints. That would need 256-bit mult; 128 is better than nothing for sure but 256-bit would be lovely.)

implementation status - Rust/C/C++ with LLVM 20

wasm-tools and wasmtime support

spec interpreter

basic tests

questions?

MS: are these Wasm instructions already compiled to some native instrs?

AC: yes, almost every platform has these instructions for multiplication. For add and sub, it’s through the overflow flag. Almost all architectures have instructions to add two instructions with a carry flag. wide mul is real, with potential optimisations if only high or low result part is used

FM: did you not suggest division because you didn’t see a benefit from it?

AC: Yes: One, in benchmarking this, I didn’t see any massive wins. There is a 128-bit division 128 divided by 64), which is specific to x86. So given that we already get a speedup on bignum div without it, it doesn’t seem motivated.

alternatives: why not mul128 - more work to optimise

New i128 type: A lot of churn for not a lot of benefit. Native CPUs don’t have 128-bit integer registers, and almost all use these 64-bit pairs.

why not use v128? - vector registers work differently - hard to optimise

div128 - not enough hardware support

comparison instructions - equivalent peephole optimisations seem easy, but also didn’t even benchmark well

Shift/rotate instructions - Major architectures don’t seem to have instructions for these.

overflow flags… next slide after questions

TL: Did the code go through wasm-opt?

AC: No, these benchmark results are just straight out of LLVM.

TL: I asked because wasm-opt’s handling of multivalue is suboptimal and kind of bloats the code.

AC: I’m proposing phase 3 here. One of the most challenging parts of this proposals for implementations is that the instructions produce multiple results. I do suspect that to be a significant hurdle for implementing this. It’ll vary a lot between compilers, but it is likely to be a sticking point.

TL: probably there’s a very small amount of code that could use these instructions, so if it gets some bloat it’s probably not a big deal and it would have the same performance. But it would be good to double-check that. Don’t want to block from phase 2 though.

AC: I would love to help with this kind of stuff, but broader refactorings are kind of out of my scope.

AC: Overflow flags. Another obvious alternative is to have add_overflow/add_with_carry kinds of things, because that’s what most CPUs actually have. However, this adds a new problem to compilers, where we have values that are not actually values in the CPU. The conditions have to live in the flags register. If the value is live over something that clobbers flags then you have to save/restore it with special instructions, which can be slow. So effectively, this approach introduces new optimization problems, and my experiments with just implementing this turned out to be slower than what we had today.

Even introducing i1 doesn’t fix it, because it’s still not a normal value in a register. So it doesn’t solve the problem.

We could also add a CPU flag “register” to Wasm, but that’s awkward because lots of instructions will implicitly clobber this register. And some CPUs don’t have a flags register at all. So it doesn’t simplify things overall.

So in conclusion, I tried to make it work, but ended up concluding that the i64 pairs are the practical path forward.

CW: could adding the add128 incentivize native architectures to add instructions that map well to it?

AC: I suspect not, they already have this functionality, I suspect they wouldn’t want to add it. But I don’t really know.

AB (in chat): All I can say is that we can't say anything about future ISA changes.

AR: Did the benchmarks depend more on the new addition or new multiplication instructions?

Blind-sig was a mixture of arithmetic operators. Fib is all addition. All these benchmarks are microbenchmarks, so I’m not saying this is a representative workload.

RR (in chat): This will get rid of calls to _multi3??

AC: Yes, this has been a long-time issue in Wasmtime and this proposal fixes it.

RH: For the add-with-carry approach, what more would people do with it?

AC: it’s awkward to express overflowing addition in terms of 128 addition. To do add 256 it would take several add128s that would be shuffled around. Comparing wasm in wasmtime vs x86, the wasm was a little less optimal.

From a pure optimization getting-to-perfect-code if you had perfect compilers that could do everything. The challenge in Wasmtime was in fusing the add_overflow with the add_with_carry reliably, and if we don’t fuse it, it generates a bunch of extra code.

RH: so 128 arithmetic works great with add128, but 265 would be slightly less efficient compared to add with carry?

AC: Not really sure but maybe a few percent slower

MS: In some high-level languages, you would want to have a panic or trap on overflow?

AC: Yes, exactly.

MS: That seems like a nice use case for this approach. If you are implementing overflow checks in normal Wasm, is there a way for compilers to compile this down to a flag check on the CPU?

AC: Good question. That’s one of the microbenchmarks I was trying to look at. One idea is that you could do an add128 and then just check the upper return value to see if it overflowed.

I couldn't get these things to fuse so that you could get the single-add followed by ‘jo’ to jump. You’d get lots of cases where you reified the value.

DD: We are interested in some of the potential there. There are compilers that track flags and manage to optimize flags. It would be an interesting area to explore in the future.

AC: totally agree. If all the web engines just said “this is easy” we’d probably just try to figure it out too. My intuition is that it would still be nontrivial. But I would love to hear feedback about that as people try to implement. I wouldn’t even mind changing this proposal if we thought that was best.

RH: Is add128 a straightforward macro over the instructions you’re proposing here? And we could potentially retcon it in terms of this proposal?

AC: Yes.

RR (in chat): I suspect using using these new 128-bit instructions to perform 256-bit (or even 512-bit, though more rarely) operations will be a very common cryptographic use case.

Michael Ficarra (in chat): if that's the case, this proposal paves the path for even wider arithmetic that takes even more i64s

RR (in chat); (The counterpoint is, I supposed, that most public-key crypto schemes aren't super sensitive to the performance of the public-key part.)

Michael Ficarra (in chat): because they just use it for key agreement and then continue with symmetric crypto?

RR (in chat): More or less. A lot of usecases involve signatures instead of ciphers, but you always sign a hash so e.g. doubling the size of the data doesn't mean twice the point multiplications.

In fact, I'd put good money on key-stretching algorithms -- say Argon2 or PBKDF2 -- massively outweighing the point multiplication in the average application. I know off the top of my head that blockchain applications mostly store their private keys in a form that requires running 10000 rounds of SHA-512 before you even get to messing with the curve point operations.

AC: The phase 3 requirements have been met. This is specifically a phase-3 vote for two instructions.

CW: This is one of the nicest proposals I’ve seen, in terms of scope and motivations.

DG: Do you have any of these benchmarks posted where we could try them?

AC: Yes, I have a benchmark repo. I can send you instructions. I ran all these in browsers too, and saw similar performance numbers.

DD: Random question/suggestion: are a lot of the use cases for add/sub128 have a 128 bit accumulator that you add an i64 to? Have you considered flavors that take 2 i64s to begin with.

AC: That is true. (Code example: slide “Overflow flags: Reality”). The highlighted extra instructions, I wasn’t able to get rid of.

DD: What about specialized forms of these where the high operand is zero?

AC: I think that would be reasonable. I haven’t explored that because I assumed it wasn’t worth it. Code size should be negligible and backends can pattern-match.

But having variants of these instructions in the spec would work well.

(question in chat): Does this point the way to 256-bit or 512-bit in the future?

AC: I would be very sad if we get 256 or 512 or so. To me, the answer is, we should use what I’m proposing here to optimize 256-bit or 512 bit. If it’s not a win, we shouldn’t do them.

Phase 3 poll to move wide-arithmetic to phase 3 as presented:

SF: 18 in room + 3 in chat F: 12 in room + 3 in chat N: 4 in room, 0 in chat A: 0 SA: 0

AC: If you are a browser implementer and have questions about this, I’d love to talk with you!

Website

TS presenting webassembly.org

TS: At the last meeting I proposed making improvements to the website, and now, that’s what I did.

TS: External links have a little icon to show they are external. Also: Dark mode!

TS: Almost all traffic lands on the main page. Second most is on “Feature Extensions” page.

The developer-guide page isn’t very pretty, but has a lot of info.

I went through all the open PRs, and I have now closed all of them.

The same for the issues. Lots closed, only 2 more open.

The two issues that are left open: No more-wasm tutorials. Turns out, there’s a lot of content on the Web. learn-wasm.org and other sites. If we find someone who wants to help us write content here, we can probably work with them to make that happen.

Mendy: I wrote a lot of the MDN docs on Wasm.

AI (in chat): The table should not be restricted by arbitrary max-width... If someone has a wide enough monitor, the whole table can't be seen because of the max-width: 1080px … fortunately, ctrl+shift+i exist

TL (in chat): Now that we're past that part of the presentation, it would be great to open an issue about this.

MF (in chat): or a PR!

TS: Some of the less visible pages are actually hosted in the design repo. Someone from the community made a page to surface a hand-selected collection of design repo docs for the website. Open discussion for the community: what should we do with these pages?

TS: WebAssembly has many different audiences coming from different backgrounds. People hearing about the “Web” in the name may assume it’s like JS or make other assumptions.

TS: webassembly.org is the landing pad for most of the “WebAssembly is” hyperlinks around the Web. A lot of people copy-and-paste the opening sentence on the main page. What information should we present on the main page? I left most of this content as-is, so let’s have an open discussion. What should we do here?

N: I visited the site and clicked on the FAQ page and it talks about how threads and SIMD are coming in the future, so it’s very out of date.

TS: The question is, do we even need this?

RH: Thank you for doing all this work. Taking a slight step back, I’m curious if there’s interest in a smaller group of people meeting regularly to work on updated content here. THey could perhaps be empowered to make changes, and check in with the CG every quarter perhaps, so this could be relatively low-effort.

TS: Yes, I think this makes a lot of sense.

DanG: I've been going through the design repo and updating design.md and friends, so most of those are up to date now.

DanG: It looks like the git submodule in the website repo pointing to the design repo is three years old, so I will submit a PR to update that. Then the website will have up-to-date contents.

FM: It would be nice to have guidelines for people who wish to have their site linked to from the webassembly.org page. Then we could make it easier for people to contribute to the website.

(discussion of approaches to guidelines)

TL: Perhaps we could use our unconference time to discuss this further.

Thursday, February 13

WebAssembly Memory & JSArrayBuffers (Deepti Gandluri)

DG Presenting slides

DG: One of the goals is having a scope discussion.

CG: Just to check: is this a prototype, or is there a path toward shipping where the Wasm memory does not know it's mappable?

DG: For now just a prototype.

AR: Can you clarify. You do not propose any core changes? I assume you mean no new instructions, but I expect there would need to be changes to the memory model, for example.

DG: They would require changes to the memory model, what types of traps are observable, some implicit changes. So it’s important to make sure people are on the same page so we know what will need to be reflected into the core spec.

CW: So I’m imagining having a “mappable” attribute sort of like we have “shared” now?

DG: yes, something along those lines

(slide: Proposed Additions: JS API)

RH: Is it a mutable mapping or a copy-on-write?

DG: for now it’s mutable; if we decide we want COW, we can discuss that. About this set of APIs, it doesn’t imply that you have to have a certain implementation strategy. Especially with MemoryMapDescriptor, you can have other web APIs that return one of those and you can use that in a memory.map function.

RH: Sure. I don't want to go too deep, but can you have a map descriptor and map it multiple times into different memories or the same memory?

DG: For now we’re experimenting with a single memory.

RH: Sure, I'm just trying to understand the edge cases.

DG: Nothing prevents them from being used in multiple memories, but we haven't experimented with it?

CF: Is the system page size exposed somehow?

DG: the system page size isn’t exposed to the API but I assume the runtime knows about the underlying device. It doesn’t seem like a large source of nondeterminism

CF: So if you want to do two maps, for example from 0-4k and from 4k to 8k, that will work with 4k pages but not with 16k pages.

DG: Runtimes abstract over that anyway because wasm defines its own page size. These APIs use the wasm page size just like memory growth, not the system page size

(slide: prototype status)

DG: Not covering the implementation details here, we can talk about that some now but also more depth later if you want. We also did an early security review and got some requirements from them.

DG: There are no large security concerns at the moment. File descriptors should not be stored in the heap.

CW: was there thought of any additional considerations in a multithreaded setting?

DG:Yes and no. There are enough open questions; we’ll come back to that at a later point.

ChrisW: You’re mapping JS array into linear memory, and this assumes the linear memory is big enough to map everything in? But we don’t always have that. Could we have a secondary memory, maybe even dynamically add them? So we thought about that. This would be cool, but the C code can't access it, so we need a getter and setter and that kills perf by 30%. Is there a way to make this work?

DG: We did talk about this. A big consideration was usability. If you want to access a secondary memory, you need to annotate your application somehow to use multiple memories. If we're clever and expose some sort of intrinsics or tooling support that doesn't exist today, it's still invasive in the source language. So this is bad for portability. If you would want to port a big application, you would have to add memory annotations.

ChrisW: we did some thinking about that would love to discuss later

KM: I think I agree largely with the usability point there; anything that requires source changes is probably worse than anything that doesn't. Also, on the JS API side, what’s the story with ARrayBuffers, can you not get one of those?

DG: We might need a narrower view on the JS side. Maybe you would want to have a tighter coupling between Wasm and the JS API. Consensus in Pittsburgh was not to let JSisms influence too much what goes into the core. But there were definitely concerns that this would probably need some changes.

KM: Hopefully we can find a way to punt on it or find something that's somewhat ergonomic in Wasm.

BV: You were talking earlier that for now there would be no changes to the memory model?

DG: I didn’t say no changes to the memory model; we’re anticipating them but just doing prototyping for now. My early thought is that we still want to zero-fill.

BV: The main reason I ask is that it impacts how you can perform mapping, especially on different OSes. when we thought about doing it on Windows before, it seemed very difficult to maintain the illusion of zero filling when you’re mapping this way. What you’re proposing seems to be more-or-less just the JS side of the virtual mode sub-proposal that’s already in the repo, right?

DG: I haven't taken a recent look at the virtual mode. I looked at it when it was posted but not since then.

BT: Where are the sources of the MemoryMapDescriptors?

DG: not talking in the slides yet but one example is WebGPU async, that would return one of these. It could be different in different context

BT: Is it within scope to discuss how to generate MemoryMapDescriptors from sub-ranges of existing WebAssembly memories?

DG: not thinking about that right now.

BV: regarding BT’s question, I don't think it’s possible to perform that kind of mapping operation without changing how memories are allocated. Virtual memory APIs don’t let you just create a mapping for some other arbitrary range of memory. You’d have to create a separate descriptor from the start. Don't think it's a good fit.

(slide: prototype status:defer overlapping mappings)

DG: This can be done on windows 10+ now so it might not be terrible. We’d like to prototype this to understand how bad it is and what it might take for older versions of Windows.

DG: We’d really like feedback from other runtime implementations, on what is the possible space for unmap; we’ve gone back and forth with zero filling and trapping; maybe a question for Ben/Kieth, are there restrictions we should be aware of , etc.

RR (in chat): From an embedded perspective, we're already having to deal with the page size being "too large" at 64k. I think that a requirement to grow the memory before mapping into it would practically require a lot more implementation complexity to allow extra "fake" zeroed pages to show up, that might exceed the physical memory limits of a device.

CW: When you say unmap is it related to undoing a mapping?

DG: this is about undoing a mapping, there has to be a signal to take ownership

KM: Is the feedback you’re looking for about engines or platforms?

DG: for engines and platforms. One things we are thinking about is discarding vs zero-filling etc

KM: on Darwin platforms there’s an API for asynchronous discard, either you’ll see what you saw before, or you’ll see zeros, depending on when the OS gets around to reclaiming the memory. That’s one thing to consider.

DG: I think that was the question. We are playing around with options. Having a performant API in Darwin would be a nice thing to have.

KM: it would be nice to allow having the option for async or synchronous depending on what the platform has available.

CW: I guess for regular memory discard. Are you imagining that has to be asynchronous?

KM: You can do it synchronously, it’s just much much slower. It’s recommended by the platform to do async, system malloc does this, you mark something to be reclaimed, and if you want to keep it you can just touch it and it will be removed from the queue.

RH: I’m trying to think through your question about unmapping. When we talked about fancy memory things, the details really matter, overlapping mappings, sharedness, OS details, etc. the APIs just have such subtle differences and so many constraints, that the options are mostly bad. One problem with unmapping on Windows is if you want to preserve the discard behavior, and see zeros, you need 2 syscalls. For shared memory, at best you can say that you either need to stop the world (really bad) or accept that there’s a potential fault and catch it with trap handling; it’s possible that you could specify that in core wasm but you need signal handling. For us with signal handling we catch faults in JITted code but not runtime code, so if you call into memcpy and it faults, it will just crash so we’d have to add a special case for that. You’d have to find a way to unwind the stack, a big project. So adding traps to arbitrary VM code is something we’d like to avoid. There are options but we’d have to really carefully work though the details. It’s fine to prototype and get performance ceilings, but we’d have to go through it really carefully before advancing a concrete proposal.

DG: I mean I think that was the thing we were struggling with. There are no good options. ??

RR (in chat): I could imagine taking a fixed-size memory and expanding it by mapping onto the end, but the easy solution to memory on embedded systems (in lieu of the small pages proposal) is to just ignore and trap on any memory above an implementation-defined smaller limit. That's not going to be possible if you have to be able to map into that space. How do you anticipate this proposal interacting with the custom pages proposal? Would they be mutually exclusive?

DG: I assume this would be orthogonal to that.

TL: But if you’re dealing with wasm page sizes and theres already a mismatch that could be an issue

DG: I assume if you have custom page sizes, maybe this proposal wouldn’t be too interesting to you?

CW: It seems that there are embedded systems that want to have small and custom (??) page size.

RH: currently we have a restriction of 1 vs 64k. If you have a page size of 1, none of this would work.

CW: Unless you arrange the API so you can specify the architectural/OS page size to map even when the wasm memory page size is 1.

BV (in chat): To follow what Ryan said, there should be no conflict as long as your custom page size is a multiple of the native page size. Anything smaller would not work. And right now the two options for a custom page size (1 and 64k) don't help you.

BV (in chat): If you weren't required to fill pages with zeroes, and pages were instead allowed to trap by default, would that address this concern? Perhaps then you would only be using address space, not a commit limit?

RR (in chat): I really appreciate the work on this, BTW. One of the beefs some embedded folks have with component model is extra memcpys; I can see these mapping primitives be useful for a component-first host environment to avoid that by flipping pages between core modules instead.

(slide: core spec)

DG: core spec isn’t time sensitive to prototyping, but just wanted to mention that there would be core spec changes; maybe no opcodes, but would affect the memory model

(Slide: Benchmarks Plz & other requests)

DG: If you have representative work loads: please reach out (important for prototyping)

CW: It seems like there has always been a buzz around memory.discard. Is there any reason to get that out quicker?

DG: There was a lot of interest. We would really like to see people experiment and tell us they’d use it. We’d like to do the memory API changes in one go, but if we can get data that shows it really solves problems, then we can prioritize

CG: My vision is that much of the new functionality is behind the capabilities of the new Map descriptors, but this could be something that could be used with the existing style of memories and not depend on that

DG: many of these use cases people brought for discard weren’t really solvable by discard, e.g. address space fragmentation on 32-bit platforms.

BV: We do actually need some experimentation to show how impactful memory.discard is. Looking more broadly, it does seem there’s a possibility for some linear memory/virtual memory split to occur, depending on how things shape up. The are together to avoid conflict, but i could imagine discard or maybe discard+static memory protection being split up and delivered more quickly.

(slides: immutable arraybuffers)

TL: Can you remind us what stage 2.7 is?

SYG: design freeze without tests, pending implementation feedback. So if you want to change the design it’s because of implementation feedback, not because you change your mind. And the idea is that assuming no implementation issues, it goes through. So now would be the time to bring feedback on that

TL: Sounds most similar to our phase 3.

DG: Yes, but we don't say "design freeze." Maybe we should.

(slide: Proposals details)

(slide: Discuss)

DG: Looking for feedback if it’s there. We think it’s largely orthogonal to what we’re doing here

(About Immutable ArrayBuffers)

CW: Do we need to think about reflecting these new capabilities back to WebAssembly?

DG: it’s possible to have that, which is why it’s good to consider; if JS ships something as an immutable arraybuffer. Most of the use cases we’ve seen here have been more on the embedded side, but if we get to the point where web apps start to use this and wasm, we might want to do something. But it seems pretty tractable to e.g. have readonly flag on a wasm memory,

CW: we might want to think about how we allocate when using data descriptors, since currently we separately copy them in. worth thinking about how this would influence what we do in the future.

DG: yeah just wanted to highlight this as an avenue for feedback.

BV: Are the contents of ImmutableArrayBuffers allowed to change? (If the backing memory changes)

SYG: I think that would be very surprising. Good to clarify with the proposal champions. I would imagine the readonly bit is truly readonly at the OS level. Not read-only by user code only.

DG: There is now immutable view at this time.

SYG: all views on immutable buffers would be presented as immutable. If you query the properties it would show as non-writable.

KM: I haven't looked too closely, but my understanding is that there is no hook to provide for modifying the immutable buffers, so doing so would violate the spec.

SYG: I would object to the proposal in TC39 if it meant “RO by user code only” thats really weird.

RH: My understanding is that if you grab a Wasm Memory's buffer, you cannot detach(?) that buffer.

DG: yes

RH: so transferToImmutable just wouldn’t work. One other thing we could do, if we had something like the static memory protection, You can imagine getting a read-only buffer. You still have to do index translation, but it's still possible.

DG: yeah index translation is just something you have to do. Unpleasant but that’s it. Will discuss detachability in a followup conversation, do we want to make ArrayBuffers backed by wasm memory detachable. Will discuss in the future.

CW: Do you have a one sentence motivation why they are not detachable right now?

DG: wasn’t able to find. They are not detachable out of convenience, but maybe we should move to a different model.

CW: We maybe have discussed that we don't want Wasm memories to have multiple aliases, but that might have been based on vibes.

DG: yeah it was definitely just a conservative decision

CW: I remember, it was when talking about resizable array buffers, we didn’t want a resizable and nonresizable alias at once.

DG: so it seems we don't have any current feedback for stage 2.7 in TC39, and we might want to expose to wasm in some way in the future.

RR (in chat): Surprised they're not transferrable! We might have issues with moving them between web workers for threading purposes? No need to use immutability tor that, I guess, I'm just baffled by the restriction.

DT (in chat): The discussion has me thinking about Wasm access to mapped webgl/webgpu buffers... has anyone in chat done any work related to this?

DG: please contact us with any feedback on any of these topics

Compilation Hints (Emanuel Ziegler)

EZ presenting slides

(Slide: Why compilation hints?)

EZ: counters are too expensive to have all the time. Early experiments showed that lazy compilation helps in the beginning a lot, so worth it. But we want peak performance ASAP. notably hints only works when profiling is accurate. When you know the use-case very well, profiling is helpful.

(slide: Guiding principles)

EZ: We do get requests from toolchains to make them mandatory, or have more guarantees e.g. inlining. Doesn’t really fit the philosophy of wasm but wouldn’t be good to have a custom section that is enforced.

EZ: we wouldn’t want to include things that can’t be used (we know what V8 uses, would like to hear from other engines about what they could use too)

(slide: open questions)

EZ: we just tier up after a particular count of calls, but that’s not necessarily transferable to other engines Extend call frequency hints? I think it’s a good idea. There is a benefit to annotating blocks (If one block is hotter than an other, we want to optimize that one (hot loops)).

We don’t currently do cross-module inlining but it might be a good idea (e.g. when using module splitting cuts hot paths across modules). In principle you could annotate imported function indexes. If you use call_ref you might not have a function index if it’s not directly imported. You could of course add one just for that. You’d also have to rely on the environment not shuffling the imports around when instantiating.But if you are at that level, you have an architectural problem. But you could also just ignore imports.

BT (in chat): 80% not called? Wow

(Slide: Scope)

EZ: My cat is trying to dismantle my Bonsai tree.

EZ: There will be no standardized profiling in this proposal (out of scope). Also out of scope is in the wild profiling, perhaps via an API that developers could use. Could have potential privacy concerns but worth thinking about.

RH: on in-the-wild profiling, it makes sense to not just have a JS API because it could be difficult to thinking through. But one thing that would be cool is if we could define the hints such that 2 profiles could be merged together and get a sensible result. Easy if it’s as quantitative as possible. Then you could e.g. average call counts or whatever.

EZ: Yes, also for engines that is nice. You could combine engine-collected stats with AOT-given stats, rather than only using the AOT ones.

AI (in chat): Are annotations namespaced? So that runtime can have its own hints, but then there is a core namespace for most common hints. Tools then can have multiple hints targeted for different runtimes (as they are optional, hints which cannot be resolved are ignored). Looking at https://github.com/WebAssembly/annotations/blob/main/proposals/annotations/Overview.md#details it doesn't look like it is being namespaced…

EZ: no, they are not. We could do that. So far the reference we used was the branch hinting proposal, which didn’t have that. With the discussion of AOT vs runtime hints, you could use namespaces for that. I don’t think you’d want to have different hints for different engines.

CW: Thinking through it now, should we consider this an anti-goal?

EZ: I think of the namespace as being generic, e.g. frontend vs engine maybe but tying it to a specific engine seems not great. We can discuss on the proposal repo.

KM: I guess on that same kind of point. There is a concern that even if you don’t have a browser hint, if people profile the code on just one particular browser that might have an influence.

EZ: I'm optimistic that it won't be a problem if we can make the hints and format generic enough.

KM: I think I mostly agree given what we have on the table right now. But just wanted to mention that I think there’s a nonzero risk of that.

EZ: That's why it's important to get feedback. We want to make sure it's sufficiently generic.

(slide: scope: compilation order, call frequencies, call targets)

BT: It is often very common to want to inline a certain part of a function. Are block annotations with “pretty please” values a way to do that?

EZ: you could do that, i would expect that block annotations have a higher level of uncertainty. Even if a block hasn’t been seen doesn’t mean it’s really not needed? We could maybe use the no-inline annotation for that. The problem is that you might want to do that per callsite but can’t do that right now. We can only annotate that per function at the callsite.

BT: I think that’s reasonable.

(slide: move to phase 2?)

CW: one comment. I feel pretty good about this proposal personally. Are we thinking if the proposal is taken seriously: do we want to add other hints or will things stay pretty much as they are now

RH: For me, I think this is a reasonable starting point based on what engines do now or are likely to do in the near future. On the “reasonably high level of consensus” I think the scope is pretty close. We might have feedback on details like the units, but the overall shape seems reasonable to me.

CW: Any other opinions on that aspect?

AI: annotations namespaced? So some runtimes can have their own hints, which are then lifted to core namespace if needed? To give runtimes their own space?

EZ: Nobody stops you from adding your own hints, but I’m not sure we want to standardize something like that. Are you thinking something specifically?

AI: no , just to make it extensible in a decentralized way.

EZ: yes the annotations proposal is designed in a way with different names, you could use ‘metadata.code.something’

YI: The format (binary/text) is easy enough to support custom use cases and reuse the format.

CW: Just to add onto that, I'm happy that we're not considering runtime-specific hints. If we wanted to consider that, we should consider and vote on it separately.

Phase 2 poll to move compilation hints to phase 2 as presented:

SF: 13 in room + 7 in chat F: 17 in room + 6 in chat N: 1 in room + 2 in chat A: 0 SA: 0

TL: That is consensus.

Custom RTTs and JS Interop

TL presenting slides

TL: Let’s get started. Here to present on Custom RTTs and JS interop.

We need the former to get the latter, though the former’s kinda nice on its own. Going to go through design, according to the slides TOC.

(Slide: “Contents subject to change”)

TL: Not phase 1 yet. All subject to change. Pretty worked out design but it’s not set it stone. Any piece of this is open to discussion. The only reason to have it so finely worked out now is to give us a baseline for that discussion.

(Slide: “Why Custom RTTs”)

TL: Why do we want custom RTTs?

MS: What is RTT?

TL: Stands for runtime type. History is that the wasm GC proposal originally had RTT values and had lots of discussion. Eventually got rid of them. Deleted from wasm-gc. Now let’s bring them back! Even though I was one of the ones who said let’s get rid of them! :-) But it’s a little different now.

TL: Why do we want custom RTTs? We don’t, in themselves; what we want is JS interop. We want ot be able to call methods from JS where the receiver is a wasm GC object. So if I allocate some struct that is the lowering of my “java Foo class”, and the instance of my Foo flows out to JS, I should be able to call Foo.bar().

TL: We spent a lot of time thinking about how can get get that. How can we attach a JS prototype to a wasm gc object. Given constraints on wasm composition/structural typing/etc this is the cleanest design we came up with so far.

TL: It turns out this custom RTT component has lots of advantages even aside from the JS interop.

(Slide: Why custom RTTs?)

TL: Byte order: first, an engine-managed header. RTTs aren’t in the language atm, but they do exist as part of the execution semantics. Objs need to know what type they are to be castable. RTTs are how we encode those types. First field of struct: rtt, which points to the RTT.

TL: In userspace this struct, lowering of a “java foo class’, has a vtable, fields, stuff, etc. The vtable here stands in for userspace type id, an actual vtable, an interface table, etc. Some source type associated data that’s in this struct.

TL: If a vtable, it may point to a struct full of methods. Today, Every instance of the Foo type points to the same RTT and the same vtable. But what if we could stick the vtable in the RTT? Then we could save having a vtable field in the Foo struct. We estimate a 10% total heap size savings for one particular program (Sheets calcworker).

CW: Heap size of the program?

TL: Wasm side of the program, yes. Lots of little objects without many fields. Removing an entire field is pretty good.

AA: You don’t need extra space to store the methods as part of the RTT?

TL: For sure RTT gets bigger. All foo structs can share the same structs with the same methods. This was originally posted to the design repo. Let’s add fields we can add to RTTs. Andreas said that looks like a struct! Let’s make it a struct. Now there’s a special-annotated kind of struct, like a normal struct, full of methods, but it also has this engine-managed RTT thing.

CW: For existing objects that don’t have custom RTTs. Can you expose their representation?

TL: That’s open for discussion. My impression is that we should ban the canonical RTTs that are currently built into types today from being ever exposed to user space. (You should never be able to get a handle to one of those.)

CW: I have follow-up questions but let’s keep going.

TL: Custom RTT thing a normal struct with engine managed stuff inside of it.

SC: Does this mean you’re chasing an extra ptr for every cast?

TL: Yes, but we were chasing one before. Whether we have to chase another depends on the implementation. In initial prototypes, the additional RTT will likely be out of line somewhere. An optimized impl should be able to have this all inline.

CW: The obvious pitfall here is you don’t want the RTT object in the ‘any’ type hierarchy, it would have to be its own hierarchy.

TL: yes.

FM: With a moving GC the RTT might move. That could confuse the GC algorithm. The GC algorithm uses the RTT to figure out what to scan. Object moved and its description moved. Tricky for gc?

TL: Could be. Should ask the GC team.

AR: Already exists in engines. In v8 the Maps, shape descriptors, are GC objects. Not new.

BT: v8 maps don’t move, but it’s a small thing to follow a forwarding pointer. Or follow any rtt that’s been moved.

CF: Can this custom RTT struct itself have a custom RTT?

TL: Good question! Also open for discussion! What Andreas recommended is “sure why not!”

CF: So this is just a prototype object system, then?

TL: yeah … Don’t think Java or Kotlin need that but maybe other systems do? Not much extra work in the engines? Very TBD.

KM: Wouldn't that make it hard to inline?

TL: This struct here the rtt isn’t being inlined here. You always have a pointer to the rtt and if this struct has a pointer to rtt then even if the rtt info was inlined you’d still have a pointer to the next rtt information. Don’t get infinite inlines.

CW: Yet!

AR: ... There’s a thing to the left of the lower box which is another rtt which is the rtt of this thing.

KM: Assumes you can GC your rtts instead of refcount?

CW: Since they’re pointing to a “meta-RTT”, can you get circularity to rtts if they can point to one another.

AR: This kind of thing wouldn’t be mutable, so you can’t possibly build a cycle.

BT: I think we discussed this on the issue. Some type do bottom out …, so you can’t(?) have recursion.

JK(chat): as far as implementation details are concerned (especially short term), imagine something like the previous slide: the struct points at the engine-managed RTT, which points at the "struct*" custom RTT.

IR: Is rtt struct mutable at runtime?

TL: Except for engine-managed stuff the other fields are normal struct fields. So yes, they can be mutable if declared as such. In the original design we have a “new” thing with a “new” heap type declaration. Probably don’t need fields to be mutable. If we’re just gonna reuse infrastructure for structs would be more complicated in the spec to disallow mutability. Maybe impl reason to disallow, but until we have feedback it’s mostly a normal struct; it can have mutable fields

FM: mutability would be handy for static variables in classes in Java. Put your class variables in there.

TL: Having mutable fields here lets you have mutable static fields.

YI: Might want to generate new code and put new pointers in there. Nice that it’s possible to be mutable.

TL: Not if you want Binaryen to optimize any of that! But in principle, sure.

CW: Immediate thing is are we dependent on speculative inlining of virtual methods right now?

TL: Yep.

CW: (yuck).

RH: When you want to access an object with an RTT do you get RTT struct and get the field off of it?

TL: Yep.

RH: So it’s going to be a subtype of eqref(?). So equality will be preserved. How can you inline the fields, then?

TL: Let’s go through more with a full example and what it’d look like in wasm/JS

SL: Is it necessary that these 2 different things are actually different things, or could we…

TL: Yes we don’t want your everyday struct to have this inline space for RTT. If you want to inline it it needs to be fixed size for subtyping.

SL: But that’s the only difference?

CW: Not a real struct? A “struct*”?

TL: Oh yeah totally different struct

CW: Thought that was a regular struct and it’s just a magic field then it’d be really big.

TL: No, this RTT field is not a normal struct field; this is engine-managed. I’ll show an example in a minute.

(Slide: Custom RTT Design)

TL: Here let’s make it concrete. Have a struct type “foo”. Just a normal struct with a vtable field. This is how you’d do it today. It’s how jco wasm does it today. Reference to separate foo.vtable type. Also normal field like “i32 bar”. Vtable is just a normal struct and its fields are methods. The methods of foo.

TL: Notice it’s the user-managed $vtable field that points to the vtable.This is how we do it today. Now, we’re going to get rid of the vtable field and add this new things to the struct definition. This says the Foo structu is going to have THIS as its descriptor. “This is the type index of my custom RTT.”

TL: Now we’ve gotten rid of the vtable user-space field. The engine-managed rtt pointer in the header actually points to one of our new custom RTT types. Also need to make foo.vtable one of these custom types. Also need to tell engine to make it a special struct with a “describes $foo” annotation.

TL: Of course we need to say what type we’re describing so when we do a cast I know what type I should be describing the engine-managed RTT. They’re sort of in a one-to-one correspondence.

SL: You could have another descriptor on the second one?

TL: Exactly. So if I had a descriptor clause here as well as a “describes” clause, it would set up another … chain and be mutually recursive.

( … room audio lost, please stand by for technical difficulties … )

CW: Future slides about downcasts?

TL: Let’s talk about that later.

SC: Is it not redundant to say “descriptor” and “describes”, since if the vtable describes Foo, why do you need to point in the other direction as well?

TL: You definitely need both. Reason you need descriptor is we need to know what type we’re pointing to. Definitely need to say what the custom rtt is for foo. Reason you need describes is because describes says “in this RTT space, which type are we describing. What info are we storing so if a cast is performed, we know how to do the cast.” When we do a cast, we’re going to follow a ptr to RTT info, so it had better describe type Foo.

KM: Describes isn’t strictly essential? Can verify no two people have the same descriptor. Can infer describes from the existence of the descriptor for foo.

TL: But then you might have to validate that you don’t have multiple types with the same descriptor.

CW: Also sensible to say type declaration is all you should need to figure out the size of the type as opposed to doing something more global.

BT: Think it’s actually required because it could happen in different modules. One module could declare it .... May be digging too deep on this one though.

TL: Yes, let’s move on. But we definitely do want the describes clause; itt makes things easier.

AR: One thing: Ben, this won’t be possible because they must be mutually recursive so can’t have separate modules.

TL: Yes.

DD: Do we strictly need this 1-to-1 correspondence even … if we had several types with the same static fields or the same methods or something. It doesn’t seem essential to have the 1-1 if the types(?) that held the same static fields…

TL: If the described types have the same shape then yes they can share the same RTT type. If they have the same shape they have the same type. It’s all structural.

DD: Could have two types with different sets of fields but the same vtable?

TL: The problem is if they have the same static fields and methods and you want to share the vtable type, which type info do you put in the RTT space?

DD: What I’m envisioning is you have a specified vtable but put the actual rtt in the instance. Sort of what the engine does. Could in theory have reuse amongst descriptors with highly generic methods.

TL: Yes. Being able to reuse descriptors and contents in your type section is a pain. When you have 1-1, you end up repeating this list of method fields a lot. But in the design issue we do mention it’s a problem and a few solutions to it. May be part of proposal to fix that—probably not by breaking 1-1 correspondence but by adding…fields(?) to reduce this duplication.

AR: Want to know you can’t break this 1:1 correspondence it’d be unsound. Can’t have multiple different vtable objects of the same shape with different method pointers. 1:1 correspondence is on the types, not the objects that have these types. Absolutely essential for ...

TL: Let’s talk about subtyping briefly. Same 2 types as before. Now we’ll define subtypes of them. Foo is described by Foo.vtable, which in turn describes Foo. Now we’re adding Bar, a subtype of Foo. To do that, it needs its own descriptor, bar.vtable. Bar.vtable needs to be a subtype of foo.vtable and needs to describe bar.

TL: Not only are types 1:1 but the entire type hierarchy is 1:1. Between the RTT types and the described types.

CW: Isn’t there an issue related to the types of the “self” part of methods? Or are we just lifting everything up?

TL: We are not solving the receiver cast problem here.

CW: I guess the fact that you already do the “dumb” thing to solve the receiver cast problem saves you from having additional problems here.

TL: Exactly. The types of the methods haven’t changed at all. They’re still going in a vtable struct. We’re just smushing together tha vtable struct with the engine RTT info.

CW: IF we did ever try and improve on the receiver cast problem with something like Andreas’s self-type idea I expect this arrangement would cause an additional problem where you’d have subtyping between vtables? (notes may have lost some here)

TL: Maybe. I don’t have a great idea how that works. I would expect this may help a bit because it’s a step toward tying the methods and the receiver type together. At least the vtable and the receiver are together.

CW: Maybe now I think it through it’s exactly as hard. Maybe orthogonal.

TL: Maybe it’s orthogonal; maybe this helps a bit. I don’t know.

CW: Is there ever a point in the hierarchy where you don’t want anything custom you just want a canonical RTT? Does it have to be custom all the way down?

TL: 80% sure you can have the canonical RTT above in the hierarchy so a type of custom RTT can be a subtype of canonical RTT, but you can’t go the other way around. Once you have a c stuom RTT, all your subtypes need to have a compatible RTT, because you could do an RTT.get instruction.

CW: Ok that’s making some assumptions about the compat of the representation of the canonical RTT with the representation of the custom RTT? YOu’d effectively need to have the canonical RTT be a subtype of the struct with the custom RTT inside of it.

TL: Exactly. So when you have all your types with canonical RTTs at the top of the type hierarchy, you can disallow doing rtt.geton the types with the canonical RTTs. So it’s impossible to get a ref to a canonical RTT…

CW: Thing I’m thinking about … Which way around is ok again?

TL: Having the canonical RTT in the supertype and the custom RTT in the subtype should be okay.

CW: Ok so then that puts a restriction where if you have one subtype with a custom RTT and one subtype with a canonical RTT and when you go to the same supertype (with canonical RTT) do things work out? Puts restriction on representation of the canonical RTT?

TL: You’re right. The representation of a canonical RTT needs to be a prefix of the …, which is why I’ve drawn all these rtt fields at the beginnings of these structs.

CW: Are we feeling good about that? Is that realistic?

TL: Think so? We can discuss more! Gut takes?

RH: Probably?

CW: Seems like it could work but seems a little scary.

TL: We’re gonna prototype it and see! Let’s talk about using this stuff.

(Slide: Allocation with CUstom RTTs)

TL: Gonna have a global that holds a vtable. Exactly like today but now it holds a custom vtable. When I allocate an instance of “foo” I pass instance of foo.vtable.

TL: This is just like we have today but we’re adding a new operator struct.new which takes the custom RTT as an operand.

TL: Here’s the tricky part, if we store a “bar” vtable in this global instead of “foo” then when we allocate our “foo” the instance of the custom RTT we’re giving it is actually an RTT for bar, not for foo. Then we can do a ref.cast on that newly allocated thing and it’s got “bar”’s rtt information so that succeeds even though we allocated a foo.

TL: And because we have WIT subtyping, Bar may have more fields than Foo. We could do a …..get for a field that doesn’t exist on Foo. Instance security vulnerability.

CW: Why are you allowed in user-code to decide that you can put a bar.vtable there?

TL: Because bar.vtable is a subtype of foo.vtable. Right now today if I have a global of type foo.vtable and bar.vtable is a subtype of foo then this is all valid and it works.

AA: So this is because you’re treating it as a normal struct, not as a struct?

TL: Exactly. When we allocate we could break subtyping where subtyping isn’t allowed for custom RTT “thing” when we allocate a foo. Must be exactly a foo.vtable. Pretty much the solution we ended up on.

FM: You don’t have $foo in the struct.new.

TL: that’s what we have here

FM: You don’t have $foo in struct.new

TL: This foo says what we’re allocating

FM: ...

TL: so get rid of ..

TL: Statically this global.get does have type ...

FM: But you’re allocating whatever the rtt actually is?

TL: Then if Bar did add new fields, we wouldn't have provided the initial contents for the fields here.

CW: I’m surprised by dynamis of putting arbitrary object in vtable. More restrictions on what objects you can pick to be the RTTs?

TL: What kind of restrictions?

AR: You want to have different vtables? Even for classes that have the same shape?

CW: Do we?

TL: Absolutely. If we have 2 Java modules and one has a Points type with x,y fields and the other have Pair struct with a, b fields they may lower to the same type, but because they’re different nominative types, we might want them to have different RTTs.

CW: I thought we were saying that punning of types was an anti-goal? Is this the motivation?

AR: You have a class with a vtable and declare a subclass which overrides one method. Vtable types exactly the same and object types exactly the same on the wasm level but the vtable itself is different because you overrode a method. All shapes are the same.

CW: We could allow that capability, but because we have this…

AR: It’s essential you have to allow it.

TL: Today all toolchains make sure all the source types make to different wasm types. That works only for closed-world compilation, but that’s cool for sheets but not good for fleets, for example. We need to lean into using structural types and structural types.

SD(chat): Even in the same modules, you need dynamically-allocated vtable instances, of a unique $jl.Object_Array.vtable type, to implement the covariant array types of Java/JVM.

CW: Wouldn’t expect wasm structural types to be rich enough to do all the dynamic code loading of source languages?

TL: Dynamic code loading is a separate thing.

CW: Let me be more careful. At the limit what you’d do is have one wasm type and all source level types are done with some kind of extension you dynamically check.

TL: Yeah. And the extension point where you differentiate the source types is because they have different RTT values.

DD: You don’t actually have this problem if you view vtable structs as an extension to RTTs. Don’t have them as 1:1 exact structure. If you had the implementation rtt being the foo one and the foo.vtable is “glommed on” you wouldn’t have this soundness issue?

TL: Right. If the actual RTT info where was FOO, you wouldn’t have the soundness problem, but that …. The RTT info put into the top instance on the slide should be for Foo.

DD: Would think that if you view foo.vtable and bar.vtable as a type of struct that extends an RTT but aren’t themselves containing an RTT you could still say that you fill in that field only when you do struct.new. Could construct some extension with a hole or maybe copy or maybe not. Somewhat tricky.

TL: Maybe some magic could make it work, but let’s talk about the proposed solution which is less magical.

IR: What if I have two structurally equal source type definitions but different vtables. How to compile? Same structural type representation in wasm but want to describe with two different vtables.

TL: If my types Foo and Baz had the same structure but different vtables. I would have a local Baz.vtable with the same wasm type but would allocate a separate RTT instance for it. If the vtable types don’t match, then Foo and Baz are different types. Because the custom RTT structure is part of the canonicaized structure of the described type.

IR: I can have nominal types now?

TL: No same as before. Canonicalization still happens at the rec-group level

CW: Another mechanism of compiling a nominally typed source language where each type has a custom rtt.

TL: True.

CW: You’d need to copy more or have some extra custom field?

TL: In terms of it’s effect on type canonicalization (how the identity and structure of a type are determined), descriptor and describees are the same.

CW: Intuitively that doesn’t seem true. You could have two different descriptors with the same static type. Do you not get different casting behavior? If I make a second instance of foo.vtable and I take each separate object and create a struct where they get their copy. Can I interchangeably cast between?

TL: No.

CW: If I make a new instance of Foo.vtable and make new structs where each has its own copy of the vtable, can I cast between them?

TL: Yes.

NS: Weird you can assign a descriptor which assigns a vtable but then can pass in a different vtable in constructing the instance.

TL: Great point. It IS weird. Let’s move on to the next slide. This blows up and is very bad. What’s the fix? We need to make sure the $foo.vtable and NOT a Bar.vtable, so we’ll just put that in the type. Now it would not validate. You can no longer smuggle in a bar.vtable in place of a foo.vtable.

(Slide: Exact Reference Types)

TL: Today type hierarchy looks like this. Nullable then non-nullable. Let’s add subtyping between any/$foo. (explaining diagram). Let’s add “exact” dimension here and complete the lattice. Nullable exact any and non-nullable exact any. Exact subtype of inexact. Non-nullable subtype of nullable. Two independent dimensions so for any heap type we have 4 reference types.

FM: Exact any not possible?

TL: True!

LW: Did we end up adding final…?

TL: Yes we already on heap type definitions have a dimension of final or not-final. Optimizes casts and call-indirects. THis is different because it’s on the reference type, and not on the heap type.

LW: Why do we need it on the reference type as well as the heap type?

TL: In order to fix this problem every custom RTT type would need to be final and then couldn’t have subtyping. Good question! This is great albeit complicated. Independently useful though. Lots of extra type information and great for Binaryen’s optimizations, unrelated to this presentation though. If we leave it at this though type hierarchy isn’t a lattice any more. No shared subtype of “ref exact any” and “ref exact foo”. No lattice any more but we want a lattice. Let’s talk about the bottom of type hierarchy.

TL: To restore it to a lattice, we’re going to special-case and add 2 new edges: 2 new edges come from ref null exact and from ref non-null exact. So now the greatest lower bound from ref exact any and ref exact none…and we’ve restored our lattice.

CW: Eventually we’ll have shared here too?

TL: oooooh yeah… Gonna be a lot of fun! Just don’t type wast tests.

CW: When in the example we alloc’d 2 types having the same vtable, that is disturbing because…

(Slide: Other Instructions)

TL: I’ve got a slide for that! Let’s fast forward. Types all good. Here’s ref.get_rtt. Got an instruction to get the RTT> Get a reference to some X. Better have a descriptor and you get the struct that’s the descriptor. If it’s an exact reference to X then it’s an exact reference to Y. Can plumb exactness through type-system.

TL: Since you’re getting a struct, structs are subtypes of eq, so you can do equality comparisons on the struct.

RH: Bottom-exact now for unreachable now?

TL: No. You already have bottom-nullability, so… Subtyping works out. (notes lost the real answer)

TL: Now Conrad wants cast_rtt. Shorthand for get_rtt with ref.eq with some expected and if that succeeds then a cast.

TL: This cast_rtt instruction exactly describes the fast-path for ref casts engines use today. Hoists fast-path from engines into user-space. Says don’t fall back to slow path.

CW: There’s still the slow-path version as well?

TL: For normal refcast, yes. For this instruction, no.

CW: Engines already do slow cast? Doesn’t this make the slow cast scarier to implement?

TL: If you do a normal ref.cast on a type with a normal RTT, you’re probably going to want to use the slow path.

CW: Is this a slow path which only exists due to slow rtts?

TL: Existed already. If you fail a pointer comparison you fall back to the vector of supertypes and look at that.

CW: Isn’t the vector of supertypes a vetor of things you do ptr comparisons again?

TL: Yes but you only put canonical RTTs in that vector of supertypes.

CW: Now you don’t have canonical RTTs?

TL: But in the vector of supertypes you ONLY put canonical RTTs.

TL: Cool at this is THE way to do the fast cast in wasm and there’s no slow path.

(Slide: Attaching JS prototypes)

TL: I have a counter-type. It’s got a custom vtable and a field which is the counter value. The vtable describes counter and has two methods: increment/get. Normal function pointers. Also going to have extra “prototype” field which is an externref. This is the JS prototype. Here’s normal function types for increment/get. No surprises.

TL: In this same wasm module, I’ll have a global which is import: an extern ref which is the prototype of counters. Then I’ll declare a global which is the vtable for counters. The vtable is going to be allocated with the imported prototype and 2 function ptrs, and we’re going to annotate it with an annotation specified in the JS spec (not thewasm spec). (Ah, this should be “ref exact” actually.) So we’ll take this value and store it not only in the user slot but also the engine-managed prototype slot.

TL: Funneling the imported prototype through here and via this annotation to tell the engine to put it in the prototype slot. It’s how JS resolves methods and goes to wasm. We’ll allocate a counter in another global and use the vtable and new_default. (assumes new_default can take vtable as a parameter, discuss this later)

TL: On the JS side, we’ll create an empty obj to be the prototype. We’ll instantiate our module and as the prototype for prototypes.counter, we’ll …. We’ll install these little get and inc wrapper functions on the object. They take “this” and pass it to the exported counter.get and counter.inc functions. Now we can get our exported counter instance. It’ll look up the export and return 0, etc.

TL: Everything comes together so nicely! Ergonomic! Lovely!

CW: I had thought we were hoping to map wasm objects to JS structs with this proposal. Is that happening here?

TL: Not at all. That’s also a goal we should talk about with Shu offline.

CW: Are we happy that we may have both happening?

TL: Let’s talk about that more offline. Should discuss more with the whole group in the future.

TL: So I didn’t discuss this deduplication in the type section, nor did I discuss the declarative thing we want to do to save startup cost of filling out all these prototypes.

TL: Bikesheds! Discussions! Votes!

CW: Is “final” or “exact” just part of the name?

TL: Yes.

Naomi: When you cast to type bar and compare that there’s the same pointers, how do you know what the vtable pointer is. When you say descriptor something does that point to the type or the value. Same name?

TL: “anyref” is the castee. The “ref null exact?” is the actual vtable we’re comparing against.

(Slide: Other instructions)

Naomi: Ok, how is subtyping possible of foo’s vtable pointer is not equal to bar’s.

TL: So a normal cast would succeed because subtyping is allow. But this cast would just trap. This cast doesn’t support “Well this is actually a subtype of the target”. This cast does only the fastpath we were have ptr equality.

TL: Phase 1 poll! Any objections? No objections; let’s go eat!

ESM Integration Update (Guy Bedford)

Slides: https://docs.google.com/presentation/d/1qCyJOwc4Id-wPCGJTWnKSiUHHC8iHP0hIL8_tkljbhU/edit?usp=sharing:

GB: Update, no phase advancement. Everything is pretty stable now.

GB: History: April 2018 presentation by Lin Clark. (The rest is history) CW:

TS: What’s the Abstract Module Source percent notation with percent signs?

GB: That’s for future evolution. The percents are part of ECMA-262 notation for builtins that are not available from the global object.

MF (in chat): FYI this is not actual valid JS source text.

CW: Is there a JS version of source phase import

GB: Work in progress (phase 2.7 in TC39)

GB: Good for workers

CW: One of the things I bump into when creating workers is the cross-origin policy being restrictive. Is the module here considered to be the same origin as the rest of the page? That would be really nice.

GB: I would have to page that back in. I can get you the actual information for that offline.

GB: Can construct JS modules from text

FM: Isn't this like eval?

GB: Yes, there are other folks in TC39 who care a lot about eval who are owning that piece.

GB: Oct 24 Chromium lands source phase. Dec 24 JSC parsing of source phase. Jan 25 Deno, Moddable ship.

GB: Sync import maps across workers is a problem. Plan to support inheritance of import maps.

CW: Some things in import maps are functions?

GB: The import map is just a map of import specifiers to urls (strings).

CW: So when you copy that, you’re still going to separately fetch from the URL.

GB: Sources are just data, but they include the URL

Module keying: If you have a source, then that source is associated with a source text, so in a sense the source is part of the key.

GB: Three specs involved: HTML, ECMA-262, ESM integration.

CW: What are cyclic module records? WebAssembly doesn't allow that.

GB: Not to each other, but you can have a Wasm/JS cycle. And it’s possible for Wasm to import JS functions that are not yet callable.

CW: For sharing the import map between parent and worker, in the worker you’re refetching the URl and expecting it to be cached. But also, the compilation of the module should also be cached; is that true?

GB: In theory yes, it could be.

GB: Evalish modules. Two string modules have a unique hash.

TL: Is there any notion of compile-time imports in JS?

GB: Let’s discuss that on the string builtins slide.

GB: Import defer integration. Going to stage 3 soon

GB: String builtins

GB: HTML concerns about blanket defn for wasm: scheme

CW: -> RH What was decided for names in strings builtins proposal?

RH: The names were fixed. We’re not hard-coding the single-quote. The first part of the specifier will always be “wasm:”

TL: How does configurable constants work with this?

RH: One option is, we just chose a name.

TL: So we could choose a name by which the string constants could be found under ESM, but retain the flexibility of the configurable module name in the plain JS API?

RH: Yes, we can make different choices for JS API and ESM integration because ESM integration doesn’t exist yet.

GB: built-in strings are pollyfillable in wasm if you instantiate modules yourself

Naomi: Does this mean if you have multiple modules, you could pass them the same options?

Naomi: Possibly people might end up having a module registry?

RH: you need control over what imports are provided. Will typically only import whole string builtins

GB: We could either have 'all' as the default set of builtins or we could require each set of builtins to be specified from day 1. We should have a follow-on discussion about this.

GB: Moving towards phase 4: Draft ESM spec avail, WPT PR ,vserver side JS is leading (Node/deno)

ESM integration spec is complete, waiting for

CW: Having a cutoff makes sense. The main thing I see missing is the two Web VMs. Is this something we can decide closer to when we have those implementations?

GB: Yes, we can decide this at the point we have two Web implementations. I think Chromium is only looking to implement source phase and not instance phase. If we have a partial implementation in browsers, but server-side JS implementations have full support, would we want to go to phase 4?

CW: I’d be most comfortable if it was two Web implementations of all of the proposal.

BD: Has there been discussion about how to do shared memories, and how that would work with imports?

GB: That would just all be on the JS side. There are no magic semantics.

Spec Tec (Andreas Rossberg)

slides link: main/2025/presentations/2025-02-13-rossberg-spectec.pdf

AR: What is spectec about? Spec authoring today has lots of pitfalls.

AR: DSL for writing specs AR: SIngle source of truth for wasm specs

AR: SpecTec can generate lots of different cool things. In this talk I’ll focus on just the generation of the spec text.

AR: Lets see some actual watsup code CW: AR: The ASCII syntax is meant to look familiar.

LW: Could you show the instantiate rule?

AR: (shows)

TL: How important are those TODOs?

AR: They are cosmetic. They’re not required to make this work, but they could make things prettier.

CW: Are we currently able to animate modules?

AR: I want to refactor how modules are validated and instantiated. Globals can be interdependent so doing them in one pass doesn’t scale anymore, so I want to refactor that.

BV: Could you show the binary parsing part of it?

AR: (shows) rules for binary encoding of block instructions; it’s structured as an attribute grammar.

BV: Which level would you pick up if you wanted to generate an actual parser?

AR: Animation is only concerned about execution right now. Last summer I implemented a meta-parser, that you’d give a binary encoding and it would run it. That kind of works, though there are some roadblocks still.

AR: In the fullness of time we could have plugins for different backends but for how the best thing is to just have interpreters for this.

BV: I recently spent some time trying to mechanise a parser by basing it on the OCaml parser, but that did not go well.

AR: Here, you could use the type definitions or syntax definitions which are more specialized to our use case and available as an AST.

MS: Is the tool available anywhere?

AR: Yes, it’s all public. TODO: links

MS: Did you define the factorization into the evaluation context? Or do you just have to define it once?

AR: Currently we don’t use evaluation contexts, but we would like to have them, especially for stack switching. We’ve done some experimentation but it’s an open problem. It’s also tricky to figure out how to handle it in all the different backends. But I would like to get back to it.

AR: progress on converting the spec document itself

AR: Almost complete on core spec. Progress on numerics but not complete. We can be incremental.

AR: Also incorporated all phase 4/5 proposals.

Summary: We’ve finished converting all the primary chapters, and all the phase 4/5 proposals except for threads which aren’t quite ready. In sync with the wasm 3.0 branch. We also tried stack switching, but there are still quite a few open questions.

We also wrote a bunch of docs, improved documentation, cross referencing, misc fixes in rendering.

In simple cases, SpecTec can generate exactly the same text that we have written by hand today.

AR: But in other cases, there are a few differences.

E.g. br spec was refactored into a small-step approach.

Also refactored parametric spec of arithmetic instructions. Able to make it more precise.

TL: THis factoring seems fine, but, would it be possible to tell SpecTec to inline some things, if we did want to factor things differently, such as to list everything out in the spec?

AR: I haven’t thought about it. It’s certainly possibly in principle, but it would take work to be implemented.

Naomi: Is there any way to have spectec check that the new spec matches the old spec?

AR: No, because spectec only knows the new thing.

Naomi: What about future refactors?

CW: You generally want a single canonical source of truth, so you derive other representations from the canonical version

AR: One thing spectec has that I didn’t show is a hint system. All constructs can be annotated with hints, and we use that to customize the typesetting of various things. You might use that to make the result even closer to the hand-written spec.

In SpectTec we’re using slightly more descriptive metavariables. “nt” stands for “numeric type”.

TL: Is there any prose that defines the squiggly equals?

AR: Yes.

AR: There’s still quite a bit of manually-written prose. We generate the rules, but all the editorial text is still hand-written.

SL: How is nt dot relop with a subscript written in ASCII?

AR: It’s a type parameter, so you'd write it in parens.

AR: User experience can be improved.

AR: generated prose wording not perfect

AR: iterations still wip

AR: There is still some hard coding; e..g the meta interpreter depends on the reference interpreter’s parser

TL: When I was playing with this, I hit many robustness issues. Has there been improvement?

AR: Yes, though things are still not where I’d like them to be. It’s pretty good for the frontend, and the LaTeX document doesn’t need to make many assumptions, but the other backends especially the interpreter backend do make a lot of assumptions and you still have to expect them to crash if you do something they don’t expect. I really hope we can improve that over time.

TL: This is one of the biggest risks for getting adoption of this within the CG.

AR: Agreed.

AR: Some of this will be solved by using the hint system to guide the backend to handle various constructs.

BT (in chat): no LLMs required :)

AR: Can generate tests, more tests and fuzzing matrices.

AR: ready for adoption. Looking for a vote in a few weeks

CW: I know we need to extend spectec to handle threads. Who can I rely on to help with these extensions?

AR: I have a slide on that later on, on the kind of SLA we can promise.

AR: Will continue to improve, fix and extend.

AR: Able to maintain for at least 2 years

AR: Looking for people who want to contribute

FM: What about integration with JS? Does the wasm embedding spec fit into this scheme?

AR: I haven’t touched the embedding spec yet; that’s one of the appendices. It would use SpecTec for generating the math formulas, but other than that, that’s a good question. I think we could define it as functions you could invoke. But other than that, it won’t be any more integrated with the JS spec. THis is really a tool to generate our spec, and you’ll end up with the same kind of artifact. If you really wanted you could integrate with Bikeshed and use some of this machinery, but I’d guess it’s not worth the trouble

AR: The splicing mechanism is generic. We can generate markup for rest in sphynx, or markup in LaTex. And you could easily add something for Bikeshed too. But I don’t know if Bikeshed can handle LaTeX fragments like that.

AR: Getting rid of sphynx would be a long-term goal that would require a lot more work. The only way to replace Sphyx would be to replace it with LaTex. I don’t think you’d want to implement a whole new document formatting system.

BT: I appreciate the thought behind. Skill transfer from other languages would reduce the risk. How can we onboard other people with their skillsets?

AR: No. Fair question. I don’t know if there’s a way to do that. These tools all have idiosyncrasies and so does SpecTec SPecTec isn’t a prover in itself.

CW: It might be useful to have “SpecTec for Coq users” etc.

AR: Writing SpecTec: I wrote a bunch of docs, there’s a tutorial, I think that’s ok. THe bigger challenge is how we can onboard people to maintain the tool.

FM: If one could extend SpecTec to support other languages (I don’t propose we do this today), that would help with adoption of spectec and help ensure its future

AR: Not really ready to advertise spectec to other people.

SC: Can we eliminate the need to learn OCaml?

AR: No

Room: laughs

??: What can you do with the meta interpreter?

AR: The meta-interpeter reuses significant parts of the OCaml interpreter today. The reference interpreter does all the parsing and decoding. This makes it brittle. I’d like to make the meta-interpreter more self-sufficient. We also use the OCaml for all the numerics, I’d like to port all that to SpecTec; that gets tricky especially with floats, but it’s doable with time. Validation might be the hardest part because generating a type-checking algorithm from a type system specification is a hard problem. Even doing it manually takes difficult proofs to prove the algorithm matches the spec. Especially with all the subtyping, it’s getting more difficult.

SR (chat): We’ll have a workshop at PLDI’25 to discuss these issues: https://pldi25.sigplan.org/home/rpls-2025 Please come!

CW (chat): Unfortunately I found out about this workshop a little late, but if I’m not able to come, I’m interested in the results of the discussion!

AR: Most of the numerics are still in OCaml, but I started implementing some of the simpler ones. Like addition. But overall that’s an area we want to improve on.

BT: I agree that we should be specialized for our use case. There are things that can be done to make it easier for people to learn.

AR: The idea for syntax is that we try to adopt the standard on-paper syntax, so it looks familiar to people familiar with academic literature.

TL: I know the wasm spec and found that going to spectec was pretty easy. People in the community are well served today.

AR: Looking for a poll in the near future. Please take a look at the documentation and prepare for the vote.

TL: The docs Andreas has written are quite good.

TL: What’s the process? Do we vote and then just merge it all in the next day?

AR: No, there’s some engineering work to be done. Need to separate building the document from building the interpreter.

CW: Two additional procedural questions: At the point we adopt SpecTec, we’ll want to change the definition of Phase 4 to use SpecTec?

AR: Now is a good time because there are no major proposals that would be affected.

CW: And, How do you envisage this plugging into the W3C publishing cadence?

AR: For W3C it doesn’t matter how you generate the artifact, so we can change the toolchain. There are discussions we should have about how to do Candidate Recommendations and other things, but those are separable conversations.

DS: We have to make sure that we can pass the HTML validation rules.

CG-02.md

Table of Contents

Agenda for the February meeting of WebAssembly's Community Group

Research Day

Registration

Logistics

Communication

Getting to the venue

Arrival on bike

Arrival by car

On arrival

Agenda Items

Wednesday, February 12

Thursday, February 13

Attendees

Meeting notes

Wednesday, February 12

JSPI

Stack Switching

Component Model:

Shared everything threads:

Thread Local Storage

Half-Precision

Wide Arithmetic (Alex Crichton)

Website

Thursday, February 13

WebAssembly Memory & JSArrayBuffers (Deepti Gandluri)

Compilation Hints (Emanuel Ziegler)

Custom RTTs and JS Interop

ESM Integration Update (Guy Bedford)

Spec Tec (Andreas Rossberg)