CG-10-11.md

October 26, 2022 · View on GitHub

WebAssembly logo

Agenda for the October 11th video call of WebAssembly's Community Group

Where: zoom.us
When: October 11th, 4pm-5pm UTC (October 11th, 9am-10am Pacific Time)
Location: link on calendar invite

Registration

None required if you've attended before. Send an email to the acting WebAssembly CG chair to sign up if it's your first time. The meeting is open to CG members only.

Logistics

The meeting will be on a zoom.us video conference. Installation is required, see the calendar invite.

Agenda items

Opening, welcome and roll call
1. Opening of the meeting
2. Introduction of attendees
Find volunteers for note taking (acting chair to volunteer)
Adoption of the agenda
Proposals and discussions
1. Proposal: Linear-Memory GC-Root Marking #1459 (Ross Tate) [20 minutes]
  - Poll for Phase 1
2. Presentation: Performance Analysis of WebAssembly Calls (Ross Tate) [40 minutes]
  - Designed a very call-intensive program using very little memory
  - Wrote 19 implementations of it
    - 3 in JavaScript
    - 8 in (hand-written) core WebAssembly
    - 8 in C
  - Measured the performance of each
    - on Safari, Chrome, and Firefox in the case of JS and Wasm
    - natively in the case of C, compiled using gcc with different optimization flags: -O0, -O1, -O2, and -O3
  - Will present the findings of those measurements
Closure

Agenda items for future meetings

None

Schedule notes

Reach out to the WebAssembly CG chairs for agenda items that are time sensitive but can't be scheduled in one of the existing meetings.

Meeting Notes

DG: Don’t forget to fill out the registration form for the in-person/hybrid meeting (even if you plan to attend remotely)

Proposal: Linear-Memory GC-Root Marking #1459 (Ross Tate) [20 minutes]

Slides: (pdf) (pptx)

CW: how do you expect this to be implemented? The engine maintains a shadow stack?

RT: The idea is when you have a function with smart locals command at the top of it, on the stack frame it will enumerate which locals there are, stack map trick to implement natively

CW: so the point is that each time a frame is created you have a bitmap that shows where the locals are?

RT: Yeah, the tagged locals. Does that make sense?

FM: when you enter a function, those locals aren’t initialized. How does the GC know which local variables it should scan?

RT: So first off, the language is in charge of its own garbage collection so it’ll say when to run the garbage collection. It’s the application’s job to provide valid safepoints, and they should have correct values, the engine should provide the functionality to trigger GC

FM: so there’s just one bitmap for the whole function then.

RT: right.

PP: how does the scanning get initiated? (i.e. what’s the interface like)

RT: enumerate-marked-locals instruction will look at the first instruction, then the next and so on

PP: so it basically walls the stack from that point and finds them for you

AZ: the option to modify the roots seems like it would add overhead because it would inhibit optimizations? I guess that’s why it’s optional?

RT: It’s a part of the mark, anything that’s not marked is safe from all of this. What’s important for the VM, is that if I do a call and that has marked locals, after the call those locals are changed, if it’s not marked mutable then you can do the same optimizations if not, then don’t make assumptions about locals

CW: a high level comment about language direction. When we designed the GC proposal we talked about whether we wanted to to GC in linear memory and we decided that the direction we wanted to go. We’re just about to have a workable version of the GC proposal just about ready. Is this a good time to consider this?

RT: Concretely, these teams from the teams have looked at the GC proposal and have decided not to look at the current GC proposals

CW: i would like to hear from them directly, rather than mediated via your presentations.

AZ: Aside from the languages listed here, there are game engines etc. that will want to scan the stack - both the proposals have separate/important use cases and in the long run we will probably need both

RT: a concrete example we’ve discussed in meetings is the Erlang Beam team is that to implement the heap efficiently they need a discardable heap model. They did request that ability in the GC proposal but we decided not to do that. Also Julia really cares about concurrency/parallelism. Racket said that they wanted more control over finalizers and pointer tagging and wanted to control their own GC. Also last week we talked about Haskell, who said they needed lazy values, but the i31ref design doesn’t work with that.

AK: Can you address the high level question of doing n things, for the attention of the group not to be distributed amongst different proposals in the same space?

RT: this comes up with the stack proposal a lot. When meeting people about the stacks proposal, the majority told us they’d like to have this along with the stack proposal because they would have a bundh of GC roots on each stack. So that’s how this came up now.

CW: Would you say this is dependent on the stacks proposal?

RT: Orthogonal, but seemed to come up quite a bit, addressing the timing question

AR: how would this interact with the stack proposal?

RT: <Fiber reference?>

FM: that detail would have to be resolved in future stack. The main question is “why now”. I’d say the GC proposal… this is much earlier in the cycle than GC is. It’s a bit overlapping but not really parallel.

RT: You can tell people the only way to get a GC, the only way to have a GC is to use the current proposal, they have to use the existing one.

JM: Has anyone tried this and measured performance of some kind of GC within wasm itself? Compared to <> what is the performance difference?

RT: In Wasm no one can know because it’s not available. Natively, there’s no one that has compared the two approaches

PP: By GC, do you mean the GC proposal?

JM: For the previous discussion on the GC proposal there has been a discussion about use cases we won’t address immediately. Are there a proof of concept or prototypes that have compared performance?

RT: as in “have these languages tried compiling to wasm and performance compared to

AZ: there are just going to be languages that can’t use the wasm GC proposal, so comparison really isn’t possible. Also this isn’t really about performance. We already have major languages/companies that are already doing it in wasm, e.g. in game engines. They are managing their own stack, and they have to do GC only when nothing is on the stack because they can’t get the roots. So it’s not just about speed, it's even just having the ability to do this at all.

CW: This bitmap is a more efficient way of using a shadow stack?

RT: it sounds like what Alon is saying is that they aren’t even opting to do it, because it’s too cumbersome. But yes you can do it without this using a shadow stack, and this would mean you wouldn’t have to do that.

PP: C# uses linear memory C# and go are both languages who are using this because

RT: This is linear memory, and not this proposal in particular

CW: This is not accurate because the GC proposal hasn’t been standardized, so we won’t know who can’t use it.

RT: for Go’s performance considerations, if you read their stuff on GC, ,they rely heavily on the fact that a lot of objects are allocated on the stack. That’s not anywhere on the plans for the GC proposal.

JM: In that case the performance difference is still not clear, they don’t know what the performance would look like natively?

AR: i’d say that there will always be reasons to use linear memory. So the interesting comparison is between this and just using the shadow stack in linear memory.

AZ: That’s a valid thing to compare, maybe I want to be clear that there are cases where people that can’t use a shadow stack, there are use cases that can’t use a shadow stack because of the size of data structures, they rely on conservative stack scanning, and this would give them a way to do that

AR: how is that true? If they can’t know what to put in the shadow stack, they can’t know what to mark?

AZ: but if they scan the whole stack conservatively they can always find the roots

AR: You could conservatively put everything ont he shadow stack, but that’s probably not what you want.

AZ: maybe so. There might be something to measure. But I do think that doing this worst-case with the shadow stack manually, the performance seems like it would be obviously bad. But I guess you’d need to measure even when putting every i32 on the stack.

LW: i wonder if Binaryen could even polyfill this using a second memory, you could just put the whole stack in that

TL: There’s work underway to have asyncify have a second memory, basically this would be the same thing. Related work is underway, there could be a way to measure this

PP: I think it’s necessary to use a second memory. If you enter functions you just adjust the pointer, like how we do the C stack.

RT: I tried looking up performance of shadow stacks, it’s hard to get performance numbers of the shadow stack that’s recent

PP: realistically what’s going to be the difference between writing code in the engine that will walk the stack for you, vs writing that code in wasm, it seems like it wouldn’t be much. Bounds checks could affect it. If you're using a shadow stack, as soon as you try to compile 2 modules that both use it, they could affect each other’s choices, it gets complex.

AR: We have some data points about performance/implementation complexity of shadow stacks from C/C++ for different purposes, we have some data points that it’s not as performance intensive in itself

DS: It’ll be a lot more expensive for a GC language, because you’ll have more tracking <> In general I agree with Alon, and you that there will be use cases that will want to try this so that’s sufficient for a Phase 1.

CW: who is going to try this? Do we have anyone interested in doing an experiment?

RT: They’re not involved at the moment, but I can reach out

CW: to me the bottleneck would be engine implementers, right?

PP: To measure it wouldn’t be a bottleneck because you don’t have to merge it into the engine

RT: it sounds like either we need an engine team to put together a prototype, or get a compiler team to do a native implementation and compare a shadow stack in memory vs. something like this?

AR: to me you really need an engine, the latter comparison wouldn’t be very useful because there are so many confounding factors.

RT: Should we poll for Phase 1?

CW: I would like to see some concrete numbers before we move too far forward

DS: I would like to get a sense of who is interested this before just punting it

RH: Compared to other Phase 1 proposals, we’ve haven’t had commitment for other folks to work on this

JM: it seems like having a phase 1 proposal is a good way to say that we have something out here and get

AR: I think a phase 1 proposal is fine, it’s not totally ridiculous and i would like to see the results, it’s just not clear that we have commitments to implement it.

RT: One idea is that someone targeting this, can make a post compilation step instead of polyfilling

DS: Let’s poll for phase 1,

SF: 2

F: 28

N: 8

CW: I would like to say that we should be careful about confusing people who can’t adopt the GC proposal with people who just haven’t for other reasons.

Presentation: Performance Analysis of WebAssembly Calls (Ross Tate) [40 minutes]

Slides: (pdf) (pptx)

Microbenchmark Source: (gist)

FM: what’s the difference in factors in the different steps between the C and wasm version?

RT: In the C version you get performance cut in half by inlining, every percentage after, in Wasm you get double-triple, and varying after. It’s very common for the list library to be compiled in a separate module from the module compiling that’s compiling the foreach loop over, so there will be an overhead in Wasm that’s not comparable to native

PP: Separate compilation has an overhead, what happens if you run it in Binaryen?

AZ: if it places the global in linear memory, if the address is taken, then the optimizer can’t help, but it depends on that.

PP: The middle line doesn’t do that, not sure

RT: natively, separate compilation doesn’t add a lot of overhead, but with wasm it’s adding a lot. I should clarify that i’m compiling with O1. if I do O2 it’s all the same, except that it inlines the switch.

PP: The very right side of slide, that’s the offset, what’s the difference overall, what would be a way to get there? What’s your baseiline? [Breaking down performance slide]

RT: that last line is basically if I were to compile say an ahead of tiem version of Java (but ignore GC), but I have classes, my list library get imported, it imports the vtable for the program, and what are the offsets in the vtable, and it calls indirect from those. PP: That 590 vs 640, what’s the difference between those two? (to the right)

RT: here, I’m doing direct calls. Natively it’s the sort object file has imported global addresses and i use that as function pointers, you load an offset from that and call through it.

PP: Do you have sources anywhere? Wasn’t sure what to do with them

RT: On my computer, have a small web page for the browser comparisons?

PP: it would also be interesting to test with standalone engines.

SC: by “separate compilation” do you also mean “separate deployment” where the runtime is doing the linking?

(RT adding response to the missed above question after the fact: Yes. The separately compiled native binaries are generated such that they reflect the image of what a runtime could generate when compiling/instantiating/linking at run time, in line with what common runtimes for other common virtual machines do.)

RT: I don’t know why switching is more expensive in Wasm than natively, again for whole program compilation you can bypass that to use a switch and bypass call-indirect. Firefox vs. Chrome performance difference is significant, Firefox does better - implementation details between the engines cause this <> callee side signature side works better, it’s more expressive in a variety of a different ways, arbitrarily large functions can be used without a performance cliff.

CW: are any of these benchmarks using web workers?

RT: No, we’re building expectation that instantiate is a cheap operation