CG-10-12.md
March 9, 2022 · View on GitHub

Agenda for the October 12th video call of WebAssembly's Community Group
- Where: zoom.us
- When: October 12th, 4pm-5pm UTC (October 12th, 9am-10am Pacific Time)
- Location: link on calendar invite
Registration
None required if you've attended before. Send an email to the acting WebAssembly CG chair to sign up if it's your first time. The meeting is open to CG members only.
Logistics
The meeting will be on a zoom.us video conference. Installation is required, see the calendar invite.
Agenda items
- Opening, welcome and roll call
- Opening of the meeting
- Introduction of attendees
- Find volunteers for note taking (acting chair to volunteer)
- Adoption of the agenda
- Proposals and discussions
- Review of action items from prior meeting.
- Update on Garbage Collection (Thomas Lively et al.) [30 mins]
- Update on Typed (Function) References (Andreas Rossberg) [20 min]
- Closure
Agenda items for future meetings
None
Schedule constraints
None
Meeting Notes
Opening, welcome and roll call
Introduction of attendees
Jacob Mischka
Steven Prine
Flaki
Slava Kuzmich
Lars Hansen
Chris Fallin
Richard Winterton
Michael Knszek
Saul Cabrera
Yuri Iozzelli
Jacob Abraham
Francis McCabe
Paolo Severini
Zhi An Ng
Sean Westfall
Thomas Lively
Alex Chrichton
Petr Penzin
Conrad Watt
Rick Battagline
Dan Gohman
Luke Wagner
Jakob Kummerow
Alon Zakai
Pat Hickey
Vibhav Pant
Jlbirch
Ben Titzer
Zalim Bashorov
Mingqiu Sun
Arun Purushan
Heejin Afn
Nick “fitzgen” Fitzgerald
Peter Heune
Ross Tate
Andreas Rossberg
Find volunteers for note taking
Acting chair to volunteer
Proposals and discussions
Update on Garbage Collection (Thomas Lively, Alon Zakai, Jakob Kummerow) [30 mins]
TL presenting Slides
CW: Is that optimization done as a Wasm to Wasm pass?
AZ: yes, Wasm to Wasm
CW: Does that mean producers are missing the opportunity? Or are all the optimizations are in the wasm to wasm pass?
AZ: dart here is based on AOT compiler path, it can be a missed opportunity in some sense, the escape analysis here is productive only because of more optimizations Binaryen does later
RT: We made a system that compiles through LLVm, we found that devirtualization wasn’t useful unless it unlocked other use cases. This sort of chaining after the user space is important
AZ: good point, devirtualize a call does little, but it allows us to inline, does escape analysis, and more optimization. Unsure why this example doesn't happen in the Dart AOT.
CW: when measuring the cast overhead, is this running a version of V8 that nops the cast?
AZ: Yes, that’s how it’s done. It’s an unsafe measurement
CW: The overhead is lower than expected, remaining overhead is mysterious
BT: for null checks, is that because V8 has a null object that needs an explicit branch?
JK: Yes
CW: It takes several minutes to compile - without type canonicalization?
JK: yes, if IR graph happens to have a pattern that runs into quadratic algorithm in e.g. register alloc
CW: Are these problems solvable by reengineering passes in V8, this is not inherent to Wasm?
JK: not inherent to Wasm at all, not much Wasm can do about that. The toolchains that produce modules have to be careful. E.g. if binaryen inlines everything it cans, the resulting function can be too much for a JIT compiler. The inlining budget is a knob we can turn. Some optimization passes are just expensive, takes time.
BT: The era of wasm engines doing dynamic inlining is about to begin :)
CW: :'(
BT: CPUs do SSA. You can’t stop the hardware getting smarter :)
LW: Hopefully in a way that can attempt to preserve as much of the predictability of the cost model as possible…
BT: I think we can address that with a compilation hints section
CW: worried about having deopts in Wasm
AR: Still confused about the call ref thing, create an auxiliary table of size one, put, and do a call ref on that, why are inlined calls needed?
JK: because a call that you don't do because you inlined, is an order of magnitude faster than non-inlined
AR: Fair, if you want to inline it?
CW: Is this an optimization that you would have done for call indirect anyway, and needed more for the number of indirections in GC?
JK: yes, accurate. We can consider doing it for call_indirect afterwards
BT: everything in call_indirect table has Wasm calling convention, if they are JS they have been wrapped by wrapper, you don't have to check if it is a JS function. call_ref has to do several checks to see if it is a JS function.
AR: Why?
BT: That’s V8, everything in the heap is a JS object..
AR: idea with other JS values passing the boundaries is that conversion is done at the boundary, then you have the exact same situation, you save the type check and reading from table. What do you do if you store the ref in the table, how do you wrap that then?
BT: Stuff might be generated when you write into the table, table writes are expensive in JS
AR: now that we have table.set.
BT: also expensive, how signatures are canonicalized, can be a broadcast write to multiple different instances that import a table, due to how signatures are canonicalized to a unique number
AR: Would that also be true for typed tables? Or any func refs?
BT: even if they were typed func refs, anything that can be call_indirect-ed. Not for table with non-funcref type.
AR: If you have a typed function ref, then you also don’t need to canonicalize the types?
BT: goes back t o how type check for call_indirect works. you have to check the right sig, if you can't do that statically, the dynamic check uses this unique identifier.
AR: that's the case if you have a concrete function type
CW: Independent of this, the large part of the speculative optimization is based on inlining, which is orthogonal to this discussion.
JK: if we want to be competitive JS or Java on JVM, we have to do inlining on non-direct calls.
RT: agree, also know that our AOT compiler doesn't do any speculation and inlining and competitive with JS or Java. Not sure if it is as necessary. It varies by benchmark.
JK: Depends on the benchmark
AZ: does varies a lot, in Java every call happens with virtual. This is much more serious now.
RT: we have the AOT compiler, they run relatively the same as Java, C#...
BT: Speculative optimization by gathering indirect call targets is valuable, we should measure this going forward with Wasm engines so we can measure where our overheads are.
AR: with Wasm, assumption is that most heavy-lifting is done beforehand, that can't cover dynamic optimizations. Worried that we get too deep into that kind of thing, depending on too much magic in Wasm engine, that was the antithesis to the intents of the original design.
JK: We want to do as much as we can ahead of time, two limitations - we are concerned about module size. Uncompressed modules are still 2digit MBs, 10-15Mbs with stripped out names section., The other limitation is that some calls can be devirtualized, but some calls if you use JAva the way it is permitted is not possible to predict ahead of time,
CW: what does the JVM do in that case? does it do the unsound inlining?
BT: not unsound, but JVM has an inline cache to collect receiver methods, will inline one or more targets if the distribution is skewed
CW: Unsound is the wrong word given the fallback
BT: another thing JVM does is dynamic class hierarchy analysis, for a given class method, has it been overridden, if not it will devirtualize and inline. If you instantiate an object of a class that overrides it, it will deopt code.
CW: Horrifying
BT: beautiful at the same time
AR: hope we haven't created another Java here. In the ideal world we can do all this in the user space, can JIT from within Wasm, express that in Wasm itself. That's a long shot, if ever possible.
BT: This is the kind of stuff to figure out over the next few years, Gral VM is a potential target for these types of optimization. The JIT may inline a constant, if you change the global it will change the code.. This method lets you do dynamic call hierarchies etc..
AR: question is how many shortcut we want to take before we have the full-blown machinery. Might have pressure to take shortcuts that are very adhoc.
CW: Off the wall, not realistic…. Never mind
DG: a lot of work in V8, have other engines prototyped this and can share experiences?
LH: We have a partial prototype, not spending a lot of time on that right now.
LW: experimentation is valuable, to see what the optimizations are to be competitive with JS. Once we have parity, we can ask where the optimizations should be. Getting to parity is a prerequisite, biggest risk to address.
JK: IS the dart compiler missing out on optimizations? Any time we have a system providing an unoptimized result, is there a missed opportunity of missing optimization in the other system? This is an intentional choice, any optimizations implemented in Binaryen will benefit the entire ecosystem
CW: agree that what binaryen is doing is incredibly valuable, what do Dart expect their burden to be, do they expect to delegate to Binaryen?
JK: Lot of optimizations on their end, it makes sense to have both language specific optimizations on their end, and language independent optimizations in Binaryen
BT: advantage of doing optimizations in VM is debugging, tracing through your programs. If you use an offline/AOT, you can inline things and you don't see their frames. If you generate unoptimized code, and VM supports debugging, you can step through. E.g. javac doesn't inline. If you think about ecosystem view, generating unoptimized code is a good thing if you want to debug it, not everything is production deployment where you run optimization passes which make it harder to debug.
RT: Context: the DART optimization does whole program optimizations, dynamic classes relying on dynamic optimizations, so we’re relying on different things