CG-10-12.md

March 9, 2022 · View on GitHub

WebAssembly logo

Agenda for the October 12th video call of WebAssembly's Community Group

Where: zoom.us
When: October 12th, 4pm-5pm UTC (October 12th, 9am-10am Pacific Time)
Location: link on calendar invite

Registration

None required if you've attended before. Send an email to the acting WebAssembly CG chair to sign up if it's your first time. The meeting is open to CG members only.

Logistics

The meeting will be on a zoom.us video conference. Installation is required, see the calendar invite.

Agenda items

Opening, welcome and roll call
1. Opening of the meeting
2. Introduction of attendees
Find volunteers for note taking (acting chair to volunteer)
Adoption of the agenda
Proposals and discussions
1. Review of action items from prior meeting.
2. Update on Garbage Collection (Thomas Lively et al.) [30 mins]
3. Update on Typed (Function) References (Andreas Rossberg) [20 min]
Closure

Agenda items for future meetings

None

Meeting Notes

Opening, welcome and roll call

Introduction of attendees

Jacob Mischka

Steven Prine

Flaki

Slava Kuzmich

Lars Hansen

Chris Fallin

Richard Winterton

Michael Knszek

Saul Cabrera

Yuri Iozzelli

Jacob Abraham

Francis McCabe

Paolo Severini

Zhi An Ng

Sean Westfall

Thomas Lively

Alex Chrichton

Petr Penzin

Conrad Watt

Rick Battagline

Dan Gohman

Luke Wagner

Jakob Kummerow

Alon Zakai

Pat Hickey

Vibhav Pant

Jlbirch

Ben Titzer

Zalim Bashorov

Mingqiu Sun

Arun Purushan

Heejin Afn

Nick “fitzgen” Fitzgerald

Peter Heune

Ross Tate

Andreas Rossberg

Find volunteers for note taking

Acting chair to volunteer

Proposals and discussions

Update on Garbage Collection (Thomas Lively, Alon Zakai, Jakob Kummerow) [30 mins]

TL presenting Slides

CW: Is that optimization done as a Wasm to Wasm pass?

AZ: yes, Wasm to Wasm

CW: Does that mean producers are missing the opportunity? Or are all the optimizations are in the wasm to wasm pass?

AZ: dart here is based on AOT compiler path, it can be a missed opportunity in some sense, the escape analysis here is productive only because of more optimizations Binaryen does later

RT: We made a system that compiles through LLVm, we found that devirtualization wasn’t useful unless it unlocked other use cases. This sort of chaining after the user space is important

AZ: good point, devirtualize a call does little, but it allows us to inline, does escape analysis, and more optimization. Unsure why this example doesn't happen in the Dart AOT.

CW: when measuring the cast overhead, is this running a version of V8 that nops the cast?

AZ: Yes, that’s how it’s done. It’s an unsafe measurement

CW: The overhead is lower than expected, remaining overhead is mysterious

BT: for null checks, is that because V8 has a null object that needs an explicit branch?

JK: Yes

CW: It takes several minutes to compile - without type canonicalization?

JK: yes, if IR graph happens to have a pattern that runs into quadratic algorithm in e.g. register alloc

CW: Are these problems solvable by reengineering passes in V8, this is not inherent to Wasm?

JK: not inherent to Wasm at all, not much Wasm can do about that. The toolchains that produce modules have to be careful. E.g. if binaryen inlines everything it cans, the resulting function can be too much for a JIT compiler. The inlining budget is a knob we can turn. Some optimization passes are just expensive, takes time.

BT: The era of wasm engines doing dynamic inlining is about to begin :)

CW: :'(

BT: CPUs do SSA. You can’t stop the hardware getting smarter :)

LW: Hopefully in a way that can attempt to preserve as much of the predictability of the cost model as possible…

BT: I think we can address that with a compilation hints section

CW: worried about having deopts in Wasm

AR: Still confused about the call ref thing, create an auxiliary table of size one, put, and do a call ref on that, why are inlined calls needed?

JK: because a call that you don't do because you inlined, is an order of magnitude faster than non-inlined

AR: Fair, if you want to inline it?

CW: Is this an optimization that you would have done for call indirect anyway, and needed more for the number of indirections in GC?

JK: yes, accurate. We can consider doing it for call_indirect afterwards

BT: everything in call_indirect table has Wasm calling convention, if they are JS they have been wrapped by wrapper, you don't have to check if it is a JS function. call_ref has to do several checks to see if it is a JS function.

AR: Why?

BT: That’s V8, everything in the heap is a JS object..

AR: idea with other JS values passing the boundaries is that conversion is done at the boundary, then you have the exact same situation, you save the type check and reading from table. What do you do if you store the ref in the table, how do you wrap that then?

BT: Stuff might be generated when you write into the table, table writes are expensive in JS

AR: now that we have table.set.

BT: also expensive, how signatures are canonicalized, can be a broadcast write to multiple different instances that import a table, due to how signatures are canonicalized to a unique number

AR: Would that also be true for typed tables? Or any func refs?

BT: even if they were typed func refs, anything that can be call_indirect-ed. Not for table with non-funcref type.

AR: If you have a typed function ref, then you also don’t need to canonicalize the types?

BT: goes back t o how type check for call_indirect works. you have to check the right sig, if you can't do that statically, the dynamic check uses this unique identifier.

AR: that's the case if you have a concrete function type

CW: Independent of this, the large part of the speculative optimization is based on inlining, which is orthogonal to this discussion.

JK: if we want to be competitive JS or Java on JVM, we have to do inlining on non-direct calls.

RT: agree, also know that our AOT compiler doesn't do any speculation and inlining and competitive with JS or Java. Not sure if it is as necessary. It varies by benchmark.

JK: Depends on the benchmark

AZ: does varies a lot, in Java every call happens with virtual. This is much more serious now.

RT: we have the AOT compiler, they run relatively the same as Java, C#...

BT: Speculative optimization by gathering indirect call targets is valuable, we should measure this going forward with Wasm engines so we can measure where our overheads are.

AR: with Wasm, assumption is that most heavy-lifting is done beforehand, that can't cover dynamic optimizations. Worried that we get too deep into that kind of thing, depending on too much magic in Wasm engine, that was the antithesis to the intents of the original design.

JK: We want to do as much as we can ahead of time, two limitations - we are concerned about module size. Uncompressed modules are still 2digit MBs, 10-15Mbs with stripped out names section., The other limitation is that some calls can be devirtualized, but some calls if you use JAva the way it is permitted is not possible to predict ahead of time,

CW: what does the JVM do in that case? does it do the unsound inlining?

BT: not unsound, but JVM has an inline cache to collect receiver methods, will inline one or more targets if the distribution is skewed

CW: Unsound is the wrong word given the fallback

BT: another thing JVM does is dynamic class hierarchy analysis, for a given class method, has it been overridden, if not it will devirtualize and inline. If you instantiate an object of a class that overrides it, it will deopt code.

CW: Horrifying

BT: beautiful at the same time

AR: hope we haven't created another Java here. In the ideal world we can do all this in the user space, can JIT from within Wasm, express that in Wasm itself. That's a long shot, if ever possible.

BT: This is the kind of stuff to figure out over the next few years, Gral VM is a potential target for these types of optimization. The JIT may inline a constant, if you change the global it will change the code.. This method lets you do dynamic call hierarchies etc..

AR: question is how many shortcut we want to take before we have the full-blown machinery. Might have pressure to take shortcuts that are very adhoc.

CW: Off the wall, not realistic…. Never mind

DG: a lot of work in V8, have other engines prototyped this and can share experiences?

LH: We have a partial prototype, not spending a lot of time on that right now.

LW: experimentation is valuable, to see what the optimizations are to be competitive with JS. Once we have parity, we can ask where the optimizations should be. Getting to parity is a prerequisite, biggest risk to address.

JK: IS the dart compiler missing out on optimizations? Any time we have a system providing an unoptimized result, is there a missed opportunity of missing optimization in the other system? This is an intentional choice, any optimizations implemented in Binaryen will benefit the entire ecosystem

CW: agree that what binaryen is doing is incredibly valuable, what do Dart expect their burden to be, do they expect to delegate to Binaryen?

JK: Lot of optimizations on their end, it makes sense to have both language specific optimizations on their end, and language independent optimizations in Binaryen

BT: advantage of doing optimizations in VM is debugging, tracing through your programs. If you use an offline/AOT, you can inline things and you don't see their frames. If you generate unoptimized code, and VM supports debugging, you can step through. E.g. javac doesn't inline. If you think about ecosystem view, generating unoptimized code is a good thing if you want to debug it, not everything is production deployment where you run optimization passes which make it harder to debug.

RT: Context: the DART optimization does whole program optimizations, dynamic classes relying on dynamic optimizations, so we’re relying on different things