simd vectors

September 28, 2018 · View on GitHub

Proposal: SE-0229
Author: Stephen Canon
Review Manager: Ben Cohen
Implementation: apple/swift#19505
Status: Active review (September 27...October 5, 2018)

Introduction

This proposal would expose a common subset of operations on the SIMD types supported by most processors in the standard library. It is based on Apple's <simd/simd.h> module, which is used throughout Apple's platforms as the common currency type for fixed-size vectors and matrices. It is not a complete re-implementation; rather it provides the low-level support needed to import any such library, and tries to make a number of things much nicer in Swift than they are in C or C++.

Preliminary Swift-evolution discussion.

Motivation

Task 1: SIMD programming

Essentially every modern CPU has support for SIMD ("Single Instruction, Multiple Data") instructions in hardware. Without getting into a long discussion of the architectural details, effective use of these instructions allows 2-10x better performance than is otherwise possible for a large class of data-parallel problems, without incurring the synchronization and data-movement hassles of working with the GPU.

Historically, there have been a number of obstacles to taking advantage of this hardware. Four programming models have been commonly used, all of which the author has considerable experience with:

Assembly: this has the advantage that you get exactly the code you want. It has numerous disadvantages--another language to learn, requiring either macro soup or separate implementations for every target (in effectively a different language for each), there are few good learning resources, and you have to slog through all the tedious things that the compiler normally does for you, like allocating registers and getting the calling conventions right (or wrong, in subtle ways that bite your users many years later).
Intrinsics: The model historically pushed by hardware vendors. Each architecture has its own set of C "intrinsic" types like __m128 (x86) or int8x8_t (ARM). These are superficially nicer than assembly, but in practice incur nearly all of the downsides, plus a few additional ones. The biggest problem is that these types are often bolted awkwardly onto the language, and so are incompatible with fundamental language or runtime assumptions about size or alignment. Innumerable bugs have been created by people attempting to use vector intrinsic types together with C++ containers, for example. These types move your implementation into a portable language (C or C++), and then immediately remove that portability by being tied to a specific architecture.
"Vector class" libraries: Agner's is the most well-known. What these generally bring to the table is support for familiar operators rather than arcane intrinsics (e.g. a + b instead of _mm_addps(a, b)). Most are narrowly focused on a single architecture (almost always x86, so they still don't provide real portability). Apple's <simd/simd.h> is similar to these, but with full support for every architecture that Apple uses, which in practice makes code written against it fairly portable.
Autovectorization: works well for simple problems, but no one has yet demonstrated a satisfactory solution for more general tasks. One of several problems is that the underlying machine model assumed by vector code is fundamentally distinct from the underlying machine model that most scalar code is written against, forcing an autovectorizing compiler to map between the two. Historically this was largely done via an ad-hoc set of transformations, which never really worked all that well. Newer approaches show some promise, but explicit manual vectorization is still frequently needed.

A major goal of these new data types in Swift is to provide a better API for vector programming. We want to capture the cross-platform niceties of the <simd/simd.h> module, but also add some new features that were difficult or impossible to do in C; things like enabling generic programming, but also things as simple as a native way to express vector permutations and unaligned loads, or even just conversions between different vector types with the same number of elements (this requires a function call rather than a cast with most vector libraries).

Looking ahead, I expect that we will use these primitives to expose things like iterating over vectors extracted from a collection of scalars, to make explicitly vectorized code much more approachable.

Task 2: geometry primitives

There is a large class of computational tasks (graphics and animation, image processing, AR, VR, computer vision) that want to have 2, 3, and 4 dimensional vector and matrix types. For these applications, these types are just as fundamental as Int and Array are for "normal" programming--they are the foundation upon which everything else is constructed.

These tasks require both elementwise operations, as well as some operations on types as abstract vectors--things like the dot and cross products, vector length, and orientation tests.

Task 3: GPU data structures

Closely related to part 2, short vectors are also essential data types for representing GPU data structures. It's frequently necessary to handle these on the CPU side do do pre/post processing, or data marshalling, and exposing these types in Swift enables that task.

Putting it together

Superficially, the only thing that these tasks have in common is that the fundamental types are "homogeneous aggregates", and that they want to benefit from the SIMD hardware that is available. Two or even three sets of types may seem more appropriate. However, our experience with clients of <simd/simd.h> is that all of our clients use a diverse subset of operations from the module, and that it's difficult to draw clear boundaries of what belongs where. While it may be reasonable to refine the underlying protocols, the types should probably remain unified.

Looking ahead, we would like to enable more sophisticated storage layouts and transforms to and from them, such as SoA - AoS conversion (a fancy way of saying "matrix transpose" or "interleave / deinterleave"; these are all the same thing). Once you start thinking about such transforms, the distinction between these types really goes out the window, because you want to map between things like 16 vectors of 3 floats and 3 vectors of 16 floats.

Proposed solution

I propose to add the following integer vector types:

Int8.Vector[2,3,4,8,16,32,64]
UInt8.Vector[2,3,4,8,16,32,64]
Int16.Vector[2,3,4,8,16,32]
UInt16.Vector[2,3,4,8,16,32]
Int32.Vector[2,3,4,8,16]
UInt32.Vector[2,3,4,8,16]
Int64.Vector[2,3,4,8]
UInt64.Vector[2,3,4,8]

and the following floating-point vector types:

Float.Vector[2,3,4,8,16]
Double.Vector[2,3,4,8]

In addition to these, there is a set of "mask" (or "predicate") types that will almost exclusively be used as associated types on the vector types listed above, and a set of protocols that allow for generic programming, described below.

Detailed design

The top-level protocol for these SIMD types is SIMDVector. It doesn't do too much, but let's look at what it does do, because while most of the meat is in the protocols that refine it, all the most subtle points come up here.

public protocol SIMDVector : Hashable,
                             CustomStringConvertible,
                             ExpressibleByArrayLiteral {
  
  associatedtype Element : Hashable
  
  /// A vector with zero in all lanes.
  init()
  
  /// A vector with value in all lanes.
  ///
  /// A default implementation is provided by SIMDVectorN.
  init(repeating value: Element)
  
  /// A vector constructed from the contents of `array`.
  ///
  /// `array` must have the correct number of elements for the vector
  /// type. If it does not, a runtime error occurs.
  ///
  /// A default implementation is provided by SIMDVectorN.
  init(_ array: [Element])
  
  /// The number of elements in the vector.
  var count: Int { get }
  
  /// Element access to the vector.
  ///
  /// Precondition: `index` must be in `0 ..< count`.
  subscript(index: Int) -> Element { get set }
  
  /// A type representing the result of lanewise comparison.
  ///
  /// Most SIMD comparison operators are *lanewise*, meaning that a
  /// comparison of two 4-element vectors produces a vector of 4 
  /// comparison results. E.g:
  ///
  ///   let vec = Float.Vector4( 1, 2, 3, 4)
  ///   let mask = vec .< 3
  ///   // mask = Float.Vector4.Mask(true,true,false,false), because the
  ///   // condition `< 3` is true in the first two lanes and false in
  ///   // the second two lanes.
  ///
  /// This vector of comparison results is itself a vector with the same
  /// number of elements as the vectors being compared.
  associatedtype Mask : SIMDMask
  
  /// Elementwise equality test.
  ///
  /// The result is a mask vector where each lane is `true` if and only
  /// if the corresponding lane of the vector `lhs` is equal to `rhs`.
  static func .==(lhs: Self, rhs: Self) -> Mask
  
  /// A vector formed from the corresponding lane of this vector where
  /// predicate is false, and from the corresponding lane of other where
  /// mask is true.
  ///
  /// See Also: replacing(with: Element, where: Mask), and the in-place
  /// operations replace(with:,where:).
  func replacing(with other: Self, where mask: Mask) -> Self
}

A couple important details to note here:

SIMDVector presents an interface similar to Collection, but does not conform to the Collection protocol. This is because, while these types have indexed access, many of the Collection functions are potentially hazardous to performance, because they would boot you right out of efficient SIMD operation. We may consider ways to add this conformance (or just conformance to Sequence) in the future, but we are deliberately leaving it out for now.
An earlier version of this proposal had two different == and != operators on SIMDVectors. One set returned Bool, the other set returned the associated type Mask. This is unusual for Swift, and may strike you as an odd choice. In practice, it worked out great, as you rarely need to explicitly disambiguate because of how overload resolution functions.

However, there was also significant pushback to this; because many people found it confusing, I've updated the implementation branch to use .-prefixed operators for lanewise comparisons. On the whole, I think using unadorned operators is clearer in use, but the . prefixes are also tolerable, and there is precedent for this choice (all lanewise operations in Julia are spelled this way, as is the lanewise product in Matlab-derived languages).

A small set of additional operations are provided as non-customizable extensions on this protocol: elementwise .== and .!= comparisons with scalars, the mutating function replace(with: Self, where: Mask), plus conformance to Hashable and CustomStringConvertible.

The next three protocols each refine SIMDVector:

public protocol SIMDIntegerVector : SIMDVector
                    where Element : FixedWidthInteger {
  
  /// Creates a bitmask vector from `mask`.
  ///
  /// The value of the mask is all 1 bits (i.e. `-1` if the element type is
  /// signed, .max if it is unsigned) in each lane where `mask` is true,
  /// and all 0 bits in each lane where `mask` is false.
  init(bitMaskFrom mask: Mask)
  
  /// A vector where each element is the count of leading zero bits in the
  /// corresponding lane of this vector.
  ///
  /// If a lane of this vector is zero, the corresponding lane of the result
  /// has the value Element.bitWidth.
  var leadingZeroBitCount: Self { get }
  
  /// A vector where each element is the count of trailing zero bits in the
  /// corresponding lane of this vector.
  ///
  /// If a lane of this vector is zero, the corresponding lane of the result
  /// has the value Element.bitWidth.
  var trailingZeroBitCount: Self { get }
  
  /// A vector where each element is the count of non-zero bits in the
  /// corresponding lane of this vector.
  var nonzeroBitCount: Self { get }
  
  /// A representation of this vector with the byte order swapped for each
  /// element.
  ///
  /// The ordering of elements within the vector is unchanged.
  var elementBytesSwapped: Self { get }
  
  static func .<(lhs: Self, rhs: Self) -> Mask
  
  static func .<=(lhs: Self, rhs: Self) -> Mask
  
  static func .>(lhs: Self, rhs: Self) -> Mask
  
  static func .>=(lhs: Self, rhs: Self) -> Mask
  
  static prefix func ~(rhs: Self) -> Self
  
  static func ^(lhs: Self, rhs: Self) -> Self
  
  static func &(lhs: Self, rhs: Self) -> Self
  
  static func |(lhs: Self, rhs: Self) -> Self
  
  static func &>>(lhs: Self, rhs: Self) -> Self
  
  static func &<<(lhs: Self, rhs: Self) -> Self
  
  static func &+(lhs: Self, rhs: Self) -> Self
  
  static func &-(lhs: Self, rhs: Self) -> Self
  
  static func &*(lhs: Self, rhs: Self) -> Self
  
  static func /(lhs: Self, rhs: Self) -> Self
  
  static func %(lhs: Self, rhs: Self) -> Self
}

public protocol SIMDMask : SIMDVector
                     where Element == Bool, Mask == Self {
  
  /// A mask vector with each lane is true where the corresponding
  /// lanes of both arguments are true, and false otherwise.
  static func &(lhs: Self, rhs: Self) -> Self
  
  /// A mask vector with each lane is true where the corresponding
  /// lanes of either argument is true, and false otherwise.
  static func |(lhs: Self, rhs: Self) -> Self
  
  /// True if every lane of the vector is true, false otherwise.
  ///
  /// Implementation hook for the all( ) free function.
  func _all() -> Bool
  
  /// True if any lane of the vector is true, false otherwise.
  ///
  /// Implementation hook for the any( ) free function.
  func _any() -> Bool
}

public protocol SIMDFloatingPointVector : SIMDVector
                          where Element : BinaryFloatingPoint,
                 Element.RawSignificand : FixedWidthInteger {
  
  static func .<(lhs: Self, rhs: Self) -> Mask
  
  static func .<=(lhs: Self, rhs: Self) -> Mask
  
  static func .>(lhs: Self, rhs: Self) -> Mask
  
  static func .>=(lhs: Self, rhs: Self) -> Mask
  
  static func +(lhs: Self, rhs: Self) -> Self
  
  static func -(lhs: Self, rhs: Self) -> Self
  
  static func *(lhs: Self, rhs: Self) -> Self
  
  static func /(lhs: Self, rhs: Self) -> Self
  
  func addingProduct(_ lhs: Self, _ rhs: Self) -> Self
  
  func squareRoot( ) -> Self
  
  mutating func round(_ rule: FloatingPointRoundingRule)
}

Besides the protocol requirements, which are the implementation hooks, a bunch of non- customizable operations are provided; the bulk of these are self-explanatory: in-place analogues to the requirements and element-by-vector operations. However, the random APIs are also provided for vectors in these extensions, e.g.:

  static func random<T: RandomNumberGenerator>(
    in range: ClosedRange<Element>,
    using generator: inout T
  ) -> Self {
    var result = Self()
    for i in 0 ..< result.count {
      result[i] = Element.random(in: range, using: &generator)
    }
    return result
  }

These allow you to generate random vectors in basically the same fashion as you generate random scalars already.

There is also a set of protocols defining the vector sizes: Vector2, Vector3, Vector4, Vector8, Vector16, Vector32, and Vector64. These are small, and mainly exist to reduce the boilerplate in the library; they default initialization from elements and arrays and array literals, and provide access to the high, low, even, and odd halves of vectors. The one other thing worth noting that these protocols provide is a vector "permute" or "shuffle" operation:

  init<D, I>(gathering source: D, at index: I)
  where D : SIMDVector, D.Element == Element,
        I : SIMDIntegerVector & SIMDVectorN {
    self.init()
    for i in 0 ..< count {
      if index[i] >= 0 && index[i] < source.count {
        self[i] = source[Int(index[i])]
      }
    }
  }

This provides for concise SIMD dictionary lookups as well as the operation of vector swizzles that you may be familiar with from GPU programming.

The Vector2, Vector3, and Vector4 protocols also define .x, .y, .z and .w getters and setters.

Finally, the concrete types offer a few other operations worth noting specifically. There are conversions between all integer and floating-point vector types (as inits) with the same number of elements, including both clamping and truncatingIfNeeded variations.

Source compatibility

No source compatibility changes.

Effect on ABI stability

No effects on ABI stability at the language level. There are some minor changes around how the <simd/simd.h> types are imported on Apple platforms, but these will have no effect on source compatibility; they only may tweak some low-level ABI details.

Effect on API resilience

Because these are entirely new types, they come with a large set of API that will become part of the standard library interface. New API can be added in the future, including to protocols, because it is generally possible to provide good default implementations for simd operations.

Alternatives considered

The main alternative is "don't do anything, let people write structs, and trust the autovectorizer." This might even mostly work, but even if it worked flawlessly (which it doesn't today), you would be left with the problem that these types are common, and everyone would be using their own set of structs, with slightly incompatible size and alignment or operations provided. Even when the layout matches up, you would still need to provide conversion shims between each and every library you work with. There is a lot of merit in providing a single ground-truth for low-level vectors in the stdlib, over which libraries and programs can build additional operations.

During the pitch phase, almost all discussion was focused on the spelling of the type names. There are three basic suggestions, with some passionate supporters of each:

Float.Vector8 (this proposal).
Vector8<Float> This has the virtue of leveraging the usual Swift syntax for generics, and arguably looking "swiftier" because of it. The downside is that although it uses the syntax, it doesn't benefit from most of the semantics of generics in the language, because we need to specialize the layout (using llvm builtin vectors) for each type/vector size to benefit from explicit vectorization. Because of this, it is somewhat more complex--every operation in effect ends up going through an extra layer of abstraction. With judicious use of @inlinable and @_transparent, this should not have any performance consequences, but in my opinion it compromises understandability, and is a layer of complexity that we don't need.

Another argument for this convention is that it may help discoverability; if I see Vector8<Float>, I might decide to try Vector8<Int8> too. On the downside, we want to cap the maximum vector size at 64B; doing this with this naming scheme is somewhat complicated, and involves adding a bunch of protocols for specifying element size, or else you can't constrain people from doing Vector64<Int64>.
Vector8.Float This is an attempt to mirror the "desired" Vector8<Float> syntax while avoiding the extra boilerplate. The downside to this convention is that there isn't a nice way straight out of the box to go from element type to a vector type, so you need a parallel set of typenames or protocols to enable that fairly desirable operation for generic programming.

Both the second and third option also add new members to the top-level namespace, while the first does not. Especially in light of the fact that we may someday be able to build something like Vector<Float, 8>, which I find considerably more desirable, avoiding adding anything to that namespace for now seems to me to be the right decision.

Notes from pre-review discussion:

I have added a static .zero property to integer and floating point vectors based on email conversation with Rick Roe. This is a departure from scalar numeric types, but he convinced me that these are good to have as an explicit alternative to T().
As mentioned above, .-prefixes have been added to some elementwise operations. I see pretty good arguments both in favor of and against this change. I think that it largely comes down to a matter of taste. There are two reasonable alternative positions here. (a) don't add the prefixes; this makes type checking a little more complex. (b) add .-prefixes to all lanewise operations; this adds a lot of new operators and a fair bit of noise, but there's reasonable precedent for it as well (julialang).
Richard Wei has argued strongly against the spelling of replacing(with: where:), objecting to both the lack of an explicit Elements object of the verb replacing and to the use of where: as a label for an argument that is not a predicate function. I am not ignoring his concerns, but from a pragmatic point of view, I believe that the fact that this operates on elements is implicit, and where: simply reads more clearly (to me and the other people that I surveyed) than any other option we could come up with.

Changelog:

The first version of this proposal used the <, <=, >=, and > operators. These have been replaced with the .-prefixed operators.
The first version of this proposal used .* and .&* for elementwise multiplication; these have been replaced with the "normal" arithmetic operations, matching +, -, /, and %.
After some discussion, I have switched from && and || back to & and | for Mask types. The rationale for this is that the precedence mismatch that comes from using the Bitwise operators is less dangerous than the implication that && and || short-circuit. A plausible third option would be to use .& and .| instead. Also note that ^ is not needed; it has exactly the same semantics as .!=; I propose marking it unavailable and referring people to .!=.
Added a little bit of additional discussion of the Vector${N} protocols.