1. oneMKL Architecture

1.1. Execution Model

This section describes the execution environment common to all oneMKL functionality.

1.1.1. Queues

Will be added in a future version.

1.1.1.1. Non-Member Functions

Each oneMKL non-member computational routine takes a sycl::queue reference as its first parameter:

mkl::domain::routine(sycl::queue &q, ...);

All computation performed by the routine shall be done on the hardware device(s) associated with this queue, with possible aid from the host, unless otherwise specified. In the case of an ordered queue, all computation shall also be ordered with respect to other kernels as if enqueued on that queue.

A particular oneMKL implementation may not support the execution of a given oneMKL routine on the specified device(s). In this case, the implementation may either perform the computation on the host or throw an exception.

1.1.1.2. Member Functions

oneMKL class-based APIs, such as those in the RNG and DFT domains, require a sycl::queue as an argument to the constructor or another setup routine. The execution requirements for computational routines from the previous section also apply to computational class methods.

1.1.2. Device Usage

oneMKL itself does not currently provide any interfaces for controlling device usage: for instance, controlling the number of cores used on the CPU, or the number of execution units on a GPU. However, such functionality may be available by partitioning a sycl::device instance into subdevices, when supported by the device.

When given a queue associated with such a subdevice, a oneMKL implementation shall only perform computation on that subdevice.

1.1.3. Asynchronous Execution

The oneMKL API is designed to allow asynchronous execution of computational routines, to facilitate concurrent usage of multiple devices in the system. Each computational routine enqueues work to be performed on the selected device, and may (but is not required to) return before execution completes.

Hence, it is the calling application’s responsibility to ensure that any inputs are valid until computation is complete, and likewise to wait for computation completion before reading any outputs. This can be done automatically when using DPC++ buffers, or manually when using Unified Shared Memory (USM) pointers, as described in the sections below.

Unless otherwise specified, asynchronous execution is allowed, but not guaranteed, by any oneMKL computational routine, and may vary between implementations and/or versions. oneMKL implementations must clearly document whether execution is guaranteed to be asynchronous for each supported routine.

1.1.3.1. Synchronization When Using Buffers

sycl::buffer objects automatically manage synchronization between kernel launches linked by a data dependency (either read-after-write, write-after-write, or write-after-read).

oneMKL routines are not required to perform any additional synchronization of sycl::buffer arguments.

1.1.3.2. Synchronization When Using USM APIs

When USM pointers are used as input to, or output from, a oneMKL routine, it becomes the calling application’s responsibility to manage possible asynchronicity.

To help the calling application, all oneMKL routines with at least one USM pointer argument also take an optional reference to a list of input events, of type sycl::vector_class<sycl::event>, and have a return value of type sycl::event representing computation completion:

sycl::event mkl::domain::routine(..., sycl::vector_class<sycl::event> &in_events = {});

The routine shall ensure that all input events (if the list is present and non-empty) have occurred before any USM pointers are accessed. Likewise, the routine’s output event shall not be complete until the routine has finished accessing all USM pointer arguments.

For class methods, “argument” includes any USM pointers previously provided to the object via the class constructor or other class methods.

1.1.4. Host Thread Safety

All oneMKL member and non-member functions shall be host thread safe. That is, they may be safely called simultaneously from concurrent host threads. However, oneMKL objects in class-based APIs may not be shared between concurrent host threads unless otherwise specified.

1.2. Memory Model

The oneMKL memory model shall follow directly from the oneAPI memory model. Mainly, oneMKL shall support two modes of encapsulating data for consumption on the device: the buffer memory abstraction model and the pointer-based memory model using Unified Shared Memory (USM). These two paradigms shall also support both synchronous and asynchronous execution models as described in Asynchronous Execution.

1.2.1. The buffer memory model

The SYCL 1.2.1 specification defines the buffer container templated on the provided data type which encapsulates the data in a SYCL application across both host and devices. It provides the concept of accessors as the mechanism to access the buffer data with different modes to read and or write into that data. These accessors allow SYCL to create and manage the data dependencies in the SYCL graph that order the kernel executions. With the buffer model, all data movement is handled by the SYCL runtime supporting both synchronous and asynchronous execution.

oneMKL provides APIs where buffers (in particular 1D buffers, sycl::buffer<T,1>) contain the memory for all non scalar input and output data arguments. See Synchronization When Using Buffers for details on how oneMKL routines manage any data dependencies with buffer arguments. Any higher dimensional buffer must be converted to a 1D buffer prior to use in oneMKL APIs, e.g., via buffer::reinterpret.

1.2.2. Unified Shared Memory model

While the buffer model is powerful and elegantly expresses the data dependencies, it can be a burden for programmers to replace all pointers and arrays by buffers in their C++ applications. A pointer-based model called Unified Shared Memory (USM) has also been provided in the DPC++ language. This alternative approach allows the programmer to use standard C++ pointers that have been allocated using one of the DPC++ provided memory allocation routines (e.g., malloc_shared()). The USM pointers, however, do not manage data dependencies between enqueued actions on the USM data, so DPC++ language provides sycl::event objects associated with each enqueued submission which can be used to explicitly manage the dependencies.

oneMKL provides APIs where USM pointers contain the memory for all non scalar input and output data arguments. Additionally, oneMKL APIs with USM pointers shall provide means to pass sycl::events to manage the data dependencies. See Synchronization When Using USM APIs for the details.

1.3. API design in oneMKL

This section discusses the general features of oneMKL API design.

1.3.1. oneMKL namespaces

The oneMKL library uses C++ namespaces to organize routines by mathematical domain. All oneMKL objects and routines shall be contained within the onemkl base namespace. The individual oneMKL domains use a secondary namespace layer as follows:

namespace

oneMKL domain or content

onemkl

oneMKL base namespace, contains general oneMKL data types, objects, exceptions and routines

onemkl::blas

Dense linear algebra routines from BLAS and BLAS like extensions. See BLAS Routines

onemkl::lapack

Dense linear algebra routines from LAPACK and LAPACK like extensions. See LAPACK Routines

onemkl::sparse

Sparse linear algebra routines from Sparse BLAS and Sparse Solvers. See Sparse Linear Algebra

onemkl::dft

Discrete and fast Fourier transformations. See Fourier Transform Functions

onemkl::rng

Random number generator routines. See Random Number Generators

onemkl::vm

Vector mathematics routines, e.g. trigonometric, exponential functions acting on elements of a vector. See Vector Math

1.3.2. Standard C++ datatype usage

oneMKL uses C++ STL data types for scalars where applicable:
  • Integer scalars are C++ fixed-size integer types (std::intN_t, std::uintN_t).

  • Complex numbers are represented by C++ std::complex types.

In general, scalar integer arguments to oneMKL routines are 64 bit integers (std::int64_t or std::uint64_t). Integer vectors and matrices may have varying bit widths, defined on a per-routine basis.

1.3.3. DPC++ datatype usage

oneMKL uses following SYCL Language data types and DPC++ Language Extensions data types:
  • SYCL queue sycl::queue for scheduling kernels on a SYCL device. See Queues for more details.

  • SYCL buffer sycl::buffer for buffer based memory access. See The buffer memory model for more details.

  • DPC++ Extension Unified Shared Memory (USM) for pointer based memory access. See Unified Shared Memory model for more details.

  • SYCL event sycl::event for output event synchronization in oneMKL routines with USM pointers. See Synchronization When Using USM APIs for more details.

  • Vector of SYCL events sycl::vector_class<sycl::event> for input events synchronization in oneMKL routines with USM pointers. See Synchronization When Using USM APIs for more details.

1.3.4. oneMKL defined datatypes

oneMKL linear algebra routines use scoped enum types as type-safe replacements for the traditional character arguments used in BLAS and LAPACK. These types all belong to the onemkl namespace.

Each enumeration value comes with two names: A single-character name (the traditional BLAS/LAPACK character) and a longer, more descriptive name. The two names are exactly equivalent and may be used interchangeably.

transpose

The transpose type specifies whether an input matrix should be transposed and/or conjugated. It can take the following values:

Short Name

Long Name

Description

transpose::N

transpose::nontrans

Do not transpose or conjugate the matrix.

transpose::T

transpose::trans

Transpose the matrix.

transpose::C

transpose::conjtrans

Perform Hermitian transpose (transpose and conjugate). Only applicable to complex matrices.

uplo

The uplo type specifies whether the lower or upper triangle of a triangular, symmetric, or Hermitian matrix should be accessed. It can take the following values:

Short Name

Long Name

Description

uplo::U

uplo::upper

Access the upper triangle of the matrix.

uplo::L

uplo::lower

Access the lower triangle of the matrix.

In both cases, elements that are not in the selected triangle are not accessed or updated.

diag

The diag type specifies the values on the diagonal of a triangular matrix. It can take the following values:

Short Name

Long Name

Description

diag::N

diag::nonunit

The matrix is not unit triangular. The diagonal entries are stored with the matrix data.

diag::U

diag::unit

The matrix is unit triangular (the diagonal entries are all 1’s). The diagonal entries in the matrix data are not accessed.

side

The side type specifies the order of matrix multiplication when one matrix has a special form (triangular, symmetric, or Hermitian):

Short Name

Long Name

Description

side::L

side::left

The special form matrix is on the left in the multiplication.

side::R

side::right

The special form matrix is on the right in the multiplication.

offset

The offset type specifies whether the offset to apply to an output matrix is a fix offset, column offset or row offset. It can take the following values

Short Name

Long Name

Description

offset::F

offset::fix

The offset to apply to the output matrix is fix, all the inputs in the C_offset matrix has the same value given by the first element in the co array.

offset::C

offset::column

The offset to apply to the output matrix is a column offset, that is to say all the columns in the C_offset matrix are the same and given by the elements in the co array.

offset::R

offset::row

The offset to apply to the output matrix is a row offset, that is to say all the rows in the C_offset matrix are the same and given by the elements in the co array.

1.4. oneMKL Exceptions

Will be added in a future version.

1.5. Global Library Controls

Will be added in a future version.

1.5.1. oneMKL Specification Versioning

This is oneMKL specification version 0.7.0.

1.5.2. Control of Pre/Post Condition Checking

Will be added in a future version.