Attributes

The parameters passed to create a primitive descriptor specify the problem. An engine specifies where the primitive will be executed. An operation descriptor specifies the basics: the operation kind; the propagation kind; the source, destination, and other tensors; the strides (if applicable); and so on.

Attributes specify some extra properties of the primitive. Users must create them before use and must set required specifics using the corresponding setters. The attributes are copied during primitive descriptor creation, so users can change or destroy attributes right after that.

If not modified, attributes can stay empty, which is equivalent to the default attributes. Primitive descriptors’ constructors have empty attributes as default parameters, so unless required users can simply omit them.

Attributes can also contain post-ops, which are computations executed after the primitive.

Scratchpad Mode

Some primitives might require a temporary buffer while performing their computations. For instance, the operations that do not have enough independent work to utilize all cores on a system might use parallelization over the reduction dimension (the K dimension in the GEMM notation). In this case different threads compute partial results in private temporary buffers, and then the private results are added to produce the final result. Another example is using matrix multiplication (GEMM) to implement convolution. Before calling GEMM, the source activations need to be transformed using the im2col operation. The transformation result is written to a temporary buffer that is then used as an input for the GEMM.

In both of these examples, the temporary buffer is no longer required once the primitive computation is completed. oneDNN refers to such kind of a memory buffer as a scratchpad.

Both types of implementation might need extra space for the reduction in case there are too few independent tasks. The amount of memory required by the im2col transformation is proportional to the size of the source image multiplied by the weights spatial size. The size of a buffer for reduction is proportional to the tensor size to be reduced (e.g., diff_weights in the case of backward by weights) multiplied by the number of threads in the reduction groups (the upper bound is the total number of threads).

By contrast, some other primitives might require very little extra space. For instance, one of the implementation of the dnnl::sum primitive requires temporary space only to store the pointers to data for each and every input array (that is, the size of the scratchpad is n * sizeof(void *), where n is the number of summands).

oneDNN supports two modes for handling scratchpads:

enum dnnl::scratchpad_mode

Scratchpad mode.

Values:

enumerator library

The library manages the scratchpad allocation. There may be multiple implementation-specific policies that can be configured via mechanisms that fall outside of the scope of this specification.

enumerator user

The user manages the scratchpad allocation by querying and providing the scratchpad memory to primitives. This mode is thread-safe as long as the scratchpad buffers are not used concurrently by two primitive executions.

The scratchpad mode is controlled though the dnnl::primitive_attr::set_scratchpad_mode() primitive attributes.

If the user provides scratchpad memory to a primitive, this memory must be created using the same engine that the primitive uses.

All primitives support both scratchpad modes.

Note

Primitives are not thread-safe by default. The only way to make the primitive execution fully thread-safe is to use the dnnl::scratchpad_mode::user mode and not pass the same scratchpad memory to two primitives that are executed concurrently.

Examples

Library Manages Scratchpad

As mentioned above, this is a default behavior. We only want to highlight how a user can query the amount of memory consumed by a primitive due to a scratchpad.

// Use default attr, hence the library allocates scratchpad
dnnl::primitive::primitive_desc op_pd(params, /* other arguments */);

// Print how much memory would be hold by a primitive due to scratchpad
std::cout << "primitive will use "
          << op_pd.query_s64(dnnl::query::memory_consumption_s64)
          << " bytes" << std::endl;

// In this case scratchpad is internal, hence user visible scratchpad memory
// descriptor should be empty:
auto zero_md = dnnl::memory::desc();

Library Manages Scratchpad

// Create an empty (default) attributes
dnnl::primitive_attr attr;

// Default scratchpad mode is `library`:
assert(attr.get_scratchpad_mode() == dnnl::scratchpad_mode::library);

// Set scratchpad mode to `user`
attr.set_scratchpad_mode(dnnl::scratchpad_mode::user);

// Create a primitive descriptor with custom attributes
dnnl::primitive::primitive_desc op_pd(op_d, attr, engine);

// Query the scratchpad memory descriptor
dnnl::memory::desc scratchpad_md = op_pd.scratchpad_desc();

// Note, that a primitive doesn't consume memory in this configuration:
assert(op_pd.query_s64(dnnl::query::memory_consumption_s64) == 0);

// Create a primitive
dnnl::primitive prim(op_pd);

// ... more code ..

// Create a scratchpad memory
// NOTE: if scratchpad is not required for a particular primitive the
//       scratchpad_md.get_size() will return 0. It is fine to have
//       scratchpad_ptr == nullptr in this case.
void *scratchpad_ptr = user_memory_manager::allocate(scratchpad_md.get_size());
// NOTE: engine here must much the engine of the primitive
dnnl::memory scratchpad(scratchpad_md, engine, scratchpad_ptr);

// Pass a scratchpad memory to a primitive
prim.execute(stream, { /* other arguments */,
        {DNNL_ARG_SCRATCHPAD, scratchpad}});

Quantization

Primitives may support reduced precision computations which require quantization.

Quantization Model

The primary quantization model that the library assumes is the following:

\[x_{f32}[:] = scale_{f32} \cdot (x_{int8}[:] - 0_{x_{int8}})\]

where \(scale_{f32}\) is a scaling factor that is somehow known in advance and \([:]\) is used to denote elementwise application of the formula to the arrays. Typically, the process of computing scale factors is called calibration. The library cannot compute any of the scale factors at run-time dynamically. Hence, the model is sometimes called a static quantization model. The main rationale to support only static quantization out-of-the-box is higher performance. To use dynamic quantization:

  1. Compute the result in higher precision, like dnnl::memory::data_type::s32.

  2. Find the required characteristics, like min and max values, and derive the scale factor.

  3. Re-quantize to the lower precision data type.

oneDNN assumes a fixed zero position. For most of the primitives, the real zero value is mapped to the zero for quantized values; that is, \(0_{x_{int8}} = 0\). For example, this is the only model that Convolution and Deconvolution and Inner Product currently support. The RNN primitives have limited support of shifted zero.

For the rest of this section we that \(0_{x_{int8}} = 0\).

Example: Convolution Quantization Workflow

Consider a convolution without bias. The tensors are represented as:

  • \(\src_{f32}[:] = scale_{\src} \cdot \src_{int8}[:]\)

  • \(\weights_{f32}[:] = scale_{\weights} \cdot \weights_{int8}[:]\)

  • \(\dst_{f32}[:] = scale_{\dst} \cdot \dst_{int8}[:]\)

Here the \(\src_{f32}, \weights_{f32}, \dst_{f32}\) are not computed at all, the whole work happens with int8 tensors. As mentioned above, we also somehow know all the scaling factors: scale_{src}, scale_{weights}, scale_{dst}.

So the task is to compute the \(\dst_{int8}\) tensor.

Mathematically, the computations are:

\[\dst_{int8}[:] = \operatorname{f32\_to\_int8}( output\_scale \cdot conv_{s32}(\src_{int8}, \weights_{int8}) ),\]

where

  • \(output\_scale := \frac{scale_{\src} \cdot scale_{\weights}}{scale_{\dst}}\);

  • \(conv_{s32}\) is just a regular convolution which takes source and weights with int8 data type and compute the result in int32 data type (int32 is chosen to avoid overflows during the computations);

  • \(\operatorname{f32\_to\_s8}()\) converts an f32 value to s8 with potential saturation if the values are out of the range of the int8 data type.

Note that in order to perform the operation, one doesn’t need to know the exact scaling factors for all the tensors; it is enough to know only the output_scale. The library utilizes this fact: a user needs to provide only this one extra parameter to the convolution primitive (see the Output Scaling Attribute section below).

Per-Channel Scaling

Primitives may have limited support of multiple scales for a quantized tensor. The most popular use case is the Convolution and Deconvolution primitives that support per-output-channel scaling factors for the weights, meaning that the actual convolution computations would need to scale different output channels differently.

Let \(\alpha\) denote scales:

  • \(\src_{f32}(n, ic, ih, iw) = \alpha_{\src} \cdot \src_{int8}(n, ic, ih, iw)\)

  • \(\weights_{f32}(oc, ic, kh, kw) = \alpha_{\weights}(oc) \cdot \weights_{int8}(oc, ic, kh, kw)\)

  • \(\dst_{f32}(n, oc, oh, ow) = scale_{\dst} \cdot \dst_{int8}(n, oc, oh, ow)\)

Note that now the weights’ scaling factor depends on the \(oc\).

To compute the \(\dst_{int8}\) we need to perform the following:

\[\dst_{int8}(n, oc, oh, ow) = \operatorname{f32\_to\_int8}( output\_scale(oc) \cdot conv_{s32}(\src_{int8}, \weights_{int8})|_{(n, oc, oh, ow)} ),\]

where

\[output\_scale(oc) := \frac{\alpha_{\src} \cdot \alpha_{\weights}(oc)}{\alpha_{\dst}}.\]

The user is responsible for preparing quantized weights accordingly. To do that, oneDNN provides reorders that can perform per-channel scaling:

\[\weights_{int8}(oc, ic, kh, kw) = \operatorname{f32\_to\_int8}( output\_scale(oc) \cdot \weights_{f32}(oc, ic, kh, kw) ),\]

where

\[output\_scale(oc) := \frac{1}{\alpha_{\weights}(oc_{})}.\]

Output Scaling Attribute

oneDNN provides dnnl::primitive_attr::set_output_scales() for setting scaling factors for most of the primitives.

The primitives may not support output scales if source (and weights) tensors are not of the int8 data type. In other words, convolution operating on the single precision floating point data type may not scale the output result.

In the simplest case, when there is only one common scale the attribute changes the op behavior from

\[\dst[:] = Op(...)\]

to

\[\dst[:] = scale \cdot Op(...).\]

To support scales per one or several dimensions, users must set the appropriate mask.

Say the primitive destination is a \(D_0 \times ... \times D_{n-1}\) tensor and we want to have output scales per \(d_i\) dimension (where \(0 \le d_i < n\)).

Then \(mask = \sum \limits_{d_i} 2^{d_i}\) and the number of scales should be \(\mathtt{scales.size()} = \prod \limits_{d_i} D_{d_i}\).

The scaling happens in the single precision floating point data type (data_type::f32). Before storing, the result is converted to the destination data type with saturation if required. The rounding happens according to the current hardware setting.

Example 1: weights quantization with per-output-channel-and-group scaling

// weights dimensions
const int G, OC, IC, KH, KW;

// original f32 weights in plain format
dnnl::memory::desc wei_plain_f32_md(
        {G, OC/G, IC/G, KH, KW},          // dims
        dnnl::memory::data_type::f32,     // the data originally in f32
        dnnl::memory::format_tag::hwigo   // the plain memory format
        );

// the scaling factors for quantized weights
// An unique scale for each group and output-channel.
std::vector<float> wei_scales(G * OC/G) = { /* values */ };

// int8 convolution primitive descriptor
dnnl::convolution_forward::primitive_desc conv_pd(/* see the next example */);

// query the convolution weights memory descriptor
dnnl::memory::desc wei_conv_s8_md = conv_pd.weights_desc();

// prepare the inverse of the scales
// (f32 = scale * int8 --> int8 = 1/scale * f32)
std::vector<float> inv_wei_scales(wei_scales.size());
for (size_t i = 0; i < wei_scales.size(); ++i)
    inv_wei_scales[i] = 1.f / wei_scales[i];

// prepare the attributes for the reorder
dnnl::primitive_attr attr;
const int mask = 0
    | (1 << 0)  // scale per  G dimension, which is the dim #0
    | (1 << 1); // scale per OC dimension, which is the dim #1
attr.set_output_scales(mask, inv_wei_scales);

// create reorder that would perform:
//   wei_s8(g, oc, ic, kh, kw) <- 1/scale(g, oc) * wei_f32(g, oc, ic, kh, kw)
// including the data format transformation.
auto wei_reorder_pd = dnnl::reorder::primitive_desc(
        wei_plain_f32_md, engine, // source
        wei_conv_s8_md, engine, // destination,
        attr);
auto wei_reorder = dnnl::reorder(wei_reorder_pd);

Example 2: convolution with groups, with per-output-channel quantization

This example is complementary to the previous example (which should ideally be the first one). Let’s say we want to create an int8 convolution with per-output channel scaling.

const float src_scale; // src_f32[:] = src_scale * src_s8[:]
const float dst_scale; // dst_f32[:] = dst_scale * dst_s8[:]

// the scaling factors for quantized weights (as declared above)
// An unique scale for each group and output-channel.
std::vector<float> wei_scales(G * OC/G) = {...};


// Src, weights, and dst memory descriptors for convolution,
// with memory format tag == any to allow a convolution implementation
// to chose the appropriate memory format

dnnl::memory::desc src_conv_s8_any_md(
        {BATCH, IC, IH, IW},            // dims
        dnnl::memory::data_type::s8,  // the data originally in s8
        dnnl::memory::format_tag::any // let convolution to choose
        );

dnnl::memory::desc wei_conv_s8_any_md(
        {G, OC/G, IC/G, KH, KW},        // dims
        dnnl::memory::data_type::s8,  // the data originally in s8
        dnnl::memory::format_tag::any // let convolution to choose
        );

dnnl::memory::desc dst_conv_s8_any_md(...);  // ditto

// Create a convolution operation descriptor
dnnl::convolution_forward::desc conv_d(
        dnnl::prop_kind::forward_inference,
        dnnl::algorithm::convolution_direct,
        src_conv_s8_any_md,                     // what's important is that
        wei_conv_s8_any_md,                     // we specified that we want
        dst_conv_s8_any_md,                     // computations in s8
        strides, padding_l, padding_r,
        dnnl::padding_kind::zero
        );

// prepare the attributes for the convolution
dnnl::primitive_attr attr;
const int mask = 0
    | (1 << 1); // scale per OC dimension, which is the dim #1 on dst tensor:
                // (BATCH, OC, OH, OW)
                //    0     1   2   3
std::vector<float> conv_output_scales(G * OC/G);
for (int g_oc = 0; G * OC/G; ++g_oc)
    conv_output_scales[g_oc] = src_scale * wei_scales(g_oc) / dst_scale;
attr.set_output_scales(mask, conv_output_scales);

// create a convolution primitive descriptor with the scaling factors
auto conv_pd = dnnl::convolution_forward::primitive_desc(
        conv_d, // general (non-customized) operation descriptor
        attr,   // the attributes contain the output scaling
        engine);

Interplay of Output Scales with Post-ops

In general, the Post-ops are independent from the output scales. The output scales are applied to the result first; then post-ops will take effect.

That has an implication on the scaling factors passed to the library, however. Consider the following example of a convolution with \(\tanh\) post-op:

\[\dst_{s8}[:] = \frac{1}{scale_{\dst}} \cdot \tanh( scale_{\src} \cdot scale_{\weights} \cdot conv_{s32}(\src_{s8}, wei_{s8}) )\]
  • The convolution output scales are \(conv\_output\_scale = scale_{\src} \cdot scale_{\weights}\), i.e. there is no division by \(scale_{\dst}\).

  • And the post-ops scale for \(\tanh\) is set to \(scale\_tanh\_post\_op = \frac{1}{scale_{\dst}}\).

API

struct dnnl::primitive_attr

Primitive attributes.

Public Functions

primitive_attr()

Constructs default (empty) primitive attributes.

scratchpad_mode get_scratchpad_mode() const

Returns the scratchpad mode.

void set_scratchpad_mode(scratchpad_mode mode)

Sets scratchpad mode.

Parameters
  • mode: Specified scratchpad mode.

void get_output_scales(int &mask, std::vector<float> &scales) const

Returns output scaling factors correspondence mask and values.

Parameters
  • mask: Scaling factors correspondence mask that defines the correspondence between the output tensor dimensions and the scales vector. The set i-th bit indicates that a dedicated output scaling factor is used for each index along that dimension. The mask value of 0 implies a common output scaling factor for the whole output tensor.

  • scales: Vector of output scaling factors.

void set_output_scales(int mask, const std::vector<float> &scales)

Sets output scaling factors correspondence mask and values.

Example usage:

int mb = 32, oc = 32,
    oh = 14, ow = 14; // convolution output params
// unique output scales per output channel
vector<float> scales = { ... };
int oc_dim = 1; // mb_dim = 0, channel_dim = 1, height_dim = 2, ...

// construct a convolution descriptor
dnnl::convolution::desc conv_d;

dnnl::primitive_attr attr;
attr.set_output_scales(attr, oc, 1 << oc_dim, scales);

dnnl::primitive_desc conv_pd(conv_d, attr, engine);

Note

The order of dimensions does not depend on how elements are laid out in memory. For example:

  • for a 2D CNN activations tensor the order is always (n, c)

  • for a 4D CNN activations tensor the order is always (n, c, h, w)

  • for a 5D CNN weights tensor the order is always (g, oc, ic, kh, kw)

Parameters
  • mask: Defines the correspondence between the output tensor dimensions and the scales vector. The set i-th bit indicates that a dedicated scaling factor is used for each index along that dimension. Set the mask to 0 to use a common output scaling factor for the whole output tensor.

  • scales: Constant vector of output scaling factors. If the scaling factors are known at the time of this call, the following equality must hold: \(scales.size() = \prod\limits_{d \in mask} output.dims[d].\) Violations can only be detected when the attributes are used to create a primitive descriptor. If the scaling factors are not known at the time of the call, this vector must contain a single DNNL_RUNTIME_F32_VAL value and the output scaling factors must be passed at execution time as an argument with index DNNL_ARG_ATTR_OUTPUT_SCALES.

void get_scales(int arg, int &mask, std::vector<float> &scales) const

Returns scaling factors correspondence mask and values for a given memory argument.

Parameters
  • arg: Parameter argument index as passed to the primitive::execute() call.

  • mask: Scaling factors correspondence mask that defines the correspondence between the output tensor dimensions and the scales vector. The set i-th bit indicates that a dedicated scaling factor is used for each index along that dimension. Set the mask to 0 to use a common scaling factor for the whole output tensor.

  • scales: Output vector of scaling factors.

void set_scales(int arg, int mask, const std::vector<float> &scales)

Sets scaling factors for primitive operations for a given memory argument.

See

dnnl::primitive_attr::set_output_scales

Parameters
  • arg: Parameter argument index as passed to the primitive::execute() call.

  • mask: Scaling factors correspondence mask that defines the correspondence between the tensor dimensions and the scales vector. The set i-th bit indicates that a dedicated scaling factor is used for each index along that dimension. Set the mask to 0 to use a common scaling factor for the whole output tensor.

  • scales: Constant vector of scaling factors. The following equality must hold: \(scales.size() = \prod\limits_{d \in mask} argument.dims[d].\)

void get_zero_points(int arg, int &mask, std::vector<int32_t> &zero_points) const

Returns zero points correspondence mask and values.

Parameters
  • arg: Parameter argument index as passed to the primitive::execute() call.

  • mask: Zero points correspondence mask that defines the correspondence between the output tensor dimensions and the zero_points vector. The set i-th bit indicates that a dedicated zero point is used for each index along that dimension. Set the mask to 0 to use a common zero point for the whole output tensor.

  • zero_points: Output vector of zero points.

void set_zero_points(int arg, int mask, const std::vector<int32_t> &zero_points)

Sets zero points for primitive operations for a given memory argument.

See

dnnl::primitive_attr::set_output_scales

Parameters
  • arg: Parameter argument index as passed to the primitive::execute() call.

  • mask: Zero point correspondence mask that defines the correspondence between the tensor dimensions and the zero_points vector. The set i-th bit indicates that a dedicated zero point is used for each index along that dimension. Set the mask to 0 to use a common zero point for the whole output tensor.

  • zero_points: Constant vector of zero points. If the zero points are known at the time of this call, the following equality must hold: \(zero\_points.size() = \prod\limits_{d \in mask} argument.dims[d].\) If the zero points are not known at the time of the call, this vector must contain a single DNNL_RUNTIME_F32_VAL value and the zero points must be passed at execution time as an argument with index DNNL_ARG_ATTR_ZERO_POINTS.

const post_ops get_post_ops() const

Returns post-ops previously set via set_post_ops().

Return

Post-ops.

void set_post_ops(const post_ops ops)

Sets post-ops.

Note

There is no way to check whether the post-ops would be supported by the target primitive. Any error will be reported by the respective primitive descriptor constructor.

Parameters
  • ops: Post-ops object to copy post-ops from.

void set_rnn_data_qparams(float scale, float shift)

Sets quantization scale and shift parameters for RNN data tensors.

For performance reasons, the low-precision configuration of the RNN primitives expect input activations to have the unsigned 8-bit integer data type. The scale and shift parameters are used to quantize floating-point data to unsigned integer and must be passed to the RNN primitive using attributes.

The quantization formula is scale * (data + shift).

Example usage:

// RNN parameters
int l = 2, t = 2, mb = 32, sic = 32, slc = 32, dic = 32, dlc = 32;
// Activations quantization parameters
float scale = 2.0f, shift = 0.5f;

primitive_attr attr;

// Set scale and shift for int8 quantization of activation
attr.set_rnn_data_qparams(scale, shift);

// Create and configure rnn op_desc
vanilla_rnn_forward::desc rnn_d(/* arguments */);
vanilla_rnn_forward::primitive_desc rnn_d(rnn_d, attr, engine);

Note

Quantization scale and shift are common for src_layer, src_iter, dst_iter, and dst_layer.

Parameters
  • scale: The value to scale the data by.

  • shift: The value to shift the data by.

void set_rnn_weights_qparams(int mask, const std::vector<float> &scales)

Sets quantization scaling factors for RNN weights tensors. The low-precision configuration of the RNN primitives expect input weights to use the signed 8-bit integer data type. The scaling factors are used to quantize floating-point data to signed integer and must be passed to RNN primitives using attributes.

Note

The dimension order is always native and does not depend on the actual layout used. For example, five-dimensional weights always have (l, d, i, g, o) logical dimension ordering.

Note

Quantization scales are common for weights_layer and weights_iteration

Parameters
  • mask: Scaling factors correspondence mask that defines the correspondence between the output tensor dimensions and the scales vector. The set i-th bit indicates that a dedicated scaling factor should be used each index along that dimension. Set the mask to 0 to use a common scaling factor for the whole output tensor.

  • scales: Constant vector of output scaling factors. The following equality must hold: \(scales.size() = \prod\limits_{d \in mask} weights.dims[d].\) Violations can only be detected when the attributes are used to create a primitive descriptor.