Glossary#

Machine learning terms#

Categorical feature#

A feature with a discrete domain. Can be nominal or ordinal.

Synonyms: discrete feature, qualitative feature

Classification#

A supervised machine learning problem of assigning labels to feature vectors.

Examples: predict what type of object is on the picture (a dog or a cat?), predict whether or not an email is spam

Clustering#

An unsupervised machine learning problem of grouping feature vectors into bunches, which are usually encoded as nominal values.

Example: find big star clusters in the space images

Continuous feature#

A feature with values in a domain of real numbers. Can be interval or ratio

Synonyms: quantitative feature, numerical feature

Examples: a person’s height, the price of the house

CSV file#

A comma-separated values file (csv) is a type of a text file. Each line in a CSV file is a record containing fields that are separated by the delimiter. Fields can be of a numerical or a text format. Text usually refers to categorical values. By default, the delimiter is a comma, but, generally, it can be any character. For more details, see.

Dataset#

A collection of observations.

Dimensionality reduction#

A problem of transforming a set of feature vectors from a high-dimensional space into a low-dimensional space while retaining meaningful properties of the original feature vectors.

Feature#

A particular property or quality of a real object or an event. Has a defined type and domain. In machine learning problems, features are considered as input variable that are independent from each other.

Synonyms: attribute, variable, input variable

Feature vector#

A vector that encodes information about real object, an event or a group of objects or events. Contains at least one feature.

Example: A rectangle can be described by two features: its width and height

Inference#

A process of applying a trained model to the dataset in order to predict response values based on input feature vectors.

Synonym: prediction

Inference set#

A dataset used at the inference stage. Usually without responses.

Interval feature#

A continuous feature with values that can be compared, added or subtracted, but cannot be multiplied or divided.

Examples: a time frame scale, a temperature in Celsius or Fahrenheit

Label#

A response with categorical or ordinal values. This is an output in classification and clustering problems.

Example: the spam-detection problem has a binary label indicating whether the email is spam or not

Model#

An entity that stores information necessary to run inference on a new dataset. Typically a result of a training process.

Example: in linear regression algorithm, the model contains weight values for each input feature and a single bias value

Nominal feature#

A categorical feature without ordering between values. Only equality operation is defined for nominal features.

Examples: a person’s gender, color of a car

Observation#

A feature vector and zero or more responses.

Synonyms: instance, sample

Ordinal feature#

A categorical feature with defined operations of equality and ordering between values.

Example: student’s grade

Outlier#

Observation which is significantly different from the other observations.

Ratio feature#

A continuous feature with defined operations of equality, comparison, addition, subtraction, multiplication, and division. Zero value element means the absence of any value.

Example: the height of a tower

Regression#

A supervised machine learning problem of assigning continuous responses for feature vectors.

Example: predict temperature based on weather conditions

Response#

A property of some real object or event which dependency from feature vector need to be defined in supervised learning problem. While a feature is an input in the machine learning problem, the response is one of the outputs can be made by the model on the inference stage.

Synonym: dependent variable

Supervised learning#

Training process that uses a dataset with information about dependencies between features and responses. The goal is to get a model of dependencies between input feature vector and responses.

Training#

A process of creating a model based on information extracted from a training set. Resulting model is selected in accordance with some quality criteria.

Training set#

A dataset used at the training stage to create a model.

Unsupervised learning#

Training process that uses a training set with no responses. The goal is to find hidden patters inside feature vectors and dependencies between them.

oneDAL terms#

Accessor#

A oneDAL concept for an object that provides access to the data of another object in the special data format. It abstracts data access from interface of an object and provides uniform access to the data stored in objects of different types.

Batch mode#

The computation mode for an algorithm in oneDAL, where all the data needed for computation is available at the start and fits the memory of the device on which the computations are performed.

Builder#

A oneDAL concept for an object that encapsulates the creation process of another object and enables its iterative creation.

Contiguous data#

Data that are stored as one contiguous memory block. One of the characteristics of a data format.

Data format#

Representation of the internal structure of the data.

Examples: data can be stored in array-of-structures or compressed-sparse-row format

Data layout#

A characteristic of data format which describes the order of elements in a contiguous data block.

Example: row-major format, where elements are stored row by row

Data type#

An attribute of data used by a compiler to store and access them. Includes size in bytes, encoding principles, and available operations (in terms of a programming language).

Examples: int32_t, float, double

Flat data#

A block of contiguous homogeneous data.

Getter#

A method that returns the value of the private member variable.

Example:

std::int64_t get_row_count() const;
Heterogeneous data#

Data which contain values either of different data types or different sets of operations defined on them. One of the characteristics of a data format.

Example: A dataset with 100 observations of three interval features. The first two features are of float32 data type, while the third one is of float64 data type.

Homogeneous data#

Data with values of single data type and the same set of available operations defined on them. One of the characteristics of a data format.

Example: A dataset with 100 observations of three interval features, each of type float32

Immutability#

The object is immutable if it is not possible to change its state after creation.

Metadata#

Information about logical and physical structure of an object. All possible combinations of metadata values present the full set of possible objects of a given type. Metadata do not expose information that is not a part of a type definition, e.g. implementation details.

Example: table object can contain three nominal features with 100 observations (logical part of metadata). This object can store data as sparse csr array and provides direct access to them (physical part)

Online mode#

The computation mode for an algorithm in oneDAL, where the data needed for computation becomes available in parts over time.

Reference-counted object#

A copy-constructible and copy-assignable oneDAL object which stores the number of references to the unique implementation. Both copy operations defined for this object are lightweight, which means that each time a new object is created, only the number of references is increased. An implementation is automatically freed when the number of references becomes equal to zero.

Setter#

A method that accepts the only parameter and assigns its value to the private member variable.

Example:

void set_row_count(std::int64_t row_count);
Table#

A oneDAL concept for a dataset that contains only numerical data, categorical or continuous. Serves as a transfer of data between user’s application and computations inside oneDAL. Hides details of data format and generalizes access to the data.

Workload#

A problem of applying a oneDAL algorithm to a dataset.

Common oneAPI terms#

API#

Application Programming Interface

DPC++#

Data Parallel C++ (DPC++) is a high-level language designed for data parallel programming productivity. DPC++ is based on SYCL* from the Khronos* Group to support data parallelism and heterogeneous programming.

Host/Device#

OpenCL [OpenCLSpec] refers to CPU that controls the connected GPU executing kernels.

JIT#

Just in Time Compilation — compilation during execution of a program.

Kernel#

Code written in OpenCL [OpenCLSpec] or SYCL and executed on a GPU device.

SPIR-V#

Standard Portable Intermediate Representation - V is a language for intermediate representation of compute kernels.

SYCL#

SYCL(TM) [SYCLSpec] — high-level programming model for OpenCL(TM) that enables code for heterogeneous processors to be written in a “single-source” style using completely standard C++.