# K-Means¶

The K-Means algorithm solves clustering problem by partitioning $$n$$ feature vectors into $$k$$ clusters minimizing some criterion. Each cluster is characterized by a representative point, called a centroid.

 Operation Computational methods Programming Interface Training Lloyd’s train(…) train_input train_result Inference Lloyd’s infer(…) infer_input infer_result

## Mathematical formulation¶

### Training¶

Given the training set $$X = \{ x_1, \ldots, x_n \}$$ of $$p$$-dimensional feature vectors and a positive integer $$k$$, the problem is to find a set $$C = \{ c_1, \ldots, c_k \}$$ of $$p$$-dimensional centroids that minimize the objective function

$\Phi_{X}(C) = \sum_{i = 1}^n d^2(x_i, C),$

where $$d^2(x_i, C)$$ is the squared Euclidean distance from $$x_i$$ to the closest centroid in $$C$$,

$d^2(x_i, C) = \min_{1 \leq j \leq k} \| x_i - c_j \|^2, \quad 1 \leq i \leq n.$

Expression $$\|\cdot\|$$ denotes $$L_2$$ norm.

Note

In the general case, $$d$$ may be an arbitrary distance function. Current version of the oneDAL spec defines only Euclidean distance case.

#### Training method: Lloyd’s¶

The Lloyd’s method [Lloyd82] consists in iterative updates of centroids by applying the alternating Assignment and Update steps, where $$t$$ denotes a index of the current iteration, e.g., $$C^{(t)} = \{ c_1^{(t)}, \ldots, c_k^{(t)} \}$$ is the set of centroids at the $$t$$-th iteration. The method requires the initial centroids $$C^{(1)}$$ to be specified at the beginning of the algorithm ($$t = 1$$).

(1) Assignment step: Assign each feature vector $$x_i$$ to the nearest centroid. $$y_i^{(t)}$$ denotes the assigned label (cluster index) to the feature vector $$x_i$$.

$y_i^{(t)} = \mathrm{arg}\min_{1 \leq j \leq k} \| x_i - c_j^{(t)} \|^2, \quad 1 \leq i \leq n.$

Each feature vector from the training set $$X$$ is assigned to exactly one centroid so that $$X$$ is partitioned to $$k$$ disjoint sets (clusters)

$S_j^{(t)} = \big\{ \; x_i \in X : \; y_i^{(t)} = j \; \big\}, \quad 1 \leq j \leq k.$

(2) Update step: Recalculate centroids by averaging feature vectors assigned to each cluster.

$c_j^{(t + 1)} = \frac{1}{|S_j^{(t)}|} \sum_{x \in S_j^{(t)}} x, \quad 1 \leq j \leq k.$

The steps (1) and (2) are performed until the following stop condition,

$\sum_{j=1}^k \big\| c_j^{(t)} - c_j^{(t+1)} \big\|^2 < \varepsilon,$

is satisfied or number of iterations exceeds the maximal value $$T$$ defined by the user.

### Inference¶

Given the inference set $$X' = \{ x_1', \ldots, x_m' \}$$ of $$p$$-dimensional feature vectors and the set $$C = \{ c_1, \ldots, c_k \}$$ of centroids produced at the training stage, the problem is to predict the index $$y_j' \in \{ 0, \ldots, k-1 \}$$, $$1 \leq j \leq m$$, of the centroid in accordance with a method-defined rule.

#### Inference method: Lloyd’s¶

Lloyd’s inference method computes the $$y_j'$$ as an index of the centroid closest to the feature vector $$x_j'$$,

$y_j' = \mathrm{arg}\min_{1 \leq l \leq k} \| x_j' - c_l \|^2, \quad 1 \leq j \leq m.$

## Usage example¶

### Training¶

kmeans::model<> run_training(const table& data,
const table& initial_centroids) {
const auto kmeans_desc = kmeans::descriptor<float>{}
.set_cluster_count(10)
.set_max_iteration_count(50)
.set_accuracy_threshold(1e-4);

const auto result = train(kmeans_desc, data, initial_centroids);

print_table("labels", result.get_labels());
print_table("centroids", result.get_model().get_centroids());
print_value("objective", result.get_objective_function_value());

return result.get_model();
}


### Inference¶

table run_inference(const kmeans::model<>& model,
const table& new_data) {
const auto kmeans_desc = kmeans::descriptor<float>{}
.set_cluster_count(model.get_cluster_count());

const auto result = infer(kmeans_desc, model, new_data);

print_table("labels", result.get_labels());
}


## Programming Interface¶

All types and functions in this section shall be declared in the oneapi::dal::kmeans namespace and be available via inclusion of the oneapi/dal/algo/kmeans.hpp header file.

### Descriptor¶

template <typename Float = float,
typename Method = method::by_default,
class descriptor {
public:
explicit descriptor(std::int64_t cluster_count = 2);

int64_t get_cluster_count() const;
descriptor& set_cluster_count(int64_t);

int64_t get_max_iteration_count() const;
descriptor& set_max_iteration_count(int64_t);

double get_accuracy_threshold() const;
descriptor& set_accuracy_threshold(double);
};

template<typename Float = float, typename Method = method::by_default, typename Task = task::by_default>
class descriptor
Template Parameters
• Float – The floating-point type that the algorithm uses for intermediate computations. Can be float or double.

• Method – Tag-type that specifies an implementation of algorithm. Can be method::lloyd.

• Task – Tag-type that specifies the type of the problem to solve. Can be task::clustering.

Constructors

descriptor(std::int64_t cluster_count = 2)

Creates a new instance of the class with the given cluster_count.

Properties

int64_t cluster_count = 2

The number of clusters $$k$$.

Getter & Setter
int64_t get_cluster_count() const
descriptor & set_cluster_count(int64_t)
Invariants
cluster_count > 0
int64_t max_iteration_count = 100

The maximum number of iterations $$T$$.

Getter & Setter
int64_t get_max_iteration_count() const
descriptor & set_max_iteration_count(int64_t)
Invariants
max_iteration_count >= 0
double accuracy_threshold = 0.0

The threshold $$\varepsilon$$ for the stop condition.

Getter & Setter
double get_accuracy_threshold() const
descriptor & set_accuracy_threshold(double)
Invariants
accuracy_threshold >= 0.0

#### Method tags¶

namespace method {
struct lloyd {};
using by_default = lloyd;
} // namespace method

struct lloyd

Tag-type that denotes Lloyd’s computational method.

using by_default = lloyd

Alias tag-type for Lloyd’s computational method.

namespace task {
struct clustering {};
using by_default = clustering;

struct clustering

Tag-type that parameterizes entities used for solving clustering problem.

using by_default = clustering

Alias tag-type for the clustering task.

### Model¶

template <typename Task = task::by_default>
class model {
public:
model();

const table& get_centroids() const;

int64_t get_cluster_count() const;
};

template<typename Task = task::by_default>
class model
Template Parameters

Task – Tag-type that specifies type of the problem to solve. Can be task::clustering.

Constructors

model()

Creates a new instance of the class with the default property values.

Properties

const table &centroids = table{}

A $$k \times p$$ table with the cluster centroids. Each row of the table stores one centroid.

Getter & Setter
const table & get_centroids() const
int64_t cluster_count = 0

Number of clusters $$k$$ in the trained model.

Getter & Setter
int64_t get_cluster_count() const
Invariants
cluster_count == centroids.row_count

### Training train(...)¶

#### Input¶

template <typename Task = task::by_default>
class train_input {
public:
train_input(const table& data = table{},
const table& initial_centroids = table{});

const table& get_data() const;
train_input& set_data(const table&);

const table& get_initial_centroids() const;
train_input& set_initial_centroids(const table&);
};

template<typename Task = task::by_default>
class train_input
Template Parameters

Task – Tag-type that specifies type of the problem to solve. Can be task::clustering.

Constructors

train_input(const table &data = table{}, const table &initial_centroids = table{})

Creates a new instance of the class with the given data and initial_centroids.

Properties

const table &data

An $$n \times p$$ table with the data to be clustered, where each row stores one feature vector.

Getter & Setter
const table & get_data() const
train_input & set_data(const table &)
const table &initial_centroids

A $$k \times p$$ table with the initial centroids, where each row stores one centroid.

Getter & Setter
const table & get_initial_centroids() const
train_input & set_initial_centroids(const table &)

#### Result¶

template <typename Task = task::by_default>
class train_result {
public:
train_result();

const table& get_labels() const;

int64_t get_iteration_count() const;

double get_objective_function_value() const;
};

template<typename Task = task::by_default>
class train_result
Template Parameters

Task – Tag-type that specifies type of the problem to solve. Can be task::clustering.

Constructors

train_result()

Creates a new instance of the class with the default property values.

Properties

const model<Task> &model = model<Task>{}

The trained K-means model.

Getter & Setter
const model< Task > & get_model() const
const table &labels = table{}

An $$n \times 1$$ table with the labels $$y_i$$ assigned to the samples $$x_i$$ in the input data, $$1 \leq 1 \leq n$$.

Getter & Setter
const table & get_labels() const
int64_t iteration_count = 0

The number of iterations performed by the algorithm.

Getter & Setter
int64_t get_iteration_count() const
Invariants
iteration_count >= 0
double objective_function_value

The value of the objective function $$\Phi_X(C)$$, where $$C$$ is model.centroids (see kmeans::model::centroids).

Getter & Setter
double get_objective_function_value() const
Invariants
objective_function_value >= 0.0

#### Operation¶

template<typename Float, typename Method, typename Task>
train_result<Task> train(const descriptor<Float, Method, Task> &desc, const train_input<Task> &input)

Runs the training operation for K-Means clustering. For more details see oneapi::dal::train.

Template Parameters
• Float – The floating-point type that the algorithm uses for intermediate computations. Can be float or double.

• Method – Tag-type that specifies an implementation of algorithm. Can be method::lloyd.

• Task – Tag-type that specifies type of the problem to solve. Can be task::clustering.

Parameters
• desc – Descriptor of the algorithm.

• input – Input data for the training operation.

Preconditions
input.data.has_data == true
input.initial_centroids.row_count == desc.cluster_count
input.initial_centroids.column_count == input.data.column_count
Postconditions
result.labels.row_count == input.data.row_count
result.labels.column_count == 1
result.labels[i] >= 0
result.labels[i] < desc.cluster_count
result.iteration_count <= desc.max_iteration_count
result.model.centroids.row_count == desc.cluster_count
result.model.centroids.column_count == input.data.column_count

### Inference infer(...)¶

#### Input¶

template <typename Task = task::by_default>
class infer_input {
public:
const table& data = table{});

const table& get_data() const;
infer_input& set_data(const table&);
};

template<typename Task = task::by_default>
class infer_input
Template Parameters

Task – Tag-type that specifies type of the problem to solve. Can be task::clustering.

Constructors

infer_input(const model<Task> &m = model<Task>{}, const table &data = table{})

Creates a new instance of the class with the given model and data.

Properties

const model<Task> &model = model<Task>{}

An $$n \times p$$ table with the data to be assigned to the clusters, where each row stores one feature vector.

Getter & Setter
const model< Task > & get_model() const
infer_input & set_model(const model< Task > &)
const table &data = table{}

The trained K-Means model.

Getter & Setter
const table & get_data() const
infer_input & set_data(const table &)

#### Result¶

template <typename Task = task::by_default>
class infer_result {
public:
infer_result();

const table& get_labels() const;

double get_objective_function_value() const;
};

template<typename Task = task::by_default>
class infer_result
Template Parameters

Task – Tag-type that specifies type of the problem to solve. Can be task::clustering.

Constructors

infer_result()

Creates a new instance of the class with the default property values.

Properties

const table &labels = table{}

An $$n \times 1$$ table with assignments labels to feature vectors in the input data.

Getter & Setter
const table & get_labels() const
double objective_function_value = 0.0

The value of the objective function $$\Phi_X(C)$$, where $$C$$ is defined by the corresponding infer_input::model::centroids.

Getter & Setter
double get_objective_function_value() const
Invariants
objective_function_value >= 0.0

#### Operation¶

template<typename Float, typename Method, typename Task>
infer_result<Task> infer(const descriptor<Float, Method, Task> &desc, const infer_input<Task> &input)

Runs the inference operation for K-Means clustering. For more details see oneapi::dal::infer.

Template Parameters
• Float – The floating-point type that the algorithm uses for intermediate computations. Can be float or double.

• Method – Tag-type that specifies an implementation of algorithm. Can be method::lloyd.

• Task – Tag-type that specifies type of the problem to solve. Can be task::clustering.

Parameters
• desc – Descriptor of the algorithm.

• input – Input data for the inference operation.

Preconditions
input.data.has_data == true
input.model.centroids.has_data == true
input.model.centroids.row_count == desc.cluster_count
input.model.centroids.column_count == input.data.column_count
Postconditions
result.labels.row_count == input.data.row_count
result.labels.column_count == 1
result.labels[i] >= 0
result.labels[i] < desc.cluster_count