Knowledge Distillation in Machine Learning
Knowledge distillation aims to condense a complex ‘teacher’ model into a simple’student’ model while maintaining performance. It leverages the information of a teacher’s output layer (predictions) and its feature representations.
Typical offline distillation methods use two loss functions to optimize knowledge transfer from the pre-trained teacher model to the student models. This article will cover several online and on-the-fly KD methods that utilize this idea.
Relation-Based Knowledge Distillation
Deep models have millions/billions of parameters and require significant memory and computation capabilities to train/deploy. Nevertheless, they often perform poorly on real-world data due to the huge model capacity gap with the data. Knowledge distillation aims to reduce the model capacity gap by leveraging the knowledge of a teacher model to guide the student network, thus improving its performance.
The simplest form of knowledge distillation leverages the soft targets from the teacher model and optimizes the student network to match these. For example, Bucilua et al. trained Net-S to produce output logits in the same fashion as the cumbersome teacher model and showed promising results in image classification.
Other studies use more sophisticated methods to transfer knowledge between the teacher and student models. These approaches can be broadly categorized into three categories: Feature-based knowledge transfer, Relation-based knowledge transfer, and Exemplar-based knowledge transfer. Feature-based knowledge transfer typically involves matching the features learned by the teacher and student models, which can be done through various loss functions such as cross-entropy.
Feature-Based Knowledge Distillation
Unlike Feature-Based Knowledge Distillation, which transfers the intermediate representations of the teacher model to the student network, Relation-Based Knowledge Distillation focuses on the underlying relationships between input examples and output labels. The relationship between these features can be summarized in many different ways, including graphs, similarity matrices, or feature embeddings. It can also be represented as a set of probabilistic distributions. Using this approach, the student model learns a set of weights to push its class prediction probability towards the one of the teacher model.
Like in the normal deep model training, a weighted average loss is used to optimize knowledge transfer between teacher and student models. However, rather than minimizing the difference between hard and soft labels, this loss uses a combination of the distillation loss and cross-entropy losses that minimize the difference between the student model’s prediction and ground truth label. This approach has been applied to various tasks such as image classification, object recognition, and segmentation.
Graph-Based Knowledge Distillation
In real-world applications, deep models need to work with immediate real-world data. This data is often limited in scope and available only on mobile devices and edge devices with limited memory and processing power. Distillation is a model compression technique that allows for the transfer of knowledge from large unwieldy models to smaller student models that can be deployed on these devices.
Graph-based knowledge distillation methods leverage graph neural networks (GNNs) to learn vectorized representation of graph data and have been applied in tasks such as node classification, link prediction, and graph clustering. These methods can be used to build more compact teacher models for training student models that can be deployed on mobile phones and edge devices.
Heterogeneous knowledge distillation enables the transfer of knowledge between different modalities and closes the cross-modal semantic gap. It also facilitates the composition of multiple modalities via contrastive learning. This article surveys Graph-based knowledge distillation methods, systematically categorizes them and discusses their limitations and future research direction.
On-the-Fly Native Ensemble (ONE) Knowledge Distillation
There are many different knowledge distillation algorithms that have been developed. Generally, these algorithms can be classified by their reductions in model size and accuracy as compared to the original teacher models. However, it is important to note that there is no one-size-fits-all solution. One must consider the system’s requirements and the available data.
It is also worth mentioning that knowledge distillation methods are often applied to overcome limitations in training student models (e.g., privacy/confidentiality issues or the difficulty of getting access to the original dataset).
Using a teacher-student approach, ONE distills knowledge from a teacher network into a smaller student model by altering the loss functions of the students to take into account the output of the teacher’s hidden layers. The results show that the resulting knowledge-distilled student models always surpass the performance of vanilla student networks trained from scratch. The improvement is particularly significant when the student model is a deep network with high classification accuracy.