Continual Learning from a Stream of Trained Models

This post is an overview of a recent line of research that we are investigating at the PAILab and a summary of our recent paper accepted at CLVISION22. The paper proposes a new scenario, called “Ex-Model Continual Learning”, where a CL agent learns to distill the knowledge from a stream of pretrained models. This scenario opens the doors to applications such as multi-agent continual learning, where multiple agents can learn continually and share their knowledge without ever sharing the data or even the same training algorithm.

Multi-Agents in Continual Learning

In continual learning (CL), an agent learns by interacting over time with an external environment or a stream of data. The environment is subject to continuous distribution shifts, and therefore the agent must learn to adapt to the new environment without forgetting the previously acquired knowledge. Additionally, the agent has some constraints, such as limited memory and computational resources.

This definition is ignoring an important property: What happens when we have multiple continual learning agents learning in parallel?

Fig.1 - Multiple agents in a continual learning scenario. Each agent learns continually from its own environment and communicates with the other agents.

In theory, each agent could ignore the existence of the other agents and learn independently. However, communication between the agents can help the learning process and generalize to better solutions. This is a consequence of the fact that each agent will explore and become an expert only in a small domain, which means that agents can learn the entire problem space only by communicating with the others.

Continual learning in a multi-agent world imposes some additional constraints and desiderata if we want the agents to communicate between each other:

Bonus Question: Isn’t this just plain old federated learning? No, because in federated learning there is a centralized controller, a shared training protocol, and strict synchronization between the agents. Here, we assume agents are independent but they may communicate with each other. We can see federated learning as a constrained version of Multi-Agent CL.

Ex-Model Continual Learning

Now, let’s introduce the “Ex-Model Continual Learning” scenario, which is a simplified multi-agent CL scenario. Let’s assume that models are trained until convergence on a single task and they are shared afterwards, like it is tipycally done with public pretrained models. The data is private, but the model’s parameters are not. Over time, more models will be shared by different agents in a continuous stream (of expert models). Now, the CL learning problem is to learn to distill the knowledge of the stream of models into a single network. The main challenge is that the original data is not available anymore.

Fig.2 - A traditional CL scenario (left) and the ex-model CL scenario (right).

Notice how the scenario fits the desiderata that we defined above. By sharing only the model’s parameters, agents can share their knowledge while preserving the data privacy and remaining independent.

Ex-Model Distillation

We can train a model in an Ex-Model Continual Learning scenario by using Ex-Model Distillation (ED), which is a data-free knowledge distillation method. There are three components in the algorithm:

For more details, check the paper.

The algorithm fits the desiderata that we defined above:

Experimental Results

For a complete overview of the details of the experiments and the results, check the paper. The main highlights are:

Possible Applications and Future Work

The paper shows a simple framework to consolidate knowledge from multiple agents over time. The algorithm is a simple baseline that can be improved upon and highlights several possible future directions.

For example, it is possible to improve the results by simplifying the scenario with some additional assumption, like having access to some data. Another alternative could be more frequent communication between the agents, like sending the model after each epoch. Finally, it is certainly possible to improve the distillation performance with techniques such as feature distillation [1] or a better synthetic data generation.

Some examples of applications are:

Useful Links