This post is an overview of a recent line of research that we are investigating at the PAILab and a summary of our recent paper accepted at CLVISION22. The paper proposes a new scenario, called “Ex-Model Continual Learning”, where a CL agent learns to distill the knowledge from a stream of pretrained models. This scenario opens the doors to applications such as multi-agent continual learning, where multiple agents can learn continually and share their knowledge without ever sharing the data or even the same training algorithm.
Multi-Agents in Continual Learning
In continual learning (CL), an agent learns by interacting over time with an external environment or a stream of data. The environment is subject to continuous distribution shifts, and therefore the agent must learn to adapt to the new environment without forgetting the previously acquired knowledge. Additionally, the agent has some constraints, such as limited memory and computational resources.
This definition is ignoring an important property: What happens when we have multiple continual learning agents learning in parallel?
In theory, each agent could ignore the existence of the other agents and learn independently. However, communication between the agents can help the learning process and generalize to better solutions. This is a consequence of the fact that each agent will explore and become an expert only in a small domain, which means that agents can learn the entire problem space only by communicating with the others.
Continual learning in a multi-agent world imposes some additional constraints and desiderata if we want the agents to communicate between each other:
- Reuse of expert knowledge: we want to share and reuse the agent’s knowledge as much as possible.
- Decentralization: while the agents can communicate with each other, they are independent entities trained with different algorithms, architectures, hardware, and so on. We don’t have a centralized control or a shared training protocol.
- Privacy: most agents may share the model but not the raw data.
Bonus Question: Isn’t this just plain old federated learning? No, because in federated learning there is a centralized controller, a shared training protocol, and strict synchronization between the agents. Here, we assume agents are independent but they may communicate with each other. We can see federated learning as a constrained version of Multi-Agent CL.
Ex-Model Continual Learning
Now, let’s introduce the “Ex-Model Continual Learning” scenario, which is a simplified multi-agent CL scenario. Let’s assume that models are trained until convergence on a single task and they are shared afterwards, like it is tipycally done with public pretrained models. The data is private, but the model’s parameters are not. Over time, more models will be shared by different agents in a continuous stream (of expert models). Now, the CL learning problem is to learn to distill the knowledge of the stream of models into a single network. The main challenge is that the original data is not available anymore.
Notice how the scenario fits the desiderata that we defined above. By sharing only the model’s parameters, agents can share their knowledge while preserving the data privacy and remaining independent.
Ex-Model Distillation
We can train a model in an Ex-Model Continual Learning scenario by using Ex-Model Distillation (ED), which is a data-free knowledge distillation method. There are three components in the algorithm:
- Data Sources: ED uses out-of-distribution data or synthetic data generated via model inversion (e.g. data impression).
- Replay Bufer: a buffer of fixed size is updated with the new data and used to keep a memory of the data generated at the previous steps.
- Double Knowledge Distillation: the model trained at the previous timestep and the current expert from the stream are used as teachers during the knowledge distillation, using the buffer’s data.
For more details, check the paper.
The algorithm fits the desiderata that we defined above:
- Reuse of expert knowledge via knowledge distillation and the use of out-of-distribution and synthetic data.
- Decentralization by training the experts independently. Knowledge distillation makes no assumption about the expert’s architectures.
- Privacy by using out-of-distribution data or model inversion techniques.
Experimental Results
For a complete overview of the details of the experiments and the results, check the paper. The main highlights are:
- While model inversion techniques work well in the joint scenario (all data available at once, no CL), they fail catastrophically in CL settings, as shown in the figure below.
- The ExML scenario is very difficult due to the absence of the real data. The Ex-Model Distillation provides a good baseline but it is still very far from the optimal performance. See the table below.
Possible Applications and Future Work
The paper shows a simple framework to consolidate knowledge from multiple agents over time. The algorithm is a simple baseline that can be improved upon and highlights several possible future directions.
For example, it is possible to improve the results by simplifying the scenario with some additional assumption, like having access to some data. Another alternative could be more frequent communication between the agents, like sending the model after each epoch. Finally, it is certainly possible to improve the distillation performance with techniques such as feature distillation [1] or a better synthetic data generation.
Some examples of applications are:
- Distillation of Pretrained Models: Multiple pretrained models solving the same task may be updated over time, and therefore integrating all of them in a single model becomes a continual learning problem. Notice that in general the agent does not have access to the original data and the pretrained models are unaware of each other.
- Distillation of Local Personalized Models: Personalized models are finetuned on the user’s data. Each personalized model is trained in a CL setting to save computation. Sharing models may be allowed, while data is private and never shared.
Useful Links
- arxiv preprint
- original codebase
- benchmarks and pretrained models are available in Avalanche