Jyothish Pari Massachusetts Institute of Technology

Samy Jellasi Harvard University

Pulkit Agrawal Massachusetts Institute of Technology

**Arxiv Code**


Overview

Screenshot 2024-11-04 at 9.03.15 AM.png

The idea of collectively utilizing the intelligence of individual entities is commonplace in nature -- organisms often come together to form collectives—bee colonies, whale pods, and human societies. Coordination between individuals (or models) with potentially specialized roles (or functions) can enhance the capabilities of the whole. Imagine we need to solve a problem related to disease modeling. Instead of training one person to be an expert in two fields, we may recruit two specialized experts -- a mathematician and a biologist -- who can work together to solve the given problem. Effective collaboration requires the two experts to share a common language. Regardless of their skills, if one speaks only Hindi and the other speaks only German, their collaboration will be limited. Thus, effective collaboration requires entities that are specialized and can communicate in a shared language -- a phenomenon we term compatible specialization.

Screenshot 2024-11-01 at 10.51.56 AM.png

On the left side sub-figure we illustrate how as we fine-tune a base model on two different tasks separately while measuring the merging performance, there is a stage where the merging improves and after a critical point the merging performance decreases. On the right side sub-figure we study how trying to merge features across different layers in two models impacts the merging performance. The diagonal teal color represents that layers from the same index / position are merged. As we attempt to merge layers that are progressively further apart positionally in the network, we see that performance saturates. We hypothesize that both phenomena are the result of a lack of compatible specialization, where models need to be compatible for effective collective use.

In the following sections we provide evidence suggesting that the current paradigm of feature-based merging faces a trade-off. When merging over time, models gain specialization; however, their features diverge (measured via Centered Kernel Alignment). Therefore, even as the individual models are gaining specialization, the lack of feature compatibility prevents better merging after a critical point. Moreover, when trying to explore more complex merging schemes by combining features across different layers via Mixture of Expert (MoE) routing, we find that layers too far apart positionally are not compatible for merging. We measured their feature similarity with CKA, and find that as layers are further apart, their feature similarity decreases. Therefore, in both settings, there is a fundamental tradeoff between feature similarity and specialization.

Evidence


Merging Over Time : As the math and coding models undergo fine-tuning and improve their performance, we measure how merging performance on a novel task that requires both math and coding abilities (gsm-hard) changes.

Screenshot 2024-11-02 at 3.58.05 PM.png

(Left) Validation cross-entropy loss (CE Loss) for math and coding models during fine-tuning, as well as the merged models. The math and coding models exhibit steady decreases in validation loss as they specialize on their respective tasks. In contrast, the validation loss of the merged model via activation interpolation on a cross-domain task requiring both math and coding, decreases quickly but gradually increases quickly and increase gradually after a critical point. (Middle) Merging loss plotted against Centered Kernel Alignment (CKA) similarity computed on data from the adaptation dataset. (Right) Merging loss plotted against CKA similarity computed on data from the pretraining dataset.


Merging within Models : We explore more complex Mixture of Expert (MoE) routing schemes by allowing the router at each layer $l$ to route to MLP layers beyond just layer $l$.

We illustrate what multi layer routing looks like, where we allow the router to send tokens to “expert” mlps from different layers.

We illustrate what multi layer routing looks like, where we allow the router to send tokens to “expert” mlps from different layers.

We show that as we route to more mlp layers, the merging performance improves yet plateaus across the three different settings.

We show that as we route to more mlp layers, the merging performance improves yet plateaus across the three different settings.

Based on both figures above, we observe a definite gain in merging or routing multiple layers. However, as we push this idea further and attempt to combine features from layers that are positionally farther apart, performance plateaus. Why does this happen? The following analysis provides our understanding.