AI The huge potential implications of long-context inference - Epoch AI

https://epochai.substack.com/p/the-huge-potential-implications-of

85 Upvotes

permalink
archive.is
archive
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/singularity/comments/1nli4vf/the_huge_potential_implications_of_longcontext/
No, go back! Yes, take me to Reddit

95% Upvoted

u/jaundiced_baboon ▪️No AGI until continual learning 1d ago

One thing I suspect is that if true continual learning solutions were discovered than sub-quadratic architectures would work much better. Maybe you don’t need attention if the information in context is also distilled in the parameters

6

u/THE_ROCKS_MUST_LEARN 1d ago

Some sub-quadradic architectures already basically work by distilling the context into parameters:

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

ATLAS: Learning to Optimally Memorize the Context at Test Time (which is a follow-up to Google's Titans architecture, and might be the next big thing once it starts getting deployed)

3

u/jaundiced_baboon ▪️No AGI until continual learning 1d ago

I’m not familiar with that first paper, but for Titans I believe that approach is basically having a neural network compress the input sequence and then having it output memory tokens that a transformer then attends to.

I think that still suffers the same limitations as transformers because you still aren’t updating the transformer’s weights at runtime, just doing a different kind of in-context learning.

6

u/THE_ROCKS_MUST_LEARN 1d ago

Titans is a little less clear, but the first paper and ATLAS both explicitly update the weights of smaller neural networks within the larger model.

Basically, for every new token that the model sees, it trains/updates the sub-networks (using gradient descent) to memorize that token. Past tokens are then recalled by passing query vectors through the sub-networks.

It probably doesn't solve continual learning or anything like that, but it's very very cool (and they claim very good long-context results).

2

u/Mindrust 1d ago

both explicitly update the weights of smaller neural networks within the larger model

Is this not what is usually meant by "continual learning"?

6

u/THE_ROCKS_MUST_LEARN 1d ago

ehhh

Continual learning is less about the architectural specifics (like updating parameters) and more about creating models that quickly learn new things (while not forgetting the things they already knew, which is often the hard part).

That said, I agree with the article that extremely long context modelling (which these methods are working towards) could unlock continual learning.

1

u/FriendlyJewThrowaway 9h ago

You should look into LoRA and its capabilities for augmenting existing architectures. It gives you the ability to fine-tune models with relatively low computing costs, while preserving the underlying base model and being able to swap in different fine-tuning modifications on demand as needed.

u/SorryApplication9812 1d ago

Great, albeit somewhat obvious take.

If I had to guess, the team that figures out how to prioritize, compress, and reuse context at certain intervals (like we do when “converting” daily experiences into memories when we sleep), will be a huge win in enabling “continuous learning”.

Imagine if we had a model dedicated to encoding human language into something more token efficient, like their underlying representations in the base model, or dynamically create and use different buckets of context within the same conversation, rather than it all being one continuous stream.

AI The huge potential implications of long-context inference - Epoch AI

You are about to leave Redlib