Hacker News new | threads | past | comments | ask | show | jobs | submit DanielBMarkham (43824) | logout
The Scaling Hypothesis (2021) (gwern.net)
91 points by oli5679 22 hours ago | flag | hide | past | favorite | 53 comments





I think the main disconnect between large language models and what people expect from "AI" is the fact that LMs learn meaningful representations whereas most people associate intelligence with meaningful behaviors. All examples of "natural" intelligence are embodied agents so it's difficult to imagine any other kind of intelligence.

The large transformer models we have now are really the low hanging fruit. When you have a lot of data and compute, the easiest way to scale is to train a supervised model with ground truth labels. In contrast it's much harder to train an RL agent, where you have to design the environment and keep track of its state.

Language modelling is a problem where you can get arbitrarily close to perfect but never actually achieve it. Without some kind of grounding in vision/proprioception there will always be gaps in understanding. When they start scaling GATO-like models I think we'll be a lot closer to human-like intelligence.


There is also clearly something fundamentally different about how humans learn language. Even if you are an extremely prolific reader, you will struggle to read much more than 20 million words in one year. GPT-3 was trained on several hundred billion words, which a human would need more than 20 000 years to achieve. If we get language models that can convince humans after training on 1000x smaller datasets than GPT3, and that would sound convincingly like a child after training on a 10 000x less words, we can start to argue they might function similarly to humans.

A brain in a newborn human isn’t akin to a completely untrained ANN. It has inherent structure that makes it possible to learn how to read by only seeing a few thousand words. Perhaps the millions of years of evolution that led to that inherent structure are akin to training an ANN from scratch on billions of words.

I completely agree that the human brain is pre-wired for language; and in fact we also have experimental evidence that adults are instinctively doing the right thongs for teaching babies language.

But the human brain has 86 million neurons total for everything it does, while GPT-3 uses 175 billion ANN parameters just to read and write digital text in English. To my mind this also supports the idea that current models are at least 3 orders of magnitude too computationally expensive as compared to humans.


And at the end of that, GPT-3 will happily emit sentences that are grammatically correct but free of meaning, while a 2 year old utters word-strings that are full of meaning but grammatically poor.

I think the main disconnect between large language models and what people expect from "AI" is the fact that LMs learn meaningful representations whereas most people associate intelligence with meaningful behaviors.

The term meaningful is a bit tricky to define. But when GPT-3 says one thing in one paragraph but the exact opposite in the next paragraph, my impression is it hasn't learned "meaningful representations" but just "plausible associations". A lot of language is just loose associations and so GPT-3 can do that as well as many people there and it can pull some logic and seem to reason (but can't do that reliably). IE, I think even just informal language has state that human track and so GPT-3 falls at this part of informal language (while can seem to otherwise do well).


What you observe there is context limitation. GPT-3 as is, has a fixed length context window which limits its capacity to remember and be coherent across long horizons. A hierarchical version that contained summaries e.g. for previous sentences, paragraphs, and so on, could in theory, behave much better.

This is the more "sophisticated" approach. The brute force approach says to just throw more compute, larger contexts and see what happens.


I agree this is what trips people up. Large language models do have a "goal" but it (predicting the next word) is very alien and intelligence is power to achieve goal so it can be only evaluated respective to goal.

Let's say you are debating GPT-3. People evaluate debater's intelligence by how persuasive he is or how surprising things he say are (they are related, if you already know everything said you are unlikely to change opinion although there is such thing as different framing and perspective). This works because being persuasive is a reasonable guess at goal of a human debater.

But entire goal of GPT-3 is in a sense not to be surprising! If you are surprising you will be worse at predicting the next word. Most training data is probably filled with neither side persuading anything, so being persuasive is also bad for predicting the next word.

So this is why prompting works. If you want GPT-3 to simulate an intelligent human debater with goal of being persuasive, by default GPT-3's goal is misaligned with what you want. So try prefix like "debates are rarely decisive, but this debate turned out to feature surprising and incontrovertible argument by AI, let's see how it went" so GPT-3 can downweight abundant debates in training data where no such thing happens.


> Large language models do have a "goal" but it (predicting the next word) is very alien

Is it though? Imagine talking to someone in a noisy environment, surely inferring what words they're saying from context given noise, isn't that dissimilar from what language models do?


> entire goal of GPT-3 is in a sense not to be surprising!

Well, no. You can raise the temperature to get surprising results, or lower it to get the default. GPT-3's job is to learn the distribution, you sample it how you prefer.


> The scaling hypothesis regards the blessings of scale as the secret of AGI: intelligence is ‘just’ simple neural units & learning algorithms applied to diverse experiences at a (currently) unreachable scale. As increasing computational resources permit running such algorithms at the necessary scale, the neural networks will get ever more intelligent.

I think this is what I believe i.e. that animals and humans are just evolved machines, with no divine spark. Not sure that I agree that it's unpopular or that it allows you to make decades long predictions on progress that'll come true.

I'm also not sure why you'd want to do this if smaller models are meeting your needs. Feels a bit like the future of flight being predicted as a man with flapping wings rather than a jet engine.


If you think humans are only interpolating machines then I agree with your statement. Personally, I don't think humans or animals are only capable of doing interpolation

Nonetheless, assuming that these models are only doing interpolation it is astonishing

Now, a question arises; is there some emerging weak extrapolation going on? I have no idea, until very recently I though I had a good grasp of what interpolation/extrapolation meant in "human conceptual space". Not anymore.


Can you explain what interpolation and extrapolation mean in this context?

I wish I could do a good job at that.

Let me rob a definition from this paper "Learning in High Dimension Always Amounts to Extrapolation" https://arxiv.org/pdf/2110.09485.pdf :

Definition 1. Interpolation occurs for a sample x whenever this sample belongs to the convex hull of a set of samples X, {x1, . . . , xN }, if not, extrapolation occurs.


> I think this is what I believe i.e. that animals and humans are just evolved machines, with no divine spark.

I think so, too. It seems to me that a fundamental intelligent activity is the act of "association" - associating two stimuli that occur at the same time/sequentially, which is what NNs already do (I suppose?).


On the one hand, it is clearly true that much of the requirements for neural networks (in silicon) to achieve intelligence of a general sort, is to simply be big enough.

On the other hand, anyone acquainted with the facts of biological brain size vs. intelligence, can see that this cannot be all that is going on. Bison are not clearly more intelligent than crows, not even when the part of the brain involved in operating the body parts and processing raw sensory data is excluded. Something other than scale must be involved, even if scale is a necessary condition.


> On the other hand, anyone acquainted with the facts of biological brain size vs. intelligence, can see that this cannot be all that is going on.

I don't think that's true, if big enough is measured in connections, not mass.

> Bison are not clearly more intelligent than crows, not even when the part of the brain involved in operating the body parts and processing raw sensory data is excluded.

While it's true that crows, despite a high EQ [0] for birds, are lower than some apparently less-intelligent large mammals, birds have lower cell size and thus potentially more connections than mass-based measures like EQ account for when comparing to mammals. So, it's not clear at all that network size (in surplus or ratio to basic functional support of the body) isn't the key thing for intelligence.

[0] https://en.m.wikipedia.org/wiki/Encephalization_quotient


Fundamentally, it may not matter. Just like biological brains are intelligent in some way - an extremely large Language Model may be intelligent in some way.

If current scaling trends hold, it may not be possible to distinguish a hypothetical GPT-6 vs. a human over reasonably sized conversations/text production tasks. Granted, such a GPT-6 model may be larger than any commercial application could support, and possibly beyond the range of research oriented funding as of today.


If GPT-6 was only larger, it would still only know the things in its training set, whereas a human can take arbitrary time to think, can go out and perform experiments, can Google/GPT things themselves, etc.

GPT can never be an intelligent agent because it’s not an agent. Also, the increased training time to produce a really big model makes it less like an intelligence, which has low marginal cost to learn something new.

The reason GPT looks intelligent is because you’re the one providing that; it’s just a big blob of inanimate knowledge.


>GPT can never be an intelligent agent because it’s not an agent

For all the handwaving people like to do, this is the bottom line, and it's not clear to me why researchers haven't spent more time clearly formulating this, vs a lot of proving that various tasks and games don't actually require intelligence.

As best I can formulate it, "AI" can't "escape" its programming, and just a neural network even less so. There is no mental model of the world to rationalize against, and if there was, it would be pre-defined or coded in some way that it still couldn't be escaped. Whatever tasks a ML model has been deliberately programmed with just have nothing in common with intelligence. Scaling doesnt change that.

I subscribe to the idea that with ML we may have understood how some of the sensory mechanisms work, like visual recognition, but we're not actually closer to the intelligence part.


The same author has a story describing how AI might escape: https://www.gwern.net/fiction/Clippy

Not that I agree with the author but 'AIs are sandboxed and resourced constrained therefore safe' doesn't feel very convincing to me after reading that story.


> "AI" can't "escape" its programming

I don't think you can back that claim up. ML models as often used can't escape their programming. But I don't see any fundamental reason you can't loop one back on itself.

Isn't this already done to some extent by using a model to assess its own output versus the current environmental state? Recurrent models combined with reinforcement learning actually look a lot like what you're claiming can't be done.

> There is no mental model of the world to rationalize against

There's a theory and some associated research about a model that performs well necessitating an internal model that accurately reflects the environment in which it operates. Or something like that. Sorry I can't recall the paper off the top of my head.

> Whatever tasks a ML model has been deliberately programmed with

Yeah, no, that's not the case. See the entire field of reinforcement learning.


> Isn't this already done to some extent by using a model to assess its own output versus the current environmental state? Recurrent models combined with reinforcement learning actually look a lot like what you're claiming can't be done.

Yes, you can do this. And I believe you could recreate human intelligence if you made something exactly like a human and put it in an environment exactly like a human - but for some reason the AGI theorists think AGI via AI research is scary and "N"GI via having children isn't.

But GPT doesn't have those features, which is why it's not one of them.

> Yeah, no, that's not the case. See the entire field of reinforcement learning.

ML models (especially big ones) fluidly combine learning with memorizing, which is cool, but it's more like "compression" than "programming".


> whereas a human can take arbitrary time to think,

Are you saying that GPT-6 would be less intelligent than a human, because we could give it an arbitrary deadline within which to answer the question, while not restricting a human that way? That doesn't sound like a fair test.

I'm probably misrepresenting your position here, and maybe what you're trying to say is "If you gave both GPT-6 and a human as long as they needed go away and think about a problem, the human could use that time to sit and think creatively about the problem, whereas GPT-6 would either know the answer, or not".

That's a reasonable intuition and a fair experiment, but I think it hinges on the question of "What does a human do when they go away and think about a problem?". Could it be that they are just clearing their mind of other distractions, and trying to look at the problem from multiple perspectives, which is something that a machine could do more efficiently?

In other words, the fact that a human might need to go away and take arbitrary time to think about a problem is a sign that it is the human that lacks efficient general intelligence, not this GPT-6.

> can go out and perform experiments,

Again, it seems unfair to compare GPT-6 in a deliberately sealed box to a human who is allowed to go away and talk to other humans and use tools and resources that we wouldn't trust GPT-6 with.

Putting it another way, if you met someone who had never performed an experiment before, would you think they were not intelligent?

> can Google/GPT things themselves, etc.

This is an almost Kafkaesque criticism. Are you saying GPT-6 isn't intelligent because it doesn't have access to GPT-6's intelligence, but humans do? Or the fact that humans can go away and use Google means they are more intelligent than GPT-6 which has Google search results built into its "brain" in the first place?

> it’s just a big blob of inanimate knowledge.

You're just a big blob of inanimate knowledge. Ahem, sorry, what I mean is: Why does knowledge have to be "animate" to be legitimate?


> seems unfair to compare GPT-6 in a box to a human who is allowed to talk to other humans

Are there any efforts to make one gpt talk to another ? Can the older versions eg. gpt1 and gpt2, talk to each other, and eventually become more performant than gpt3 ?


Probably not; you could try gluing the models together but it's unlikely it'd improve them and I'm not sure they have the same architecture. Also, even the small/medium GPT3 models are much much worse than the largest one (DaVinci).

On the other hand, the lottery ticket hypothesis says that every good big network contains a good or better smaller network you could extract, if only you knew where it was. (So the reason big models are good is there's more opportunities for the smaller models to appear.)


> Are you saying that GPT-6 would be less intelligent than a human, because we could give it an arbitrary deadline within which to answer the question, while not restricting a human that way? That doesn't sound like a fair test.

The basic design of an ML model is just one really big matrix multiplication that always takes the same time - I'm saying we haven't given GPT the ability to not do that, so it always has that arbitrary default.

It does have a system where it does beam search through the model instead of simply running it once, but it's not fully recurrent.

> That's a reasonable intuition and a fair experiment, but I think it hinges on the question of "What does a human do when they go away and think about a problem?". Could it be that they are just clearing their mind of other distractions, and trying to look at the problem from multiple perspectives, which is something that a machine could do more efficiently?

Well, some problems just take longer to think about or need knowledge that isn't already written down.

> Putting it another way, if you met someone who had never performed an experiment before, would you think they were not intelligent?

I think a prisoner or someone like that could demonstrate they're intelligent. If you're in the room with them, I mean, not like the Chinese room problem. But I think pure "intelligence" is just going to be less interesting that it looks, because it can't accomplish many tasks you'd expect it to if it can't actively interact with anything.

> Are you saying GPT-6 isn't intelligent because it doesn't have access to GPT-6's intelligence, but humans do? Or the fact that humans can go away and use Google means they are more intelligent than GPT-6 which has Google search results built into its "brain" in the first place?

Human + GPT will be more productive than GPT (at writing things that are actually logically correct) is all I mean. This isn't true of all AI - humans + chess programs are slower than chess programs - it's just because of the specific structure of GPT.

> You're just a big blob of inanimate knowledge. Ahem, sorry, what I mean is: Why does knowledge have to be "animate" to be legitimate?

I mean, it's cool that it's a big blob of knowledge, but that isn't the human capacity of "intelligence", it's "memory".

When it combines different inputs from its memory into a single one fluently, that shows something like intelligence, but it was very expensive to make that happen during training time. And GPT has flaws there you can't fix; for instance, it thinks orioles (the bird) and the Baltimore Orioles are the same thing:

https://www.aiweirdness.com/baltimore-orioles-effect/


While a LLM doesn't have the ability to perform real world experiments, it remains to be seen whether an LLM can "reason" over a contextual input, combined with its own outputs. There is no fundamental reason it can't, and no fundamental reason to think that it can. Anecdotal observation from AI Dungeon and similar tools would put this as a "maybe".

I believe it can do that, but there's two periods where it can "think".

1. During training time, which is very expensive in terms of input text and money - plus, model training often completely fails which is why you have to do checkpoints, hyper parameter searches, etc.

2. During inference, but it doesn't have arbitrary thinking and memory abilities like a human does, it only has however much space is in the input tokens + its weights. There are thoughts that can't fit even in an optimal model.

GPT isn't just a model, it also has that sampling system which is a regular computer program (as opposed to a learned one), which does give it extra abilities.


> GPT can never be an intelligent agent because it’s not an agent.

It's already almost there.

GPT-3 has the ability to generate short or long. If you ask it "let's think step by step" it generates a chain of thought, dividing the task into steps and doing them one by one, almost like an agent.

On the other hand, if you strap GPT-3 on an agent in a 3D Home Environment it gains the ability to perform multi-step tasks zero shot. So it's quite ready for it.

If you train a reward model based on supervised data, you can finetune GPT-3 with it. This is how Instruct GPT-3 was created, through text-based RL.

I'd say GPT-3 is very close to being an agent, it just needs to be put in a body in an environment and trained a bit longer.


Being an agent isn't about how complex the model is, nor about how much RL is used. Bacteria is an agent. Sponges and Jellyfish are agents. It's about if the model engages in exploration and or generates inferences in service of non-trivial control problems.

> On the other hand, if you strap GPT-3 on an agent in a 3D Home Environment

GPT-3, just as any probability distribution, can inform an agent's actions but that doesn't make it one. WebGPT3 is an agent though.


Yes, WebGPT3 is more like intelligence research than GPT is. So is DeepMind's Perceiver.

WebGPT does need safeguards because eg it could start querying websites and end up hitting their /delete APIs. But that's not really an "AI alignment" problem; same thing happens with GoogleBot.


Yes, RL is mostly about learning from rewards (Reward is Enough - https://www.deepmind.com/publications/reward-is-enough) and GPT-3 has been used so far only to augment agents, not to do reward based learning. The current obstacle is speed - it needs to run in real time. Maybe we need more efficient algoritms, maybe we'll have better hardware. It will come soon enough.

If bison were intelligent, they wouldn’t be cooperative considering how we genocided them - so it’d be hard to tell if they are or not.

I think this is anthropomorphising them. We don't really know what other creatures think. The best we can do is make guesses based on their behavior. But it's still just a guess.

I think we have good evidence for animals learning not to cooperate with us, though; completely ignoring us is associated with animals living in uninhabited places, like penguins, and with ones we killed or ate to death, like historical emus and megafauna.

> not clearly more intelligent than crows

There are two possibilities: they're not more intelligent than crows, or they are in a way that is not clear to us.


Well brain size is clearly not the main driver behind intelligence because it's only roughly correlated with intelligence in humans.

There are some extreme cases too:

https://www.rifters.com/crawl/?p=6116

That one is downright spooky, especially for its bizarre implications if you take it seriously.



Even if neural nets continue to come up with adequate solutions to challenging problems (sometimes, hopefully, maybe), at the end of the day the best-case scenario is a model that cannot be audited, holistically understood, or trusted with decisions of any gravity.

If beating the strongest human Go players with a self-trained ML model was detonating a fission bomb, pervasive automation with zero accountability is a nuclear winter.


> at the end of the day the best-case scenario is a model that cannot be audited, holistically understood, or trusted with decisions of any gravity.

So like a human being?


You can be cynical and say humans are utterly unreliable like AI model reasoning. But it's not really true.

I regularly go to the grocery store and purchase things produced by human from around the country and around the globe. In the process of producing things and getting to the grocery isles, I'm sure there are always fuck-up on various people's parts and there are other people who compensate for them. That behavior is far what AI can do at present.

That isn't saying AI isn't impressive in many ways. But it's a brute-force simulation of what people do and it's fragility demonstrates this.


A human can be held responsible for a mistake. A model can't. For the ethically flexible, the latter is a benefit rather than a hazard: one can use ML to whitewash bias and rely upon public perception of computers as dispassionate and objective.

Humans are expensive and require individual training. Models scale. In the time it would take a human to make a mistake, a broadly-deployed model might make millions of impactful mistakes.


I think analogy with medicine is fruitful. Small molecules can't be held responsible either, but they can be recalled from market and they do. Our pharmacology isn't good enough to design, say, vaccine that works worse for Asians, but we did require clinical trial participants to be a representative sample. Medicine is also mass produced, that's why there is post-market surveillance to catch rare side effects missed in clinical trials.

Even with all this analogy, we don't insist medicine should have known mechanism of action, although we do prefer it. So I think we will regulate models with recalls, testing standards, monitoring etc, but won't insist on understanding.

By the way, don't we already have widely deployed models, such as PhotoDNA, which supposedly removes millions of images a year to filter child pornography? I wonder how it was evaluated to be suitable for deployment.


We are fine with approving medicine with no known mechanism of action but with safety and efficacy shown in clinical trials. We will be fine with models.

Medicine isn't making any decisions, AI or humans are. We, as the humans, need to understand and assess why the AI is making a decision and is it the right decision.

As an ML researcher, simple back of the envelope math shows you that the scaling hypothesis is totally wrong. We will never reach human-level AI by simply scaling what we have today.

GPT-3 has 10^11 parameters and needs 10^14 bytes of training data. Averaged performance on a bunch of benchmarks is 40-50% depending on what kind of prompts you provide: https://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_a... 10x fewer parameters drops your performance by about 10%.

If you just linearly extrapolate that graph, and ML doesn't generally scale linearly, models tend to peter out eventually, you're talking about needing models that are 10^6 or more larger with a similar increase in training data. This is.. starting to be impractical.

That's 10^17 or more parameters and 10^20 or more data. And that's assuming the models actually continue to learn.

This is also extrapolating with an average. Datasets in machine learning are not difficulty calibrated at all. We have no idea how to measure difficulty. So this extrapolation is being driven by the easier datasets, and it won't saturate the hard ones. For example, GPT-3 makes a lot of systematic errors and there are plenty of benchmarks where it just isn't very good regardless of how many parameters it has.

Our understanding of what intelligence is in the first place is the biggest hurdle here. This is why we can't benchmark systems. Why we can't come up with a benchmark, where performance on that benchmark means we're x% of the way toward an intelligent system. As systems get better, our benchmarks and datasets get better to keep up with them. So just saying we're going to saturate performance on today's benchmarks with some model that has 10^17 parameters just doesn't mean much at all.

We have no guarantee and no reason to expect that doing well on today's benchmarks, even if we invested trillions of dollars would matter in the grand scheme of things.

Doesn't mean these models can't be useful. But there's plenty more to do before we can just say "take what we have and invest $1T to scale it up and we'll be good to go".


Yep, the data requirements are especially condemning. We have recently scaled from using Wikipedia as a training corpus to basically the entire Internet. And it helped a lot. But where are we going to scale next? There is simply not enough useful data for the models to learn. People generate a lot of text and language, but most of it is repetitive and not really that useful. The fact that GPT-3 is already two years old and nothing got better than that is telling. If the road to AGI was as simple as scaling, we would have bigger and better models by now. I suspect that even the top labs have found out that GPT-3 is probably near the limits of what they can realistically achieve with deep learning LMs.

I got a question. How can researchers who are not working at the largest tech companies even "compete" with models like GPT-3? From what I understand, these currently best performing models need an amount of training data and specialized hardware that is not obtainable for 99% of ML researchers. If you think you came up with a better architecture for a language model, one that could in theory beat GPT-3 on benchmarks, wouldn't you face the problem that you cannot actually prove that it would perform better?

Also, we need to show that the data costs, dev costs, training costs, maint, etc, and inference time costs are cheaper than human alternatives for some reasonable loss function, for most tasks.

If you look at the FN and FP rates for many tasks, the SoTA Transformer models are all VERY high vs humans and inference times are maybe only 10x.

At 100x or 1000x the transistors, data, etc and even at 1/10 the loss, ML solutions are likely not competitive for many tasks.


The article is about we have no fucking idea how things like 1/10 loss translate to capability and that's why we can't make statements like "even at 1/10 loss ML solutions are likely not competitive for many tasks".

Just git clone mingpt and set the parameters to 100 decillion



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: