S Parrot 2
Argument 4
Some more scenario demonstrating the ability of GPT-4 to simulate the world behind the words:
The criteria I followed to build those scenarios are the following:
- There must be irrelevant objects in the scene. It ensures that it is not obvious what will be affected after something happens.
- The events that are happening must be either implied (time passing) or indirect (the wind made me miss my throw). It shouldn’t be obvious that the event is affecting the rest of the scene.
- The question at the end is something that is indirectly affected by the evolution of the scene after the event. For example, to know what happens to the picnic area after the wind blows, I didn’t ask “What is the state of the picnic area”, but “How do you feel”, which will be affected by the state of the picnic area.
- ^I didn’t do cherry-picking on the examples. I tried each of the examples at least 10 times with a similar setup, and they all worked for GPT-4. Although I selected only a portion of the most interesting scenarios for this post.
- ^On the day I did this demo, OpenAI rolled out ChatGPT-4 image reading capability. So I decided to do those examples on the playground with gpt-4-0613 to show that it can even do it without having ever seen anything afaik
- ^On a previous version of GPT-4 (around early September 2023) it did guess correctly on the first try but I can’t reproduce with any current version in the playground.
- ^The objects were generated with GPT-4 and I did manual edits to try to reduce the chances that this image was in the training data. I tested around 15 simple objects which all worked. I also tried 4 other complex objects which kind of worked but not perfectly (like the articulated lamp guessed as a street lamp)
- ^It could be interesting to investigate this ability further. What is learned by heart? What kind of algorithm they build internally? What is the limit? …
- ^This would indeed imply a weaker world model (or Theory of Mind) if it cannot make good predictions but does not refute its existence just on the basis of bad predictions.
- ^I left this example because it is the first one I made and I used it quite a lot during debates
- ^Actually, clues about the non-bidirectional encoding of knowledge were discussed by Jacques in his critique of the ROME/MEMIT papers.
Language Models2AI Capabilities2AI2World Modeling1Frontpage
50
Preface to the Sequence on LLM Psychology
Log in to save where you left off
Mentioned in69
44
Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor
18
A Chinese Room Containing a Stack of Stochastic Parrots
The Stochastic Parrot Hypothesis is debatable for the last generation of LLMs
New Comment
SUBMIT
11 comments, sorted by top scoring
Click to highlight new comments since: Today at 1:00 AM
[-]
5
1
One concern I have is that there are many claims here about what was or was not present in the training data. We don't know what training data GPT-4 used, and it's very plausible that, for instance, lots of things that GPT-3 and GPT-3.5 were asked were used in training, perhaps even with custom, human written answers. (You did mention that you don't know exactly what it was trained on, but there's still an implicit reliance. So mostly I'm just annoyed that OpenAI isn't even open about the things that don't pose any plausible risks, such as what they train on.)
And this is not to say I disagree - I think the post is correct. I just worry that many of the claims aren't necessarily possibly to justify.
Reply
[-]
1
0
I agree. However, I doubt that the examples from argument 4 are in the training, I think this is the strongest argument. The different scenario came out of my mind and I didn't find any study / similar topic research with the same criteria as in the appendix (I didn't search a lot though).
Reply
[-]
2
0
I agree that, tautologically, there is some implicit model that enables the LLM to infer what will happen in the case of the ball. I also think that there is a reasonably strong argument that whatever this model it, it in some way maps to "understanding of causes" - but also think that there's an argument the other way, that any map between the implicit associations and reality is so convoluted that almost all of the complexity is contained within our understanding of how language maps to the world. This is a direct analog of Aaronson's "Waterfall Argument" - and the issue is that there's certainly lots of complexity in the model, but we don't know how complex the map between the model and reality is - and because it routes through human language, the stochastic parrot argument is, I think, that the understanding is mostly contained in the way humans perceive language.
Reply
[-]
4
1
I think the links to the playground are broken due to the new OAI playground update.
Reply
[-]
2
0
Thanks for the catch!
Reply
[-]
3
-1
True, but you can always wriggle out saying that all of that doesn't count as "truly understanding". Yes, LLM's capabilities are impressive, but does drawing SVG changes the fact that somewhere inside the model all of these capabilities are represented by "mere" number relations?
Do LLM's "merely" repeat the training data? They do, but do they do it "merely"? There is no answer, unless somebody gives a commonly accepted criterion of "mereness".
The core issue with that is of course that since no one has a more or less formal and comprehensive definition of "truly understanding" that everyone agrees with - you can play with words however you like to rationalize whatever prior you had about LLM.
Substituting one vaguely defined concept of "truly understanding" with another vaguely defined concept of a "world model" doesn't help much. For example, does "this token is often followed by that token" constitutes a world model? If not - why not? It is really primitive, but who said world model has to be complex and have something to do with 3D space or theory of mind to be a world model? Isn't our manifest image of reality also a shadow on the wall since it lacks "true understanding" of underlying quantum fields or superstrings or whatever in the same way that long list of correlations between tokens is a shadow of our world?
The "stochastic parrot" argument has been an armchair philosophizing from the start, so no amount of evidence like that will convince people that take it seriously. Even if LLM-based AGI will take over the world - the last words of such a person gonna be "but that's not true thinking". And I'm not using that as a strawman - there's nothing wrong with a priori reasoning as such, unless you doing it wrong.
I think the best response to "stochastic parrot" is asking three questions:
1. What is your criterion of "truly understanding"? Answer concretely in a terms of the structure or behavior of the model itself and without circular definitions like "having a world model" which is defined as "conscious experience" and that is defined as "feeling redness of red" etc. Otherwise the whole argument becomes completely orthogonal to any reality at all.
2. Why do you think LLM's do not satisfy that criterion and human brain does?
3. Why do you think it is relevant to any practical intents and purposes, for example to the question "will it kill you if you turn it on"?
Reply
[-]
2
0
I don't think this line of argumentation is actually challenging the concept of stochastic parroting on a fundamental level. The ability of generative ML to create images or solve math problems or engage in speculation about stories, etc, were all known to the researchers who coined the term; these things you point to, far from challenging the concept of stochastic parrots, are assumed to be true by these researchers.
When you point to these models not understanding how reciprocal relationships between objects work, but apologize for it by reference to its ability to explain who Tom Cruise's mother is, I think you miss an opportunity to unpack that. If we imagine LLMs as stochastic parrots, this is a textbook example: the LLM cannot make a very basic inference when presented with novel information. It only gets this "right" when you ask it about something that's already been written about in its training data many times: a celebrity's mother.
The model is very excellent at reproducing reasoning that it has been shown examples of: Tom Cruise has a mother, so we can reason that his mother has son named Tom Cruise. For your sound example, there is information about how sound propagation works on the internet for the model to draw on. But could the LLM speculate on some entirely new type of physics problem that hasn't been written about before and fed into its model? How far can the model move laterally into entirely new types of reasoning before it starts spewing gibberish or repeating known facts?
You could fix a lot of these problems. I have no doubt that at some point they'll work out how to get ChatGPT to understand these reciprocal relationships. But the point of that critique isn't to celebrate a failure of the model and say it can never be fixed, the point is to look at these edge cases to help understand what's going on under the hood: the model is replicating reasoning it's seen before, and yes, that's impressive, but it cannot reliably employ reasoning to truly novel problem types because it is not reasoning. You may not find that troubling, and that's your prerogative, truly, but I do think it would be useful for you to grapple with the idea that your arguments are compatible with the stochastic parrots concept, not a challenge to them.
Reply
[-]
2
0
the new OAI update has deployed a GPT4 version which was trained with vision, GPT4-turbo. not sure if that changes anything you're saying.
Reply
[-]
1
0
I agree with the other comments here suggesting that working hard enough on an animals' language patterns in LLMs will develop models of the animals' worlds based on that language use, and so develop better contexted answers in these reading comprehension questions. With no direct experience of the world.
The SVG stuff is an excellent example of there being available explicit short cuts in the data set. Much of that language use by humans and their embodied world/worldview/worldmaking is is not that explicit. To arrive at that tacit knowledge is interesting.
If beyond the stochastic parrot, now or soon, are we at the stage of stochastic maker of organ-grinders and their monkeys? (Who can churn out explicit lyrics about the language/grammar animals and their avatars use to build their worlds/markets. )
If so there may be a point where we are left asking, Who is master, the monkey or the organ? And thus we miss the entire point?
Poof. The singularity has left us behind wondering what that noise was.
Are we there yet?
Reply
[-]
2
0
I partially agree. I think stochastic parrot-ness is a spectrum. Even humans behave as stochastic parrots sometimes (for me it's when I am tired). I think, though that we don't really know what an experience of the world really is, and so the only way to talk about it is through an agent's behaviors. The point of this post is that SOTA LLM are probably farther in the spectrum than most people expect (My impression from experience is that GPT4 is ~75% of the way between total stochastic parrot and human). It is better than human in some task (some specific ToM experience like the example in argument 2), but still less good in others (like at applying nuances. It can understand them, but when you want it to actually be nuanced when it acts, you only see the difference when you ask for different stuff). I think it is important to build a measure for stochastic parrot ness as this might be an useful metric for governance and a better proxy for "does it understand the world it is in?" (which I think is important for most of the realistic doom scenarios). Also, these experiences are a way to give a taste of what LLM psychology look like.
Reply
[-]
1
0
Given that in the limit (infinite data and infinite parameters in the model) LLM's are world simulators with tiny simulated humans inside writing text on the internet, the pressure applied to that simulated human is not understanding our world, but understanding that simulated world and be an agent inside that world. Which I think gives some hope.
Of course real world LLM's are far from that limit, and we have no idea which path to that limit gradient descent takes. Eliezer famously argued about whole "simulator vs predictor" stuff which I think relevant to that intermidiate state far from limit.
Also RLHF applies additional weird pressures, for example a pressure to be aware that it's an AI (or at least pretend that it's aware, whatever that might mean), which makes fine-tuned LLM's actually less save than raw ones.
Reply