TLDRUpFront: Why current AI LLMs may be able to write a 3k-word scientific paper, but not a 100k-word scientific book or a 220k-word gaming rulebook exposes how LLM memory, token sizes, hallucination, and hard and soft limits create limits-to-growth on both the size and accuracy of a given LLM project.
FullContextInTheBack
Caveats:
#NotAnAIExpert #CaveatLector See
Discussion below SourcesWelcome anyone with more information to improve my thinking on token/memory interactions in the comments.
First things first. Tokens are the size of the ‘memory’ of an AI LLM session you’re interacting with. Memory matters if you want the AI LLM to recognize “past work” done on a specific effort. So the first time you ask the AI to write a 3k paper, it’s only going to use the LLM algorithm to make its best guess on the answer. But if you want to then use *that* draft, and refine it, you’re now getting into memory and tokens.
If your effort goes above the memory/token limit, the AI begins “forgetting” the past versions and in effect goes back to the beginning, giving you the best guess of the prompt. Note this won’t be the same as the best guess the first time you asked either, since every time you ask an AI LLM you may get different answers.
Although the unit of measure is not perfect, think of 1 token = 1 text character. This includes both the output material AND prompts. So if you write a ~300-character prompt and ask for a ~300-character output you’re talking roughly 600 tokens. (Again this is simplified.)
Related to writing a paper – depending on the token limit you’re going to run into a memory problem pretty quickly assuming you A) have to do more than one version and B) want it to remember the versions prior.
“But what if you use a supercomputer?”
As far as I know, there are two problems at hand with the memory/context issue. One is a hard limit and the other is a softer concurrency limit. The hard limit is what the LLM is developed for a given token size. ChatGPT4.5 32k has a 32k token contextual size, which means about 16k words (depending on prompt sizes) can exist in the “memory” so you could get maybe 3-5 versions of the OP 3k word paper before it would begin forgetting. That’s enough for a 3k single-shot scientific paper.
You can jailbreak token limits and go above this level, going up over the hard limit may run into hallucination issues above and beyond what already exists. This is because no one actually knows *why* the AI LLM performs best at a given token limit, it just was trained that way. So adding to it, you might get more accurate because you’ve added more memory or you might get less accurate because it’s now suboptimized relative to its design.
The soft limit of concurrency is how many people are hitting the servers at the same time. That’s a combination of the compute power of the server, which also speaks to chip availability, water availability, and energy costs. You can scale that easier than you can scale the hard limit…just add more of everything. Yes – one could just add another supercomputer to the server farm. The problem is due to a variety of factors (especially chip availability) it’s not just automatic anymore that you can scale to meet demand at any increase. So as multiple millions of people all start hitting that server farm simultaneously you begin getting capacity errors, which is what’s been happening recently.
Installing a private LLM instance on your own servers or private cloud could alleviate the concurrency issue of the soft limit because you’re no longer sharing space with everyone else. But you’d still run into the hard contextual limit of what that particular LLM was trained on.
“Okay – we can’t just supercomputer our way to more accurate larger papers…what about improving the training data set? Wouldn’t larger training data sets provide more accurate answers?”
The evidence is pretty strong that as the training data size increases, the quality improves. The problem however is where is the next marginally “larger” data set to gain that enhancement?
We’ve already used the entirety of all public human writing ever produced up to the last cutoff date. For ChatGPT4 this cutoff date is January 2022. And yes this includes virtually all scientific articles…but nothing past that cutoff date.
At best each iteration from hereon out is adding the written works produced since the last cutoff date. Companies are now working on all video and audio ever produced (which is going to be of limited use because it’s still a 2D model of the world). After that…what’s left? All classified materials? All ISR data and/or formats that aren’t in text/video/audio form? The volume of data in those areas sharply declines as we move away from the most common expression mediums.
There’s also a very real possibility that as AI tools proliferate, they will ‘infect’ the training sets with AI-generated data, creating a feedback loop that actually leads to *reduced* performance even as the size of the training set scales because the % of human output the LLM is trying to replicate becomes smaller by share over time. (Roughly put consider it the multiplication of a hallucination learning on top of a hallucination.)
That’s with current LLM methods. There are some experiments I believe in alternative LLM approaches that rely less on sheer size/scale and more on curated data sets for quality. That may really improve the hallucination but any curation at this scale takes enormous resources. We may need a new kind of AI capable of curating high-quality data sets as subsets from the overall data set for the training of new LLMs. This gets back to the self-licking ice cream cone issue.
So back to the OP question. ChatGPT 4.5 could indeed write a stand-alone 3k scientific paper…and give you 3-4 draft versions to improve before it hits its token/memory limit. Its hallucination problem would still be significant, so you’d have to do lots of defect checking to ensure that the concepts are correct, the citations are not faked, etc.
And if you can chunk your work into completely independent 3k sections, you can go further. As long as each 3k section doesn’t need to “remember” the previous sections, you can thread them serially.
But take a book like Selfish Gene at 96.5k words or D&D 5th Edition coming in at a chunky 219k. Those books *can’t* be done as independent serialized stand-alone sections. Because the LLM needs to “remember” what was said in the beginning 3k sections throughout the entire book, lest inconsistencies arise.
This is why I see a confluence of emerging limits, some of which are inherent to current AI LLM (e.g. hard limit on optimal token size) and others external (e.g. chip shortages limiting limitless scaling of compute power.)
Don’t get me wrong AI LLMs are *extremely* powerful now for shorter pieces of work. But the size of those works is limited if the work has to “remember” details throughout and they’re not very accurate, at any size, but declining in size the longer the work you ask the AI LLM to ‘remember’ relative to its token limit.
Vastly improving the accuracy, or the length of work at a current accuracy, is going to require larger training data sets (which we don’t have), vastly more computing power (which will be hard to scale), or some innovation in either which could be anything from neat programming tricks to quantum computing.
DISCUSSION:
Can this be solved by chunking up the work?
The key though is to what extent a consistent memory needs to be held between domains. For example, a 300k-word dictionary could be broken down in the way you describe it. The definition for any single word does not need to be “remembered” by any other words. But take the rules of D&D5E, where cross-references and consistency of rule formats are vital across the whole work. These can be broken down, but there still needs to be knowledge of what’s in each of those areas when writing the other, and that’s where the memory problem emerges (currently).
When it comes to accuracy, if I can google search for accuracy, why can’t the LLM be automated to do that?
Because it doesn’t know what is right or what is wrong by looking at it, only the probability that the next thing it says is something it thinks you’re looking for.
To get more accurate results you have to curate your training base, I think Elicit has several hundred million academic articles in it for example. But the problem there is by reducing the size of your training base you may increase accuracy while reducing predictability of the next-best-word guess itself.
Ultimately there’s going to need to be some new innovative method. One that doesn’t increase hardware requirements, a relaxation on the scalability issue due to hardware improvements (e.g. quantum computing or scaling of chip supply), or maybe even both.
Don’t get me wrong, in the future some of this will probably be resolved. But it’s presenting a sort of fuzzy upper limit right now. Which is typical for most new technical innovations that demonstrate a boom-bust-boom phase. LLM’s, as a subset of AI in general, are currently in their first boom and may have to go to a bust, to get to their next boom. Other forms of AI had their booms much earlier, then entered busts, and are only now crawling out into their next boom. Some of these may help the problem as an after-processing tool that is configured to accuracy. It’s hard to tell. The big problem is how few people understand *why* LLM works. This is the nature of the beast here and a bit like human intelligence, parsing out *why* someone thinks a certain thought is super tricky to figure out.
Couldn’t it use a script to take some small outputs and google every claim? Then instead of looking at the original and searching I can read the output and skim citations. A team of LLMs plus a little code connecting them, then a little of my time upfront and at the end.
The problem however is LLM is not a search engine. It’s a guessing engine. So the process you describe, currently, is outside the capability of LLMs as far as I know.
The distinction may be subtle, but even if you ask it “give me a list of citations” it’s not ‘searching’ those but ‘guessing’ them based on its training data set. Its guesses may be accurate, but they are still guesses. This is why you won’t get any answers more recent than the last update (Jan 2022 for ChatGPT4.5 I believe.)
So setting up a chain of LLMs wouldn’t reduce the guessing function and instead simply accumulate the error function as a byproduct of sequential variation.
You should try Bard and Bing.
I have. However neither Bard nor Bing is accurate enough, currently, to resolve the question “How do I know this large piece of text and its sources are accurate enough within such that I have confidence I don’t have to check myself through alternate means.”
You can get really good specific answers with very specific prompted inquiries, but remember, the token limit here and compounded hallucinations.
Depending on the size of the white paper, loading it into the engine and asking something like “check all the sources” may either bork the token limit or result in hallucinated results because you’re now compounding two hallucinations (whatever was in the first paper added by whatever the second process takes.)
If you aren’t familiar with the math yet on how sequential variation in linked processes can result in unexpected large accumulated errors I can link you to a simulated example of it. That’s the problem of chaining LLMs in their current state, there’s a system memory of error that serves as the baseline for the second, which usually adds its own error. The result is probably going to be a larger error than if you did each process independently.
Consider the use case mentioned where a lawyer, when challenged on his case citations, returned to ChatGPT to find them and it simply *invented* the entire case as a document rather than ‘verifying’ accuracy.