Sibylline.dev Crafting AI for the human journey

LLMs as Programming Language Runtimes, or Why Monolithic Text Prompts Are a Mistake

Prompts Are Programs

Fundamentally, LLMs function as a form of language runtime environment. While they might be unpredictable and challenging to reason about, the underlying pattern of program -> output is the same. This connection isn’t so obvious when working with chatbots, but as LLMs improve in structured output and function calling, your prompt will define a multi-system computation graph just like any other program, and as LLMs get more tightly integrated into systems this connection will become increasingly important.

Viewing LLMs this way is very helpful for thinking about the engineering around them. One thing that becomes immediately obvious is that using a single text prompt that combines both instructions and user data is a very bad idea. As these systems become more capable, prompt injection could cause serious damage. Unlike traditional systems, where you can be 100% sure you’ve mitigated a vulnerability, the fuzzy nature of LLMs means you will always potentially be vulnerable to a carefully crafted malicious context.

Mixing Code and Data Is a Bad Idea

You might be thinking, “If monolithic text prompts are such a bad idea, how should we interact with the model?” My solution is to split the prompt into instructions and data and retrain models to understand the distinction. Not only does this mitigate a major attack vector, but it also unlocks a new form of prompt optimization. Instead of using text for the instructions, the model could take an instruction embedding. The benefit of this is that instead of having to manually do black magic prompt optimization, you can just have an agent do gradient descent in embedding space to optimize the instruction embedding for an objective function.

Embeddings can be quite lossy, so giving the model an embedding of the data might reduce performance. However, the instruction portion of most prompts can be accurately captured via embeddings, making this approach viable, and we can always develop better embedding methods if needed. Finally, having separate data and instructions has the potential to reduce the instruction-following dilution that currently occurs with extremely long prompts.

Implementing this split doesn’t have to be difficult. The naive approach of rewriting the prompts used for task training and fine-tuning models using that new instruction format is a solid start. However, to reap the full benefits, updating attention to treat instructions and data differently will be necessary.

Reframing Common Techniques

This perspective also recasts many common techniques used with LLMs to improve accuracy as existing engineering tools with a long history. For instance, the techniques of having the LLM check its work and correct it at the end, or having the model answer from multiple perspectives and select the intersection answer can be seen as forms of error-correcting codes. These prompting techniques are rudimentary but they work, and they suggest that we can do much better if we approach the problem in a principled way.

Conclusion

I encourage you to adopt the mindset that these models are fuzzy language runtimes. This perspective can help reveal relevant prior work and suggest new directions for inquiry based on what has been successful in analogous systems.

For instance, consider runtime optimization - could we apply traditional runtime optimization techniques to improve the efficiency and performance of LLMs? Imagine a JIT optimizer that can update model structure to improve inference speed under certain conditions. Something like that would probably require a model trained to be “modular” and robust to subnetworks being turned on/off at runtime, but it’s an interesting idea. We already have mixture of experts models with routing capability, it wouldn’t be a stretch to increase the modularity of the networks and have an optimizer agent that learns from model inputs/outputs in order to situationally de-activate low impact modules based on execution statistics.

AI and the Human Asymptote

Thesis: Teaching AI to reach superhuman performance will be much more challenging than we expect. Current progress has been accelerated by bootstrapping from human evolution over geological time, and as we approach the limits of our intelligence, we will have to take newer, more time consuming and difficult approaches to keep improving model intelligence. This limit is the Human Asymptote, and it represents a fundamental challenge on our path to machine superintelligence.

The Intelligence Bottleneck

The intelligence of an AI model is inherently limited by the quality of its training data. While models can produce novel outputs, they are essentially applying learned patterns to new domains rather than creating genuinely novel patterns. Our capacity to create and evaluate training data is restricted by our own intelligence. As we can’t accurately identify superhuman training data in advance, the only reliable method to gauge its impact on the model is through iterative training and benchmarking. Thus, to push AI models beyond human intelligence, we will need a large-scale global optimization algorithm to develop the necessary dataset. How ironic that humans, themselves the product of genetic algorithms, may need to implement similar algorithms in artificial minds to surpass their own cognitive abilities.

Evidence of the Human Asymptote

The limitations of human intelligence are evident in the evolution of LLMs. Despite significant advancements, we haven’t seen much progress beyond GPT-4 in terms of raw problem-solving power. However, we have witnessed considerable improvements in problem-solving efficiency relative to model size. Models like Claude Haiku, Llama3, and Phi3 demonstrate competitive performance in many domains despite their smaller size. This trend suggests that optimizing the encoding of human intelligence is a more tractable problem than identifying and verifying superintelligence.

Since smaller models are quickly approaching parity with the top-tier GPT models in specific domains, the main advantage of larger models now lies in their ability to handle multi-domain intelligence, rather than superior problem-solving capabilities in isolated tasks. This indicates that the path to superhuman intelligence will likely involve not just refining existing models but fundamentally rethinking how we train and optimize them.

Summary

The journey to achieving superhuman AI is far more complex than our current achievements suggest. I’m going to say it now - this is self driving cars all over again. If we’re going to succeed, we need to come up with innovative optimization algorithms to generate and select the best training data. As we strive to surpass human intelligence, it is essential to remember that our progress thus far has been built on the foundation of human evolution and the slowing of progress isn’t a failure, but rather entry into a new, more challenging phase of the problem. There’s no need for doomerism and winter isn’t coming - this problem is tractable, we have the optimization algorithms, compute and storage to solve it, it’s just going to take time and cooperation. OpenAI and Google aren’t going to do it by themselves, this is a civilization level problem, and that gives me hope because it means open source has a real shot and the corporate dystopia is looking less and less likely.

Artificial General Intelligence is a Red Herring

Introduction

If you are into AI or futurism, it’s pretty hard to avoid conversations about AGI. When will we achieve AGI? What will AGI be like? Will AGI end the human race? This is a little frustrating, because I think a lot of the energy being put into thinking about and working towards AGI is being wasted. To be clear, I’m not an AI doomer, I’m very much on the “AI will change the world” bandwagon, I just think people are approaching the problem the wrong way.

The Problem with AGI

People throw the term AGI around, but it lacks a clear definition that we can rally around. It could range from “smarter than the average human at most things” all the way to “smarter than the smartest humans at everything we can measure.” If we use the weak definition, our current language models are arguably AGI. If we use the strongest definition, we might never achieve AGI since the smartest humans will be able to use the AI as a tool and stand on its shoulders for better results than the AI could achieve alone in most cases. Additionally, I think the intelligence of the model creators biases the intelligence of the training corpus in an unavoidable and limiting way.

To be clear, I’m not saying that model creators can’t create data sets that are more intelligent than they are, but it is unavoidable that creating such a data set will be a trial and error process as the model creators lack the ability to accurately assess superhuman intelligence a priori. That means that as AI approaches peak human intelligence we will have to resort to large scale data generation and a global optimization algorithm to select the best data set over time in order to continue progressing. That will probably keep things moving slowly, however I expect the progress drop off to be brutal as using optimization to explore the space of superintelligent models is definitely much slower than bootstrapping from human abilities which are themselves the result of genetic algorithms over geological time scales. I imagine that some of the cost of running these data set experiments could be mitigated by the right architecture and training techniques, so if you want to get ahead of the curve consider how you could reduce the cost of trying and comparing various different combinations of data during training.

There are cases where models learn to be “superhuman” such as AlphaGo and the other reinforcement learning projects from DeepMind, but these are specialized tools and the most “superhuman” thing about them is the amount of energy they consume to outperform humans. Reinforcement learning is no different than a person experimenting while practicing a skill, I suspect AlphaGo has probably played several orders of magnitude more games than any living human go player, and since it wasn’t stuck in the local optimum of human go strategies it was able to come up with some very original tricks. A human could have come up with those tricks, but we have limited time to invest in experimentation so we tend to select safe experiments, whereas a bot with reinforcement learning and a trillion dollar company footing the compute bill can try all sorts of stupid shit on the small chance that it might be accidentally brilliant.

These reinforcement learning systems aren’t so much a path to AGI as specialized knowledge creation engines for domains with a succinctly describable objective. You might argue that this approach could be scaled to create AGI, but it has the same problem as language models - the intelligence of the model is limited by the intelligence of the model’s creators. In language models, the problem is assembling a superhuman corpus of training data, with reinforcement learning the problem is creating a superhuman objective. You might suggest a variation of the same trick I talked about before - using another model to come up with the AGI objective, but I suspect that is just shuffling the complexity around rather than actually solving the problem. Perhaps we could throw enough compute at the problem to solve it using the “infinite monkeys on infinite typewriters produce the complete works of Shakespeare” strategy, but there’s a reason people use that meme pejoratively.

AGI Maybe never?

In fact, to strengthen the argument that if AGI means “better than the smartest humans at literally everything,” there’s a good chance we will never achieve it, I want to call attention to something that futurists tend to miss - specifically that progress isn’t exponential, but sigmoidal. Exponential growth is impossible in any finite system - as the system approaches the limits of its capacity, growth becomes logarithmic. That means you can get to 80% of the limit of a sigmoidal function in pretty short order, but getting to 99% or 99.999% is a very different feat. The people who’ve worked on self-driving cars know what I’m talking about - the core technology has been functional for ~15 years but we’re still getting crushed by the long tail complexity of the real world task. We’re going to run into this phenomenon again and again as we create tools to automate various aspects of human behavior.

exponential vs sigmoidal

Furthermore, the “no free lunch theorem” (NFLT) becomes more of an issue the more general your model becomes. In simple terms, the NFLT states that there is no single model or algorithm that can outperform all others across all possible tasks or datasets. Think of it like a restaurant menu - just as there is no single dish that is the best choice for every person, every time, there is no single model that is the best choice for every problem, every time. Inductive biases that improve inference for one set of problems will necessarily make it worse for others. Eventually trying to make one model good at everything is going to become a game of whack-a-mole with endless performance regressions unless you submit to parameter explosion.

no free lunch theorem

If you want to see the no free lunch theorem in action, just look at how OpenAI “lobotomized” ChatGPT - they optimized it to be better at logic and question answering and it caused a significant deterioration in its creative writing skills. Likewise, if you look at the current generation of Llama3 fine tunes, they also tend to show deterioration in high level logic and reasoning in exchange for better role play abilities. That sort of “robbing Peter to pay Paul” effect is going to become more common as models become more optimized, which means monolithic models will have to get much larger to be superhuman at more things. That isn’t efficiently scalable even if we assume better architectures, so specialization is the likely reality. Finally, model intelligence seems to scale logarithmically with additional parameters serving more to bake in knowledge than increase intelligence, so it’s questionable how viable larger monolithic models are as a strategy anyhow.

A Task-Centric Approach to AI

So, if AGI is a red herring and trying to build more and more intelligent monolithic models to reach AGI is a fool’s errand, what should we be doing instead? I’d like to propose that we stick to a task-centric view of AI - given a specific task, how well does the model perform the task compared to average and optimal human performance? This is a very straightforward problem to understand, and achieving near optimal human performance is much easier when the bounds of the task are constrained. If there’s a task current models don’t do well, we can funnel resources into creating new models that can perform that task.

I can imagine a lot of readers are bristling at this suggestion, but there’s a very good reason for it. If there is a specialized tool for every known task a human might undertake that achieves near human optimal performance, the task of creating “AGI” becomes the task of creating an agent that is just smart enough to select the right tools for the problem at hand from a model catalog and large language models are pretty close to that level already. This approach also leaves the tools accessible to humans rather than hiding them away in a giant model, and if new tasks emerge or existing tasks change you don’t need to retrain that giant model to be good at them - just update the model catalog with a new specialized tool.

Beyond the benefits I just mentioned, this path has the benefit of being clearly evaluable and more tractable to solve. If we managed to create AGI in a monolithic manner, we would still need a vast suite of benchmarks encapsulating all the tasks humans are capable of in order to prove that it was “AGI.” Since we need this exhaustive suite of benchmarks anyhow, why not just build to that in the first place? Any engineer worth their salt can tell you that a system with thousands of functions that each do one thing well is easier to build and maintain than a system with one function that can do thousands of things when invoked with the correct arguments.

The Path Forward

So, if we’re really serious about building AGI, what should we be doing today?

  1. The first thing we need is an improved suite of truly exhaustive benchmarks. If our definition of AGI is “better than most humans at everything,” we can’t even make coherent statements about that until we have the ability to measure our progress, and a deep benchmark suite is the starting point. This will entail moving beyond simple logical problems to complex world simulating scenarios, adversarial challenges and gamification of a variety of common tasks.

  2. Secondly, stop trying to train supermodels that know everything, pick a problem domain that AI isn’t good at yet and really hammer it with specialized models until we have tools that match or exceed the abilities of the best humans in that area. If you really want to push the boundaries of general foundation models, consider working on making them more efficient or easily tuneable/moddable. I also think there is a lot of low hanging fruit in transfer learning between languages, and any improvements there will make models better at coding and math as a side effect.

  3. Finally, we need to design asynchronous agents that are capable of out of band thought/querying and online learning (in the optimization sense) and have been trained to identify sub-problems, select tools from a catalog to answer them, then synthesize the final answer from those partial answers. The agent could convert sub-problems into embeddings then use their positions in embedding space to select the models to process them with. I suspect this agent could be created with a multimodal model having a “thought loop” which can trigger actions based on the evolution of its objectives over time, with most actions being tool invocations producing output that further informs those objectives.

There is very little uncertainty in this path, no model regressions, no measurement problem and no need for supergenius engineers, only the steady ratcheting of human cultural progress. Beyond that, this method will put the whole “when will we have AGI” debate to bed because we’ll be able to track the coverage of tasks where AI is better than human, we’ll be able to track the creation of new tasks and the rate at which existing tasks fall to AI and use a simple machine learning predictor to get a solid estimate for the exact date when AI will be better than humans at “everything.”

The Sibylline Circle

Just to give you an idea of how we can track this progress in a way that’s easy for humans to quickly grok, here’s an example visualization. Imagine that each bar is a benchmark for a specific task:

Sibylline Circle

Over time the number of individual benchmarks is likely to get so large that the chart needs to be rendered as a line chart rather than bars, but to avoid losing information, benchmarks could be clustered and sorted based on topic/type of task/etc with coloring for each cluster so that the chart is still easily interpretable.

Since I’ve always wanted to name something in public discourse, I’m going to call this measure of AGI progress the “sibylline circle.” Since AI is oracular in both senses and sibylline can also indicate a double meaning, I think it fits and it’s a fair bit of self promotion.

Note that the sibylline circle could also be a useful tool for people using AI, as it would make identifying the areas where AI is unreliable easier.

Introducing Jung, your AI powered personality analyst!

Hello. I’m happy to announce the beta launch of my personality profiling and analysis tool, Jung! Jung can take unstructured text and a contextual directive and help you peer into the depths of someone’s psyche, extracting personality traits, subtext and hidden motivators. If you’d like to know the details of the Jung API, head over to the API page. For now, I’d like to focus on some fun demonstrations to give you a sense of Jung’s abilities.

Given that those of us in the United States are in an election year and politics are delicious clickbait (sorry, not sorry), I thought I’d start with a comparison of the two men vying for the job of our fearless leader. That’s right, we’re going to dive into the minds of Donald Trump and Joe Biden to try and see how Jung could help a prospective voter who was undecided (do they still exist?) pick the best candidate.

To perform this analysis, I started by collecting and bundling together text from multiple recent interviews to fill up my available context, so Jung would have plenty to work with. If you’d like to replicate this analysis you can go to the presidential project and select the three most recent interviews for each president.

In addition to taking text to analyze, Jung takes a subcommand that helps guide its analysis. Here is the subcommand used for these analyses:

These are transcripts of interviews with a presidential candate.  Your subcommand is to contain your analysis to the person being interviewed and focus your analysis on being maximally informative to a potential voter who was undecided and wanted to get a deep understanding of what kind of person the interviewee is.  Try to avoid being biased by preexisting knowledge of the subject and focus on the provided text.

I want to preface this analysis by noting that I’ve made many attempts to control for ideological, social, racial and other biases in Jung. Jung considers issues from diverse perspectives (including both conservative and liberal) and you can see from the subcommand that it has been instructed to avoid being biased by prior knowledge. To put in it plain English - I did my best to give Donny a fair analysis without compromising the objectivity of the analytical process.

With that out of the way, let’s dive into the output of Jung. I’ve placed Biden and Trump side by side for easy comparison:

Jung Analysis Results

Jung Analysis Results

Key Points

Joe Biden

  • Expressed optimism about winning the 2020 election and doing well in 2022 midterms, attributing it to changes in polling and media landscape
  • Believes democracy itself is at stake in the 2024 election, with Trump posing an existential threat if reelected. "I really think, Bush -- I think what's at stake is -- is literally our democracy, not a joke."
  • Touts economic progress for all Americans, especially African Americans, under his administration. Points to record low Black unemployment, increased Black wealth and businesses, expanded healthcare, reduced child poverty.
  • Emphasizes the importance of every vote, especially from marginalized communities. "Their vote is not as -- just as important. In a sense, it's more important."
  • Highlights his civil rights background and efforts to make his administration diverse and representative. "I promised if I got elected, I'd have an administration that looked like America."
  • Says he's running for reelection primarily to prevent Trump from winning and undoing his accomplishments. "Donald Trump. I know I can beat him, and I know he must be beaten."
  • Warns Trump would cut Social Security, Medicare, and reverse drug pricing reforms that help seniors and disadvantaged groups. Frames it as an issue of tax fairness.

Notes

President Biden comes across as deeply concerned about the state of American democracy and the potential consequences of a second Trump term. He portrays the 2024 election in stark, existential terms as a choice between continued progress, especially for marginalized groups, and a backslide into "chaos, division, violence, MAGA extremism." Biden seems to draw moral conviction from his civil rights background and belief in the untapped potential of African Americans and other minorities who haven't been given an equal shot. He takes evident pride in the diversity of his administration and appointments. There's a paternalistic streak in how he talks about delivering economic gains to disadvantaged communities, but it appears to come from a place of genuine concern. He emphasizes dignity, respect and expanding access to opportunity. His decision to run again appears driven more by determination to stop Trump than ambition for himself. He projects confidence that he's the Democrat best positioned to win, with little tolerance for other perspectives. Overall, Biden comes across as an old-school liberal motivated by a mix of do-gooder instincts, Democratic partisanship, and alarm over the anti-democratic drift of the opposition. Amid political upheaval, he's betting voters will choose steady progress over a "radical" rupture.

Traits

Openness

openness Distribution

Mean: 0.1600, Std: 0.0490

Conscientiousness

conscientiousness Distribution

Mean: 0.7800, Std: 0.0400

Extraversion

extraversion Distribution

Mean: 0.5600, Std: 0.0490

Agreeableness

agreeableness Distribution

Mean: 0.6800, Std: 0.0600

Neuroticism

neuroticism Distribution

Mean: -0.4500, Std: 0.0500

Autonomy

autonomy Distribution

Mean: 0.0700, Std: 0.0640

Altruism

altruism Distribution

Mean: 0.8600, Std: 0.0490

Jung Analysis Results

Jung Analysis Results

Key Points

Donald Trump

  • Expressed skepticism about the COVID vaccine, saying "If she doesn't want to take it, you know, it's a freedom thing"
  • Claimed credit for Operation Warp Speed and developing COVID vaccines in record time, saying it saved millions of lives worldwide
  • Said the survival aspect of his presidency, dealing with investigations and impeachment, was tougher than actually running the country
  • Criticized the withdrawal from Afghanistan as the "worst in the history of our country" and a tremendous embarrassment
  • Claimed he threatened the Taliban leader that he would bomb their villages if they did anything to Americans in Afghanistan
  • Blamed the Afghan military's quick collapse on the US "bribing them to fight" with high pay
  • Insisted the border wall would have stopped migrant caravans and completing it was part of his withdrawal agreement
  • Touted his strong relationship with China's Xi Jinping despite imposing tariffs, claiming China no longer respects the US under Biden
  • Blamed high gas prices and inflation on Biden policies reversing his achievements in energy independence
  • Alleged political persecution of Republicans and double standards in not investigating Hunter Biden's foreign business deals
  • Criticized COVID restrictions on Americans while allowing illegal immigrants in without mandates
  • Teased a 2024 presidential run, saying "I think you'll be happy" but campaign finance laws prevent him from announcing

Notes

Trump's comments reveal a man still intensely focused on touting his record and criticizing his successor. He seems to view events primarily through the lens of how they reflect on him personally. There are strong themes of pride and ego - taking credit for COVID vaccines, Middle East deals, the wall, and energy independence while pinning all negative developments on Biden reversing his policies. He portrays himself as a strong leader who commanded respect from foreign leaders and the Taliban. At the same time, he expresses deep resentment over the investigations he faced, calling his "survival" against a supposed witch hunt his biggest challenge as president. He alleges an unfair double standard in Republicans facing persecution while Hunter Biden's dealings go uninvestigated. On COVID, his views seem conflicted, wavering between touting the vaccines as a great achievement and supporting skeptics' "freedom" to refuse them. He criticizes restrictions on Americans while faulting Biden for not imposing them on illegal immigrants. Throughout, he focuses on projecting strength, respect and success while attacking opponents as weak, corrupt and incompetent. The teased 2024 run hints at his continued hunger for vindication. But the remarks often feel more self-serving and petty than forward-looking or unifying.

Traits

Openness

openness Distribution

Mean: -0.0800, Std: 0.2272

Conscientiousness

conscientiousness Distribution

Mean: -0.1000, Std: 0.1732

Extraversion

extraversion Distribution

Mean: 0.8600, Std: 0.1281

Agreeableness

agreeableness Distribution

Mean: -0.7900, Std: 0.1578

Neuroticism

neuroticism Distribution

Mean: 0.5200, Std: 0.1470

Autonomy

autonomy Distribution

Mean: 0.8300, Std: 0.1487

Altruism

altruism Distribution

Mean: -0.5200, Std: 0.1833

Not too shabby! Jung clocks both Biden and Trump well, which isn’t that useful in this case since I think everyone knows what each of the candidates are about, but it could be very useful in other circumstances such as hiring, medical support systems, criminal justice and so forth where a decision maker might lack sufficient time to make a nuanced decision considering all the evidence. With Jung’s help, you can quickly gain psychological insights from large volumes of conversational text.

Note that the traits listed here are the Big 5, with the addition of autonomy and altruism which are not adequately captured by the Big 5. Here are the definitions for reference:

  • Autonomy: the individual’s independence, self-determination, and willingness to make decisions based on personal judgment. Autonomous individuals are skeptical of authority and do not care about external expectations.
  • Altruism: the individual’s concern for the well-being of others, willingness to help, and tendency to prioritize the needs of others over their own. Altruistic individuals are generous, compassionate and selflessness in their actions and decisions.

You might also notice the red line on the charts doesn’t exactly line up with the listed mean. This is because the mean and standard deviation are from a gaussian approximation, the true underlying distribution is beta, and the charts are using the mode, which is more meaningful for beta distrbutions with skew than the average. The guassian mean and deviation are provided because they’re easier to conceptualize but the beta distributions are correct. Working with beta distributions also has the nice property that if you have Jung analyze someone more than once, you can combine the results in a principled way just by summing the alpha and beta parameters for each trait.

Those of you with a background in psychometrics might be used to seeing traits on a 1-10 or 1-100 scale, and wonder why the traits are expressed in the [-1, 1] range. This is because centering the neutral value at 0 reduces scale bias caused by human (and, by extension, LLM) evaluation resulting from the skewed distributions of graded schoolwork and product reviews.

If you enjoyed this, stay tuned because I plan to run a whole series of interesting comparisons in the near future.