Earlier this year, when looking for an inroad to AI safety, I learned about the Löbian Obstacle, which is a problem encountered by 'purely logical' agents when trying to reason about and trust one another. In the original paper of Yudkowsky and Herreshoff [1], they show that a consequence of Löb's theorem is that an agent X can only "trust" the reasoning of an agent Y with a strictly weaker reasoning system than themselves, where "trust" here means 'formally prove that the conclusions of the other agent's reasoning will be true'. As stated, this looks like a major problem if X is a human trying to build an artificially intelligent system Y, but it's also a problem for any individual (embedded) agent trying to reason about their own future behaviour.

I'm not the first person to find this problem counterintuitive, and for good reason. In this article I'm going to explain why a formal (purely syntactic) logic system alone is a poor model of the reasoning of embedded agents, and show that by fixing this, we remove the foundation for the difficulties arising from Löb's theorem.

For the uninitiated, there is a handy survey of application of Löb's theorem in AI safety research by Patrick LaVictoire [6].

## Pure syntax

First, I should explain the formal set-up for applying Löb's theorem to agents. We model an agent's reasoning with a formal language, or *logic*, which I'll call L. Here I shall make the further assumption that this logic fits (or can be squeezed into) a formal language of the kind logicians are familiar with: the logic consists of some formal symbols or *variables* A,B,C... along with some *logical connectives, *operators and quantifiers for combining variables into expressions, or *formulas. *The agent is also assumed to carry some *inference rules *for manipulating formulas. Altogether, this data constitutes the * syntax *of L (its symbolic content and the rules for manipulating those symbols).

**Since we don't care precisely what the symbols in L refer to**, we need go no further, and we can start worrying about how this formal system behaves.

To arrive at the Löbian Obstacle, we suppose that an agent A (who reasons with L) builds an agent B who shall also reason with the logic L. Suppose, specifically, that A is building B to help them achieve some goal G. In order for A to ensure B will be helpful, A programs B to produce a proof that B's actions will lead to the goal G before acting. I should clarify that we are assuming that A and B inhabit an idealized environment over which they have complete control and perfect information, so that such a proof represents a guarantee of success.

Since A doesn't know in advance what B needs to do to achieve G (else there would be no dilemma!) there is a large number of formulas in L which could be output by B as 'actions provably achieving G'. Being maximally pessimistic (incidentally, this assumption is another one I find problematic, but I'll allow it for the time being) we can assume it could be any formula in L. Thus, in order for A to be certain that B will behave helpfully, it would seem that A must be able to prove, for all such formulas, a theorem schema of the form , which is to say that the existence of a proof of (which will eventually be supplied by B) implies that is true.

This is the point at which Löb's Theorem wags its finger: in the words of Yudkowsky and Herreshoff, "no formal system at least as powerful as Peano Arithmetic can prove its own soundness schema". In other words, it would seem that A cannot possibly prove in L that it will be able to trust an arbitrary output from B.

## The problem of induction in disguise

There's one obvious problem here, which I hinted at in the introduction: the same argument applies to A reasoning about itself! The only means that A has of concluding that its own actions will result in the desired consequences are by reasoning in L, and A has no way of proving that its musings in L have any bearing on its environment. This can be interpreted as an extremely condensed version of Hume's problem of induction [5], which consists in the observation that there is no purely logical basis for our inferences regarding, for example, the consequences of our actions.

I contend that applying Löb's Theorem to a purely syntactic system is a rather arcane way of arriving at this conclusion, since it's much easier to see how an agent's reasoning can fail to be sound with a simple example.

Earlier, I emphasised the phrase "**since we don't care precisely what the symbols in L refer to,**" a claim which I will now make more precise. While the precise details of any given implementation aren't important for arguing about generic behaviour, it *does* matter that the symbols in L refer to *something, *since the veracity (which is to say the "actual" truth value) of the soundness schema depends on it. Let me throw some vocabulary out for the important concepts here.

The symbols and formulas of L don't come with intrinsic meaning (no formal language does). To imbue them with meaning, we require a mapping from formulas in L to objects, values or concepts which the agent is able to identify in their environment. This mapping is called the * semantics *of L. The semantic map is typically constructed inductively, starting at interpretations of variables and extending this to formulas one logical connective or quantifier at a time. From another perspective, this mapping is an attempt to express the environment as a

*model*of the logic L.

In order for the agent to reason correctly about its environment (that is, come to valid conclusions using L) it is necessary that the semantics is *sound*, which is to say that all of the inference rules transform formulas whose interpretation is true into formulas of the same kind. In other words, the semantics should be *truth-functional* with respect to the inference rules of L.

It's easy to see how soundness can fail. Consider a chess-playing agent. If the agent isn't aware of some rule for how chess pieces may be moved (the agent isn't aware of the possibility of castling, say) then it's obvious how, under circumstances where those rules apply, they may come to incorrect conclusions regarding the game. The logic involved in this case may be much simpler than Peano Arithmetic!

For our purposes here, the take-away is that **soundness of the semantics cannot be guaranteed by syntax alone**. Again, we don't need Löb's Theorem to demonstrate this; producing non-sound semantics for any given logic is rarely difficult, since it suffices to deliberately make the semantics fail to be truth-functional (by deliberately misinterpreting formal conjunctions as "or", say).

## Soundness as a belief

No agent is able to justify their system of reasoning from immutable first principles; this is another presentation of Hume's problem of induction. If that claim sounds contentious, I highly recommend Nancy Cartwright's essay [2] on relativism in the philosophy of science. And yet, we (humans) continue to reason, and to expect the conclusions of our reasoning to represent valid statements about reality.

The crucial fact underpinning this apparent contradiction is that every agent, embedded or otherwise, must hold **metalogical beliefs** regarding the soundness of their reasoning. In the very simplest case of an agent with a rigid reasoning system, this will consist of a belief that the chosen semantics for their logic is sound. In other words, while the soundness schema appearing in Löb's Theorem can never be proved in L, *the agent necessarily carries a metalogical belief that (the metalogical version of) this soundness schema holds*, since this belief is required in order for A to confidently apply any reasoning it may perform within L.

A related metalogical belief is that of consistency: an agent cannot prove the consistency of its own reasoning (Gödel's second incompleteness theorem), but it must nonetheless believe that its logic is consistent for soundness to be a possibility in the first place.

Allowing for the existence of metalogical beliefs immediately dissolves the Löbian Obstacle, since this belief extends to proofs provided by other agents: as soon as agent A encounters a proof in L from any source, they will expect the conclusion to be validated in their environment by the same virtue that they expect the conclusions of their own reasoning to be valid. Agent A can delegate the task of proving things to subordinates with more computational power and still be satisfied with the outcome, for example.

## Where did the problems go?

I do not expect that the perception shift I'm proposing can have magically eliminated self-reference problems in general. A critical reader might accuse me of simply ignoring or obfuscating such problems. Let me try to convince you that my proposed shift in perspective achieves something meaningful, by explaining how Löb's theorem rears its head.

A devotee of formal logic might naively hope from my exposition so far that these so-called metalogical beliefs of soundness can simply be encoded as axioms in an expanded logic L' containing L. But Löb's theorem tells us that this will automatically make L' inconsistent!

Thus, I am advocating here that we interpret Löb's theorem differently. Rather than concluding that a logical agent simply cannot believe its own proofs to be valid (a conclusion which we humans flout on a daily basis), which brings us to a seemingly insurmountable obstacle, I am proposing that we relegate assertions of soundness to beliefs at the interface between purely formal reasoning and the semantics of that reasoning. This distinction is protected by Hume's guillotine: the observation that there is a strict separation between purely formal reasoning and value judgements, where the latter includes judgements regarding truth. The agent's beliefs cannot be absorbed into the logic L, precisely because syntax can never subsume (or presume) semantics.

To make this concrete, I am proposing that soundness assertions (and metalogical beliefs more generally) must include the additional data of a semantic mapping, an *aboutness condition*. Note that this doesn't preclude an agent reasoning about its own beliefs, since there is nothing stopping them from building a semantic map from their logic into their collection of beliefs, or indeed from their logic into itself. Incidentally, the proof of Löb's theorem requires just such a mapping in the form of a Gödel numbering (in modern modal logic proofs, the 'provability' modality takes on this role), although it is usually only a *partial *semantic map, since it is constrained to an interpretation of a representation of the natural numbers in L into a collection of suitably expressive formulas/proofs in L.

## Empirical belief updates

The impossibility of an *a priori* guarantee of soundness forces an abstract intelligent agent to resort to empiricism in the same way that we human agents do: the agent begins from a hypothetical belief of soundness with less than absolute certainty, and must test that belief in their environment. When failures of soundness are observed, the agent must adapt. Depending on the architecture of the agent, the adaptation may consist of updates to L, an update to the semantic map, or an update to the belief value (if the truth values employed by the agent in their beliefs are sufficiently structured to express domains of truth, say). I am deliberately hinting at some interesting possibilities here, which I may explore in more detail in a later post.

Under the belief that the Löbian Obstacle was a genuine problem to be avoided, several approaches have been proposed in the past decade. Foremost is identifying a probabilistic regime in which Löb's Theorem fails to hold, and by extension fails to produce an obstacle [3]. In the setting of agents reasoning about mathematics or purely formal systems, this has been vastly expanded into the concept of a *logical inductor* [4] which assigns probabilities to all formulas in its logic (under the implicit assumption of particular fixed semantics!).

However, fixed semantics are inflexible, and any system which assigns values (such as probabilities) directly to formulas in its logic must be extremely data-heavy, since it must store (encodings of) large numbers of formulas explicitly rather than implicitly. An immediate advantage of recognizing that semantics is, and indeed must be, separate from syntax is that it highlights the possibility that the semantic map may vary. Rather than carrying around a heavy knowledge base consisting of all of the statements which it knows to be true (with respect to a fixed semantics), an agent may instead apply the same formal reasoning in different situations and carry the much lighter load of empirical knowledge of situations in which its reasoning will be sound. Or, with a little refinement, the agent can carry an understanding of which fragment of its logic can be applied in a given situation.

To me, this feels a lot closer to my own experience of reasoning: I have some factual knowledge, but I often rely more heavily on my understanding of how to apply my reasoning abilities to a given situation.

---

[1] *Tiling Agents for Self-Modifying AI, and the Löbian Obstacle*, Yudkowsky, E and Herreshoff, M. intelligence.org, 2013

[2] *Relativism in the Philosophy of Science*, Cartwright, N., from *Relativism, a contemporary anthology, *2010, Columbia University Press. Available on researchgate.

[3] *Probabilistic Löb Theorem, *Armstrong, S., lesswrong.org, 2013

[4] *Logical Induction*, Garrabrant, S. *et al.*, arxiv, 2020

[5] *The Problem of Induction,* Henderson, L., *The Stanford Encyclopedia of Philosophy *(Spring 2020 Edition), Edward N. Zalta (ed.), url

[6] *An Introduction to Löb’s Theorem in MIRI Research, *LaVictoire, P., 2015, intelligence.org