Why AI safety

Nov 15

November 2022

AI Safety is Weird

I realize that, to outsiders, AI safety can seem like an odd thing to be concerned about. When I was first introduced to the topic, I myself reacted with some hesitation and skepticism, especially as I'm generally cautious about niche issues that have an activist-like undertone (not that I’d call myself an activist now). Given this, I wanted to outline my reasoning here. Please note that this was written in 2022, two years ago. Much has changed since, and I’d probably express some points differently today. If you disagree or have questions about anything here, please feel free to reach out!

The movie that cried wolf

Imagine a world similar to ours but with two simple differences:

There are no diseases. The viruses and bacteria that cause diseases by chance or for some other reason never evolved, or died out after multicellular life developed.
There is a genre of media, movies, books and TV shows that heavily involve the concept of some suffering being passed on from one human to another. It's become somewhat of a trope. People call it “disease”. There was a famous movie in the 80s where Arnold Schwarzenegger plays what we would call an Epidemiologist. It’s called “the Pandemic” or something.

Now medical companies have started to figure out that they can use small single-celled organisms as a delivery system for their medicine. In fact, they find that these small life forms have a lot more applications like the digestion of waste or the production of oxygen and other chemicals through photosynthesis and similar processes.
The development of this bacteria-based technology is accelerating. Scientists come up with bacteria that spread better and better.
Some scientists start thinking about what would happen if at some point we develop maybe by accident, maybe on purpose, a rapidly spreading bacterium with nasty side effects or even purposefully destructive primary effects. They become quite worried. Some start thinking that such an effect might be intrinsic to what rapidly spreading microorganisms are at their core. They also think that, now that the organisms have been developed, it's only a matter of time before one evolves into something dangerous and could cause a catastrophe. They say that humanity has only a few decades or even less to invent something called “hygiene”. Some start to think of weird solutions called “vaccines”. But it’s becoming increasingly clear that this is one of the bigger challenges that humanity has ever faced.

Most people react to the scientist's concerns with a confused and sceptical look and say “you mean like… in the movies?”. There are other members of the scientific community, who ridicule the concerned ones. “yeah, those guys have watched ‘the pandemic’ a few times too many”.

How would you try to convince such a world that sickness and plagues aren’t just some cool movie cliche but are actually possible and have a real chance of eradicating the human race if we don’t think of a solution soon?

The actual wolf

This is very similar to the kind of situation AI safety researchers find themselves in. It is very hard to convince people that something that is as much of a sci-fi trope as the robot apocalypse, is actually a really big problem.

So while you read this: do me a favour and try to forget the tropes, the movies and the cliches. While some of them end up being accurate for all the wrong reasons, the central premise that the AI is somehow evil hates humans or has some sense of superiority is exactly wrong. The kind of AI that experts are worried about is one that does exactly what it is told. And I mean exactly. When you tell your coworker “could you fetch me a coffee, please?” you always mean “could you fetch me a coffee, without killing anyone, please?”.

An AI doesn’t necessarily know this. If you task a sufficiently intelligent and general AI with bringing you coffee at nine in the morning it has a clear incentive to enforce a totalitarian dictatorship over the world in order to ensure the supply of coffee beans, guarantee the running of the factories, and secure safe passage of the coffee to your desk. If the only way of judging the good or bad in the world is by looking at how likely it is that the coffee is on your desk every morning at nine am, it won’t even care if you drink the coffee. Very importantly it will also realize that it cannot guarantee the supply of coffee to your desk if it’s switched off. So it will prevent being switched off. If the AI thinks that this minimises the chance of interference, it will make humans go extinct and bring a cup of coffee to your desk, every morning at nine am, until the end of time.

This scenario is funny. It's absurd. But the logic behind it checks out and applies to other tasks aswell. So how do we make sure that the AI knows what we mean when we say “bring me a coffee every morning at nine”? How do we make sure that AI is truly aligned with our interests? This problem is aptly named “the alignment problem”.

Just tech hype

A common and very understandable origin of doubt about the significance of this problem is that it seems unlikely that we would ever make a machine that’s intelligent enough to cause a problem like this. To these people, it sounds like I’m talking about the sound pollution of flying cars or “back to the future hoverboard” safety concerns.

First of all, I want to say that I have great sympathy for this objection. The world is filled with ridiculous failed promises of technological transformation. From cryptocurrency to flying cars to fusion energy.

AI might be different though. First, we know it to be possible because of ourselves. The fact that our brains are generally intelligent means that it's possible for a bit of matter to do what we would expect AGI (artificial general intelligence) to do. Second, because it’s a self-accelerating process. We already have code-writing AIs. AIs are made using code. We have seen AI surpass human-level capability in almost all the areas that it had any proficiency in just a few years before. It’s quite likely that we could have AI that writes better code than humans in a few years. This means they could write better AI code, which in turn could write even better code. This self-improvement is called an intelligence explosion or “the singularity”. This is one of those words that I would love for you to associate less with science fiction and more with a real problem.

And third, because if AI is developed it will have a much greater impact than any other technology before it. It is important to understand what the kind of intelligence we are talking about in AI is at its core. It is the ability to select actions to make one outcome most likely. In other words, it is the ability to solve problems, to affect reality most efficiently and effectively (given a way to interface with it).

Intelligence is what differentiates us from other species. Slightly more of it than what our closest relative has makes the difference between digging in the dirt and walking on the moon, between struggling to find enough nuts or berries and simulating the Big Bang in a particle accelerator. The ability to make more of it, even just a little bit, is more consequential, powerful and just plain profound than anything we have done ever.

And sure. Maybe we will find a roadblock somewhere. Nobody is saying that that’s impossible. What I am saying is that, since there is nothing suggesting, that there is some insurmountable obstacle, we should not rely on one. In short, it’s bad to bet against human innovation in a time where we have gone from a majority of us shitting in a hole to quantum computers, satellites and social media in one hundred years. Especially when we are talking about a problem that may take decades or centuries to solve.

Decades? Centuries?

Yes. Alignment is hard and could take a really really long time to get right.

It can be split into two areas: “how do we communicate what we want” and “what do we want?”.
The first question is a technical one. It's not solved and that’s bad, but as I just said, we are pretty good at solving technical issues.
The second one is very very hard. Like “we've been thinking about this for millennia”-hard.

Literature is full of cautionary tales about people who get three wishes and end up regretting how they worded them. King Midas’s wish was that “everything he touches turns to gold” and he ended up dead and alone because he touched his friends and relatives and of course his food. The important thing here is that he didn’t die because he was deceived or the wish was fulfilled maliciously, but because the king didn’t think carefully about what he actually wanted in a very literal way. This is important because the AI will be very very literal in following our commands.

AI researchers have thought about which commands might work. Anyone who has watched iRobot will have heard of Azimov’s Laws. Yet another movie that leads some people to believe that AI safety is just a problem in Hollywood. For most approaches, including the laws of Azimov, it’s very easy to think of how it could go wrong. “prevent death” might be an instruction that most people would agree with. Well, the AI might put us all in a rubber cell where any possibility of death is minimal. Or it prevents any procreation because any new life is just another potential and over time inevitable death. Okay, how about “prevent pain, including mental pain”? Well if we are all killed painlessly without anticipating it, the AI has absolutely maximised the fulfilment of its given goal.

The only “wish“ that doesn’t seem to immediately lead to catastrophe is one that the computer scientist Stuart Russel came up with:

The command goes like this: “infer what humans want by observing their action and then try to maximise the fulfilment of what you think their preferences are”. The idea is that the AI will see our efforts to stay alive, to prevent pain, to procreate and so on and it will also see you getting a coffee every morning at 9 am. Now it will be able to complete that task for you without ruining the world.

There are still problems here, for instance, that we still heavily rely on the AI getting our preferences right and the question of what happens when the AI influences our preferences, but if that goes well (which isn’t necessarily unrealistic) this could actually be successful with one really important limit. The AI still doesn’t know how to deal with conflicting or even opposing preferences. This becomes important once we start considering scenarios where a majority of humans hate a minority. Is it maximisation of the fulfilment of preferences if the AI kills the minority? Not finding a good answer to this “opposing preferences”-problem is obviously a huge dealbreaker.

Now there is good news and bad news. The good news is that we have a whole field of philosophy that tries to deal with opposing preferences as one of its central questions. It’s called ethics. The bad news is that even though we’ve had ethics for about 2500 years we haven’t really gotten any concrete objective answers out of that field ever.

The point I’m trying to get across is that solving the problem of AI involves answering some of the most basic questions that we as a species can ask ourselves. And if you believe the average prediction of AI researchers that AGI will arrive in about 25 years, even if you think they are off by a factor of two or three or ten, you might start to feel the urgency of AI safety.

Hannes Thurnherr

Why AI safety

Transformer Decompiler - TraDe

Hannes Thurnherr

hannes.thurnherr@gmail.com