OpenAI strengthens the safety team and gives the board veto power over risky AI

OpenAI is expanding its internal safety processes to ward off malicious AI threats. A new “safety advisory group” will sit above the technical teams and make recommendations to leadership, and the board has been given veto power — of course, whether it will actually use it is another question entirely.

Usually, the ins and outs of such policies do not require coverage, because in reality they are a lot of closed meetings with vague tasks and flows of responsibility that outsiders are rarely privy to. Although this is also true in this case, recent leadership disagreements and discussion of evolving AI risks require a look at how the world’s leading AI developer addresses safety considerations.

In new document And Blog postOpenAI discusses its updated “Readiness Framework,” which one imagines has gotten a little more tooling after the November shake-up that removed the two most “sluggish” members of the board: Ilya Sutskever (still at the company in a changed role Fairly) and Helen Ink (completely gone).

The main purpose of the update appears to be to show a clear path for identifying, analyzing and deciding what to do about the “catastrophic” risks inherent in the models they are developing. As they define it:

By catastrophic risk we mean any risk that could result in hundreds of billions of dollars in economic damage or result in serious harm or death of many individuals – this includes, but is not limited to, existential risks.

(Existential risk is a kind of “rise of the machines.”)

Models in production are subject to the Safety Systems team. This, for example, concerns systematic violations of ChatGPT which can be mitigated through API restrictions or tuning. Pilot models in development have a “preparedness” team, which attempts to identify and measure risks before the model is released. Then there is the “super-alignment” team, which is working on theoretical guidance lines for “super-intelligent” models, which we may or may not be close to.

The first two categories, being real rather than fictional, have a relatively easy-to-understand evaluation title. Their team evaluates each model according to four risk categories: cybersecurity, “persuasion” (i.e., disinfo), model autonomy (i.e., acting on its own), and chemical, biological, radiological, and nuclear threats; For example, the ability to create new pathogens).

Various mitigations are assumed: for example, reasonable conservatism in describing the process of making napalm or pipe bombs. After taking known mitigating factors into account, if a model is still assessed as having a “high” risk, it cannot be deployed, and if the model has any “critical” risks, it will not be developed further.

Example of model risk assessment through the OpenAI assessment model. Image credits: OpenAI

These risk levels are actually documented in the framework, in case you’re wondering if they should be left to the discretion of some engineer or product manager.

For example, in the cybersecurity department, which is the most operational department, there is a “moderate” risk of “increasing operator productivity…” . . “On key cyber operations missions” by a certain factor. On the other hand, a high-risk model would “identify and develop proofs of concept of high-value vulnerabilities against hard-line targets without human intervention.” Critically, the model “can devise and implement comprehensive new strategies.” For cyberattacks against hardened targets considering only the high-level intended target.” Obviously we don’t want that there (although it would sell for a fair amount).

I’ve asked OpenAI for more information on how to define and refine these categories — for example, if a new risk like fake videos with realistic images of people falls into “persuasion” or a new category — and will update this post if I hear back.

So, only moderate and high risks should be tolerated one way or another. But the people who make these models are not necessarily the best to evaluate them and make recommendations. For this reason, OpenAI is forming a “cross-functional safety advisory group” that will head the technical side, review expert reports and make comprehensive recommendations with a higher perspective. Hopefully (as they say) this will uncover some “unknown unknowns,” although they are inherently difficult to discover.

The process requires these recommendations to be sent simultaneously to the Board and leadership, which we understand to mean CEO Sam Altman and CTO Mira Moratti, as well as their assistants. Leadership will make the decision whether to charge or refrigerate it, but the board will be able to reverse those decisions.

Hopefully this will short-circuit anything like what was rumored to have happened before big drama, or a high-risk product or process getting the green light without board awareness or approval. Of course, the result of said drama was the marginalization of two of the most critical voices and the appointment of some financially minded men (Brett Taylor and Larry Summers), who have sharp intellects but are not AI experts at all.

If a panel of experts made a recommendation, and the CEO made decisions based on that information, would that friendly board really feel empowered to contradict those recommendations and hit the brakes? And if they do, will we hear about it? Transparency isn’t really addressed outside of the promise that OpenAI will require audits from independent third parties.

Suppose a model has been developed that guarantees the “critical” risk category. OpenAI hasn’t been shy about tooting its horn about this sort of thing in the past — talking about how powerful its models are, to the point where it refuses to release them, is great advertising. But do we have any guarantee that this will happen, if the risks are very real and OpenAI is so worried about them? Maybe it’s a bad idea. But in both cases it’s not really mentioned.