Zach Stein-Perlman

AI strategy & governance. ailabwatch.org. Looking for new projects.

Sequences

Slowing AI

Wiki Contributions

Load More

Comments

Added updates to the post:

Superalignment dissolves.

Leike tweets, including:

I have been disagreeing with OpenAI leadership about the company's core priorities for quite some time, until we finally reached a breaking point.

I believe much more of our bandwidth should be spent getting ready for the next generations of models, on security, monitoring, preparedness, safety, adversarial robustness, (super)alignment, confidentiality, societal impact, and related topics.

These problems are quite hard to get right, and I am concerned we aren't on a trajectory to get there.

Over the past few months my team has been sailing against the wind. Sometimes we were struggling for compute and it was getting harder and harder to get this crucial research done.

Building smarter-than-human machines is an inherently dangerous endeavor. OpenAI is shouldering an enormous responsibility on behalf of all of humanity.

But over the past years, safety culture and processes have taken a backseat to shiny products.

Daniel Kokotajlo talks to Vox:

“I joined with substantial hope that OpenAI would rise to the occasion and behave more responsibly as they got closer to achieving AGI. It slowly became clear to many of us that this would not happen,” Kokotajlo told me. “I gradually lost trust in OpenAI leadership and their ability to responsibly handle AGI, so I quit.” 

Kelsey Piper says:

I have seen the extremely restrictive off-boarding agreement that contains nondisclosure and non-disparagement provisions former OpenAI employees are subject to. It forbids them, for the rest of their lives, from criticizing their former employer. Even acknowledging that the NDA exists is a violation of it.

More.

TechCrunch says:

requests for . . . compute were often denied, blocking the [Superalignment] team from doing their work [according to someone on the team].

The commitment—"20% of the compute we've secured to date" (in July 2023), to be used "over the next four years"—may be quite little in 2027, with compute use increasing exponentially. I'm confused about why people think it's a big commitment.

Full quote:

We’ve evaluated GPT-4o according to our Preparedness Framework and in line with our voluntary commitments. Our evaluations of cybersecurity, CBRN, persuasion, and model autonomy show that GPT-4o does not score above Medium risk in any of these categories. This assessment involved running a suite of automated and human evaluations throughout the model training process. We tested both pre-safety-mitigation and post-safety-mitigation versions of the model, using custom fine-tuning and prompts, to better elicit model capabilities.

GPT-4o has also undergone extensive external red teaming with 70+ external experts in domains such as social psychology, bias and fairness, and misinformation to identify risks that are introduced or amplified by the newly added modalities. We used these learnings to build out our safety interventions in order to improve the safety of interacting with GPT-4o. We will continue to mitigate new risks as they’re discovered.

[Edit after Simeon replied: I disagree with your interpretation that they're being intentionally very deceptive. But I am annoyed by (1) them saying "We’ve evaluated GPT-4o according to our Preparedness Framework" when the PF doesn't contain specific evals and (2) them taking credit for implementing their PF when they're not meeting its commitments.]

How can you make the case that a model is safe to deploy? For now, you can do risk assessment and notice that it doesn't have dangerous capabilities. What about in the future, when models do have dangerous capabilities? Here are four options:

  1. Implement safety measures as a function of risk assessment results, such that the measures feel like they should be sufficient to abate the risks
    1. This is mostly what Anthropic's RSP does (at least so far — maybe it'll change when they define ASL-4)
  2. Use risk assessment techniques that evaluate safety given deployment safety practices
    1. This is mostly what OpenAI's PF is supposed to do (measure "post-mitigation risk"), but the details of their evaluations and mitigations are very unclear
  3. Do control evaluations
  4. Achieve alignment (and get strong evidence of that)

Related: RSPs, safety cases.

Maybe lots of risk comes from the lab using AIs internally to do AI development. The first two options are fine for preventing catastrophic misuse from external deployment but I worry they struggle to measure risks related to scheming and internal deployment.

Safety-wise, they claim to have run it through their Preparedness framework and the red-team of external experts.

I'm disappointed and I think they shouldn't get much credit PF-wise: they haven't published their evals, published a report on results, or even published a high-level "scorecard." They are not yet meeting the commitments in their beta Preparedness Framework — some stuff is unclear but at the least publishing the scorecard is an explicit commitment.

(It's now been six months since they published the beta PF!)

[Edit: not to say that we should feel much better if OpenAI was successfully implementing its PF -- the thresholds are way too high and it says nothing about internal deployment.]

There should be points for how the organizations act wrt to legislation. In the SB 1047 bill that CAIS co-sponsored, we've noticed some AI companies to be much more antagonistic than others. I think [this] is probably a larger differentiator for an organization's goodness or badness.

If there's a good writeup on labs' policy advocacy I'll link to and maybe defer to it.

Adding to the confusion: I've nonpublicly heard from people at UKAISI and [OpenAI or Anthropic] that the Politico piece is very wrong and DeepMind isn't the only lab doing pre-deployment sharing (and that it's hard to say more because info about not-yet-deployed models is secret). But no clarification on commitments.

But everyone has lots of duties to keep secrets or preserve privacy and the ones put in writing often aren't the most important. (E.g. in your case.)

I've signed ~3 NDAs. Most of them are irrelevant now and useless for people to know about, like yours.

I agree in special cases it would be good to flag such things — like agreements to not share your opinions on a person/org/topic, rather than just keeping trade secrets private.

Related: maybe a lab should get full points for a risky release if the lab says it's releasing because the benefits of [informing / scaring / waking-up] people outweigh the direct risk of existential catastrophe and other downsides. It's conceivable that a perfectly responsible lab would do such a thing.

Capturing all nuances can trade off against simplicity and legibility. (But my criteria are not yet on the efficient frontier or whatever.)

Load More