Navigating the Era of Advanced AI: OpenAI’s Pursuit of Controllable Superhuman Intelligence

Amidst the aftermath of Sam Altman’s abrupt departure from OpenAI, a storm brewed among investors while Altman strategized his comeback to the company. Simultaneously, within the corridors of OpenAI, the Superalignment team remained fervently dedicated to addressing the challenge of governing AI intelligence surpassing human capabilities.

That’s the impression they aim to convey.

Recently, I engaged in a conversation with three pivotal members of the Superalignment team—Collin Burns, Pavel Izmailov, and Leopold Aschenbrenner—while they were situated in New Orleans for NeurIPS, the annual machine learning conference. Their focus was on unveiling OpenAI’s latest strides, specifically centered on ensuring the intended behavioral patterns of AI systems.

OpenAI formed the Superalignment team in July to develop ways to steer, regulate and govern “superintelligent” AI systems — that is, theoretical systems with intelligence far exceeding that of humans.

“Today, we can basically align models that are dumber than us, or maybe around human-level at most,” Burns said. “Aligning a model that’s actually smarter than us is much, much less obvious — how we can even do it?”

At the helm of the Superalignment initiative stands Ilya Sutskever, the co-founder and chief scientist of OpenAI. This leadership role, which initially garnered little attention in July, has now sparked significant interest, given Sutskever’s previous advocacy for Sam Altman’s dismissal. While certain reports imply that Sutskever finds himself in an uncertain position following Altman’s reinstatement, OpenAI’s public relations team affirms to me that, at present, Sutskever continues to steer the Superalignment team’s endeavors.

Superalignment is a bit of touchy subject within the AI research community. Some argue that the subfield is premature; others imply that it’s a red herring.

While Altman has drawn parallels between OpenAI and the Manhattan Project, actively assembling a team focused on probing AI models to safeguard against potential “catastrophic risks” such as chemical and nuclear threats, experts remain skeptical about the startup’s technology reaching world-ending or surpassing-human capabilities in the near or distant future. According to these experts, claims of imminent superintelligence appear designed to divert attention from crucial present-day AI regulatory concerns, including algorithmic bias and AI’s proclivity toward generating toxic outputs.

Interestingly, Sutskever seems genuinely convinced that AI, not exclusively tied to OpenAI, could one day pose an existential threat. Reportedly, he took symbolic action by commissioning and burning a wooden effigy during a company offsite, underscoring his commitment to preventing potential AI-related harm to humanity. Moreover, he wields substantial influence within OpenAI, commanding a significant share—20%—of the organization’s current computer chips allocation for the Superalignment team’s research efforts.

“AI progress recently has been extraordinarily rapid, and I can assure you that it’s not slowing down,” Aschenbrenner said. “I think we’re going to reach human-level systems pretty soon, but it won’t stop there — we’re going to go right through to superhuman systems … So how do we align superhuman AI systems and make them safe? It’s really a problem for all of humanity — perhaps the most important unsolved technical problem of our time.”

The Superalignment team, currently, is attempting to build governance and control frameworks that might apply well to future powerful AI systems. It’s not a straightforward task considering that the definition of “superintelligence” — and whether a particular AI system has achieved it — is the subject of robust debate. But the approach the team’s settled on for now involves using a weaker, less-sophisticated AI model (e.g. GPT-2) to guide a more advanced, sophisticated model (GPT-4) in desirable directions — and away from undesirable ones.

 

“A lot of what we’re trying to do is tell a model what to do and ensure it will do it,” Burns said. “How do we get a model to follow instructions and get a model to only help with things that are true and not make stuff up? How do we get a model to tell us if the code it generated is safe or egregious behavior? These are the types of tasks we want to be able to achieve with our research.”

But wait, you might say — what does AI guiding AI have to do with preventing humanity-threatening AI? Well, it’s an analogy: the weak model is meant to be a stand-in for human supervisors while the strong model represents superintelligent AI. Similar to humans who might not be able to make sense of a superintelligent AI system, the weak model can’t “understand” all the complexities and nuances of the strong model — making the setup useful for proving out superalignment hypotheses, the Superalignment team says

“Consider a scenario where a sixth-grade student attempts to guide a college student,” elucidated Izmailov. “Imagine the sixth grader trying to explain a task they have a basic understanding of to the college student. Despite potential mistakes in the details provided by the sixth grader, there remains an expectation that the college student comprehends the core idea and executes the task more proficiently than the supervisor.”

Within the framework of the Superalignment team, a less robust model, fine-tuned for a specific task, produces labels utilized as a means of ‘communication’ to convey the fundamental aspects of that task to a stronger model. Operating based on these labels, the robust model can generalize correctly, aligning with the intended guidance of the weaker model—this remains true even if the weaker model’s labels encompass errors and biases, as discovered by the team.

The weak-strong model approach might even lead to breakthroughs in the area of hallucinations, claims the team.

“Hallucinations are actually quite interesting, because internally, the model actually knows whether the thing it’s saying is fact or fiction,” Aschenbrenner said. “But the way these models are trained today, human supervisors reward them ‘thumbs up,’ ‘thumbs down’ for saying things. So sometimes, inadvertently, humans reward the model for saying things that are either false or that the model doesn’t actually know about and so on. If we’re successful in our research, we should develop techniques where we can basically summon the model’s knowledge and we could apply that summoning on whether something is fact or fiction and use this to reduce hallucinations.”

But the analogy isn’t perfect. So OpenAI wants to crowdsource ideas.

In pursuit of this objective, OpenAI has announced a groundbreaking $10 million grant initiative dedicated to advancing technical research on superintelligent alignment. These funds will be distributed across various segments, including allocations for academic laboratories, nonprofit organizations, individual researchers, and graduate students. Moreover, OpenAI intends to organize an academic conference focused on superalignment in early 2025, serving as a platform to showcase and endorse the work of finalists for the superalignment prize.

Interestingly, a part of the grant funding stems from former Google CEO and chairman Eric Schmidt. Schmidt, a staunch advocate of Sam Altman, is increasingly emblematic of AI doomsday scenarios, warning about the imminent arrival of hazardous AI systems and criticizing regulators for insufficient preparedness. Notably, Schmidt, an active investor in AI ventures, is speculated, as highlighted in reports from Protocol and Wired, to potentially reap significant commercial gains if the U.S. government were to adopt his proposed strategy to fortify AI research—an aspect not solely driven by altruistic motives.

Viewed cynically, the donation might be seen as an act of virtue signaling. Eric Schmidt boasts a personal fortune estimated at approximately $24 billion, having already invested hundreds of millions into various AI ventures and funds, many of which are notably less oriented towards ethical considerations. This includes substantial investments in endeavors, including his own, that prioritize objectives divergent from an ethics-centric approach to artificial intelligence.

Schmidt denies this is the case, of course.

“AI and other emerging technologies are reshaping our economy and society,” he said in an emailed statement. “Ensuring they are aligned with human values is critical, and I am proud to support OpenAI’s new [grants] to develop and control AI responsibly for public benefit.”

Indeed, the involvement of a figure with such transparent commercial motivations begs the question: will OpenAI’s superalignment research as well as the research it’s encouraging the community to submit to its future conference be made available for anyone to use as they see fit?

The Superalignment team provided a firm assurance that both OpenAI’s research, encompassing code and other contributions, and the endeavors of those who receive grants and accolades from OpenAI for superalignment-related work will be openly shared with the public. They expressed a resolute commitment to this transparency, and there’s an expectation to hold the company accountable to this promise.

Leopold Aschenbrenner emphasized, “Our mission extends beyond ensuring the safety of our models alone; it encompasses enhancing the safety of models developed by other laboratories and advancing AI in its entirety. This commitment lies at the heart of our goal to construct AI for the collective benefit of humanity, prioritizing safety above all. We firmly believe that conducting this research is pivotal for ensuring both its beneficial and safe integration.”