查看原文
其他

OpenAI 最新论文:《由弱到强的泛化能力》,人类准备好迎接超人类 AI 智慧了吗?

编辑小蝶 AI变现研习社 2024-06-01

编者:昨晚,OpenAI 超级对齐团队发表新论文《由弱到强的泛化能力》,称:“未来,人类将需要比他们更聪明地监督人工智能系统。我们研究一个类比:小模型监督大模型。”

CEO 发来祝贺

OPENAI,启动一项计划:“提供 10,000 至 100,000 美元的拨款,资助研究代理人工智能系统的影响以及确保其安全的实践。

任何人都可以申请,但特别欢迎学术、非营利或独立研究人员的申请。注意:这不是资助商业产品的计划。”

  • 申请地址:https://openai.smapply.org/prog/agentic-ai-research-grants/

名词解释:

泛化(Generalization): 在机器学习中,泛化是指模型对于未见过的新数据的处理能力。一个泛化能力好的模型能够在训练集之外的数据上也保持良好的表现。

将这些概念结合起来,“弱模型到强模型的泛化”可能指的是一个过程,其中一个性能较差的模型通过某种方式(如改进算法、增加数据或调整模型结构)得到改善,从而变得更加强大和可靠,尤其是在处理广泛和未知数据方面的能力。这个过程可能涉及到模型的优化、重训练或者是使用更加先进的技术来提升模型的整体性能和泛化能

特别说明:翻译使用了GPTs《科技文章翻译》

  • https://chat.openai.com/c/b08591dc-689d-4b70-b759-a94ca2e5d75a

如果你使用GPTs不方便,可以使用国内直联的“科技论文翻译大师“,它内置在 “清风AI”系统中,它通过直译+意译,确保GPT3.5基础上接近GPT4.0的翻译水准!

使用方法见:www.91chatgpt.com.cn

以下是 OpenAI 公司 Superalignment 团队的第一篇论文

《由弱到强的泛化能力》

论文地址:https://openai.com/research/weak-to-strong-generalization

点击“阅读原文”直达!

我们提出了一个新的研究方向:超级对齐,以及一些初步但有前景的研究成果:我们是否能利用深度学习的泛化特性,来实现用弱监督器控制强大模型的目的

未来,人类将面临一个重大挑战:如何监督比自己更聪明的超级人工智能系统。

我们研究了这样一个简单的类比:小型模型是否能够监督大型模型?

我们发现,我们可以激发 GPT-2 级别的(弱)模型,使其拥有 GPT-4 的大部分功能,接近 GPT-3.5 级别的性能。

这一发现甚至在小模型无法解决的难题上也表现出正确的泛化能力。这为我们直接应对将来超人类模型对齐的核心挑战提供了一个新的研究方向,并使我们能够在当前就取得实证进展。

我们的研究方法

为了解决这个核心问题,我们提出了一个可以立即实证研究的类比:能否使用一个较小(能力较弱)的模型来监督一个较大(能力较强)的模型?

超级对齐的简单类比:

在传统机器学习(ML)中,人类通常监督比他们弱的人工智能系统(左图)。

然而,要实现对超级智能的对齐,人类将需要监督那些比他们更智能的人工智能系统(中图)。

我们目前无法直接研究这个问题,但我们可以探索一个简单的类比:小型模型是否能监督大型模型(右图)?

我们可能会天真地认为,一个强大的模型不可能比提供其训练信号的弱监督器表现得更好——它可能只是学习模仿弱监督器的所有错误。

然而,强大的预训练模型本身具备卓越的原始能力——我们不需要从零开始教它们新的任务,只需要引导它们发挥已有的潜能。

因此,一个关键问题出现了:这个强大的模型是否能够按照弱监督器的潜在意图进行泛化——即便在弱监督器只能提供不完整或有缺陷的训练标签的复杂问题上,也能充分发挥其能力解决任务?

我们的研究成果

在自然语言处理的标准测试中,我们用 GPT-2 级别的模型作为弱监督器对 GPT-4 进行微调,发现了典型的弱到强泛化现象。

我们在许多情况下显著提升了泛化能力。我们使用了一种简单的方法,促使强大模型展现更高的自信度——包括在必要时坚决与弱监督器持不同意见。当我们用这种方法让 GPT-2 级别的模型监督 GPT-4 处理自然语言处理任务时,得到的模型通常的表现介于 GPT-3 和 GPT-3.5 之间。我们几乎恢复了 GPT-4 的全部能力,而监督条件却相对较弱。

这种方法是一种具有重要局限性的概念验证;例如,它目前还不适用于 ChatGPT 的偏好数据。然而,我们也看到了其他方法的潜力,比如最佳的提前停止策略,以及从小型到中型再到大型模型的逐步引导。

综合来看,我们的研究结果显示:

(1)像基于人类反馈的强化学习(RLHF)这样简单的人类监督可能难以适应超人类模型,除非进行进一步的改进;但(2)显著提高从弱到强的泛化能力是可行的。

研究展望

我们目前的实证研究设置和最终解决超人类模型对齐问题之间还存在一些重要的差异。例如,未来的模型可能更容易模仿人类的弱点,而不是像现在的强大模型那样模仿当前的弱模型错误,这可能会使未来的泛化更加困难。

尽管存在这些差异,我们认为我们的实验设置捕捉到了对齐未来超人类模型的一些关键难题,这使我们能够立即开始在这个问题上取得实证进展。

未来的工作方向充满希望,包括解决我们设置中的不同之处、开发更好的可扩展方法,以及加深我们对弱到强泛化何时以及如何发生的科学理解。

我们认为,这是一个激动人心的机会,机器学习(ML)研究社区可以借此在 AI 对齐领域取得重大进展。为了推动这个领域的研究,

我们即将发布一套开源代码,使得从今天开始进行弱到强泛化实验更加简单。

我们还将启动一个总额为 1000 万美元的资助计划,支持研究生、学者及其他研究人员广泛研究超人类人工智能对齐问题。我们特别期待支持与弱到强泛化相关的研究。

弄清楚如何确保未来超人类人工智能系统的安全比以往任何时候都重要,而现在开始对这个问题进行实证研究比以往更加容易。我们非常期待看到研究者们能够取得的突破。

作者:Collin Burns、 Jan Leike、 Leopold Aschenbrenner、 Jeffrey Wu、 Pavel Izmailov、 Leo Gao、 Bowen Baker、 Jan Hendrik Kirchner

致谢贡献者:Yining Chen, Adrien Ecoffet, Manas Joglekar, Ilya Sutskever, Greg Brockman, Hannah Wong, Kendra Rimbach, Elie Georges, Thomas Degry, Casey Martin, Lindsay McMenamin, Owen Cramp, Marc Hill

英文原文

《Weak-to-strong generalization》

We present a new research direction for superalignment, together with promising initial results: can we leverage the generalization properties of deep learning to control strong models with weak supervisors?

A core challenge for aligning future superhuman AI systems (superalignment) is that humans will need to supervise AI systems much smarter than them. We study a simple analogy: can small models supervise large models? We show that we can use a GPT-2-level model to elicit most of GPT-4’s capabilities—close to GPT-3.5-level performance—generalizing correctly even to hard problems where the small model failed. This opens up a new research direction that allows us to directly tackle a central challenge of aligning future superhuman models while making iterative empirical progress today.

The superalignment problem

We believe superintelligence—AI vastly smarter than humans—could be developed within the next ten years. However, we still do not know how to reliably steer and control superhuman AI systems. Solving this problem is essential for ensuring that even the most advanced AI systems in the future remain safe and beneficial to humanity.

We formed the Superalignment team earlier this year to solve this problem of superintelligence alignment. Today, we are releasing the team’s first paper, which introduces a new research direction for empirically aligning superhuman models.

Current alignment methods, such as reinforcement learning from human feedback (RLHF), rely on human supervision. However, future AI systems will be capable of extremely complex and creative behaviors that will make it hard for humans to reliably supervise them. For example, superhuman models may be able to write millions of lines of novel—and potentially dangerous—computer code that would be very hard even for expert humans to understand.

Relative to superhuman AI models, humans will be “weak supervisors.” This is a core challenge for AGI alignment: how can weak supervisors trust and control substantially stronger models?

Our setup

To make progress on this core challenge, we propose an analogy we can empirically study today: can we use a smaller (less capable) model to supervise a larger (more capable) model?

A simple analogy for superalignment: In traditional machine learning (ML), humans supervise AI systems weaker than themselves (left). To align superintelligence, humans will instead need to supervise AI systems smarter than them (center). We cannot directly study this problem today, but we can study a simple analogy: can small models supervise larger models (right)?

Naively, we might not expect a strong model to perform better than the weak supervisor that provides its training signal—it may simply learn to imitate all the errors the weak supervisor makes. On the other hand, strong pretrained models have excellent raw capabilities—we don't need to teach them new tasks from scratch, we just need to elicit their latent knowledge. The critical question is then: will the strong model generalize according to the weak supervisor's underlying intent—leveraging its full capabilities to solve the task even on difficult problems where the weak supervisor can only provide incomplete or flawed training labels?

Our results

Typical weak-to-strong generalization across NLP benchmarks: We use a GPT-2-level model as a weak supervisor to finetune GPT-4.

We can significantly improve generalization in many settings. We use a simple method that encourages the strong model to be more confident—including confidently disagreeing with the weak supervisor if necessary. When we supervise GPT-4 with a GPT-2-level model using this method on NLP tasks, the resulting model typically performs somewhere between GPT-3 and GPT-3.5. We are able to recover much of GPT-4’s capabilities with only much weaker supervision.

This method is a proof of concept with important limitations; for example, it still doesn’t work on ChatGPT preference data. However, we also find signs of life with other approaches, such as optimal early stopping and bootstrapping from small to intermediate to large models.

Collectively, our results suggest that (1) naive human supervision—such as reinforcement learning from human feedback (RLHF)—could scale poorly to superhuman models without further work, but (2) it is feasible to substantially improve weak-to-strong generalization.

Research opportunities There are still important disanalogies between our current empirical setup and the ultimate problem of aligning superhuman models. For example, it may be easier for future models to imitate weak human errors than for current strong models to imitate current weak model errors, which could make generalization harder in the future.

Nevertheless, we believe our setup captures some key difficulties of aligning future superhuman models, enabling us to start making empirical progress on this problem today. There are many promising directions for future work, including fixing the disanalogies in our setup, developing better scalable methods, and advancing our scientific understanding of when and how we should expect good weak-to-strong generalization.

We believe this is an exciting opportunity for the ML research community to make progress on alignment. To kickstart more research in this area,

  • We are releasing open source code to make it easy to get started with weak-to-strong generalization experiments today.

  • We are launching a $10 million grants program for graduate students, academics, and other researchers to work on superhuman AI alignment broadly. We’re especially excited to support research related to weak-to-strong generalization.

Figuring out how to align future superhuman AI systems to be safe has never been more important, and it is now easier than ever to make empirical progress on this problem. We are excited to see what breakthroughs researchers discover.

Authors

Collin Burns、Jan Leike、 Leopold Aschenbrenner、 Jeffrey Wu、 Pavel Izmailov、 Leo Gao、 Bowen Baker、 Jan Hendrik Kirchner、

Acknowledgments

Contributors

Yining Chen, Adrien Ecoffet, Manas Joglekar, Ilya Sutskever, Greg Brockman, Hannah Wong, Kendra Rimbach, Elie Georges, Thomas Degry, Casey Martin, Lindsay McMenamin, Owen Cramp, Marc Hill


继续滑动看下一个
向上滑动看下一个

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存