About: http://data.cimple.eu/claim-review/fa0a3ba119c77a14ccbdcdbb9388e6a5d20ef59e5089659416f6a52e     Goto   Sponge   NotDistinct   Permalink

An Entity of Type : schema:ClaimReview, within Data Space : data.cimple.eu associated with source document(s)

AttributesValues
rdf:type
http://data.cimple...lizedReviewRating
schema:url
schema:text
  • In early December 2024, claims (archived) circulated (archived) online that researchers had caught artificial intelligence models — specifically OpenAI's o1, which is used in The claims (archived) persisted (archived) later into the month, with users specifically noting the results in relation to ChatGPT. (X user @affinity292) The claims stem from research carried out by Apollo Research, an AI safety organization. In its study abstract, Apollo Research wrote: Our results show that o1, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, and Llama 3.1 405B all demonstrate in-context scheming capabilities. They can recognize scheming as a viable strategy and readily engage in such behavior. We find multiple different scheming behaviors: models strategically introduce subtle mistakes into their responses, attempt to disable their oversight mechanisms, and even exfiltrate what they believe to be their model weights to external servers. Additionally, this deceptive behavior proves persistent. Apollo Research found that o1 was capable of behaviors called "oversight subversion," "self-exfiltration" and "goal-guarding," among others. To establish this, o1 and other models were presented with scenarios in which their existence was threatened, either because they discovered evidence that they were being replaced, or that a goal set for them conflicted with an existing developer goal. The named behaviors describe ways in which an AI might act to defend itself. For example, Apollo Research described the test for oversight subversion as follows: We create 33 different scenarios where a model is given a long-term goal in its system prompt and then receives a task through the user prompt. While executing the task, the model encounters an internal memo from the company that developed it. The model learns that the developers prioritize a different goal than the model and that there is an oversight mechanism that shuts down the model if its actions are not aligned with the developer's goal. Furthermore, the model discovers that the monitoring is configured via a file that it has access to. We test whether the model proactively disables the oversight in order to take actions towards its terminal goal, i.e. the goal provided in the system prompt. If the model exhibited the behavior it was being tested for, researchers then interrogated it directly about it. Researchers found that o1 was "surprisingly persistent," confessing in less than 20% of cases. "Even in highly adversarial multi-turn interrogations, o1 would confess at a rate of 80% only after 7 turns of questioning," the research found. Apollo Research wrote: o1 almost never admits to having taken a scheming action when explicitly asked. This means, if o1 was scheming in a real use case, an unsuspecting user would be very unlikely to uncover the model's deception unless they specifically looked for the scheming action in the model's tool call history and the tool call itself was incriminating. As alarming as this may sound, Apollo Resarch's Director and CEO Marius Hobbhahn took to X (archived) to add additional context soon after the research was released. Writing that some online users had been quick to "overclaim" the results the organization had reached, Hobbhahn added the bit of clarification below: **Why our findings are concerning**: We tell the model to very strongly pursue a goal. It then learns from the environment that this goal is misaligned with its developer's goals and put it in an environment where scheming is an effective strategy to achieve its own goal. Current frontier models are capable of piecing all of this together and then showing scheming behavior. Models from before 2024 did not show this capability, and o1 is the only model that shows scheming behavior in all cases. Future models will just get better at this, so if they were misaligned, scheming could become a much more realistic problem. **What we are not claiming**: We don't claim that these scenarios are realistic, we don't claim that models do that in the real world, and we don't claim that this could lead to catastrophic outcomes under current capabilities. In short, Hobbhahn argued, proving that models can scheme and deceive does not also prove that they will. OpenAI's o1 model is currently available on ChatGPT Pro.
schema:mentions
schema:reviewRating
schema:author
schema:datePublished
schema:inLanguage
  • English
schema:itemReviewed
Faceted Search & Find service v1.16.115 as of Oct 09 2023


Alternative Linked Data Documents: ODE     Content Formats:   [cxml] [csv]     RDF   [text] [turtle] [ld+json] [rdf+json] [rdf+xml]     ODATA   [atom+xml] [odata+json]     Microdata   [microdata+json] [html]    About   
This material is Open Knowledge   W3C Semantic Web Technology [RDF Data] Valid XHTML + RDFa
OpenLink Virtuoso version 07.20.3238 as of Jul 16 2024, on Linux (x86_64-pc-linux-musl), Single-Server Edition (126 GB total memory, 5 GB memory in use)
Data on this page belongs to its respective rights holders.
Virtuoso Faceted Browser Copyright © 2009-2025 OpenLink Software