Constitutional ai: Harmlessness from ai feedback
As AI systems become more capable, we would like to enlist their help to supervise other AIs.
We experiment with methods for training a harmless AI assistant through self-improvement, …
We experiment with methods for training a harmless AI assistant through self-improvement, …
Training a helpful and harmless assistant with reinforcement learning from human feedback
We apply preference modeling and reinforcement learning from human feedback (RLHF) to
finetune language models to act as helpful and harmless assistants. We find this alignment …
finetune language models to act as helpful and harmless assistants. We find this alignment …
Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned
We describe our early efforts to red team language models in order to simultaneously discover,
measure, and attempt to reduce their potentially harmful outputs. We make three main …
measure, and attempt to reduce their potentially harmful outputs. We make three main …
Language models (mostly) know what they know
We study whether language models can evaluate the validity of their own claims and predict
which questions they will be able to answer correctly. We first show that larger models are …
which questions they will be able to answer correctly. We first show that larger models are …
In-context learning and induction heads
"Induction heads" are attention heads that implement a simple algorithm to complete token
sequences like [A][B] ... [A] -> [B]. In this work, we present preliminary and indirect evidence …
sequences like [A][B] ... [A] -> [B]. In this work, we present preliminary and indirect evidence …
A general language assistant as a laboratory for alignment
Given the broad capabilities of large language models, it should be possible to work towards
a general-purpose, text-based assistant that is aligned with human values, meaning that it …
a general-purpose, text-based assistant that is aligned with human values, meaning that it …
Predictability and surprise in large generative models
Large-scale pre-training has recently emerged as a technique for creating capable, general-purpose,
generative models such as GPT-3, Megatron-Turing NLG, Gopher, and many …
generative models such as GPT-3, Megatron-Turing NLG, Gopher, and many …
Discovering language model behaviors with model-written evaluations
As language models (LMs) scale, they develop many novel behaviors, good and bad,
exacerbating the need to evaluate how they behave. Prior work creates evaluations with …
exacerbating the need to evaluate how they behave. Prior work creates evaluations with …
The capacity for moral self-correction in large language models
We test the hypothesis that language models trained with reinforcement learning from human
feedback (RLHF) have the capability to "morally self-correct" -- to avoid producing harmful …
feedback (RLHF) have the capability to "morally self-correct" -- to avoid producing harmful …
Primary Tumor Hypoxia Recruits CD11b+/Ly6Cmed/Ly6G+ Immune Suppressor Cells and Compromises NK Cell Cytotoxicity in the Premetastatic Niche
Hypoxia within a tumor acts as a strong selective pressure that promotes angiogenesis,
invasion, and metastatic spread. In this study, we used immune competent bone marrow …
invasion, and metastatic spread. In this study, we used immune competent bone marrow …