User profiles for Anna Chen

Anna H. Chen

- Verified email at fas.harvard.edu - Cited by 888

Anna Chen

- Verified email at student.unimelb.edu.au - Cited by 710

Constitutional ai: Harmlessness from ai feedback

…, S Kundu, A Askell, J Kernion, A Jones, A Chen… - arXiv preprint arXiv …, 2022 - arxiv.org
As AI systems become more capable, we would like to enlist their help to supervise other AIs.
We experiment with methods for training a harmless AI assistant through self-improvement, …

Training a helpful and harmless assistant with reinforcement learning from human feedback

Y Bai, A Jones, K Ndousse, A Askell, A Chen… - arXiv preprint arXiv …, 2022 - arxiv.org
We apply preference modeling and reinforcement learning from human feedback (RLHF) to
finetune language models to act as helpful and harmless assistants. We find this alignment …

Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned

…, K Ndousse, A Jones, S Bowman, A Chen… - arXiv preprint arXiv …, 2022 - arxiv.org
We describe our early efforts to red team language models in order to simultaneously discover,
measure, and attempt to reduce their potentially harmful outputs. We make three main …

Language models (mostly) know what they know

…, A Jones, N Elhage, T Hume, A Chen… - arXiv preprint arXiv …, 2022 - arxiv.org
We study whether language models can evaluate the validity of their own claims and predict
which questions they will be able to answer correctly. We first show that larger models are …

In-context learning and induction heads

…, T Henighan, B Mann, A Askell, Y Bai, A Chen… - arXiv preprint arXiv …, 2022 - arxiv.org
"Induction heads" are attention heads that implement a simple algorithm to complete token
sequences like [A][B] ... [A] -> [B]. In this work, we present preliminary and indirect evidence …

A general language assistant as a laboratory for alignment

A Askell, Y Bai, A Chen, D Drain, D Ganguli… - arXiv preprint arXiv …, 2021 - arxiv.org
Given the broad capabilities of large language models, it should be possible to work towards
a general-purpose, text-based assistant that is aligned with human values, meaning that it …

Predictability and surprise in large generative models

…, L Lovitt, A Askell, Y Bai, A Chen… - Proceedings of the …, 2022 - dl.acm.org
Large-scale pre-training has recently emerged as a technique for creating capable, general-purpose,
generative models such as GPT-3, Megatron-Turing NLG, Gopher, and many …

Discovering language model behaviors with model-written evaluations

…, S Ringer, K Lukošiūtė, K Nguyen, E Chen… - arXiv preprint arXiv …, 2022 - arxiv.org
As language models (LMs) scale, they develop many novel behaviors, good and bad,
exacerbating the need to evaluate how they behave. Prior work creates evaluations with …

The capacity for moral self-correction in large language models

…, N Schiefer, TI Liao, K Lukošiūtė, A Chen… - arXiv preprint arXiv …, 2023 - arxiv.org
We test the hypothesis that language models trained with reinforcement learning from human
feedback (RLHF) have the capability to "morally self-correct" -- to avoid producing harmful …

Primary Tumor Hypoxia Recruits CD11b+/Ly6Cmed/Ly6G+ Immune Suppressor Cells and Compromises NK Cell Cytotoxicity in the Premetastatic Niche

J Sceneay, MT Chow, A Chen, HM Halse, CSF Wong… - Cancer research, 2012 - AACR
Hypoxia within a tumor acts as a strong selective pressure that promotes angiogenesis,
invasion, and metastatic spread. In this study, we used immune competent bone marrow …