| | How Can Interpretability Researchers Help AGI Go Well? (alignmentforum.org) |
| 2 points by gmays 43 days ago | past |
|
| | Embedding Spaces – Transformer Token Vectors Are Not Points in Space (alignmentforum.org) |
| 1 point by ofou 4 months ago | past |
|
| | How to Become a Mechanistic Interpretability Researcher (alignmentforum.org) |
| 2 points by speckx 4 months ago | past |
|
| | LLMs Are Simulators (alignmentforum.org) |
| 1 point by msvana 6 months ago | past |
|
| | Highly Opinionated Advice on How to Write ML Papers (alignmentforum.org) |
| 2 points by jxmorris12 7 months ago | past |
|
| | Catastrophic sabotage as a major threat model for human-level AI systems (alignmentforum.org) |
| 5 points by speckx on Oct 24, 2024 | past |
|
| | Would catching AIs trying to escape convince AI devs to slow down or undeploy? (alignmentforum.org) |
| 2 points by rntn on Aug 27, 2024 | past |
|
| | AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work (alignmentforum.org) |
| 1 point by sebg on Aug 20, 2024 | past |
|
| | Mysteries of mode collapse – AI Alignment Forum (2022) (alignmentforum.org) |
| 1 point by Bluestein on July 17, 2024 | past |
|
| | Opinionated Annotated List of Favourite Mechanistic Interpretability Papers v2 (alignmentforum.org) |
| 2 points by thunderbong on July 9, 2024 | past |
|
| | Transformers Represent Belief State Geometry in Their Residual Stream (alignmentforum.org) |
| 3 points by HR01 on May 1, 2024 | past |
|
| | LLMs for Alignment Research: a safety priority? (alignmentforum.org) |
| 1 point by rntn on April 7, 2024 | past |
|
| | Modern Transformers Are AGI, and Human-Level (alignmentforum.org) |
| 2 points by rntn on April 3, 2024 | past |
|
| | Larger language models may disappoint you [or, an eternally unfinished draft] (alignmentforum.org) |
| 2 points by behnamoh on Jan 11, 2024 | past |
|
| | AGI safety from first principles: Superintelligence (alignmentforum.org) |
| 2 points by warkanlock on Dec 21, 2023 | past |
|
| | Anthropic Fall 2023 Debate Progress Update (alignmentforum.org) |
| 2 points by EvgeniyZh on Nov 29, 2023 | past |
|
| | When do "brains beat brawn" in chess? An experiment (alignmentforum.org) |
| 124 points by andrewljohnson on Nov 21, 2023 | past | 79 comments |
|
| | Critique of some recent philosophy of LLMs' minds (alignmentforum.org) |
| 2 points by behnamoh on Oct 9, 2023 | past |
|
| | Mesa-Optimization (alignmentforum.org) |
| 1 point by reqo on Sept 5, 2023 | past |
|
| | Glitch Tokens (alignmentforum.org) |
| 1 point by peter_d_sherman on June 8, 2023 | past |
|
| | The Unsolved Technical Alignment Problem in LeCun's A Path Towards AGI (alignmentforum.org) |
| 4 points by sandinmyjoints on June 6, 2023 | past |
|
| | A Mechanistic Interpretability Analysis of Grokking (alignmentforum.org) |
| 202 points by famouswaffles on May 30, 2023 | past | 54 comments |
|
| | AI Will Not Want to Self-Improve (alignmentforum.org) |
| 3 points by behnamoh on May 22, 2023 | past | 3 comments |
|
| | GPTs are Predictors, not Imitators or Simulators (alignmentforum.org) |
| 2 points by famouswaffles on April 22, 2023 | past |
|
| | Concrete Open Problems in Mechanistic Interpretability (alignmentforum.org) |
| 1 point by raviparikh on April 11, 2023 | past |
|
| | Imitation Learning from Language Feedback (alignmentforum.org) |
| 2 points by tim_sw on April 5, 2023 | past |
|
| | Othello-GPT Has a Linear Emergent World Representation (alignmentforum.org) |
| 2 points by todsacerdoti on April 1, 2023 | past |
|
| | Gitch Tokens in GPT (SolidGoldMagikarp) (alignmentforum.org) |
| 1 point by gwd on March 8, 2023 | past |
|
| | Data, not size, is the current active constraint on language model performance (alignmentforum.org) |
| 4 points by satvikpendem on Dec 22, 2022 | past |
|
| | Central AI alignment problem: capabilities generalization and sharp left turn (alignmentforum.org) |
| 1 point by kvee on Nov 8, 2022 | past |
|
|
| More |