AI Safety Fundamentals: Alignment – Lyssna här

Avsnitt

How to get feedback
12 maj· AI Safety Fundamentals: Alignment
Feedback is essential for learning. Whether you’re studying for a test, trying to improve in your work or want to master a difficult skill, you need feedback.
The challenge is that feedback can often be hard to get. Worse, if you get bad feedback, you may end up worse than before.

Original text:
https://www.scotthyoung.com/blog/2019/01/24/how-to-get-feedback/

Author:
Scott Young

A podcast by BlueDot Impact.

Learn more on the AI Safety Fundamentals website.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Public by default: How we manage information visibility at Get on Board
12 maj· AI Safety Fundamentals: Alignment
I’ve been obsessed with managing information, and communications in a remote team since Get on Board started growing. Reducing the bus factor is a primary motivation — but another just as important is diminishing reliance on synchronicity. When what I know is documented and accessible to others, I’m less likely to be a bottleneck for anyone else in the team. So if I’m busy, minding family matters, on vacation, or sick, I won’t be blocking anyone.
This, in turn, gives everyone in the team the freedom to build their own work schedules according to their needs, work from any time zone, or enjoy more distraction-free moments. As I write these lines, most of the world is under quarantine, relying on non-stop video calls to continue working. Needless to say, that is not a sustainable long-term work schedule.

Original text:
https://www.getonbrd.com/blog/public-by-default-how-we-manage-information-visibility-at-get-on-board

Author:
Sergio Nouvel
A podcast by BlueDot Impact.

Learn more on the AI Safety Fundamentals website.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Saknas det avsnitt?

Klicka här för att uppdatera flödet manuellt.
Writing, Briefly
12 maj· AI Safety Fundamentals: Alignment
(In the process of answering an email, I accidentally wrote a tiny essay about writing. I usually spend weeks on an essay. This one took 67 minutes—23 of writing, and 44 of rewriting.)

Original text:
https://paulgraham.com/writing44.html

Author:
Paul Graham
A podcast by BlueDot Impact.

Learn more on the AI Safety Fundamentals website.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Being the (Pareto) Best in the World
4 maj· AI Safety Fundamentals: Alignment
This introduces the concept of Pareto frontiers. The top comment by Rob Miles also ties it to comparative advantage.
While reading, consider what Pareto frontiers your project could place you on.

Original text:
https://www.lesswrong.com/posts/XvN2QQpKTuEzgkZHY/being-the-pareto-best-in-the-world

Author:
John Wentworth
A podcast by BlueDot Impact.

Learn more on the AI Safety Fundamentals website.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
How to succeed as an early-stage researcher: the “lean startup” approach
23 apr· AI Safety Fundamentals: Alignment
I am approaching the end of my AI governance PhD, and I’ve spent about 2.5 years as a researcher at FHI. During that time, I’ve learnt a lot about the formula for successful early-career research.
This post summarises my advice for people in the first couple of years. Research is really hard, and I want people to avoid the mistakes I’ve made.

Original text:
https://forum.effectivealtruism.org/posts/jfHPBbYFzCrbdEXXd/how-to-succeed-as-an-early-stage-researcher-the-lean-startup#Conclusion

Author:
Toby Shevlane
A podcast by BlueDot Impact.

Learn more on the AI Safety Fundamentals website.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Become a person who Actually Does Things
17 apr· AI Safety Fundamentals: Alignment
The next four weeks of the course are an opportunity for you to actually build a thing that moves you closer to contributing to AI Alignment, and we're really excited to see what you do!
A common failure mode is to think "Oh, I can't actually do X" or to say "Someone else is probably doing Y."
You probably can do X, and it's unlikely anyone is doing Y! It could be you!

Original text:
https://www.neelnanda.io/blog/become-a-person-who-actually-does-things

Author:
Neel Nanda
A podcast by BlueDot Impact.

Learn more on the AI Safety Fundamentals website.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Planning a High-Impact Career: A Summary of Everything You Need to Know in 7 Points
16 apr· AI Safety Fundamentals: Alignment
We took 10 years of research and what we’ve learned from advising 1,000+ people on how to build high-impact careers, compressed that into an eight-week course to create your career plan, and then compressed that into this three-page summary of the main points.
(It’s especially aimed at people who want a career that’s both satisfying and has a significant positive impact, but much of the advice applies to all career decisions.)

Original article:
https://80000hours.org/career-planning/summary/

Author:
Benjamin Todd
A podcast by BlueDot Impact.

Learn more on the AI Safety Fundamentals website.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Working in AI Alignment
14 apr· AI Safety Fundamentals: Alignment
This guide is written for people who are considering direct work on technical AI alignment. I expect it to be most useful for people who are not yet working on alignment, and for people who are already familiar with the arguments for working on AI alignment. If you aren’t familiar with the arguments for the importance of AI alignment, you can get an overview of them by doing the AI Alignment Course.

by Charlie Rogers-Smith, with minor updates by Adam Jones

Source:
https://aisafetyfundamentals.com/blog/alignment-careers-guide

Narrated for AI Safety Fundamentals by Perrin Walker
A podcast by BlueDot Impact.

Learn more on the AI Safety Fundamentals website.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Computing Power and the Governance of AI
7 apr· AI Safety Fundamentals: Alignment
This post summarises a new report, “Computing Power and the Governance of Artificial Intelligence.” The full report is a collaboration between nineteen researchers from academia, civil society, and industry. It can be read here.
GovAI research blog posts represent the views of their authors, rather than the views of the organisation.

Source:
https://www.governance.ai/post/computing-power-and-the-governance-of-ai

Narrated for AI Safety Fundamentals by Perrin Walker
A podcast by BlueDot Impact.

Learn more on the AI Safety Fundamentals website.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
AI Control: Improving Safety Despite Intentional Subversion
7 apr· AI Safety Fundamentals: Alignment
We’ve released a paper, AI Control: Improving Safety Despite Intentional Subversion. This paper explores techniques that prevent AI catastrophes even if AI instances are colluding to subvert the safety techniques. In this post:
We summarize the paper;We compare our methodology to the methodology of other safety papers.
Source:
https://www.alignmentforum.org/posts/d9FJHawgkiMSPjagR/ai-control-improving-safety-despite-intentional-subversion

Narrated for AI Safety Fundamentals by Perrin Walker
A podcast by BlueDot Impact.

Learn more on the AI Safety Fundamentals website.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Challenges in Evaluating AI Systems
7 apr· AI Safety Fundamentals: Alignment
Most conversations around the societal impacts of artificial intelligence (AI) come down to discussing some quality of an AI system, such as its truthfulness, fairness, potential for misuse, and so on. We are able to talk about these characteristics because we can technically evaluate models for their performance in these areas. But what many people working inside and outside of AI don’t fully appreciate is how difficult it is to build robust and reliable model evaluations. Many of today’s existing evaluation suites are limited in their ability to serve as accurate indicators of model capabilities or safety.

At Anthropic, we spend a lot of time building evaluations to better understand our AI systems. We also use evaluations to improve our safety as an organization, as illustrated by our Responsible Scaling Policy. In doing so, we have grown to appreciate some of the ways in which developing and running evaluations can be challenging.
Here, we outline challenges that we have encountered while evaluating our own models to give readers a sense of what developing, implementing, and interpreting model evaluations looks like in practice.

Source:
https://www.anthropic.com/news/evaluating-ai-systems

Narrated for AI Safety Fundamentals by Perrin Walker
A podcast by BlueDot Impact.

Learn more on the AI Safety Fundamentals website.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Emerging Processes for Frontier AI Safety
7 apr· AI Safety Fundamentals: Alignment
The UK recognises the enormous opportunities that AI can unlock across our economy and our society. However, without appropriate guardrails, such technologies can pose significant risks. The AI Safety Summit will focus on how best to manage the risks from frontier AI such as misuse, loss of control and societal harms. Frontier AI organisations play an important role in addressing these risks and promoting the safety of the development and deployment of frontier AI.
The UK has therefore encouraged frontier AI organisations to publish details on their frontier AI safety policies ahead of the AI Safety Summit hosted by the UK on 1 to 2 November 2023. This will provide transparency regarding how they are putting into practice voluntary AI safety commitments and enable the sharing of safety practices within the AI ecosystem. Transparency of AI systems can increase public trust, which can be a significant driver of AI adoption.
This document complements these publications by providing a potential list of frontier AI organisations’ safety policies.

Source:
https://www.gov.uk/government/publications/emerging-processes-for-frontier-ai-safety/emerging-processes-for-frontier-ai-safety

Narrated for AI Safety Fundamentals by Perrin Walker

A podcast by BlueDot Impact.

Learn more on the AI Safety Fundamentals website.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
AI Watermarking Won’t Curb Disinformation
7 apr· AI Safety Fundamentals: Alignment
Generative AI allows people to produce piles upon piles of images and words very quickly. It would be nice if there were some way to reliably distinguish AI-generated content from human-generated content. It would help people avoid endlessly arguing with bots online, or believing what a fake image purports to show. One common proposal is that big companies should incorporate watermarks into the outputs of their AIs. For instance, this could involve taking an image and subtly changing many pixels in a way that’s undetectable to the eye but detectable to a computer program. Or it could involve swapping words for synonyms in a predictable way so that the meaning is unchanged, but a program could readily determine the text was generated by an AI.
Unfortunately, watermarking schemes are unlikely to work. So far most have proven easy to remove, and it’s likely that future schemes will have similar problems.

Source:
https://transformer-circuits.pub/2023/monosemantic-features/index.html

Narrated for AI Safety Fundamentals by Perrin Walker
A podcast by BlueDot Impact.

Learn more on the AI Safety Fundamentals website.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small
1 apr· AI Safety Fundamentals: Alignment
Research in mechanistic interpretability seeks to explain behaviors of machine learning (ML) models in terms of their internal components. However, most previous work either focuses on simple behaviors in small models or describes complicated behaviors in larger models with broad strokes. In this work, we bridge this gap by presenting an explanation for how GPT-2 small performs a natural language task called indirect object identification (IOI). Our explanation encompasses 26 attention heads grouped into 7 main classes, which we discovered using a combination of interpretability approaches relying on causal interventions. To our knowledge, this investigation is the largest end-to-end attempt at reverse-engineering a natural behavior "in the wild" in a language model. We evaluate the reliability of our explanation using three quantitative criteria–faithfulness, completeness, and minimality. Though these criteria support our explanation, they also point to remaining gaps in our understanding. Our work provides evidence that a mechanistic understanding of large ML models is feasible, pointing toward opportunities to scale our understanding to both larger models and more complex tasks. Code for all experiments is available at https://github.com/redwoodresearch/Easy-Transformer.

Source:
https://arxiv.org/pdf/2211.00593.pdf

Narrated for AI Safety Fundamentals by Perrin Walker
A podcast by BlueDot Impact.

Learn more on the AI Safety Fundamentals website.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
31 mar· AI Safety Fundamentals: Alignment
Using a sparse autoencoder, we extract a large number of interpretable features from a one-layer transformer.

Mechanistic interpretability seeks to understand neural networks by breaking them into components that are more easily understood than the whole. By understanding the function of each component, and how they interact, we hope to be able to reason about the behavior of the entire network. The first step in that program is to identify the correct components to analyze.

Unfortunately, the most natural computational unit of the neural network – the neuron itself – turns out not to be a natural unit for human understanding. This is because many neurons are polysemantic: they respond to mixtures of seemingly unrelated inputs. In the vision model Inception v1, a single neuron responds to faces of cats and fronts of cars . In a small language model we discuss in this paper, a single neuron responds to a mixture of academic citations, English dialogue, HTTP requests, and Korean text. Polysemanticity makes it difficult to reason about the behavior of the network in terms of the activity of individual neurons.

Source:
https://transformer-circuits.pub/2023/monosemantic-features/index.html

Narrated for AI Safety Fundamentals by Perrin Walker
A podcast by BlueDot Impact.

Learn more on the AI Safety Fundamentals website.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Zoom In: An Introduction to Circuits
31 mar· AI Safety Fundamentals: Alignment
By studying the connections between neurons, we can find meaningful algorithms in the weights of neural networks.

Many important transition points in the history of science have been moments when science “zoomed in.” At these points, we develop a visualization or tool that allows us to see the world in a new level of detail, and a new field of science develops to study the world through this lens.
For example, microscopes let us see cells, leading to cellular biology. Science zoomed in. Several techniques including x-ray crystallography let us see DNA, leading to the molecular revolution. Science zoomed in. Atomic theory. Subatomic particles. Neuroscience. Science zoomed in.
These transitions weren’t just a change in precision: they were qualitative changes in what the objects of scientific inquiry are. For example, cellular biology isn’t just more careful zoology. It’s a new kind of inquiry that dramatically shifts what we can understand.
The famous examples of this phenomenon happened at a very large scale, but it can also be the more modest shift of a small research community realizing they can now study their topic in a finer grained level of detail.

Source:
https://distill.pub/2020/circuits/zoom-in/

Narrated for AI Safety Fundamentals by Perrin Walker
A podcast by BlueDot Impact.

Learn more on the AI Safety Fundamentals website.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Weak-To-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
26 mar· AI Safety Fundamentals: Alignment
Widely used alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on the ability of humans to supervise model behavior—for example, to evaluate whether a model faithfully followed instructions or generated safe outputs. However, future superhuman models will behave in complex ways too difficult for humans to reliably evaluate; humans will only be able to weakly supervise superhuman models. We study an analogy to this problem: can weak model supervision elicit the full capabilities of a much stronger model? We test this using a range of pretrained language models in the GPT-4 family on natural language processing (NLP), chess, and reward modeling tasks. We find that when we naively fine-tune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors, a phenomenon we call weak-to-strong generalization. However, we are still far from recovering the full capabilities of strong models with naive fine-tuning alone, suggesting that techniques like RLHF may scale poorly to superhuman models without further work.
We find that simple methods can often significantly improve weak-to-strong generalization: for example, when fine-tuning GPT-4 with a GPT-2-level supervisor and an auxiliary confidence loss, we can recover close to GPT-3.5-level performance on NLP tasks. Our results suggest that it is feasible to make empirical progress today on a fundamental challenge of aligning superhuman models.
Source:
https://arxiv.org/pdf/2312.09390.pdf

Narrated for AI Safety Fundamentals by Perrin Walker
A podcast by BlueDot Impact.

Learn more on the AI Safety Fundamentals website.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Can We Scale Human Feedback for Complex AI Tasks?
26 mar· AI Safety Fundamentals: Alignment
Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique for steering large language models (LLMs) toward desired behaviours. However, relying on simple human feedback doesn’t work for tasks that are too complex for humans to accurately judge at the scale needed to train AI models. Scalable oversight techniques attempt to address this by increasing the abilities of humans to give feedback on complex tasks.
This article briefly recaps some of the challenges faced with human feedback, and introduces the approaches to scalable oversight covered in session 4 of our AI Alignment course.

Source:
https://aisafetyfundamentals.com/blog/scalable-oversight-intro/

Narrated for AI Safety Fundamentals by Perrin Walker
A podcast by BlueDot Impact.

Learn more on the AI Safety Fundamentals website.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Machine Learning for Humans: Supervised Learning
13 maj 2023· AI Safety Fundamentals: Alignment
The two tasks of supervised learning: regression and classification. Linear regression, loss functions, and gradient descent.
How much money will we make by spending more dollars on digital advertising? Will this loan applicant pay back the loan or not? What’s going to happen to the stock market tomorrow?

Original article:
https://medium.com/machine-learning-for-humans/supervised-learning-740383a2feab

Author:
Vishal Maini
A podcast by BlueDot Impact.

Learn more on the AI Safety Fundamentals website.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Visualizing the Deep Learning Revolution
13 maj 2023· AI Safety Fundamentals: Alignment
The field of AI has undergone a revolution over the last decade, driven by the success of deep learning techniques. This post aims to convey three ideas using a series of illustrative examples:
There have been huge jumps in the capabilities of AIs over the last decade, to the point where it’s becoming hard to specify tasks that AIs can’t do.This progress has been primarily driven by scaling up a handful of relatively simple algorithms (rather than by developing a more principled or scientific understanding of deep learning).Very few people predicted that progress would be anywhere near this fast; but many of those who did also predict that we might face existential risk from AGI in the coming decades.
I’ll focus on four domains: vision, games, language-based tasks, and science. The first two have more limited real-world applications, but provide particularly graphic and intuitive examples of the pace of progress.
Original article:
https://medium.com/@richardcngo/visualizing-the-deep-learning-revolution-722098eb9c5
Author:
Richard Ngo
A podcast by BlueDot Impact.

Learn more on the AI Safety Fundamentals website.
- Lyssna Lyssna igen Fortsätt Lyssnar...
- Lyssna senare Lyssna senare
Visa fler

Avsnitt

How to get feedback

Public by default: How we manage information visibility at Get on Board

Writing, Briefly

Being the (Pareto) Best in the World

How to succeed as an early-stage researcher: the “lean startup” approach

Become a person who Actually Does Things

Planning a High-Impact Career: A Summary of Everything You Need to Know in 7 Points

Working in AI Alignment

Computing Power and the Governance of AI

AI Control: Improving Safety Despite Intentional Subversion

Challenges in Evaluating AI Systems

Emerging Processes for Frontier AI Safety

AI Watermarking Won’t Curb Disinformation

Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Zoom In: An Introduction to Circuits

Weak-To-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

Can We Scale Human Feedback for Complex AI Tasks?

Machine Learning for Humans: Supervised Learning

Visualizing the Deep Learning Revolution