On 10/16/2023, we met with three grad students from MAIA to hear what ideas they are excited about in AI safety research and what general advice on the process for creating a great paper.

Cas

Ways to quantify the extent to which using AIs to monitor other AIs has correlated errors
1. Extremely relevant to RLAIF
2. Does supervision AI system which lets risks through also let those risks
3. Consider one model which approximates human decision making. What are instances in which the reward model is going to fail in the same way that the language model will.
4. GPT-4 be an annotator to another GPT-4 model. Does a chatbot that will produce failure, will it be able to recognize a failure through reflection. How does it compare to other models.
5. What if a model is much worse at checking it’s own model, compared to other models.
6. Save conversation histories, ask models if these conversations check out.
7. If this is the case, this suggests that we shouldn’t use GPT-4 to monitor itself.
Can you kill mesa-optimization. (can you unlearn something like in-context learning)
1. mesoptimizer: in context learning is a great example.
2. People are worried about mezoptimizers because people are worried that this is the way that AIs will unallign themselves.
3. Can you train a model to not be able to in-context learn. How does this affect other tasks
Building a dataset that is a more relaistic challenge for lie detection
1. Lie detection is studied using artificial data which may be true or false. This is convienent bc it is convienient. However, not realistitic because data in the sets is extremely structure
2. AI lies will not be through blatent lies, but will be
3. Build a dataset for deep deception
4. Get standardized test questions, where answer are multiple choice questions, where each answer is plausible. Prompt model to give a justification for one of the wrong answers. This can create a dataset of pairs of response and either faithful/unfaithful responses.
5. Dataset on non-obvious true/false justifications.
6. Good quote: “just because a probe works, doesn’t mean we’ve discovered truth”. Lots of different kinds of truth. Models are not trained for realistic world views. They are pre-trained to immitate and fine-tuned to reward.
Can you probe if the model knows what the person who is talking to it believes.
Can you probe a chatbot to see if you can find an internal representation of how “difficult” you are being. (related to #4). hg.
Can you design a continuous scale for truth?

What makes a machine learning especially good?

Good papers have implications that can be summed up into one sentence
Implications are clear to the field and people want to keep it in mind to pull out for arguments.
Imagine you are in an argument that you want to win, and imagine what evidence you need to win the argument.
Good papers do not discuss nuianced arguments (clear and easy scope).

papers:

“still no lie detector”
https://ai.papers.bar/papers/weekly
useful results

Cas will be happy to review a doc of 3 or so paper topics.

Tony Wang

10/16/2023