Ω 35y

18Buck2h

[epistemic status: I think I’m mostly right about the main thrust here, but probably some of the specific arguments below are wrong. In the following, I'm much more stating conclusions than providing full arguments. This claim isn’t particularly original to me.] I’m interested in the following subset of risk from AI: * Early: That comes from AIs that are just powerful enough to be extremely useful and dangerous-by-default (i.e. these AIs aren’t wildly superhuman). * Scheming: Risk associated with loss of control to AIs that arises from AIs scheming * So e.g. I exclude state actors stealing weights in ways that aren’t enabled by the AIs scheming, and I also exclude non-scheming failure modes. IMO, state actors stealing weights is a serious threat, but non-scheming failure modes aren’t (at this level of capability and dignity). * Medium dignity: that is, developers of these AIs are putting a reasonable amount of effort into preventing catastrophic outcomes from their AIs (perhaps they’re spending the equivalent of 10% of their budget on cost-effective measures to prevent catastrophes). * Nearcasted: no substantial fundamental progress on AI safety techniques, no substantial changes in how AI works. This subset of risk is interesting because I think it’s a natural scenario at which to target technical work on AI safety. (E.g. it’s the main scenario we’re targeting with our AI control agenda.) I claim that the majority of this risk comes from the lab using AIs internally to do AI development (by which I mean both research and engineering). This is because the AIs doing AI development naturally require access to compute and model weights that they can potentially leverage into causing catastrophic outcomes—in particular, those resources can be abused to run AIs unmonitored. Using AIs for AI development looks uniquely risky to me among applications of early-transformative AIs, because unlike all other applications I know about: * It’s very expensive to refra

3Raemon1h

I didn't get this from the premises fwiw. Are you saying it's trivial because "just don't use your AI to help you design AI" (seems organizationally hard to me), or did you have particular tricks in mind?

ryan_greenblatt1mΩ220

The claim is that most applications aren't internal usage of AI for AI development and thus can be made trivially safe.

Not that most applications of AI for AI development can be made trivially safe.

The formal goal is a pointer

Pi Rogers

When I introduce people to plans like QACI, they often have objections like "How is an AI going to do all of the simulating necessary to calculate this?" or "If our technology is good enough to calculate this with any level of precision, we can probably just upload some humans." or just "That's not computable."

I think these kinds of objections are missing the point of formal goal alignment and maybe even outer alignment in general.

To formally align an ASI to human (or your) values, we do not need to actually know those values. We only need to strongly point to them.

AI will figure out our values. Whether it's aligned or not, a recursively self-improving AI will eventually get a very good model of our values, as part...

(See More – 155 more words)

Wei Dai9m20

What do you think of this post by Tammy?

It seems like someone could definitely be wrong about what they want (unless normative anti-realism is true and such a sentence has no meaning). For example consider someone who thinks it's really important to be faithful to God and goes to church every Sunday to maintain their faith and would use a superintelligent religious AI assistant to help keep the faith if they could. Or maybe they're just overconfident about their philosophical abilities and would fail to take various precautions that I think are importan... (read more)

Please stop publishing ideas/insights/research about AI

Tamsin Leake

13h

Basically all ideas/insights/research about AI is potentially exfohazardous. At least, it's pretty hard to know when some ideas/insights/research will actually make things better; especially in a world where building an aligned superintelligence (let's call this work "alignment") is quite harder than building any superintelligence (let's call this work "capabilities"), and there's a lot more people trying to do the latter than the former, and they have a lot more material resources.

Ideas about AI, let alone insights about AI, let alone research results about AI, should be kept to private communication between trusted alignment researchers. On lesswrong, we should focus on teaching people the rationality skills which could help them figure out insights that help them build any superintelligence, but are more likely to first give them insights...

(Continue Reading – 1022 more words)

17habryka3h

My steelman of this (though to be clear I think your comment makes good points): There is a large difference between a system being helpful and a system being aligned. Ultimately AI existential risk is a coordination problem where I expect catastrophic consequences because a bunch of people want to build AGI without making it safe. Therefore making technologies that in a naive and short-term sense just help AGI developers build whatever they want to build will have bad consequences. If I trusted everyone to use their intelligence just for good things, we wouldn't have anthropogenic existential risk on our hands. Some of those technologies might end up useful for also getting the AI to be more properly aligned, or maybe to help with work that reduces the risk of AI catastrophe some other way, though my current sense is that kind of work is pretty different and doesn't benefit remotely as much from generically locally-helpful AI. In-general I feel pretty sad about conflating "alignment" with "short-term intent alignment". I think the two problems are related but have really important crucial differences, I don't think the latter generalizes that well to the former (for all the usual sycophancy/treacherous-turn reasons), and indeed progress on the latter IMO mostly makes the world marginally worse because the thing it is most likely to be used for is developing existentially dangerous AI systems faster. Edit: Another really important dimension to model here is also not just the effect of that kind of research on what individual researchers will do, but what effect this kind of research will have on what the market wants to invest in. My standard story of doom is centrally rooted in there being very strong short-term individual economic incentives to build more capable AGI, enabling people to make billions to trillions of dollars, while the downside risk is a distributed negative externality that is not at all priced into the costs of AI development. Developing app

2esme3h

If people don’t publish their findings and make contributions in public, how are they supposed to establish themselves as researchers in the first place?

34leogao3h

I'm very sympathetic to the idea of being very careful about publishing things that could spread capabilities ideas. However, I think there are several important things missing from your world model, which cause me to believe that following your advice would substantially hurt alignment progress. (To be clear, none of this applies to alignment people working directly on capabilities, who should, like, not. Rather, this is about alignment researchers accidentally advancing capabilities by talking to capabilities people) * It's genuinely hard to come up with ideas that help capabilities a lot. I think you are severely underestimating how hard it is, and how much insight is required. I think one issue here is that most papers on arxiv are garbage and don't actually make any progress, but those papers are not the ones that are pushing AGI forward anyways. * Even if you try very hard to do so, it's still very hard to convince people that you're right if you don't have a ton of clout via a legible reputation of being right a lot. Everyone has an agenda they're convinced will solve AGI and is too busy trying to convince everyone else to work on their agenda. * High level ideas are generally not that valuable in and of themselves. People generally learn to ignore ideas unless they have strong empirical evidence of correctness (or endorsement of highly respected researchers) because there are simply too many ideas. The valuable thing is not the idea itself, but the knowledge of which ideas are actually correct. * I think deeply understanding top tier capabilities researchers' views on how to achieve AGI is actually extremely valuable for thinking about alignment. Even if you disagree on object level views, understanding how very smart people come to their conclusions is very valuable. * I think alignment discourse is greatly harmed by people being too scared to say things. When it bleeds over to being too scared to think about capabilities related topics for fear of a

LawrenceC13m20

While I've softened my position on this in the last year, I want to give a big +1 to this response, especially these two points:

It's genuinely hard to come up with ideas that help capabilities a lot. I think you are severely underestimating how hard it is, and how much insight is required. I think one issue here is that most papers on arxiv are garbage and don't actually make any progress, but those papers are not the ones that are pushing AGI forward anyways.
[..]
High level ideas are generally not that valuable in and of themselves. People generally learn

Vanessa Kosoy

12h

The beauty industry offers a large variety of skincare products (marketed mostly at women), differing both in alleged function and (substantially) in price. However, it's pretty hard to test for yourself how much any of these product help. The feedback loop for things like "getting less wrinkles" is very long.

So, which of these products are actually useful and which are mostly a waste of money? Are more expensive products actually better or just have better branding? How can I find out?

I would guess that sunscreen is definitely helpful, and using some moisturizers for face and body is probably helpful. But, what about night cream? Eye cream? So-called "anti-aging"? Exfoliants?

Answer by Wei DaiMay 03, 202420

Retinoids, which is a big family of compounds but I would go with adapalene, which has better safety/side effect than anything else. It has less scientific evidence for anti-aging than other retinoids (and is not marketed for that purpose), but I've tried it myself (bought it for acne), and it has very obvious anti-wrinkle effects within like a week. You can get generic 0.1% adapalene gel on Amazon for 1.6oz/$12.

(I'm a little worried about long term effects, i.e. could the increased skin turnover mean faster aging in the long run, but can't seem to find any data or discussion about it.)

4Answer by romeostevensit2h

'Evidence based' in the skincare industry mostly means a company paid a consultancy to do 'scientific testing.' Very little shows any signs of actual large effect sizes other than blocking UV damage, afaik. Moisturizer does seem to help, especially in low humidity climes. Which moisturizer your skin responds best to seems to be trial and error, ie the detectable/subjective metrics are all we really have. Retinoids are one exception, but come with tradeoffs that I don't fully understand. I settled on snail mucin as a more natural alternative, but a significant fraction of people have mild allergic reactions to it apparently.

1Answer by benjaminikuta3h

I would bet that sunscreen dwarfs all other effects in the long run

AI #62: Too Soon to Tell

Zvi

12h

What is the mysterious impressive new ‘gpt2-chatbot’ from the Arena? Is it GPT-4.5? A refinement of GPT-4? A variation on GPT-2 somehow? A new architecture? Q-star? Someone else’s model? Could be anything. It is so weird that this is how someone chose to present that model.

There was also a lot of additional talk this week about California’s proposed SB 1047.

I wrote an additional post extensively breaking that bill down, explaining how it would work in practice, addressing misconceptions about it and suggesting fixes for its biggest problems along with other improvements. For those interested, I recommend reading at least the sections ‘What Do I Think The Law Would Actually Do?’ and ‘What are the Biggest Misconceptions?’

As usual, lots of other things happened as well.

Introduction.
Table of

...

(Continue Reading – 9081 more words)

Radford Neal1h32

Regarding Cortez and the Aztecs, it is of interest to note that Cortez's indigenous allies (enemies of the Aztecs) actually ended up in a fairly good position afterwards.

From https://en.wikipedia.org/wiki/Tlaxcala

For the most part, the Spanish kept their promise to the Tlaxcalans. Unlike Tenochtitlan and other cities, Tlaxcala was not destroyed after the Conquest. They also allowed many Tlaxcalans to retain their indigenous names. The Tlaxcalans were mostly able to keep their traditional form of government.

5henry8h

It's a bit of an out-of-body experience to see your own tweet in a newsletter! The location-finding model used is available on geospy.ai.

3habryka9h

Oh no, I wonder what happened. Re-importing it right now.

1mishka8h

success

How would you navigate a severe financial emergency with no help or resources?

Tigerlily

Hello, friends.

This is my first post on LW, but I have been a "lurker" here for years and have learned a lot from this community that I value.

I hope this isn't pestilent, especially for a first-time post, but I am requesting information/advice/non-obvious strategies for coming up with emergency money.

I wouldn't ask except that I'm in a severe financial emergency and I can't seem to find a solution. I feel like every minute of the day I'm butting my head against a brick wall trying and failing to figure this out.

I live in a very small town in rural Arizona. The local economy is sustained by fast food restaurants, pawn shops, payday lenders, and some huge factories/plants that are only ever hiring engineers and other highly specialized personnel.

I...

(See More – 560 more words)

romeostevensit1h42

Finding a loan to move to somewhere with jobs is probably your best bet. This may devolve to begging amongst any social circles as well, which is a big pride hit. Many probably won't believe you will wind up with the means to pay it back. Minimizing the cost of the move by getting rid of non essential belongings. It is probably somewhat easier these days to line up a far away job via zoom interviewing. Quantity over quality.

One thing that I think is non-obvious: if you lay out the case for the loan in detail, that demonstrates intelligence and conscientiou... (read more)

7Answer by Elizabeth5h

https://www.modestneeds.org/ will give one time cash infusions to people with capital intensive problems (like moving costs, or keeping a vehicle). I haven't looked into them in a while; a few years ago there was a requirement that the cash infusion would get recipients on a stable track, I think that might be looser now.

3benjaminikuta3h

"At least one adult member of the applicant's household must be documentably employed at the time of the application"

4Answer by RedMan5h

Aella has written a bunch on camgirling, including questions to ask yourself about suitability. The advice is probably applicable to twitch streaming or tiktok video creation too. Content or product creation online and sales has never been easier, but it's hard work with no guarantee of payoff. There is a lot written/on youtube about retail arbitrage, if you have stores nearby you might be able to do that. Fully remote entry level call center/sales jobs are pretty much always hiring, they're pretty demanding though. Staffing agencies can potentially set you up, be ready to do things like convince elderly people to give to a charity. Longer term, professional certifications in healthcare or IT can usually make a big difference in someone's life. I'm guessing the funding environment for entrepreneurs isn't great right now, but something is always happening somewhere. Amazon Mechanical Turk used to be a decent way of doing boring work for a little money. Free money from the government is a thing, but services for people who don't have kids are few and far between. Starting from zero today is hard, but the best thing you can do is get out there and start trying stuff. You don't have any opportunity cost for trying things, which isn't true for a lot of people. You can go into the unknown knowing that the alternative (your present situation) is complete crap. Good luck!

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

Transformers Represent Belief State Geometry in their Residual Stream

330

Adam Shai

Ω 12716d

Produced while being an affiliate at PIBBSS^[1]. The work was done initially with funding from a Lightspeed Grant, and then continued while at PIBBSS. Work done in collaboration with @Paul Riechers, @Lucas Teixeira, @Alexander Gietelink Oldenziel, and Sarah Marzen. Paul was a MATS scholar during some portion of this work. Thanks to Paul, Lucas, Alexander, Sarah, and @Guillaume Corlouer for suggestions on this writeup.

Introduction

What computational structure are we building into LLMs when we train them on next-token prediction? In this post we present evidence that this structure is given by the meta-dynamics of belief updating over hidden states of the data-generating process. We'll explain exactly what this means in the post. We are excited by these results because

We have a formalism that relates training data to internal

...

(Continue Reading – 3335 more words)

Adam Shai2h10

This is a great question, and one of the things I'm most excited about using this framework to study in the future! I have a few ideas but nothing to report yet.

But I will say that I think we should be able to formalize exactly what it would mean for a transformer to create/discover new knowledge, and also to apply the structure from one dataset and apply it to another, or to mix two abstract structures together, etc. I want to have an entire theory of cognitive abilities and the geometric internal structures that support them.

1Adam Shai2h

If I'm understanding your question correctly, then the answer is yes, though in practice it might be difficult (I'm actually unsure how computationally intensive it would be, haven't tried anything along these lines yet). This is definitely something to look into in the future!

3Adam Shai2h

It's surprising for a few reasons: * The structure of the points in the simplex is NOT * The next token prediction probabilities (ie. the thing we explicitly train the transformer to do) * The structure of the data generating model (ie. the thing the good regulator theorem talks about, if I understand the good regulator theorem, which I might not) The first would be not surprising because it's literally what our loss function asks for, and the second might not be that surprising since this is the intuitive thing people often think about when we say "model of the world." But the MSP structure is neither of those things. It's the structure of inference over the model of the world, which is quite a different beast than the model of the world. Others might not find it as surprising as I did - everyone is working off their own intuitions. edit: also I agree with what Kave said about the linear representation.

4kave5h

IIUC, the good regulator theorem doesn't say anything about how the model of the system should be represented in the activations of the residual stream. I think the potentially surprising part is that the model is recoverable with a linear probe.

Mechanistic Interpretability Workshop Happening at ICML 2024!

Neel Nanda, LawrenceC, Fazl

Ω 162h

Announcing the first academic Mechanistic Interpretability workshop, held at ICML 2024! I think this is an exciting development that's a lagging indicator of mech interp gaining legitimacy as an academic field, and a good chance for field building and sharing recent progress!

We'd love to get papers submitted if any of you have relevant projects! Deadline May 29, max 4 or max 8 pages. We welcome anything that brings us closer to a principled understanding of model internals, even if it's not "traditional” mech interp. Check out our website for example topics! There's $1750 in best paper prizes. We also welcome less standard submissions, like open source software, models or datasets, negative results, distillations, or position pieces.

And if anyone is attending ICML, you'd be very welcome at the workshop!...

(See More – 28 more words)

LessOnline Festival Updates Thread

Ben Pace

14d

This is a thread for updates about the upcoming LessOnline festival. I (Ben) will be posting bits of news and thoughts, and you're also welcome to make suggestions or ask questions.

If you'd like to hear about new updates, you can use LessWrong's "Subscribe to comments" feature from the triple-dot menu at the top of this post.

Reminder that you can get tickets at the site for $400 minus your LW karma in cents.

Ben Pace2h42

Launched a few days ago, the plan is:

Kids tickets are $50
There's daycare purchasable on-site from 10am to 7pm, for like $10/hour if you book ahead of time or $30/hour if you use it on-the-day
If you want connection to a nanny for outside of those hours we have a service that can help with that at $45/hour.

Happy to get feedback on this, still figuring out what exactly helps parents and how to set it up right.

LESSWRONG
LW

Quick Takes

Popular Comments

Recent Discussion

Table of Contents

Introduction

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA