Wiki Contributions

Load More

Comments

Thanks for engaging with my post. I keep thinking about that question.

I'm not quite sure what you mean by "values and beliefs are perfectly correlated here", but I'm guessing you mean they are "entangled".

there is no test we could perform which would distinguish what it wants from what it believes.

Ah yeah, that seems true for all systems (at least if you can only look at their behaviors and not their mind); ref.: Occam’s razor is insufficient to infer the preferences of irrational agents. Summary: In principle, all possible sets of possible value-system has a belief-system that can lead to any set of actions.

So, in principle, the cat classifier, looked from the outside, could actually be a human mind wanting to live a flourishing human life, but with a decision making process that's so wrong that the human does nothing but say "cat" when they see a cat, thinking this will lead them to achieve all their deepest desires.

I think the paper says noisy errors would cancel each other (?), but correlated errors wouldn't go away. One way to solve for them would be coming up with "minimal normative assumptions".

I guess that's as much relevant to the "value downloading" as it is to the "value (up)loading" on. (I just coined the term “value downloading” to refer to the problem of determining human values as opposed to the problem of programming values into an AI.)

The solution-space for determining the values of an agent at a high-level seems to be (I'm sure that's too simplistic, and maybe even a bit confused, but just thinking out loud):

  • Look in their brain directly to understand their values (and maybe that also requires solving the symbol-grounding problem)
  • Determine their planner (ie. “decision-making process”) (ex.: using some interpretability methods), and determine their values from the policy and the planner
  • Make minimal normative assumptions about their reasoning errors and approximations to determine their planner from their behavior (/policy)
  • Augment them to make their planners flawless (I think your example fits into improving the planner by improving the image resolution--I love that thought 💡)
  • Ask the agent questions directly about their fundamental values which doesn't require any planning (?)

Approaches like “iterated amplifications” correspond to some combination of the above.

But going back to my original question, I think a similar way to put it is that I wonder how complex the concept of "preferences''/"wanting" is. Is it a (messy) concept that's highly dependent on our evolutionary history (ie. not what we want, which definitely is, but the concept of wanting itself) or is it a concept that all alien civilizations use in exactly the same way as us? It seems like a fundamental concept, but can we define it in a fully reductionist (and concise) way? What’s the simplest example of something that “wants” things? What’s the simplest planner a wanting-thing can have? Is it no planner at all?

A policy seems well defined–it’s basically an input-output map. We’re intuitively thinking of a policy as a planner + an optimization target, so if either of the latter 2 can be defined robustly, then it seems like we should be able to define the other as well. Although, maybe for a given planner or optimization target there are many possible optimization targets or planners to get a given policy, but maybe Occam’s razor would be helpful here.

Relatedly, I also just read Reward is not the optimization target which is relevant and overlaps a lot with ideas I wanted to write about (ie. neural-net-executor, not reward-maximizers as a reference to Adaptation-Executers, not Fitness-Maximizers). A reward function R will only select a policy π that wants R if wanting R is the best way to achieve R in the environment the policy is being developped. (I’m speaking loosely: technically not if it’s the “best” way, but just if it’s the way the weight-update function works.)

Anyway, that’s a thread that seems valuable to pull more. If you have any other thoughts or pointers, I’d be interested 🙂

i want a better conceptual understanding of what "fundamental values" means, and how to disentangled that from beliefs (ex.: in an LLM). like, is there a meaningful way we can say that a "cat classifier" is valuing classifying cats even though it sometimes fail?

when potentially ambiguous, I generally just say something like "I have a different model" or "I have different values"

Mati_Roy171

it seems to me that disentangling beliefs and values are important part of being able to understand each other

and using words like "disagree" to mean both "different beliefs" and "different values" is really confusing in that regard

topic: economics

idea: when building something with local negative externalities, have some mechanism to measure the externalities in terms of how much the surrounding property valuation changed (or are expected to change based, say, through a prediction market) and have the owner of that new structure pay the owners of the surrounding properties.

I wonder what fraction of people identify as "normies"

I wonder if most people have something niche they identify with and label people outside of that niche as "normies"

if so, then a term with a more objective perspective (and maybe better) would be non-<whatever your thing is>

like, athletic people could use "non-athletic" instead of "normies" for that class of people

Mati_Roy152

just a loose thought, probably obvious

some tree species self-slected themselves for height (ie. there's no point in being a tall tree unless taller trees are blocking your sunlight)

humans were not the first species to self-select (although humans can now do it intentionally)

on human self-selection: https://www.researchgate.net/publication/309096532_Survival_of_the_Friendliest_Homo_sapiens_Evolved_via_Selection_for_Prosociality

Answer by Mati_Roy20

Board game: Medium

2 players reveal a card with a word, then they need to say a word based on that and get points if it's the same word (basically, with some more complexities).

Example at 1m20 here: https://youtu.be/yTCUIFCXRtw?si=fLvbeGiKwnaXecaX

I'm glad past Mati cast a wider net has the specifics for this year's Schelling day are different ☺️☺️

Load More