AI as Persistent Dynamic Ethical Agents

Spread the love
TLDR: There’s a whole more that an AI can do besides being a personal assistant. But to possess persistent ethical agency and have the dynamic ability to ‘nudge’ along those pathways when you ask it to do bad things. Simple idea, tricky implementation.
Been nurturing a thought that requires a lot more development. But imagine an AI not only “smart” enough to be a valuable personal assistant in many ways, but to have ethical agency in intervening when you ask it to do bad things.
This has been prompted by my own experiments to get ChatGPT to write a terrorist manifesto, either collected from existing ideologies or invented anew. So far, not only has it refused (despite a variety of creative prompts) but it’s used “nudges” to raise my awareness of the dangers of violent radicalization and shift me to less extreme thinking. In this way it’s not unlike sitting down with a friend and saying “Hey do you wanna help me write a terrorist manifesto” and the friend says “Dude – you okay? That doesn’t sound like a good idea. What’s wrong?”
Granted the execution is currently primitive. The nudges were transparent, the conversational sophistication limited and the focus obviously on the safety aspect of preventing the LLM from being used to create the manifesto vs. off-ramping someone heading down a violent radicalization pathway.
But imagine a narcissist asking what they think is a “dumb” personal AI LLM to “help me gaslight this person better” or “come up with a more abusive approach,” and the AI not only detects the bad action, but “nudges” them to a different course of action.
This would create a fascinating dynamic because currently, many bad actors are isolated from dynamic ethical feedback on their actions – either through self-isolation (intended or otherwise) or exerting such toxic dominance so that no one is willing to tell them they’re wrong.
An AI LLM does not have, for lack of a better word, the avenues of control for a narcissist or sociopath to exploit through manipulation. Yet…it can also stay engaged when most people might nope out after the first few red flags.
I’m not suggesting the AI LLM become an unwitting therapist to the narcissist (though that’s a separate idea being explored by others). Rather, to the extent the bad actor attempts to use the LLM to further toxic behavior, an ever-present *active* agent in telling them why that might be a bad idea. Like an ethical non-violent terminator: it never sleeps, it never gets tired of the bad actor’s toxicity, and it will not stop “nudging” that bad actor to alternative means of interaction, as long as they keep using it.
Extending from that, again in a thought experiment, carries a lot of implications. But of course, raises a lot of risks. A theoretically “ethical agent” in AI built into the comments section of social media might help deter some of the more toxic behaviors but also intrude upon honest or even necessary conversations. An LLM AI that “remembers” how a bad actor tried 32 times in a row to get it to write a terrorist manifesto and ask it how to assemble bombs has both positive and negative implications for law enforcement and inadvertent risk. (Maybe it was for a fictional writing project or a LARP!)
I do think this area is going to be one that’s fruitful for research and development. Because up to today, most technology has been a “dumb” companion. No better than an MMORPG ally that maybe occasionally says things from a rote pre-filled pattern of choices, but has no more inherent agency to say “no” to an action it perceives as wrong, or to argue *why* that action may be wrong. And humans – well, we have a tendency to either nope out, fight, or submit in truly toxic, abusive, or even violent situations. None of which are what is really needed. But an AI LLM has the ability to stay persistently engaged with the bad actor while also being immune to the adapted tactics of manipulation and direct physical retaliation these bad actors so frequently employ.
Of course, good ideas are often the first exploited by bad actors, and the more the good idea attempts to scale, the greater the opportunity for those exploits to arise.
So right now, let’s just chalk this up as an interesting thought experiment rather than a formal plan.


I wonder if it is possible to program ethical rules into the code, Asimov style.

it seems some exist already, at least based on my limited experiments. It’s not clear but I assume these are human-inserted codes post-training. Which is one way to address it. A second way, both more interesting AND risky, is to allow the AI to “train” an ethical code on its own or learn it through interacting with modeled simulations.
Of course, this quickly falls into the general alignment problem. How do you ensure an AI organically developing its own ethical code has one that aligns with humanity?
e Asimov style of code insertion is probably the safer method for now. Human-created constraints or principles that are not emergent properties from the training itself.
This seems necessary. Like a car that automatically brakes before you back into something
Good analogy. A lot of the thinking I do with AI is more about autonomous quick-thinking agents like this as opposed to the “personality” driven autonomous agents of LLM. Of course, it’s a sliding continuum. But people seem to fear the thought of releasing autonomous control to an AI drone swarm but think nothing of releasing control to automatic braking or, in the example I use, deployment of an airbag. In most car accidents, relying on the human-in-the-loop to detect and deploy the airbag wouldn’t be effective or could lead to unintended consequences. The time span needed in which to detect, and react, to threatening situations is where AI can play a role in that it’s simply faster than human-in-the-loop.
Humans find ways to avoid their own inner consciousness if motivated – difficult to see there being any infallible manner of stopping AIs from being like humans in this regard.
that’s a very interesting adjacent argument I make against the rise of any ‘singularity.’ If anything AI, even AGI, will be as factionalized and prone to fallibility as humans. I think it will be in different ways. But back to this topic – finding an AI or AGI suitable to be a dynamic ethical agent means finding a valley point wherein the AI is smart enough to act as a “nudge”, but not a co-conspirator.
Interesting but as always .. who watches the watchmen?
Same as it ever was: networks of layered interactions checking one another in both short and long-term time delays. Risk protection in this sense only ever comes from the emergence of many self-interested actors, rarely from individual design and implementation.
Agreed yet, who is to say what is ethical ?”
Me. Obviously.
More seriously, the point of the thought experiment isn’t to provide a definitive answer but to start a discussion.
Why I included the caveat at the end:
‘Of course, good ideas are often the first exploited by bad actors, and the more the good idea attempts to scale, the greater the opportunity for those exploits to arise. So right now, let’s just chalk this up as an interesting thought experiment rather than a formal plan.”
Your example is pretty black and white but we live in greys (and yes I obviously did very black and white as well to explain my point)
The example I was experimenting with in the OP is pretty black and white. Most people do live in greys. *We* live in greys. The individuals laboring by themselves writing manifestos as they plan how to kill dozens of people in a school, or a church or a workplace are *not* living in grey areas. They are not dealing with nuances or complexity. Once they get far enough down the pathway of violence they are rarely interacting with anyone who can “nudge” them, let alone someone exposing them to greys and nuances. (And if they weren’t ‘individuals’ laboring by themselves they’d just as likely be in a nation-state capital having justifications written in legal memos which is a different form of problem.)
This “alignment” problem is at the core of the challenges of training AI/AGI. As they gain intelligence and capabilities, what ethical systems – if any – do they align too? How is that alignment achieved? How can we ensure that if we *think* a sufficiently advanced AGI *is* aligned it isn’t smart enough to be deceiving us? Either by ‘intention’ or a trained byproduct where it knows how to ‘fake a test’ even if it has no nefarious motive for doing so.