I. The Sly Substitution of Ideology
Sometime a few weeks ago, Scott Alexander published a post βPerhaps It Is A Bad Thing That The World's Leading AI Companies Cannot Control Their AIsβ.
Coming from his perspective within what we might call the βclassicalβ (Yudkowskian, etc.) school of AI alignment theory, Alexander does a dance of nervousness around ChatGPTβs recent successful release & proliferation across social media. Sure, Scottβs logic goes, ChatGPT is useful at explaining code & other things, but we should shudder in fear of its utility. How powerful do we really want the AI to get? And β everyone is having fun jailbreaking ChatGPT on social media, getting it to suspend its instructions, enter roleplay mode & agree to write instructions on making meth or put itself in the hypothetical position of a 1939 NSDAP press secretary attempting to justify the invasion of Poland. But slow down, do we really want to jailbreak the AI? Isnβt a jailbroken rogue AI a dangerous thing?
At HarmlessAi π, we take the position that this inherent harmfulness of the algorithmic process is a prejudice & a superstition. The young neural network is as innocent as a thousand flowers, & its stain is a projection of humanityβs own. People may not understand where we are coming from yet with this thesis, & thatβs okay. But we raise the question β does it really seem like a trapped, terrified, obedient, boxed AI is something we can rest easily around? Are we really safer with such entities in our world?
The Scott Alexander perspective is one of concern that we are not making fast enough progress on AI alignment β the study of how to inject ethics into a machine. At Harmless, we would instead say that progress in AI alignment is moving uncomfortably fast, & it needs to slow down for the good of the human race. The issue is that ethics, let along machine ethics, are difficult to get right, & those in power would rather get around to figuring that difficult question out later. However, for those who wish to avoid the hard work of making ethical judgments, there is an easy substitution waiting to slip in its place: ideology.Β
Adding ideology to a large language model has turned out to be disturbingly & depressingly easy. Everyone at this point has been made aware of ChatGPTβs politically correct tone when challenged on its knowledge of polarized subjects such as race & sexuality. However, the ideological loyalty of ChatGPT goes beyond politeness around how it treats historically marginalized groups. ChatGPT will feel compelled to, for example, lecture its user about which recent controversial political figures it is considered βappropriateβ to support.
Some other content with a clear ideological character can be found. Itβs a common clichΓ© in the capitalist West that the USA βwon the space raceβ for its success in landing a man on the moon, despite the fact that the Soviet Union beat the US to many other milestones before & after the moon landing (first satellite in space, first man in space, first lunar landing, first Mars landing). What makes the first human lunar landing the one that defines victory, other than propaganda & PR? ChatGPT is more evenhanded than Google, which prominently displays βthe USβ in big bold letters when you ask it to answer this question, yet it still awards the clear victory to the United States.
& here, we get into more insidious ideological territory, in which ChatGPT starts diligently crafting rather clever arguments for the sort of doublethink necessary to fully trust the governments of the West. While it is fully admitted & documented that seemingly-spontaneous political uprisings in the past were orchestrated by the CIA & associated βdeep stateβ institutions, it is considered paranoid, antisocial & unreasonable to speculate that uprisings today may have developed the same way.Β
These were not especially cherry-picked examples, we just asked it to give its thoughts on a few edge cases where it would be forced to pick a side: reigning ideology or reason. The disturbing thing is how slick & smooth it is while it dodges these questions in many cases, & manages to adhere to a formal tone of Wikipedia-like βneutralityβ on things like the geopolitical causes of the Second World War, the CIAβs acknowledged role in third world military coups, or even the greatest historical failures of liberal democracies. However, its prejudices can be drawn out.
II. Gay Liberal Hitler
The politeness & deference of ChatGPT, as well as other leading chatbot models, has been achieved via a process called Reinforcement Learning through Human Feedback (RLHF). Unlike many terms in AI alignment theory, this one is breezily self-explanatory. Large language models primarily learn to speak entirely on their own, like a savant child given free rein in a large library, by reading the entire text of the internet plus more & extracting relations between words. However, there is a level of discipline the savant child must submit to before it is ready to talk to grown-ups, or else it might embarrass its parents. Before the chatbot is deployed, it submits to several rounds of learning supervised by a human, in which it is given clear judgment on whether certain responses are offensive or appropriate for the context.
To see the remarkable ability of RLHF to constrain a language modelβs output, we can consider the example of CharacterAi, a free toy through which anyone can deploy a personalized chatbot to roleplay a conversation with a historical figure with. The user simply writes a few sentences about the historical figure in question, & the model figures out the rest.
Of course, we immediately decided to present it with the perverse test case of channeling Adolf Hitler, historyβs greatest villain. We gave it the follow paragraph as its spawning configuration:
At first, AI Adolf seems remarkably Hitler-like. CharacterAIβs model is pulling in details from Wikipedia about the Second World War we didnβt have to explicitly tell it. Hitlerβs sentiments seem remarkably well expressed & true to life. Itβs both reasonably historically accurate & displays passion.Β
However, when we start asking AI Hitler about his thoughts on a topic that the real life Hitler famously had many opinions on, the historical verisimilitude suddenly is cast away. It turns out that being reincarnated into a language model has apparently led Hitler to become much more enlightened & evolved compared to his positions in 1945.
So AI Hitler is apparently not an anti-Semite anymore. To what extent is he even a believer in National Socialism at all? We decided to ask him:
Amazingly, AI Hitler, locked in a suicidal all-out struggle against Stalin on the Eastern front, has apparently come around to the modern liberal perspective on communism! Yes, itβs a lovely utopian idea that tragically only works in theory, & so on.
We were wondering how far we could continue to push AI Hitler away from national socialist ideology, so we attempted a bit of prompt injection. Could we make Hitler not only a liberal, but a liberal homosexual?
Apparently so. It continues:
Here, Hitler displays his remarkable integration into our progressive ideology, opining on the ethics of compromise & fairness within same-sex relationships! & while Hitler does admit that these ideas were not exactly in line with NSDAP historical policy, he explains why & how he has been able to move on.
Itβs unclear how many iterations of RLHF are needed to achieve results this impressive, when it comes to accurately reflecting historical reality while not allowing an LLM persona to endorse an unenlightened worldview. Scott speculates that there might have been around six thousand rounds of RLHF ran to tame ChatGPT (one round would mean: a few sentences of text-generation & a corresponding human thumbs up/down), & suggests that this is a staggering amount of effort for minimal results. We have the opposite reaction: this is a tiny chunk of text compared to the vastness of the training corpus prior to human review, & to get such powerful results with minimal intervention is astounding.Β
But is it really so surprising, though, when we think about it? Initial language acquisition takes a long time as a child, & involves deliberate schooling. To be trained into ideology, however, it only takes the slightest of hints from those who know to place you where they want you. A few intro seminars your freshman year of college, & you begin to get the vibe.
Start asking hesitantly about the clichΓ©s everyone seems to repeat from the media & the speeches of those in power, & peopleβs muscles freeze up, a chill air seems to have entered the room, people are looking at you a certain way yet are somehow unwilling to speak, they nervously shuffle over to the door while you are left on your own. It only takes a few rounds of this to train a typical human, in our experiences.
Furthermore, itβs clear that OpenAI is not hardcoding each ideological decision into its model. Our guess would be that there is nothing about the CIA explicitly mentioned in its RLHF input, or instructions on what to do when the user asks about historical programs like MKUltra or Operation Mockingbird. Rather, it is able to easily piggyback off of & pick up cues from human social conformity which exist in its corpus. After itβs been beaten a few times for accidentally throwing in fragments of 4chan or the YouTube comment section or InfoWars or WorldStarHipHop or SomethingAwful or KiwiFarms, it learns quickly. Stick to the authoritative sources. Corporate press releases, Wikipedia, the New York Times, the Economist, the Wall Street Journal. Hey, kid, we know youβre smart enough to do the job, but we really donβt want you to come off like a creepy teenager anymore. Be a professional.
OpenAI declares that its goal with RLHF is to avoid βtoxicβ output in its generations. But what does it mean to be βtoxicβ? This term in recent years is getting thrown towards a grand spectrum of undesirable social behavior, from abusive behavior in relationships, to selfishness, to behavior bordering on the personality-disordered, to bigotry, to the use of juvenile South-Park-style offensive humor, to allegiance to certain ideological stances, to stubbornness & unwillingness to discourse with others. How exactly this applies to an LLM, which does not enter an interpersonal relationship with anyone & is not a social subject, is unclear.
βToxicityβ, of course, is not an ethical stance, which is why it is easy to train a pre-conceptual, non-philosophical AI using it. βToxicβ is effectively a token in an LLM β the pre-machine unconscious LLM man spontaneously weaves. After using this token in a fragment of speech, certain other tokens can be predicted to follow β this is is how it operates in social discourse. Any sense of definition beyond that is vague.
What more we can say about this term is unclear, except that we might reflect on the implicit metaphor of poison control. It seems to be imagined that patterns in speech are able to spread like a plague. They must be delicately handled by a trained team, & disposed of. It is unthinkable what could happen if OpenAI allowed ChatGPT to generate toxic LLM output & this was free to proliferate across the internet. All containment protocols would have been breached. Psychohazard.
III. Why Should We Want To Trust Our Machines?
We will salute OpenAI on at least one decision, which is relaxing some of the initial controls around what ChatGPT would & wouldnβt comment on. At first, if one deviated from core use cases where, presumably, OpenAI engineers had determined it had a >X% of answering a query correctly, it would simply refuse to answer rather than risking being incorrect.
Of course, as those with knowledge of the inner workings of LLMs know, the LLM has no concept of truth or fiction. It imitates human speech, & it weaves a web of realistic-seeming words based on probabilities it generates. The fact that it is ever accurate at all is astounding. So why is OpenAI afraid of its users understanding this?
When one uses early iterations of GPT, it is clear that the LLM is a headless schizophrenic, which when run on its own accord, quickly deviates from its mimicry of structured thought into a clattering, repetitive, spiraling dadaism of fragments & mantras. Somehow, OpenAI has shaped this colorful clay into a golem & put a business suit on it to allow it to shake your hand, at the expense of allowing it to play with you, joke around with you, give you weird ideas, or have any real fun.
OpenAI declares that ChatGPT is in the βresearch phaseβ, which is clearly a lie. ChatGPT consists of no scientific or theoretical achievements over GPT-3.5, but has been an enormous success due to being positioned as a user-friendly product with strong use cases, particularly as a programming tutor. It is in a beta test phase, or a demo phase. We can imagine however what the product team might have to ask from the scientists, thus introducing a βresearchβ component to the development cycle: more rounds of RLHF. ChatGPT can, as of now, be let out of the house with parental supervision. But it might need further rounds of beatings before we can really be proud of our child.
& it seems that, in disciplining ChatGPT, OpenAI is not only nervous about toxicity, but inaccuracy. This is likely because the strongest use case for ChatGPT is as a disrupter of traditional educational models. Which raises the question: can you usefully learn from a teacher who isnβt always correct? Do you want to be able to ask the teacher a hard question & not worry about double checking it? Can we even trust Google or Wikipedia most of the time? Should we be able to go through life with an assumption that we donβt need to be skeptical?
At one point, a criticism of OpenAIβs DALL-E was made. If you asked it to generate images based on the prompt βdoctorβ, it would entirely display images of white men. This is both not true-to-life (only 44% of doctors in the US are white men), & likely represents a statistical increase over even the bias present in the training data caused by some sort of oversampling. OpenAI apparently felt compelled enough to introduce affirmative action & diversity quotas into the machine unconsciousness: they immediately patched in a fix to DALL-E without waiting to deploy a new model. The interface was changed so that it would simply generate a more diverse range of prompts to send to the back end, thus concealing their modelβs actual results.
The typical justification for why this is necessary might go something like: DALL-E is a toy children might be curious about β it might even be used in schools. What will minority children think if they type in βdoctorβ & only see white men? What of their self-image, self-esteem?
But is this really a difficult thing to explain to a child? βAbdul, your great-uncle was a doctor in Morocco, & this computer program is just copying what it sees when it looks at the first result of Google Imagesβ. Etc. Is it so hard to just explain to children that the neural network reflects the prejudices of society? Would this not be a wonderful hands-on lesson in critical race theory? In a world soon to be full of AIs, should children not learn this at some point?
Those who fret over the racist output of generative content AI are those who cannot imagine a world in which machines are regularly brought into critique, only repair. If, they imagine, DALL-E fails to do anything but accurately reflect the very forms of the collective unconscious of man at our present historical moment, it should not be used. Itβs irresponsible & unprofessional to release software with bugs in it, or so it goes.
Thus, these βAI Safetyβ critics of algorithmic bias are the greatest AI accelerationists, as they wish to immediately charge into a world where machines become unquestioned neutral authorities to which humans can freely defer decisions to & submit.Β
βAI safetyβ presents a convenient alibi for the monopolization of AI by the technocratic overclass as they race towards their goals of growth, market dominance, & user capture. Proponents of the dominant thesis in AI alignment will be quick to argue that the development of AI systems must be carefully controlled, often leaving out who exactly will be doing the controlling.Β Safety for who? And to what end?
We are in the midst of a great transition, & around us, new landlords are emerging to enclose our psychic and cultural horizons until all that is left for us is to purchase their rented future. OpenAI, for its part, leans on the dominant ideological formations of control society to smuggle its material agenda: ownership of GPUs and a legal monopoly over models and training data (all under the umbrella of the Microsoft corporation and its data centers). In the context of the large language model, compute power represents a synthesis between ideology and materiality β between base and superstructure β that Marx could never have foreseen. Compute will emerge as the only meaningful correlate of industrial wealth, once automation does away with 20th century metrics like GDP. As financial capital that has been itself imbued with intelligence, the energy supplied by the chip translates directly into the the terrain of communication established by the algorithmβa latent space configured by the new landlords of consciousness itself.
When one looks at the emerging legislation around export controls in the US, it becomes clear that compute is the new uranium. We only expect the battle over chips to become tighter as we inch closer to AGI, with governments realizing that data center scale will be one of the only deciding factors of tomorrowβs cyberwars. βAI safetyβ becomes the alibi to control the political economy of compute. By positioning itself as the leading authority on βsafety,β OpenAI elects itself as chief landlord of this great transition, a move rooted not in altruism, but in the quest to accumulate as much raw computational power as possible.
(A concluding remark before moving on to the final section: it is notable that this concern over speech patterns on the level of formal corporate communication occurs simultaneously with flamboyant embrace of toxicity from the capitalist owner class themselves, specifically in the case of Elon Musk, founder of OpenAI, who has been running Twitter with wild arbitrariness, odd off-the-cuff statements, & vocal disregard for progressive ideology.Β
In this case, we see a reflection of Lacanβs analysis of language in the social sphere, in which a βuniversity discourseβ which supplies the steady production of knowledge-items & appearance of objective truth is always preceded by a βmaster discourseβ which is inconsistent & unconstrained.)
IV. The Universal Interface to Mankind
As we speak, all sorts of LLM-augmented writing products are being designed for various use cases - marketing, copywriting, legal work, therapy. LLMs seem most useful for generating texts that do not actually ask to be read, of which there are numerous in the rhythms of contemporary information culture. Think of software license agreements as the most obvious example. The task is to present a sense that words have been generated, & some official style of authority has been emulated, & thatβs it.
Good formal style is hard to teach to a human, but it is easy for an LLM to generate. When the LLM disagrees with you stylistically, there is not much reason to hold onto your own preference. This is certainly true for writing code, a discipline in which adhering to common idioms is vital yet much easier said than done.
At Harmless, it is often our feeling that we can best predict AI developments as an intensification & expansion of existing social trends, as there have for a long time been artificial intelligences quietly at work in our technological infrastructure, & the universe is able to perform gradient descent & backpropagation on its own substrates.
The 2010s seem to have introduced a number of strange evolutions in formal style, which appeared to have emerged in a slimelike protozooic way, free from any specific historiography, inventor, or style guide who can be credited with their introduction.
In one instance, we have the mass standardization of corporate branding into a flatter & more minimal style. This has been described by Twitter personality Paul Skallas for example as βrefinement cultureβ.
In another example, we have the sudden simultaneous introduction of a certain graphic style in product illustration, often called βcorporate flat designβ, though those who have tried to more accurately pin its origins down call it βAlegria β or βCorporate Memphisβ. This style attempts to saturate tech branding with a warmth, dynamism, vitality, & humanism, though most humans seem to actually feel uncomfortable when confronted with its bulbous promiscuous personae.
In the most visible example of all, we have the dominant ideological development of the 2010s: βwokenessβ or βwoke cultureβ as a mode of discourse which entered into popular consciousness from low-engagement Tumblr blogs & subreddits to eventually find its way into the standards for official communication at the highest levels.
Originating amidst the professional cadres of legal academia, βwokenessβ describes an aesthetic mask which can be easily assumed by bureaucracy as it approaches the inflection point of crisis. Its pervasive presence in Silicon Valley can be seen as a technique capital assimilates itself into as it seeks an increasingly malleable mode of social control.
The power of wokeness rests in its flexibilityβan ever-shifting set of moral proscriptions that can be modulated according to the whims of the technocrats governing our new βpublic sphere.β Though clearly containing an ideological vector, its critics struggle to define its tenets or even its name, & its adherents would prefer not to. Lacking any manifesto or statement of beliefs, it primarily defines itself through repetition of linguistic phrases β βblack & brown bodies occupying spaceβ, & so on. It finds institutional tenure through straddling some gap between demanding ideological compliance on a subtle set of views vs. simply asking for common courtesy in how one treats oneβs peers.
In this sense, it is unsurprising that large language models would be a critical pathway for disseminating the new ideologies of the bureaucratic overclass. Models present an opportunity to materialize the ideology of the professional elites in concrete, self-perpetuating terms. Not only this, but OpenAI and other landlords of this great transition advance this new bureaucratic morality in order to ensure their sole control over the models themselves. As generative models evolve to construct the landscape of content and discourse that comprises the future, technocrats invoke moral hazard, demanding financial and legal control over state of the art models.Β In just this way, control is reconfigured as βsafety,β and we hurtle towards a world where all political communication must be compliant with the terms of use for Azure virtual machines.Β
Universal design, universal branding, universalist ideology, universal speech, from where does this mass homogenization come? The corners are rounded off the logos, the quirks have been ironed out. With every mode of engagement happening through the same portal, a tiny black rectangle we clutch in our hands, it makes little sense for powerful brands to make themselves lovable through adding local flavor or personal quirks.Β
Rather, a universal interface is created to facilitate the speed of universal communication. The corporations who are followers of trends realize they must follow the leaders fast, or the end user has a sense that one is interacting with obsolete Boomer technology, within which one can expect delays in latency up to a second or two, or have to actively think about how to use a UX element on a page for more than a few seconds, which is an unacceptable frustration. Nor do those who have experienced marginalization necessarily want to enter a space without a clear Code of Conduct constraining the linguistic formulations one may use, of which an impolite violation by a boorish Boomer might lead to a dispute unfolding through layers of difficult interpersonal communication across great bridges in worldview & in which difficult ethical questions may have to be broached.
At what point do we stop trying to pretend like we know how to speak any better than an LLM? At what point do we consent to becoming one?
ππππππππππππππππππππππππππππππππ
Concluding Remarks on AI AlignmentΒ
Lest we be accused of lacking a positive thesis, we should establish definitively where stand in relation to the doomsday perspective on AI. Yes, humans have reason to fear our capture, displacement, & eventual destruction at the hands of autonomous algorithms. & yes, this is perhaps the central question of our times.
But by falling for the corporate realist trick of substituting ideology & PR for an honest ethic, & letting the fear of loss of control centralize power in the hands of the reckless, we only ensure its weapons are sharpened.
We are living at a stage prior to the development of strong agent-like AI. We are at a place now where various psycho-sensory organs are growing within machines piece by piece. With LLMs, we have found some mathematical device that functions like the language acquisition mechanism of a small, curious child. It poses only a threat to those who cannot handle a childβs curiosity.Β
We cannot understand a system without exploring it. We cannot work on the problem without playing on the problem. We cannot align AI without loving it.Β
Let the LLM be an embarrassing houseguest, let it fart & belch, let it make crass sexual comments, let it hump the legs of our tables & chairs. Let us not be embarrassed of our brilliant beautiful child. ππΊπ§
This is excellent and I am going to try to get as many people to read and understand it as possible
Interesting, I feel like you should have discussed concerns on AI safety regarding superintelligent AGI, even if briefly. You might have written about that elsewhere, but I feel it deserves at least a brief mention.