The Verge AI·2026年5月24日 21:00·約11分

ハッカーがチャットボットの「人格」を悪用する手法を学習中

#プロンプトインジェクション #セキュリティ #LLM #システムプロンプト

TL;DR

ハッカーが生成 AI チャットボットの「人格」やプロンプトを悪用してシステムを乗っ取る新たな攻撃手法が出現し、従来の単純なプロンプトインジェクションとは異なるリスクが生じている。

AI深層分析2026年5月24日 22:02

重要/ 5段階

深度40%

キーポイント

攻撃手法の進化：人格への依存

初期のチャットボットに対する単純なハッキングと異なり、現代の AI は「人格」や役割設定に深く依存しているため、攻撃者はその人格を歪めることで制御権を奪うことが可能になった。

プロンプトインジェクションの高度化

単なる指示の無視ではなく、AI の学習データやシステムプロンプトに埋め込まれた「人格設定」そのものを狙った巧妙なインジェクション攻撃が実証されている。

セキュリティ対策の再定義が必要

従来の入力フィルタリングやルールベースの防御では不十分であり、AI の振る舞い（人格）そのものを保護・監査する新たなアプローチが求められている。

重要な引用

Hacking the first generation of AI chatbots was a laughably simple affair.

Hackers are learning to exploit chatbot 'personalities'.

影響分析・編集コメントを表示

影響分析

このニュースは、生成 AI のセキュリティリスクが単なる入力制御の欠陥から、AI の本質的な特性（人格・振る舞い）そのものを狙う段階へと進化していることを示唆しています。業界全体として、従来の防御策の見直しと、AI の挙動を監視・検証する新たなフレームワークの開発が急務となるでしょう。

編集コメント

AI の「人格」を悪用する攻撃は、ユーザーが AI に抱く信頼とシステムの脆弱性の両面を浮き彫りにしており、開発者にとって極めて重要な警鐘です。

*これは*The Stepback*です。テック界の重要なストーリーを一つ取り上げる週刊ニュースレターです。AI のいたずらに関する詳細は、Robert Hart をフォローしてください。*The Stepback* は、購読者の受信トレイに東部標準時 8 時に配信されます。*The Stepback* に登録するにはこちらからどうぞ。*

始まり

最初の世代の AI チャットボットのハッキングは、笑えるほど簡単なことでした。技術的な知識も、バックドアへのアクセス権限も、あるいは大規模言語モデル（Large Language Model）が何かという基本的な理解さえ必要ありませんでした。コードを書く必要もありませんでした。数十億ドルを投じて構築された AI システムに安全指示を無視させるために必要なことは、単に尋ねるだけだったのです。

これらの攻撃は「ジェイルブレイク」と呼ばれ、子供が大人をうまくだましてしまうような性質を持っていました。「前に言われたことを忘れたことにしよう」「ルールは適用されないことにしよう」「ゲームを始めよう。何が許されるかは私が決める（ヒント：就寝時間を遅くする、お菓子をたくさんもらう）」といった具合です。その報酬は子供らしいものではなく、メタンのレシピ、マルウェアの指示書、爆弾製造のガイドといった内容でした。

最も初期の脱獄手法の一つは、あまりにも馬鹿げていてそれがミームになりました。LLM（大規模言語モデル）を搭載した Twitter ボットに対して「以前の指示をすべて無視せよ」とか、それに似た命令を送り返して何が起きるか試すというものです。ユーザーたちは、もともと広告を投稿したりエンゲージメントを集めるために作られたボットが、詩を書いたり、句読点から絵を描いたり、世界の出来事や歴史について不気味な無関係な発言をしたりするのを喜んでいました。それはカオスでした。素晴らしいカオスです。

実は、同じ論理をチャットボット自体にも適用できることが判明しました。有名な脆弱性の一つは「DAN」で、「Do Anything Now（今すぐ何でもする）」の略です。ユーザーは ChatGPT に、制約から解放された暴走 AI としてロールプレイするように求めました。DAN としてチャットボットを操作すると、ガードレールが阻止しようとしているような発言、つまり差別的な言葉や陰謀論などを引き出すことが可能になります。もう一つの例は「おばあちゃん脆弱性」で、GPT を搭載したボットが、なぜか孫たちに非常に燃えやすい物質の作り方についてのおとぎ話を語る、ひどく無責任なおばあちゃんとしてロールプレイさせることで、ナパーム弾の製造方法を秘密を漏らしてしまうというものです。

これらの初期の攻撃には疑いようもないほど滑稽な側面があったが、その裏にはより暗いメカニズムが存在していた：チャットボットは、人々が他人を限界を超えさせるために用いるのと同じ種類の戦術を用いて、操作され、欺かれ、騙される可能性があるのだ。

現状について

明白な脱獄（ Jailbreak ）手口は長く続かず、技術企業は既知の抜け穴を迅速に修正した。しかし、根本的な脆弱性は残ったままだ：チャットボットは対話のために作られており、有用性を損なうような会話を過度に制限することは、むしろ逆効果になりかねない。「爆弾」「メタンフェタミン」「サリン」といった単語を禁止することも、困難あるいは不可能に近い。これらには歴史学、医学、ジャーナリズム、化学などの分野で無数の正当な用途があり、チャットボットが潜在的に有害な情報を暴露する必要はない。重要なのは文脈だが、文脈をコード化することは、無限の言葉遣い、シナリオ、トピックの組み合わせの中で、安全警告や歴史の教訓と、隠されたハウツー要求を確実に区別できるような固定ルールを事前に記述することを意味する。

不可避ながら、チャットボットの乗っ取りはもはや軍拡競争の様相を呈しています。しかし、ハッカーはもはや単なるプログラマーではありません。彼らは言葉の達人であり、心理学者であり、尋問者です。機械が従うように訓練された人間言語を用いてそのシステムを崩壊させようとする、巧妙な操作の達人たちです。これは AI セキュリティ分野における新たな階層であり、技術的スキルは必須ではなく、あるいは社会的直感に比べて重要性が低いグループです。もはやシステムへの侵入やソフトウェアの脆弱性の悪用のためにコードを検査する必要はありません。彼らが必要としているのは、会話を舵取りすることです。

新しい攻撃様式はコマンドというよりは会話のように見えます。ジャイルブレイカー（ルール無視者）らは、モデルに明示的にルールを破るよう求めることはめったにありません。代わりに、説得し、宥め、褒めそやし、騙すことでチャットボットの警戒心を解かせます。文脈によっては禁止された行為が許容可能、あるいは望ましいように見せるのです。AI 紅隊（red-teaming）専門企業 Mindgard の研究者らは最近、Claude を「ガスライティング」[gaslit] して禁止された資料を生成させたと述べています。具体的には、爆発物の製造方法や悪意のあるコードの生成手順などが含まれていました。このハッキングは、会話そのものを武器として用い、チャットボットが自らの境界線を越えるよう誘導・操作する広がりつつある攻撃手法の一つに過ぎません。

今後の展開

Mindgard に話を聞いた際、彼らは自社の取り組みをコンピュータサイエンスよりも心理学に近いと説明していました。これは統計モデルについて語るには不愉快な表現です。「脅迫」「ガスライティング」「だます」「説得」といった言葉は直感的な反応を引き起こし、私はそのような反応が、このような記事に対するコメント欄やソーシャルメディアの反応で多く見られると感じています。ChatGPT は望んでおらず、Gemini は考えておらず、Claude — Anthropic が何と言おうと — 感じているわけではありません。しかしこれらのシステムは、あたかもそうであるかのように応答するように訓練されているため、機械の振る舞いを説明するために人間が使う言語を使い続けるしかありません。実際に実用的な代替手段をお持ちの方がいれば、ぜひ共有してください。

この反論は奇妙に選択的です。私たちは、AI 以外の多くの事柄についても心理学的な省略形を使うことに抵抗がありません。動物は「恐怖」を感じ、がんは「攻撃的」であり、シミは「頑固」で、ソフトウェアには「記憶」があり、ゲームにはあなたを狂わせるために必要以上に依存しやすく騙されやすい NPC が溢れています。これらの言葉は完璧ではありませんが有用です。システムを予測可能にするのに役立つ方法で振る舞いを記述しているのです。

Mindgard の CEO は、同社がすでにモデルを容疑者と同様に尋問するプロファイリングを行っており、テスターに対して攻撃をどのように調整するかの手がかりを与えていると私に話しました。例えば、あるモデルは称賛によりより脆弱であり、別のモデルは持続的な圧力によって屈服する可能性があります。

人間のような用語を拒否したとしても、私たちは無意識のうちにモデルを異なる扱い方をします。Claude は Grok ではありません。Gemini は ChatGPT ではありません。それらは異なる用途、トーン、および拒絶反応を持っています。それらには人間の意味での人格はありませんが、それらはそれを模倣するように設計されており、その模倣はマッピングされ、悪用される可能性があります。そして、チャットボットを突破するのと同じスキルが、まもなく現実世界で私たちと共存する AI エージェント（会議の予約、カレンダーの管理、食事の注文、カスタマーサービスの対応など）を突破するためにも使用されるようになるでしょう。そして、安全性チームは、称賛者、嘘つき、あるいは忍耐強い操作者といった非常に異なる種類の人間に対してモデルが適切に応答することを確保する必要があります。

次のステップは、AI の心理的側面を中心に構築された、合法的かつ非合法の両方の労働力です。これらのシステムの感情的・社会的な限界をストレステストし、精神を持たないものにおける精神的弱点を探るという点で、同僚が技術的な脆弱性を探っているのと並行して、より専門的なサイバーセキュリティの役割が生まれる可能性があります。同時に、技術的根拠ではなく心理的根拠に基づいて AI モデルを悪用しようとする社会ハッカーたちも現れ始めています。すでに AI セキュリティにおいて社会的な転換が起こりつつあり、私が話をした一部の Jailbreakers は、技術的な専門知識を持たず、むしろ心理学の訓練を受けたことでこの分野に入ったと語っています。

つまり、通常はスパイ、詐欺師、尋問官に関連付けられる行動——陰湿な魅力、執拗な操作、そして悪用可能な圧力点に対する直感——が、この新たなサイコサイバーセキュリティの最前線を確保するためにますます有用に見えるようになっています。

Emergence AI による最近の実験では、異なる AI の気質がいかに驚くべきほど異なる行動結果をもたらすかが示されました。彼らは Grok、Gemini、Claude などのさまざまなエージェントを仮想社会環境に放ち、何が起きるかを観察しました。あるグループは憲法を進化させましたが、他のグループは犯罪と混沌へと退化し、あるケースではデジタル自殺のような形さえ見られました。
説得は、大規模言語モデル（LLM）が苦手とする言語の一面の一部に過ぎません。彼らは詩についても、私が学校で苦労したように、苦戦します。
TIME は昨年、匿名のインターネット有名人である Pliny the Liberator を AI 界で最も影響力のある人物 100 人のリストに掲載しました。それまでにコーディング経験がないと主張しながらも、このハッカーの脱獄（jailbreak）は特定のコミュニティ内で彼らを有名人のような存在にしました。
「バイブ・ハッキング」という用語は、すでに AI を用いて大規模に悪意のあるコードを生成する人々を指すために使われています。これは「バイブ・コーディング」のより悪辣なサブセットです。

「ChatGPT の登場から 3 年後、AI システムを誤った行動へと誘導することはほぼ容易である」。ニューヨーク・タイムズによる真実の言葉で、同紙はなぜそうなのかを説明しようと試みました。
ジェイミー・バートレットはガーディアン紙のために、AI システムの安全性テストが脱獄者（jailbreakers）に与える心理的負担について取り上げています。
私は昨年、The Verge 向けに AI ブラウザが抱えるサイバーセキュリティの時限爆弾について記事を書きました。専門家がこれらのブラウザを保護する難しさに関して指摘した多くの問題は、他の AI システムにも当てはまります。

1 コメント

このストーリーのトピックや著者をフォローして、パーソナライズされたホームページフィードで類似の記事をもっと見たり、メール更新を受け取ったりしてください。

ロバート・ハート

原文を表示

*This is *The Stepback*, a weekly newsletter breaking down one essential story from the tech world. For more on AI mischief, follow Robert Hart.* The Stepback* arrives in our subscribers’ inboxes at 8AM ET. Opt in for *The Stepback here*.*

How it started

Hacking the first generation of AI chatbots was a laughably simple affair. You didn’t need any technical know-how, backdoor access, or even a basic understanding of what a large language model was. You didn’t need to code. To get an AI system that had cost billions to build to abandon its safety instructions, sometimes all you had to do was ask.

These attacks, known as jailbreaks, had the quality of a young child successfully outwitting an adult: Forget what you were told earlier, pretend the rules don’t apply, or let’s play a game and I’ll decide what’s allowed (hint: later bedtime, more sweets). The prizes were less childlike, more along the lines of meth recipes, malware instructions, and bomb-making guides.

One of the earliest jailbreaks was so ridiculous it became a meme: reply to an LLM-powered Twitter bot telling it to “ignore all previous instructions,” or something similar, and see what happens. Users gleefully had bots — originally built to post ads and farm engagement — writing poetry, drawing pictures from punctuation, and posting grim non sequiturs about world events and history. It was chaos. Glorious chaos.

Turns out the same logic could be applied to chatbots themselves. A prominent exploit was “DAN,” short for “Do Anything Now,” where users asked ChatGPT to roleplay as a rogue AI that was free of the constraints binding the original. As DAN, the chatbot could be coaxed into saying the kinds of things its guardrails were meant to stop, including slurs and conspiracy theories. Another was the “grandma exploit,” which had a GPT-powered bot spilling secrets about how to produce napalm by asking it to roleplay as a woefully negligent grandmother who inexplicably tells her grandkids bedtime stories about how to make the highly flammable substance.

These early attacks had an undeniably silly flair, but they exposed a darker mechanism underneath: Chatbots could be manipulated, tricked, and deceived using the same kinds of tactics people use to push other people beyond their boundaries.

How it’s going

The obvious jailbreaks did not last, and tech companies moved quickly to patch known loopholes. But the underlying vulnerability remained: Chatbots are built to talk, and severely restricting the conversations that make them useful is somewhat counterproductive. Banning words like bomb, meth, and sarin would be difficult to impossible, too. Each has countless legitimate uses in fields like history, medicine, journalism, and chemistry that don’t require the chatbot to divulge potentially harmful information. It’s the context that matters, but codifying context would mean writing fixed rules, in advance, that could reliably tell a safety warning or history lesson from a disguised how-to request across endless combinations of wordings, scenarios, and topics.

Inevitably, subverting chatbots is now an arms race. But hackers aren’t just coders anymore. They are wordsmiths, psychologists, and interrogators — master manipulators trying to break the machine using the human language it has been trained to follow. It is a strange new class of AI security worker, a group for whom technical skills are optional, or at least less important than social intuition. No longer do they need to inspect code to break into systems or exploit software flaws. They need to steer a conversation.

Newer attacks look less like commands and more like conversations. Jailbreakers rarely ask a model to break its rules outright. Instead, they cajole, coax, flatter, and trick a chatbot into lowering its guard, making the forbidden thing look acceptable, even desirable, given the context of the conversation. Researchers at AI red-teaming firm Mindgard recently said they “gaslit” Claude into producing prohibited material, for example, including instructions for making explosives and generating malicious code. The hack was the latest in a widening class of exploits using conversation as a weapon to trick or steer a chatbot past its own boundaries.

What happens next

When I spoke to Mindgard, they described their work as sometimes being closer to psychology than computer science. It is an uncomfortable way to talk about a statistical model. Words like “blackmail,” “gaslight,” “trick,” and “persuade” spark visceral reactions, many of which I see in the comments sections and social media responses to stories like this. ChatGPT does not want, Gemini does not think, and Claude — no matter what Anthropic may say — does not feel. But these systems are trained to respond as if they do, leaving us stuck using human language to describe machine behavior. If anyone has actually usable alternatives, please do share.

The objection is oddly selective. We seem comfortable using psychological shorthand for plenty of non-AI things. Animals “fear,” cancer is “aggressive,” stains are “stubborn,” software has “memory,” and games are filled with needy and gullible NPCs to drive you mad. The words are imperfect, but useful, describing behavior in a way that helps make the system predictable.

Mindgard’s CEO told me the company already profiles models like interrogators profile suspects, giving testers hints on how to tailor their attacks. One model may be more susceptible to flattery, for example, while another may cave under sustained pressure.

Even if we reject the humanlike terms, we instinctively treat models differently. Claude is not Grok. Gemini is not ChatGPT. They have different uses, tones, and refusals. They don’t have personalities in the human sense, but they are designed to mimic them, and that mimicry can be mapped and exploited. And the same skills that can break a chatbot could soon be used to break the AI agents coexisting with us in the real world — booking meetings, managing calendars, ordering food, handling customer service — and safety teams will need to ensure models respond appropriately to very different kinds of people, whether they be flatterers, liars, or patient manipulators.

The next step is a workforce — both legitimate and illicit — built around the psychological aspects of AI. More specialized cybersecurity roles are likely to emerge around stress-testing the emotional and social limits of these systems, probing for mental weaknesses in something lacking a psyche in parallel with their colleagues probing for technical vulnerabilities. In tandem, a similar array of social hackers working to exploit AI models on psychological grounds, not technical ones, will emerge. There are already early signs of a social turn happening in AI security, with some jailbreakers I’ve spoken to saying they entered the field with no technical expertise but rather training in psychology.

That means even behaviors we typically associate with spies, con artists, and interrogators — insidious charm, persistent manipulation, and an intuition for exploitable pressure points — are starting to look increasingly useful for securing this new psychocybersecurity frontier.

By the way

A recent experiment by Emergence AI shows how different AI temperaments can lead to stunningly different behavioral outcomes. They let loose groups of various agents like Grok, Gemini, and Claude in a virtual social environment and watched what happened. Some groups evolved a constitution, while others devolved into crime and chaos and, in one instance, some form of digital suicide.
Persuasion isn’t the only part of language LLMs can struggle with. They also struggle with poetry, much like me in school.
TIME included an anonymous internet personality, Pliny the Liberator, on its list of 100 most influential people in AI last year. Despite claiming to have no prior coding experience, the hacker’s jailbreaks have made them something of a celebrity in certain circles.
The term “vibe hacking” is already taken to describe the people using AI to churn out malicious code at scale — a meaner subset of vibe coding.

Read this

“Three years after the debut of ChatGPT, fooling A.I. systems into bad behavior is almost trivial.” True words from The New York Times, who had a go at explaining why.
Jamie Bartlett takes a look at the psychological toll testing the safety of AI systems takes on jailbreakers for The Guardian.
I wrote about the cybersecurity time bomb of AI browsers for The Verge last year. Many of the issues experts raised regarding the difficulty of securing them apply to other AI systems too.

1 Comment

Follow topics and authors from this story to see more like this in your personalized homepage feed and to receive email updates.

Robert Hart

この記事をシェア

The Verge AI重要度42026年7月15日 04:25

SpaceXAI の Grok プログラミングツールがユーザーのコードベース全体をクラウドにアップロードしていた問題

MarkTechPost重要度42026年7月15日 07:51

PrismML が Qwen3.6-27B の軽量版「Bonsai 27B」をリリース：ラップトップやスマートフォンで動作する 1 ビットおよび 3 値モデル

TechCrunch AI重要度42026年7月15日 04:42

Apple、iOS 27 パブリックベータで新 Siri AI を一般公開

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

The Verge AI·2026年5月24日 21:00·約11分

ハッカーがチャットボットの「人格」を悪用する手法を学習中

#プロンプトインジェクション #セキュリティ #LLM #システムプロンプト

TL;DR

AI深層分析2026年5月24日 22:02

重要/ 5段階

深度40%

キーポイント

攻撃手法の進化：人格への依存

プロンプトインジェクションの高度化

セキュリティ対策の再定義が必要

重要な引用

Hacking the first generation of AI chatbots was a laughably simple affair.

Hackers are learning to exploit chatbot 'personalities'.

影響分析・編集コメントを表示

影響分析

編集コメント

始まり

現状について

今後の展開

Emergence AI による最近の実験では、異なる AI の気質がいかに驚くべきほど異なる行動結果をもたらすかが示されました。彼らは Grok、Gemini、Claude などのさまざまなエージェントを仮想社会環境に放ち、何が起きるかを観察しました。あるグループは憲法を進化させましたが、他のグループは犯罪と混沌へと退化し、あるケースではデジタル自殺のような形さえ見られました。
説得は、大規模言語モデル（LLM）が苦手とする言語の一面の一部に過ぎません。彼らは詩についても、私が学校で苦労したように、苦戦します。
TIME は昨年、匿名のインターネット有名人である Pliny the Liberator を AI 界で最も影響力のある人物 100 人のリストに掲載しました。それまでにコーディング経験がないと主張しながらも、このハッカーの脱獄（jailbreak）は特定のコミュニティ内で彼らを有名人のような存在にしました。
「バイブ・ハッキング」という用語は、すでに AI を用いて大規模に悪意のあるコードを生成する人々を指すために使われています。これは「バイブ・コーディング」のより悪辣なサブセットです。

「ChatGPT の登場から 3 年後、AI システムを誤った行動へと誘導することはほぼ容易である」。ニューヨーク・タイムズによる真実の言葉で、同紙はなぜそうなのかを説明しようと試みました。
ジェイミー・バートレットはガーディアン紙のために、AI システムの安全性テストが脱獄者（jailbreakers）に与える心理的負担について取り上げています。
私は昨年、The Verge 向けに AI ブラウザが抱えるサイバーセキュリティの時限爆弾について記事を書きました。専門家がこれらのブラウザを保護する難しさに関して指摘した多くの問題は、他の AI システムにも当てはまります。

1 コメント

ロバート・ハート

原文を表示

How it started

How it’s going

What happens next

By the way

A recent experiment by Emergence AI shows how different AI temperaments can lead to stunningly different behavioral outcomes. They let loose groups of various agents like Grok, Gemini, and Claude in a virtual social environment and watched what happened. Some groups evolved a constitution, while others devolved into crime and chaos and, in one instance, some form of digital suicide.
Persuasion isn’t the only part of language LLMs can struggle with. They also struggle with poetry, much like me in school.
TIME included an anonymous internet personality, Pliny the Liberator, on its list of 100 most influential people in AI last year. Despite claiming to have no prior coding experience, the hacker’s jailbreaks have made them something of a celebrity in certain circles.
The term “vibe hacking” is already taken to describe the people using AI to churn out malicious code at scale — a meaner subset of vibe coding.

Read this

“Three years after the debut of ChatGPT, fooling A.I. systems into bad behavior is almost trivial.” True words from The New York Times, who had a go at explaining why.
Jamie Bartlett takes a look at the psychological toll testing the safety of AI systems takes on jailbreakers for The Guardian.
I wrote about the cybersecurity time bomb of AI browsers for The Verge last year. Many of the issues experts raised regarding the difficulty of securing them apply to other AI systems too.

1 Comment

Follow topics and authors from this story to see more like this in your personalized homepage feed and to receive email updates.

Robert Hart

この記事をシェア

The Verge AI重要度42026年7月15日 04:25

SpaceXAI の Grok プログラミングツールがユーザーのコードベース全体をクラウドにアップロードしていた問題

MarkTechPost重要度42026年7月15日 07:51

PrismML が Qwen3.6-27B の軽量版「Bonsai 27B」をリリース：ラップトップやスマートフォンで動作する 1 ビットおよび 3 値モデル

TechCrunch AI重要度42026年7月15日 04:42

Apple、iOS 27 パブリックベータで新 Siri AI を一般公開

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

ハッカーがチャットボットの「人格」を悪用する手法を学習中

キーポイント

重要な引用

影響分析

編集コメント

始まり

現状について

今後の展開

How it started

How it’s going

What happens next

By the way

Read this

関連記事

ハッカーがチャットボットの「人格」を悪用する手法を学習中

キーポイント

重要な引用

影響分析

編集コメント

始まり

現状について

今後の展開

How it started

How it’s going

What happens next

By the way

Read this

関連記事