404 Media·2026年4月28日 00:05·約6分

新ウェブサイトの3分の1がAI生成であるという研究結果

#LLM #コンテンツ生成 #Internet Archive #AI検知 #Dead Internet Theory

TL;DR

スタンフォード大学と帝国理工学院らの研究チームは、2022年以降に作成された新規ウェブサイトの約35%がAI生成またはAI支援によるものであることを明らかにした。

AI深層分析2026年4月28日 04:29

重要/ 5段階

深度40%

キーポイント

AI生成ウェブサイトの急増

2025年半ば時点で、ChatGPT登場後の新規ウェブサイト約35%がAI生成またはAI支援と分類され、以前はゼロだったことから急激な増加が見られる。

ウェブの性質変化とDead Internet Theory

AIテキストの普及により、ウェブはより「陽気」で簡潔になり、意味的・様式的多様性や事実精度の低下が懸念されており、「死んだインターネット理論」の実証に近い現象が生じている。

大規模データ分析と検知手法

インターネットアーカイブのデータを基に、AI検知ツールPangram v3を用いてウェブサイトスナップショットを分析し、6つの仮説（情報過多の欠如、出典未記載など）を検証した。

AI生成コンテンツの主な影響

研究により、AIはインターネット上の文脈的多様性を減少させ、全体的な感情をよりポジティブにしていることが判明したが、嘘の増加や出典の省略は確認されなかった。

「真理崩壊」仮説の否定と懸念

検証可能な誤情報の増加は確認されなかったものの、既存のファクトチェックツールでは検証できない「検証不可能な主張」の増加という新たな懸念が残っている。

今後の研究とAIの役割

研究者はアーカイブとの連携で継続的な監視ツールを開発中であり、AIが単なる代替ではなく創造的なパートナーとして機能するためには、多様性や「摩擦」のある個性が重要だと指摘している。

影響分析・編集コメントを表示

影響分析

この発見は、インターネットの生態系が人間中心からAI中心へ急速にシフトしていることを示しており、検索エンジン最適化（SEO）やコンテンツマーケティングの戦略見直しを迫られる。また、「死んだインターネット」の実態がデータで裏付けられたことで、信頼性の高い情報源の選別や、AI生成コンテンツの識別・規制に関する議論が加速する可能性がある。

編集コメント

3年間で新規サイトの35%がAI生成という数字は、デジタルエコシステムの変革速度を象徴するものであり、コンテンツの信頼性担保が今後のWeb標準となるだろう。

インターネットアーカイブのデータを活用して研究を行ったチームは、2022年以降に作成されたウェブサイトの3分の1がAI生成（人工知能による生成）であることを発見しました。スタンフォード大学、インペリアル・カレッジ・ロンドン、インターネットアーカイブの関係者で構成されるこの研究チームは、「The Impact of AI-Generated Text on the Internet」と題された論文で、その調査結果をオンライン上で公開しました。さらにこの研究では、これらのAI生成テキストがウェブ全体をより陽気かつ簡潔なものにしていることも明らかになりました。

インターネットの多くの部分が現在、ボット同士が会話し合っているだけであるという「死んだインターネット理論」に触発され、研究チームはChatGPTおよびその競合他社が2022年以降、インターネットをどのように再構築したのかを探ることにしました。「インターネット上でのAI生成およびAI支援テキストの蔓延は、意味的・様式的多様性、事実の正確さ、その他の否定的な傾向の低下につながると懸念されています」と研究者たちは論文に記しています。「私たちは、2025年半ばまでに、新たに公開されたウェブサイトの約35%がAI生成またはAI支援と分類されるに至ったことを発見しました。これは2022年末のChatGPT発売前にはゼロだった数値です。」

"ウェブにおけるAIの支配の速さには本当に驚かされます」と、スタンフォード大学のAI研究者で論文の共著者であるJonáš Doležal氏は404 Mediaに語った。「何十年もの間、人間がウェブを形作ってきた後、インターネットの重要な部分がわずか3年でAIによって定義されるようになりました。私の意見では、私たちは、当初構築するのに要した時間の桁違いに短い期間で、デジタル景観の大きな変革を目撃しているのです。」

研究者たちはまた、AI生成テキストに対する6つの一般的な批判を検証した。それは見解の縮小をもたらすのか？幻覚（hallucinations）が蔓延することで誤情報が増えるのか？オンラインでの文章はより洗練され、陽気な印象を与えるのか？出典の引用が不十分になるのか？意味密度の低い単語の羅列を生み出すのか？そして、ユニークな声が消え、一般的で均一なスタイルが支配する単一文化へと文章を強制しているのか？

これらの問いに答えるため、研究者たちはインターネットアーカイブと連携し、2022年8月から2025年5月までの33ヶ月間にわたるウェブサイトのサンプルを抽出した。「各サンプリングされたURLについて、Wayback MachineのCDX Server APIを通じて利用可能な最も古いアーカイブスナップショットを取得します」と研究は説明している。「各スナップショットの生HTMLをダウンロードし、後続の処理のためにローカルに保存します。

研究者らは抽出したウェブサイトのテキストを取得し、AI検出ソフトウェア「Pangram v3」を用いて AI 生成のウェブサイトを探し出した。チームは複数の AI 検出ツールをテストしたが、Pangram v3 が最も高い検知率を示した。Pangram v3 によって AI 生成のウェブサイトが特定されると、研究者らはそのサイトをサンプルとして他の6つの仮説をテストした。「各仮説について、測定可能なシグナルを定義し、毎月のウェブサイトサンプルそれぞれに対してそれを計算し、それが月ごとの集計 AI 可能性スコアと相関するかどうかをテストした」と研究は述べている。

例えば、AI が嘘に満ちたインターネットを作成しているかどうかを検証するため、チームは選定したウェブサイトから事実に基づく主張を抽出し、その後、人間の事実確認担当者（ファクトチェッカー）にそれらの検証を依頼した。AI が情報源を引用しているかどうかを確認するため、チームは AI 生成テキストにおける外部リンク密度（アウトバウンドリンク密度）を計算した。

研究者たちの驚くべきことに、AI 生成テキストの影響についてテストした6つの理論のうち、真実 seemed と見なされたのはわずか2つであった。AI はインターネットの意味的多様性を低下させ、全体的にはよりポジティブな内容にしているものの、嘘の蔓延や情報源の排除を引き起こしていたわけではない。

"最も驚くべき結果は、私たちの『真理崩壊』仮説が裏付けられなかったことです」とドレザルは語った。「検証可能な誤情報が増加していることを特定に探していたのですが、それが見つからなかった点は注目に値します。しかし、AI が既存のファクトチェックツールやインフラで検証できない『検証不可能な主張』の量を静かに増大させている可能性は依然としてあります。あるいは、インターネットが元々真理を遵守する場所ではなかったという単純な事実かもしれません」

研究者たちは、AI 生成テキストがインターネットに与える影響を継続的に研究していくと述べた。「現在、インターネットアーカイブ（Internet Archive）と協力し、論文の静的な性質に縛られた単一の固定スナップショットではなく、今後継続的にこのシグナルを提供し続けるツールへと発展させています」と、スタンフォード大学の学生研究者であり論文の共著者の一人であるマティ・ボハチェクは 404 Media に語った。「さらに、どの種類のウェブサイトが最も影響を受けているかのカテゴリや言語別の詳細な分析など、より細かな粒度を追加することにも関心があります。これにより、これらの影響がどこに及んでいるかについて、よりニュアンス豊かな理解を提供したいと考えています」

ドレザルにとって、このような研究は有用で生産的なインターネットを確保する上で極めて重要です。「AI生成コンテンツが広まるにつれて、課題はこれらのモデルに、単に整えられたり反復的なWebになるような役割ではなく、意味のある役割を見出すことです」と彼は語ります。「モデルを完璧に従順で好意的なものに強制するのではなく、より明確な個性や『摩擦』を持たせることを許容することで、人間の声を置き換えるものではなく、創造的なパートナーとして機能するのに役立つかもしれません。」

原文を表示

imageResearchers working with data from the Internet Archive have discovered that a third of websites created since 2022 are AI-generated. The team of researchers—which includes people from Stanford, the Imperial College London, and the Internet Archive—published their findings online in a paper titled “The Impact of AI-Generated Text on the Internet.” The research also found that all this AI-generated text is making the web more cheery and less verbose.

Inspired by the Dead Internet Theory—the idea that much of the internet is now just bots talking back and forth—the team set out to find out how ChatGPT and its competitors had reshaped the internet since 2022. “The proliferation of AI-generated and AI-assisted text on the internet is feared to contribute to a degradation in semantic and stylistic diversity, factual accuracy, and other negative developments,” the researchers write in the paper. “We find that by mid-2025, roughly 35% of newly published websites were classified as AI-generated or AI-assisted, up from zero before ChatGPT's launch in late 2022.”

“I find the sheer speed of the AI takeover of the web quite staggering,” Jonáš Doležal, an AI researcher at Stanford and co-author of the paper, told 404 Media. “After decades of humans shaping it, a significant portion of the internet has become defined by AI in just three years. We're witnessing, in my opinion, a major transformation of the digital landscape in a fraction of the time it took to build in the first place.”

The researchers also tested six common critiques of AI-generated text. Does it lead to a shrinking of viewpoints? Does it create more disinformation as hallucinations proliferate? Does online writing feel more sanitized and cheerful? Does it frail to cite its sources? Does it create strings of words with low semantic density? Has it forced writing into a monoculture where unique voices vanish and a generic, uniform style takes hold?

To answer these questions, the researchers partnered with the Internet Archive to pull samples of websites from the 33 months between August 2022 and May 2025. “For each sampled URL, we retrieve the oldest available archived snapshot via the Wayback Machine’s CDX Server API,” the research said. “The raw HTML of each snapshot is downloaded and stored locally for subsequent processing.”

The researchers took the extracted website text and used the AI-detection software Pangram v3 to find AI-created websites. The team tested several AI-detection tools and found Pangram v3 had the highest detection rate. Once Pangram v3 had identified an AI-generated website, the researchers used that website as a sample to test their other six hypotheses. “For each hypothesis, we define a measurable signal, compute it for each monthly sample of websites, and test whether it correlates with the aggregate AI likelihood score across months,” the research said.

To test if AI was creating an internet full of falsehoods, for example, the team extracted fact based claims from the websites they’d selected and then paid human factcheckers to verify them. To figure out if AI is citing its sources, the team computed the outbound link density in AI-generated text.

To the surprise of the researchers, only two of the six theories they tested about the effects of AI-generated text seemed true. AI was making the internet less semantically diverse and more positive overall, but it wasn’t causing a proliferation in lies or cutting out its sources.

“The most surprising result was that our Truth Decay hypothesis wasn't confirmed,” Doležal said. “It's worth noting that we were specifically looking for an increase in verifiably untrue statements, which we didn't find. But it could still be the case that AI is quietly increasing the volume of unverifiable claims, ones that can't be checked against existing fact-checking tools and infrastructure. Or it may simply be that the internet wasn't a particularly truth-adhering place to begin with.”

The researchers said they’d continue to study how AI-generated text shaped the internet. “We're now working with the Internet Archive to turn this into a continuous tool that keeps providing this signal going forward, rather than a single fixed snapshot bounded by the static nature of a paper,” Maty Bohacek, a student researcher at Stanford and one of the co-authors of the paper, told 404 Media. “We're also interested in adding more granularity: looking at which kinds of websites are most affected, broken down by category or language, and generally providing more nuance about where these impacts are landing.”

For Doležal, studies like this are critical for ensuring a useful and productive internet. “As AI-generated content spreads, the challenge is finding a role for these models that doesn’t just result in a sanitized, repetitive web,” he said. “Rather than forcing models to be perfectly compliant and agreeable, allowing them to have a more distinct personality or ‘friction’ might help them act as a creative partner rather than a replacement for human voice.”

この記事をシェア

KDnuggets重要度42026年6月27日 00:00

Apple Silicon で MLX を用いた言語モデルのファインチューニング

The Zvi重要度42026年6月26日 23:51

ホワイトハウスが個別に GPT-5.6 のアクセス権をその場しのぎで決定する方針へ

AWS Machine Learning Blog重要度42026年6月26日 23:42

AWS を活用した保険仲介向けドメイン特化型 AI の先駆者、Cara の取り組み

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む