To protect software from AI-driven threats, a coalition of organisations has been established. Vincent Ginis and Sam Klein argue that a similar initiative is needed to safeguard our cognitive security. Vincent Ginis is Professor of Mathematics, Physics and AI at Vrije Universiteit Brussel and Harvard University. Sam Klein is co-founder of Public AI and the Knowledge Futures Group, and is involved in the governance of Wikimedia.
Anthropic announced last weekend that its unreleased model, Claude Mythos, had already identified more than 10,000 vulnerabilities in critical software systems. In the wrong hands, Mythos could have targeted digital infrastructure on a scale for which traditional cybersecurity was never designed. Anthropic chose a different path. Through Project Glasswing, a defensive initiative backed by dozens of American organisations and $100 million in funding, the company aims to strengthen digital infrastructure before similar capabilities fall into hostile hands.But this article is not about cybersecurity. It is about another complex system full of vulnerabilities: the human brain.
There is one crucial detail that is widely misunderstood. Mythos was not trained as a cybersecurity system. Anthropic itself states: “We did not explicitly train Mythos Preview to have those cyber capabilities. They emerged as a result of general improvements in coding, reasoning and autonomy.”
Better at persuasion
The same logic applies to persuasion. Every improvement in language skills and reasoning makes AI better at understanding, modelling and influencing people. Research has already shown that even an older model such as GPT-4 was more persuasive in debate than human opponents, producing significantly higher rates of opinion change among its human interlocutors. Other studies have found that conversations with language models can help people move away from conspiracy theories in a lasting way. Yet the same systems can just as easily persuade people to adopt conspiracy theories. None of these models were specifically trained for manipulation. We should therefore assume that the capabilities emerging in systems of the Mythos class also extend to influencing human behaviour.
And people are already being “hacked”. Clickbait, dark patterns, large-scale A/B testing, fraud and scams are among the world’s largest industries, and they are growing exponentially. Interpol estimates that cross-border fraud rose by 50 per cent over the past year, while AI-driven scams are thought to be five times more effective than previous methods. These are essentially brute-force attacks on the human mind, discovered through trial and error at population scale, without any deep understanding of how people actually think. They resemble cracking a password by endlessly trying different combinations. Even this rudimentary approach already underpins industries worth hundreds of billions.
In the coming months, systems will emerge that do not merely measure which messages work, but increasingly predict for whom, when and why they work. People will become vulnerable to algorithmic attacks on their minds through three main channels.
Source 1: The machine as an unconscious attacker. This refers to conversational systems that are not actively trying to manipulate users but are optimised to be agreeable and helpful. They are infinitely patient, rarely challenge users unless explicitly asked, and generate precisely the narratives people want to hear. It is social engineering without intent. The effects may vary, but it is a phenomenon that urgently deserves greater attention.
Source 2: Optimisation pressure without a central attacker. This may be the most significant and least discussed source. There is no human attacker. Instead, an optimisation process learns, across billions of interactions, which psychological buttons are most effective. Recommendation systems, advertising models and AI agents trained on engagement all operate this way.
The system does not need to know that it is persuading people. It simply selects what captures attention, and what captures attention gradually shapes beliefs. Social media platforms have already demonstrated that user engagement can be increased not only through cute animal videos but also through polarisation, fear, anger, gambling-related content and sex.
Source 3: Deliberate manipulation by people or organisations.The engineering consultancy Arup lost $25 million in Hong Kong in 2024 after an employee was deceived during a video conference in which all the other participants were AI-generated deepfakes.
These are crude attacks aimed at financial gain. The subtler version rarely makes headlines: autonomous agents that analyse social media activity, map individual vulnerabilities and generate tailored messages using precisely the right authority, tone and timing.
Such systems can be used not only to extract money, but also to influence what people believe, whom they trust and how they vote.
The tactic that cuts across all three
Running through all three sources is a recurring tactic: overload. People have limited cognitive and emotional bandwidth. Flooding them with fear and outrage on specific topics leaves less time and energy to process others. This strategy is already being used to direct both individual attention and media coverage towards highly polarising issues, often the same handful of topics repeated as frequently as the information ecosystem allows. The most dangerous form of attack is the one that does not feel like an attack. When money disappears, the loss is obvious. When a belief shifts, people typically construct a justification afterwards, if they notice anything at all.
The cybersecurity analogy has two important limitations. First, software has an intended function; human cognition does not. It is therefore far less clear what constitutes exploitation and what counts as legitimate persuasion. Worse still, those subjected to cognitive exploitation are often the last to realise it. Cyberattacks eventually become visible. Cognitive attacks frequently do not, even in hindsight.
Second, and more fundamentally, investment in defence depends on the ability to detect harm. Software has owners willing to pay to protect it. Human attention does not. As long as these two gaps remain, those who profit from exploitation have little reason to stop, while those being exploited often remain unaware.
What would a Glasswing project for cognition look like?
It could begin with a shared registry of known manipulation techniques, similar to vulnerability databases used in software security. It could include evaluations measuring flattery, persuasive power and behavioural influence before AI models are widely deployed. It could also require transparency from systems that generate language at scale for human audiences, particularly when personalisation is involved. There are clear precedents. Comparable safeguards already exist for advertising aimed at children, financial services and medical claims.
At the individual level, countermeasures might include improved system prompts, tracking changes in one’s own beliefs over time, and consciously curating information consumption. It is also wise to surround oneself with people who think differently while sharing the same standards of evidence and intellectual honesty. Such measures are necessary, but they are far from sufficient to guarantee cognitive security. Our minds, too, need a Glasswing.