Skip to main navigation Skip to search Skip to main content

Normative conflicts and shallow AI alignment

Raphaël Millière*

*Corresponding author for this work

    Research output: Contribution to journalArticlepeer-review

    4 Downloads (Pure)

    Abstract

    The progress of AI systems such as large language models (LLMs) raises increasingly pressing concerns about their safe deployment. This paper examines the value alignment problem for LLMs, arguing that current alignment strategies are fundamentally inadequate to prevent misuse. Despite ongoing efforts to instill norms such as helpfulness, honesty, and harmlessness in LLMs through fine-tuning based on human preferences, they remain vulnerable to adversarial attacks that exploit conflicts between these norms. I argue that this vulnerability reflects a fundamental limitation of existing alignment methods: they reinforce shallow behavioral dispositions rather than endowing LLMs with a genuine capacity for normative deliberation. Drawing from on research in moral psychology, I show how humans’ ability to engage in deliberative reasoning enhances their resilience against similar adversarial tactics. LLMs, by contrast, lack a robust capacity to detect and rationally resolve normative conflicts, leaving them susceptible to manipulation; even recent advances in reasoning-focused LLMs have not addressed this vulnerability. This “shallow alignment” problem carries significant implications for AI safety and regulation, suggesting that current approaches are insufficient for mitigating potential harms posed by increasingly capable AI systems.

    Original languageEnglish
    Pages (from-to)2035-2078
    Number of pages44
    JournalPhilosophical Studies
    Volume182
    Issue number7
    DOIs
    Publication statusPublished - Jul 2025

    Bibliographical note

    Copyright the Author(s) 2025. Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.

    Keywords

    • Adversarial attacks
    • AI safety
    • Alignment problem
    • Large language models
    • Normative reasoning

    Fingerprint

    Dive into the research topics of 'Normative conflicts and shallow AI alignment'. Together they form a unique fingerprint.

    Cite this