A paper was presented by Sophos Principal Data Scientist Younghoo Lee at the Virus Bulletin conference in 2024, discussing SophosAI’s exploration of ‘multimodal’ AI – a system that merges diverse data types into a unified analytical platform. Lee’s presentation delved into the team’s innovative empirical research applying multimodal AI to detect spam, phishing, and unsafe web content.
Understanding Multimodal AI
Multimodal AI represents a significant departure in artificial intelligence. Rather than the traditional single-mode analysis, multimodal systems can concurrently process multiple data streams, amalgamating data from various inputs.
In the realm of cybersecurity – especially in threat classification – this represents a potent capability. Instead of dissecting textual and visual content separately, a multimodal system can analyze both, comprehending the intricate interplay amongst them.
For instance, in phishing detection, multimodal AI scrutinizes the linguistic patterns and writing style of the text alongside the visual fidelity of logos and branding elements, while also assessing the semantic consistency between textual and visual components. This comprehensive method enables the system to pinpoint sophisticated attacks that might appear legitimate to conventional systems. Furthermore, multimodal AI can learn and adjust based on the connections between diverse data types, cultivating an understanding of how legitimate and malicious content varies across multiple dimensions.
Capabilities in Multimodal AI Systems
The research by Lee highlights some of the detection capabilities of multimodal AI systems:
Text Scrutiny and Language Analysis
- Scrutiny of linguistic patterns, writing style, and contextual indicators to spot manipulation attempts
- Identification of social engineering strategies like contrived urgency and unusual requests for sensitive data
- Collection of an evolving repository of phishing pretexts and narratives
Visual Recognition and Brand Authentication
- Comparison of logos, corporate aesthetics, and visual layouts against legitimate models
- Detection of subtle disparities in brand colors, fonts, and layouts
- Examination of image metadata and digital signatures
Advanced Examination of URLs and Security
- Recognition of deceptive techniques such as typosquatting and homograph attacks
- Assessment of correlations between displayed link text and real destinations
- Identification of efforts to camouflage malicious URLs with formatting and styling tricks
Illustrative Example: Counterfeit Costco Email
The image below exemplifies a genuine phishing endeavor, aimed at deceiving recipients into believing they have won a prize from Costco. The email mimics an official look, complete with replicated Costco logo and branding.
Figure 1: An illustration of a phishing email, supposedly from Costco
Through multimodal AI, several suspicious elements of this email can be identified, including:
- Utilization of phrases inducing urgency and prompting action
- Email sender’s domain not aligned with legitimate domains
- Inconsistencies in logos and images
As a result, the system flags the email as suspicious, attributing it a high score.
SophosAI also implemented multimodal AI to assess NSFW websites hosting content related to gambling, weaponry, and more. Similar to phishing emails classification, the detection process integrates various capabilities like keyword assessment and image analysis.
Experimental Outcomes
Conducting a series of empirical experiments, SophosAI compared the efficacy of multimodal AI against traditional machine learning models like Random Forest and XGBoost. The comprehensive results can be found in Lee’s whitepaper and Virus Bulletin talk. In summary, traditional models excelled in recognizing known threats but struggled with newly encountered phishing emails. Their F1 scores varied from as low as 0.53 to a pinnacle of 0.66 with unfamiliar samples. Conversely, multimodal AI (leveraging GPT-4o) exhibited a superior performance in spotting fresh phishing attempts, achieving F1 scores reaching up to 0.97 even for unrecognized brands.
Acorss NSFW content assessment, traditional models garnered F1 scores of approximately 0.84-0.88, whereas multimodal AI embedded models achieved scores up to 0.96.
Wrapping Up
The digital arena is continually evolving, ushering in a host of new threats – including the exploitation of generative AI to dupe users. Phishing emails now adeptly imitate authentic communications on a regular basis, while NSFW websites veil harmful content behind misleading visuals. While traditional cybersecurity measures retain significance, they are becoming progressively inadequate in isolation. Multimodal AI provides a pioneering defense layer that enhances our grasp of content.
By effectively discerning sophisticated phishing emails and precisely categorizing NSFW websites, multimodal AI not only boosts user protection but also adjusts to emerging threats. The experimental findings in Lee’s paper highlight substantial enhancements over conventional methods.
Looking ahead, integrating multimodal AI into cybersecurity strategies is not just advantageous, but imperative for safeguarding our digital milieu amidst mounting intricacies and perils.
For more information, Lee’s comprehensive whitepaper can be accessed here. The recording of his 2024 Virus Bulletin presentation is available here (along with the slides).

