Report: Massive Amounts of Sensitive Data Being Shared with GenAI Tools

A report published today by Harmonic Security suggests that the amount of data being shared with generative artificial intelligence (AI) tools is exponentially increasing in ways that will inevitably lead to more security breaches and compliance issues.An analysis of 22.4 million prompts that were used across six generative AI applications in 20205 finds that data exposures most commonly involve ChatGPT (71%). Additionally, 17% of all the exposures discovered involved personal or free accounts where organizations have zero visibility, no audit trails, and data may train public models. Of the 98,034 instances involving sensitive data, the vast majority (87%) occurred via ChatGPT Free, followed by Google Gemini at 5,935 (6%), Microsoft Copilot at 3,416 (3.5%), Claude at 2,412 (2.5%) and Perplexity at 1,245 (1.3%).Of the 22.4 million prompts analyzed, 579,000 (2.6%) contained company-sensitive data. Code, with 30% of data exposures, was the leading risk, followed by legal discourse (22.3%), merger and acquisition data (12.6%), financial projections (7.8%), and investment portfolio data (5.5%).Michael Marriott, vice president of product marketing for Harmonic Security, said most of the sensitive data discovered was inadvertently shared when some type of unstructured document was exposed to a generative AI model. No personally identifiable information or proprietary file content actually left a customer environment.However, the analysis only spans six tools. When you consider that there are at least 661 tools that have generative AI capabilities the amount of potentially sensitive data being shared is significantly greater than many organizations fully appreciate, noted Marriott.Additionally, a significant percentage of the data is finding its way in repositories that reside in data centers in countries that don’t respect data privacy, noted Marriott. For example, a total of 4% of usage of generative AI tools that Harmonic Security was able to track involved applications that were storing data in China. Many of the applications being used don’t necessarily make it apparent where the data they collect is being stored, noted Marriott.The challenge cybersecurity teams face is that the data being shared with a generative AI tool might not lead to a breach being known for months or years. Some providers of these tools are using the data collected to train the next generation of their AI models unless an end user has specifically opted out of allowing them to use their data for that purpose. As such, it’s probably that much of that sensitive data is going to one day show up in output that is generated by an AI model to create a breach that might lead to significant fines being levied.Ideally, more organizations would be relying on commercial versions of these tools that have guardrails in place that help prevent sensitive data from being shared. Even then, many of those guardrails might be bypassed so it’s also apparent there is a need to monitor how employees are using these tools to ensure that the number of data leakage incidents are minimized.After all, the only thing worse than the actual breach itself is knowing that data is readily accessible to anyone who cares to craft a simple prompt to retrieve it anytime they like.

About Author

AndyC

Andy Curtis is an award-winning security consultant, researcher and public speaker. He has been working in the computer security industry since the early 1990s, having been employed by state and federal government, leading healthcare and banking providers across three continents. He has given talks about computer security for some of the world’s largest companies, worked with law enforcement agencies on investigations into hacking groups, and is a regular voice on TV and radio explaining IT security threats.

See author's posts