Announcing langchain-textual: PII redaction and synthesis for LangChain on Tonic Textual
Organizations building with AI are operating inside a growing paradox: the unstructured data that makes models effective—support tickets, clinical notes, chat logs, and documents—is the same data that is hardest to use safely.
Announcing the Tonic Textual MCP server: PII redaction meets AI agents
Organizations building with AI are operating inside a growing paradox: the unstructured data that makes models effective—support tickets, clinical notes, chat logs, and documents—is the same data that is hardest to use safely. Privacy regulations, internal governance, and the risk of exposing sensitive information slow or block access to the very datasets that power model training, retrieval pipelines, and agent workflows. As a result, teams are forced to choose between data utility and compliance, creating a persistent bottleneck across AI systems.
With the introduction of langchain-textual, that tradeoff no longer has to exist. By bringing Tonic Textual’s transformer-based PII detection, redaction, and high-fidelity data synthesis directly into LangChain workflows, organizations can safely use real-world data across training, RAG, and agent pipelines—without compromising privacy or velocity.
This challenge shows up across every stage of modern AI systems—and langchain-textual is designed to address it wherever sensitive data flows:
Model training and fine-tuning, where real-world data produces better models — but that data contains names, SSNs, and account numbers that can’t appear in training sets.
RAG pipelines, where documents are chunked, embedded, and stored in vector databases. PII persists at every layer: the source documents, the chunks, the embeddings, and the retrieval results that eventually reach the LLM.
LLM proxies, where organizations need to scrub PII from prompts before they leave internal infrastructure and from completions before they reach end users.
Agent workflows, where tools pull data from databases, APIs, and files — each step potentially surfacing and forwarding PII downstream.
Regulatory compliance — HIPAA de-identification under both Safe Harbor and Expert Determination, GDPR right-to-erasure, PCI cardholder data requirements.
Lower environments, where QA teams and developers need data that behaves like production without carrying the regulatory exposure.
The common thread across all of these is that regex and rule-based approaches aren’t sufficient. A pattern like d{3}-d{2}-d{4} will match anything that looks like a Social Security number — including dates, product codes, and case numbers. The question isn’t whether a string matches a pattern. It’s whether, in context, “her social is 123-45-6789” means something different from “case number 123-45-6789.” That requires NER models that understand language, not string matching.
Tonic Textual provides exactly this — transformer-based PII detection and transformation across text, JSON, HTML, PDFs, images, and tabular data. With langchain-textual, those capabilities are now available as standard LangChain tools.
What Tonic Textual brings to the table
It’s worth understanding what Textual does before looking at the LangChain integration, because the integration inherits all of it.
Textual’s NER model identifies 46+ entity types across 50+ languages. These aren’t just the obvious ones like email addresses and phone numbers. The model detects names (given and family, separately), dates of birth, occupations, healthcare IDs, routing numbers, IP addresses, and more — with a confidence score for each detection.
What happens after detection is where Textual differentiates itself. There are two modes:
Redaction replaces detected PII with labeled placeholders:
1Input: “Contact John Smith at [email protected]”
2Output: “Contact [NAME_GIVEN_a1b2] [NAME_FAMILY_c3d4] at [EMAIL_ADDRESS_e5f6]”
This is the safest option when the goal is to guarantee that no real PII appears in the output. The placeholders are tagged with their entity type and a consistent identifier, which means you can track which replacements correspond to the same original entity across a document.
Synthesis replaces PII with realistic fake data:
1Input: “Contact John Smith at [email protected]”
2Output: “Contact Maria Chen at [email protected]”
This distinction matters more than it might seem at first glance. If you’re training a language model on de-identified clinical notes, every [NAME_GIVEN_xxxx] placeholder is a token the model learns to predict — one that has no relationship to how real names appear in text. The model’s understanding of sentence structure around names degrades. Synthesized data preserves the format and distribution of the original. A synthesized email is still a valid email. A synthesized date has the right format. Downstream models and analytics work correctly on the transformed data.
For organizations navigating HIPAA, this is particularly relevant. Expert Determination requires a qualified statistical expert to certify that the risk of re-identification is very small. Synthesis that preserves statistical properties while eliminating re-identification risk is a powerful tool in that process — far more so than blanket redaction that destroys data utility.
Textual handles all of this across multiple formats. In JSON, it understands the difference between keys (typically not PII) and values (often PII). In HTML, it preserves markup structure while redacting text content. For PDFs and images, it uses OCR to detect text and then redacts in the rendered output. It runs in the cloud or self-hosted on your own infrastructure.
The LangChain integration
langchain-textual wraps Textual’s capabilities into five LangChain tools:
Tool
Input
Use for
TonicTextualRedactText
Plain text string
Raw text, .txt file contents
TonicTextualRedactJson
JSON string
Raw JSON, .json file contents
TonicTextualRedactHtml
HTML string
Raw HTML, .html/.htm file contents
TonicTextualRedactFile
File path
PDFs, images (JPG, PNG), CSVs, TSVs
TonicTextualPiiTypes
None
List all supported PII entity types
These are standard LangChain BaseTool subclasses. They work with any agent, chain, or tool-calling model. Installation is a single line:
pip install langchain-textual
The separation into distinct tools is deliberate. When an LLM agent has access to multiple tools, it selects which one to call based on the tool’s name, description, and input schema. A single tool that handles text, JSON, HTML, and files via a format parameter forces the model to make two decisions — “should I redact?” and “what format is this?” — when it should only need to make one. Separate tools with clear descriptions let the model match the user’s intent to the right tool directly.
How the tools are structured
All redaction tools share a common base class that handles client initialization and configuration:
1class _BaseTonicTextual(BaseTool):
2 client: TextualNer = Field(default=None)
3 tonic_textual_api_key: SecretStr = Field(default=SecretStr(“”))
4 tonic_textual_base_url: str | None = None
5 generator_default: Literal[“Off”, “Redaction”, “Synthesis”] | None = None
6 generator_config: dict[str, Literal[“Off”, “Redaction”, “Synthesis”]] = Field(
7 default_factory=dict
8 )
A Pydantic model_validator runs before instantiation, reading the API key from the constructor argument or the TONIC_TEXTUAL_API_KEY environment variable and initializing the Textual client. Each tool subclass then implements _run with a single call to the appropriate Textual API method — client.redact() for text, client.redact_json() for JSON, and so on.
Configuration options like generator_default and generator_config are defined on the base class, so every tool inherits them. A shared _build_kwargs() method assembles these into the keyword arguments that every Textual API method accepts. This keeps each tool’s _run method focused on its format-specific logic.
Teaching the agent which tool to use
When an LLM receives a tool definition, it sees three things: the tool’s name, its description, and its input schema — including the description field on each parameter. These descriptions are instructions the model reads before deciding how to call the tool.
We define explicit Pydantic schemas for each tool’s input:
1class _RedactTextInput(BaseModel):
2 text: str = Field(
3 description=(
4 “Plain text that may contain PII. ”
5 “For .txt files, read the file first and pass the contents here.”
6 )
7 )
8
9class _RedactFileInput(BaseModel):
10 file_path: str = Field(
11 description=(
12 “Absolute path to the file to redact. ”
13 “Supported types: JPG, PNG, PDF, CSV, TSV. ”
14 “Do NOT use for .txt, .json, .html, or .htm files — use the ”
15 “dedicated text, JSON, or HTML redaction tools instead.”
16 )
17 )
By stating what each parameter expects and explicitly noting what not to use each tool for, we significantly reduce the chance the agent picks the wrong tool. The text tool says “Do NOT use this tool for JSON, HTML, or binary files.” The file tool says “Do NOT use for .txt, .json, .html, or .htm files.” These negative instructions are as important as the positive ones — models are good at following explicit boundaries.
When the agent picks the wrong tool anyway
Even with good descriptions, agents sometimes make mistakes. The question is what happens next. If a tool returns a raw Python traceback, the agent has nothing to work with. If it returns an actionable error message, the agent can self-correct.
Each tool validates its input and redirects when appropriate:
1# In the text tool: if the input is valid JSON, redirect
2try:
3 json.loads(text)
4 return (
5 “Error: input looks like JSON, not plain text. ”
6 “Use tonic_textual_redact_json for .json files (read contents first).”
7 )
8except (json.JSONDecodeError, TypeError):
9 pass
10# In the file tool: if the extension is .html, redirect
11if ext_lower in {“.html”, “.htm”}:
12 return (
13 “Error: .htm files are not supported by this tool. ”
14 “Use tonic_textual_redact_html for .html/.htm files (read contents first).”
15 )
This works because LLM agents treat tool outputs as observations in their reasoning loop. An error that names the correct tool is just another observation the agent can act on. In practice, the agent calls the right tool on the next step, and the user never sees the intermediate error.
Putting it together: a working agent
Here’s a complete agent with PII redaction capabilities:
1from langchain_textual import (
2 TonicTextualRedactText,
3 TonicTextualRedactFile,
4 TonicTextualPiiTypes,
5)
6from langchain_openai import ChatOpenAI
7from langgraph.prebuilt import create_react_agent
8
9llm = ChatOpenAI(model=”gpt-4o-mini”)
10tools = [
11 TonicTextualRedactText(),
12 TonicTextualRedactFile(),
13 TonicTextualPiiTypes(),
14]
15agent = create_react_agent(llm, tools)
Two environment variables (TONIC_TEXTUAL_API_KEY and OPENAI_API_KEY), six lines of code, and you have an agent that can redact PII from text and files.
When a user asks “Redact this: My name is John Smith and my email is [email protected],” the agent calls tonic_textual_redact with the text. Textual’s NER model identifies John Smith as NAME_GIVEN + NAME_FAMILY and [email protected] as EMAIL_ADDRESS, and the tool returns the redacted text.
When a user asks “Redact the file /tmp/medical_record.pdf,” the agent calls tonic_textual_redact_file. Behind the scenes, the tool uploads the file to Textual, waits for the redaction job to complete, downloads the result, and writes it to /tmp/medical_record_redacted.pdf. The agent reports back with the output path.
When a user asks “What PII types can you detect?” the agent calls tonic_textual_pii_types, which returns all 46 entity types from the SDK’s PiiType enum — no API call needed, no latency.
Controlling redaction at the entity level
The default behavior redacts everything Textual detects. But in practice, you often want different handling for different entity types. Names might be safe to synthesize for analytical purposes, while SSNs should always be hard-redacted. Credit card numbers get redacted; organization names might be left alone entirely.
generator_config provides this control:
1tool = TonicTextualRedactText(
2 generator_default=”Off”,
3 generator_config={
4 “NAME_GIVEN”: “Synthesis”,
5 “NAME_FAMILY”: “Synthesis”,
6 “EMAIL_ADDRESS”: “Redaction”,
7 “US_SSN”: “Redaction”,
8 },
9)
This configuration synthesizes names (so text reads naturally), hard-redacts emails and SSNs (for maximum safety), and leaves everything else untouched. The generator_default of “Off” means any entity type not listed in generator_config passes through unchanged.
This composability is essential for compliance workflows. HIPAA’s Safe Harbor method specifies 18 identifier categories that must be removed or generalized. With generator_config, you can map each of those categories to the appropriate handling — redaction for direct identifiers, synthesis for quasi-identifiers where you need to preserve analytical utility, and “Off” for categories that don’t apply to your data.
Not sure which entity types are available? That’s what TonicTextualPiiTypes is for:
1from langchain_textual import TonicTextualPiiTypes
2TonicTextualPiiTypes().invoke(“”)
3# “NUMERIC_VALUE, LANGUAGE, MONEY, …, US_SSN, CREDIT_CARD, EMAIL_ADDRESS, …”
What’s next
This initial release covers the core redaction workflows — text, JSON, HTML, and binary files. We’re continuing to expand format support and add capabilities.
The package is open source under the MIT license:
A self-contained agent example lives in the examples/ directory of the repo — clone it, set two environment variables, and run uv run agent.py to see it in action.
For the full Tonic Textual platform — including the web UI, dataset management, custom entity training, and enterprise deployment — visit https://textual.tonic.ai/signup.
*** This is a Security Bloggers Network syndicated blog from Expert Insights on Synthetic Data from the Tonic.ai Blog authored by Expert Insights on Synthetic Data from the Tonic.ai Blog. Read the original post at: https://www.tonic.ai/blog/announcing-langchian-on-tonic-textual
