How to de-identify financial documents with Tonic Textual

Financial services organizations sit on enormous volumes of sensitive text. Bank statements, transaction histories, loan documents, and customer correspondence are rich with context that teams are eager to use for analytics and AI. They are also packed with personally identifiable information (PII) that must be protected before the data can be shared or used outside of production systems.
These documents are filled with names, account numbers, routing numbers, addresses, dates of birth, and other potentially sensitive identifiers. In many cases they appear inside free-text fields like transaction descriptions or notes, which makes financial documents especially difficult to work with. Without de-identification, teams are forced to choose between perceived risk or potential progress with initiatives that could benefit from leveraging these documents.
Tonic Textual was built to remove this tradeoff by helping teams to automatically detect and de-identify sensitive data in unstructured financial documents so the data can be used safely without losing the context that makes it valuable.
Why de-identification matters in financial services
Consider a common use case like fine-tuning an ML model. Without safeguards in place, those documents may expose customer names, account numbers, card details, addresses, or other identifiers. Even partial details can be enough to re-identify an individual when combined with other data.
This is more than a security concern, it can also come with legal implications. Regulations like GLBA, CCPA, CPRA, and GDPR require financial institutions to apply strong protections to customer data. In practice, that means sensitive information must be de-identified before documents are used for development, testing, analytics, or shared with partners.
As financial institutions begin to embrace AI it has become apparent that de-identification is not optional; it’s a prerequisite for compliant use of financial information.
Redaction vs. synthesis
Tonic Textual approaches de-identification through two strategies: redaction and synthesis.
Redaction removes sensitive values entirely, replacing them with black boxes or categorical identifiers like [NAME] or [ACCOUNT NUMBER]. This approach minimizes risk but can also remove useful context from the document.
Synthesis replaces sensitive entities with realistic but fictional alternatives. “Tom” becomes “Steve” and account numbers are replaced with structurally valid but fake numbers. This allows documents to remain readable and useful while ensuring the original data cannot be recovered.
The choice of strategy ultimately boils down to the use case. While redaction is often preferred for sharing documents with external groups (contractors or teams with different levels of clearance); synthesis is better when the goal is to train or fine-tune a model on realworld data. With Tonic Textual, teams can apply either strategy or mix the two depending on the type of data and the downstream use case.
Why financial documents are especially challenging
Financial documents do not typically follow a single schema, which means that sensitive data appears in many forms and many places.
A single bank statement may include structured fields like account numbers and balances alongside unstructured transaction descriptions that like payroll deductions or account activity.
Traditional rule-based approaches struggle in these environments because they miss sensitive values or over-redact content that is not actually risky. Effective de-identification of financial documentation requires understanding the context within.
De-identifying bank statements with Tonic Textual
Tonic Textual is designed specifically for unstructured text, and leverages modern document parsing and named entity recognition (NER) to identify sensitive information across multiple formats and file types.
Teams can detect and transform sensitive entities, including names, addresses, account and routing numbers, card details, dates of birth, government identifiers, and organization-specific identifiers that appear inside transaction descriptions.
Once detected, each entity type can be handled differently. Some fields can be fully redacted. Others can be synthesized. All transformations are consistent across documents, which is critical for downstream analysis.
A practical example using bank statements
To demonstrate how this works in practice, we created three fictional bank statements with intentionally embedded sensitive data. These documents are fictitious and were mocked up for the purposes of demoing Tonic Textual, and can be downloaded so that you can replicate this exercise yourself:
Step 1: Create a dataset using documents that require redaction
A dataset within Tonic Textual is essentially a collection of documents that share a de-identification strategy. Datasets are created because all of the documents contained within are typically being prepared for the same purpose downstream. This view within the Textual UI gives users a central place to upload and manage different de-identification projects.

When configuring the dataset, you should give it a name that aligns with the workflow; and also specify your output type: a de-identied version of the original document in the same format and file type, or blocks of JSON code that can be fed directly into a model.
Step 2: Configure your de-identification strategy

After you have uploaded your documents into the dataset, you are able to determine your de-identification strategy by selecting whether you want to redact, synthesize, or ignore sensitive entities detected by Textual. For the sake of this exercise, we have chosen to redact all entities with the exception of Given (or first) name and Family (last) name, which will be synthesized.
Step 3: Preview initial results

In many cases the initial output may be sufficient; however, this view gives users the ability to verify that sensitive information has been sufficiently redacted (and that information necessary to the use case has been retained). We can see here that the previously configured identification strategy has been adhered to: all PII has been redacted with the exception of names (“Jamie Reynolds” has been synthesized to “Tanner Babic”).
Step 4: Adjust as needed
From the preview view, users can make necessary changes manually. For this exercise, we decided that we want to preserve the date range of the financial statement. This is accomplished by manually clicking on the entity in the original document and then selecting “ignore”. The result is that the statement period has been re-identified in the new document, while all other redactions and synthesis remain intact.

Step 5: Export the redacted dataset
Exporting final documents is accomplished with a single click. Simply select “Download all files” from the Dataset top view, and you will be prompted to download a .Zip file with all of your redacted documents. With this workflow, preparing large volumes documents for safe, downstream use can be accomplished in minutes instead of hours.
Unlocking financial text data safely
The barrier for financial institutions for AI development is not lack of data, it’s lack of confidence that data will be leveraged safely.
De-identification is what makes that confidence possible. With Tonic Textual, teams can move beyond one-off redaction scripts and build scalable, repeatable workflows for sensitive financial documents.
If you are exploring how to safely use bank statements or other financial text in non-production environments, we would love to show you how it works.

*** This is a Security Bloggers Network syndicated blog from Expert Insights on Synthetic Data from the Tonic.ai Blog authored by Expert Insights on Synthetic Data from the Tonic.ai Blog. Read the original post at: https://tonicfakedata.webflow.io/blog/de-identify-financial-documents

About Author

AndyC

Andy Curtis is an award-winning security consultant, researcher and public speaker. He has been working in the computer security industry since the early 1990s, having been employed by state and federal government, leading healthcare and banking providers across three continents. He has given talks about computer security for some of the world’s largest companies, worked with law enforcement agencies on investigations into hacking groups, and is a regular voice on TV and radio explaining IT security threats.

See author's posts