Fingerprints beyond device IDs: engineered representations for fraud detection
In fraud and bot detection, people usually think of fingerprinting as the classic browser or device fingerprint. This comes from techniques that use JavaScript and HTTP signals to recognize a device across sessions.
Fingerprints beyond device IDs: engineered representations for fraud detection
In fraud and bot detection, people usually think of fingerprinting as the classic browser or device fingerprint. This comes from techniques that use JavaScript and HTTP signals to recognize a device across sessions. We also use this type of fingerprint at Castle to detect account takeover attempts or users creating multiple accounts. But fingerprinting covers many more ideas.
However, not all fingerprints aim to identify a single device. For example, TLS fingerprints like JA3 group many clients together. Their value comes from describing the environment, not from being unique. Some attributes commonly included in a fingerprint, like navigator.webdriver, do not significantly contribute to the uniqueness. They simply reveal possible automation or spoofing when combined with other signals.
In this article, I do not try to redefine fingerprinting. The term already has several meanings. Instead, I want to share the perspective I use when building detection systems. I like to think of a fingerprint as an engineered representation of a set of attributes that captures something useful for detection. It can be a hash, a list of derived fields, or a pattern extracted from actions. It does not need to be unique or stable for weeks. It only needs to reflect the structure you care about.
Thinking about fingerprints in this way makes it easier to solve real fraud detection problems. A device fingerprint, an email pattern, a TLS handshake signature, or a behavioral sequence all follow the same idea: they turn raw data into a compact representation that is easier to group, compare, or monitor over time. The rest of the article shows how these fingerprints can be built and how they help with practical fraud and bot detection tasks.
The traditional browser and device fingerprint
Most people think of fingerprinting as the classic browser or device fingerprint. It comes from collecting JavaScript and HTTP attributes such as the user agent, screen size, hardware concurrency, WebGL renderer, canvas fingerprinting, or navigator.deviceMemory. The goal is usually to build something unique and stable enough across sessions to recognize the same device over time.
We use this type of fingerprinting at Castle. Our device identification helps detect account takeover attempts when a login comes from a device the user has never used before, and it helps catch multi account creation from the same environment. You can see examples of this in our device product page and our device fingerprinting docs.
This type of fingerprint solves an important set of problems, but it is limited to describing the user device. In practice, user actions, network signals, input patterns, and protocol details do not all fit into one coherent representation. You could technically hash everything together, but that would not help anyone. A fingerprint is only useful if it exposes a structure that analysts or detection logic can work with. Device attributes, email patterns, TLS handshakes, and interaction sequences describe very different things. Forcing them into a single fingerprint would hide the patterns you actually care about.
This is why fraud detection systems rely on many fingerprints, each focused on a specific aspect of the traffic. TLS fingerprints like JA3 describe the client’s protocol stack and networking behavior. Behavioral fingerprints capture repeated or scripted interaction flows. They are all fingerprints, but they are built for different goals.
The point is simple. A device fingerprint is a valuable tool, but it cannot represent everything you need to detect fraud and bots. Other fingerprints capture different slices of attacker behavior, and combining them gives you a much clearer picture than any single fingerprint could.
A broader view: fingerprints as engineered representations
If we step back from the classic device fingerprint, we can describe many other useful signals in the same way. I like to see a fingerprint as an engineered representation of a set of attributes. You take some raw inputs, you transform or reduce them, and you produce something that captures a pattern you care about. It does not need to be unique. It does not need to be stable for weeks. It only needs to describe something that helps you detect fraud or bot activity.
Here is what I mean by “engineered representation”. You decide what you want to encode and why. For example, you may want to group emails that follow the same structure, group TLS handshakes that come from the same family of clients, or group user interactions that follow the same automation script. The fingerprint is the tool you build to encode this pattern. This is also how many teams already work when they create features for ML models. They do not use raw attributes directly. They build representations that reflect the structure of the problem.
A few examples make this clearer:
Email patterns. Many attackers generate usernames with the same structure, for example random letters plus a sequence of digits. You can encode this structure into a simple fingerprint that groups similar emails together.
TLS or protocol behavior. Clients using the same automation framework often share the same TLS handshake structure. A fingerprint like JA3 will group them naturally.
User actions. Bots repeating the same interaction sequence during signup will produce the same high-level pattern, even if they add small noise to mouse movements or timings.
These fingerprints all follow the same idea, even if they represent very different parts of the system. You collect attributes first, then you engineer a representation that keeps only the information you need. This separation helps you reason about the problem. It forces you to pick the right attributes for each fingerprint and to decide how much detail you want to keep.
This approach is also practical for fraud detection. Representations like these are easier to store, compare, and monitor over time than highdimensional raw data. They remove noise and make patterns more obvious. The next sections show how to build these fingerprints in practice and how they help solve real detection problems.
Designing fingerprints in practice
When you start thinking of fingerprints as small engineered representations instead of one giant “device ID”, the landscape becomes much simpler. In fraud detection, attackers leave patterns everywhere, but those patterns show up in different places. One fingerprint will never capture everything. Instead, you build several fingerprints, each focused on one slice of the problem, and each one helps you catch a different kind of abuse. Below are a few examples of simple fingerprints you could use.
Environment and protocol fingerprints
These come from the network or protocol layer. Things like TLS handshakes, TCP signatures, or HTTP header ordering. They have nothing to do with the user, and they are not unique at all. That is exactly why they are useful. Many bots rely on the same HTTP clients/libraries, or automation frameworks, especially those that do not run full browsers. As a result, they tend to expose very similar protocol behavior, which makes them easy to cluster.
A custom TLS fingerprint might look like this:
tls_fingerprint = hash([
client_hello.version,
count(client_hello.cipher_suites),
hash_in_order(client_hello.cipher_suites),
hash_in_order(client_hello.extensions)
])
In theory thousands of clients can share it. In practice, a lot of attackers may cluster very tightly here.
Device and browser fingerprints
This is the part everyone knows: WebGL, canvas, hardware concurrency, deviceMemory, and all the usual JavaScript and HTTP attributes. In practice, device fingerprinting is not just a client-side hash. Browser signals are combined with server-side matching logic that takes into account IP, location, and device history. Together, they help you detect two things:
returning devices (for example during account takeover investigations)
inconsistent devices (for example anti-detect browsers lying about capabilities)
A simple example:
device_fp = hash([
normalize_browser(ua.browser),
normalize_os(ua.os),
bucket_screen(screen.resolution),
navigator.deviceMemory, // RAM
navigator.hardwareConcurrency, // CPU cores
extract_gpu_brand(webgl.renderer)
])
Interaction and behavior fingerprints
Raw interaction data is noisy. Mouse coordinates change slightly, timings vary, and attackers often add random delays or extra events. Instead of trying to capture everything, the idea is to extract the structure of the interaction.
For example, imagine a short sequence of events during a signup flow:
t = 0ms mousemove(x=12, y=25)
t = 8ms mousemove(x=14, y=27)
t = 16ms mousemove(x=18, y=30)
t = 120ms click(x=18, y=30)
t = 350ms keydown(key=”a”)
t = 380ms keyup(key=”a”)
t = 600ms mousemove(x=40, y=80)
If you look at the raw data, everything is different every time: coordinates, timings, even the number of mouse moves. But the structure is often very stable for bots.
One simple way to encode this is to:
drop timestamps
drop coordinates
deduplicate consecutive identical events
keep only the event order
This turns the raw interaction stream into something like:
[“mousemove”, “click”, “keydown”, “keyup”, “mousemove”]
From there, the fingerprint becomes trivial:
behavior_fp = hash([
dedupe(event.type for event in session.events)
])
This fingerprint ignores noise on purpose. A bot replaying the same script will produce the same high-level sequence across many sessions, even if it adds random delays or slightly different mouse paths. Humans, on the other hand, almost never produce identical sequences at that level of abstraction.
You can make this fingerprint more or less strict (and therefore unique) depending on the use case. You might include form step boundaries, navigation events, or coarse timing buckets. The key point is always the same: you are not trying to model human behavior perfectly. You are trying to expose repeated structure that shows up when automation is involved.
Identity and input fingerprints
When attackers create accounts in bulk, they often generate emails the same way: same structure, same kind of domains, same weird TLDs. You usually do not care about one email address. You care about grouping addresses that look like they come from the same generator.
Here is an example of 6 emails that clearly “look related”:
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
What stands out:
the local part is random letters followed by a long digit suffix
digit suffix is usually 6 digits
domains are not mainstream providers
TLDs are “cheap” / uncommon in legit signups (.xyz, .top, .fun, .site)
some domains have a repeated pattern (mailq)
You can turn these observations into a fingerprint by extracting a few stable features that capture the structure, not the exact string.
For example:
email_fp = hash([
localpart_charset(email), // mostly letters+digits
localpart_letter_run(email), // long run of letters at start
localpart_trailing_digits(email),// true/false
trailing_digits_bucket(email), // e.g. “6+” instead of exact count
is_mainstream_domain(email.domain),
tld_bucket(email.domain.tld), // e.g. “common” vs “uncommon”
domain_token(email.domain) // e.g. contains “mail” or other tokens/substrings
])
Now, apply it to the examples above. They will all map to roughly the same representation:
localpart_charset = letters+digits
localpart_letter_run = long
localpart_trailing_digits = true
trailing_digits_bucket = 6+
is_mainstream_domain = false
tld_bucket = uncommon
domain_token = “mail” (optional, depends on your heuristic)
So the fingerprint might effectively look like:
email_fp = hash([
“letters+digits”,
“long_letter_prefix”,
true,
“6+”,
false,
“uncommon_tld”,
“mail_token”
])
That fingerprint will not uniquely identify a user. That is not the goal. The goal is that when an attacker rotates emails during a bulk signup attack, you still cluster them together because the generator behind the emails did not change.
If you want to make it more resilient, you can generate multiple fingerprints at different granularities. For example:
one that uses only the local part structure
one that uses only the domain and TLD structure
one that combines both
This avoids overfitting to a single choice of features, and makes it harder for attackers to evade just by changing one small part of the email format.
Infrastructure fingerprints
Attackers rent infrastructure. They rotate IPs, but they usually stay within the same ASNs, the same hosting providers, or the same proxy services. You can encode these signals into a simple representation and use it to understand the broader “shape” of the attack.
Something like:
infra_fp = hash([
bucket_asn(ip.asn),
ip.country,
classify_network(ip), // residential / hosting / mobile
has_proxy_signals(ip),
reputation_bucket(ip) // low / medium / high risk
])
Not something you use alone, but very powerful once you combine it with device or input fingerprints.
Using fingerprints to solve detection problems
Once you treat fingerprints as small engineered representations instead of one monolithic device ID, it becomes much easier to slice traffic, spot anomalies, and isolate attacker behavior. Different fingerprints expose different parts of the attack surface, and most attacks leak something somewhere. Here are the main ways fingerprints help in practice, but I’m pretty sure you can be creative and find other use cases.
Detecting abnormal spikes inside specific slices of traffic
A common situation is a sudden increase in traffic during signup or login. Some of it is legitimate, some of it is automated, and the two are often mixed together. When you group traffic using the right fingerprint, the anomaly usually becomes visible in a much smaller subset.
The graph below shows an example of a traffic spike isolated using Castle’s device software fingerprint. This fingerprint leverages device and browser characteristics and is designed to cluster events that share similar randomization patterns, even when individual attributes change. Instead of looking at all traffic globally, it highlights a specific slice where automated activity concentrates.
Or maybe all the new accounts follow the same email structure, or a behavior pattern that has never appeared before.
This is the simplest form of anomaly detection. No heavy ML required. You just monitor how each fingerprint behaves over time. When one fingerprint drifts or explodes, you know where to look.
Once you find the suspicious group, you can:
block or rate limit traffic matching that fingerprint
increase risk scoring
trigger a review
or go deeper by combining other fingerprints to isolate the exact malicious subset
Attackers rarely change everything at once. Their toolchain, their scripting pattern, or their identity generator usually stays consistent.
Combining fingerprints to isolate malicious subsets of traffic
Not all attacks create obvious spikes. Some are deliberately low volume, others are spread across time or mixed with legitimate traffic. Even when there is a spike, it may only appear once you look at the right subset of requests. This is where combining fingerprints becomes useful.
Looking at raw attributes does not scale. There are too many dimensions, and small variations explode the search space. Fingerprints reduce that space. When you combine a small number of them, patterns that were invisible at the global level start to stand out.
For example:
// device + email pattern
suspicious = device_fp == “A12…” &&
email_fp.pattern == “random_letters+digits”
// behavior + TLS
suspicious = behavior_fp == “mousemove->click->keydown” &&
tls_fp IN known_bot_tls
// infrastructure + identity
suspicious = infra_fp.hosting == true &&
email_fp.exotic_tld == true
Individually, none of these conditions may look abnormal. Combined, they often isolate a very specific slice of traffic that explains the abuse. This is especially useful for attacks that stay below obvious thresholds or that intentionally mimic normal user behavior.
This works because attackers tend to be consistent within an attack, even when they deliberately add noise. More precisely, even when individual behaviors or fingerprints change, the pattern of inconsistency often stays stable throughout the attack. The randomization strategy itself is usually consistent. One fingerprint may vary from request to request, but others remain stable: the same tooling, the same identity generator, the same infrastructure, or the same interaction flow. Combining fingerprints lets you exploit that stability without relying on any single signal.
These combinations are also easy to use operationally. They can drive rules, feed scoring systems, or serve as inputs to ML models. Most importantly, they remain easy to reason about. When something fires, analysts can understand why, which part of the traffic was targeted, and how the decision was made.
Detecting replayed or scripted behavior
Some fingerprints should be very specific, i.e. mostly unique and unstable over time. For example, behavioral fingerprints that take into account timing at the millisecond granularity are a good example. Attackers often reuse the same automation script or replay the same interaction sequence with tiny random noise added. A simple fingerprint like this is often enough:
behavior_fp = hash(order_of_events(session.events))
If a bot replays the same script across thousands of accounts, the same behavioral fingerprint will appear repeatedly. Humans never produce that kind of consistency. This is one of the easiest ways to catch unsophisticated bots, and even some “semi-human” ones.
Conclusion
Fingerprinting means different things depending on the context. In fraud and bot detection, it usually refers to device identification, but in practice it covers a much broader set of techniques. Device properties, protocol behavior, identity structure, interaction flows, and infrastructure all expose patterns that can be useful for detection. Treating all of this as a single “fingerprint” hides more than it reveals.
A more practical way to think about fingerprints is as small engineered representations, each designed to capture one specific pattern. A device fingerprint helps reason about how a client presents itself. A TLS fingerprint helps cluster traffic that shares the same protocol behavior or tooling. An email or input fingerprint groups identities generated by the same process. A behavioral fingerprint highlights repeated or scripted interaction flows. Each answers a different question, and none of them is sufficient on its own.
This approach also reflects how attacks actually look in production. Many attacks do not create obvious spikes, and even when they do, the signal often only appears once you aggregate or slice the traffic in the right way. Attackers may randomize individual attributes, but the way they randomize tends to stay stable within an attack. Some fingerprints change, others do not. Combining multiple fingerprints lets you take advantage of that structure without depending on any single signal.
Working with fingerprints this way makes detection systems easier to reason about and operate. Instead of dealing with high-dimensional raw data, you work with compact representations that can be monitored over time, combined when needed, and explained during investigations. If one fingerprint becomes noisy or less reliable, it can be adjusted without redesigning the entire system.
This is not a new technique, and there is no single correct way to do it. Many teams already follow this approach implicitly when they engineer features or build heuristics. Making the concept explicit helps design better detection logic and reason more clearly about why a system catches certain attacks and misses others.
*** This is a Security Bloggers Network syndicated blog from The Castle blog authored by Antoine Vastel. Read the original post at: https://blog.castle.io/fingerprints-beyond-device-ids-engineered-representations-for-fraud-detection/
