Putting Undetectable Backdoors in Machine Learning Models

1 year ago AndyC

Putting
Undetectable
Backdoors
in
Machine
Learning
Models

This
is
really
interesting

research
from
a
few
months
ago:

Abstract:
Given
the
computational
cost
and
technical
expertise
required
to
train
machine
learning
models,
users
may
delegate
the
tas

Putting
Undetectable
Backdoors
in
Machine
Learning
Models

This
is
really
interesting

research
from
a
few
months
ago:

Abstract:
Given
the
computational
cost
and
technical
expertise
required
to
train
machine
learning
models,
users
may
delegate
the
task
of
learning
to
a
service
provider.
Delegation
of
learning
has
clear
benefits,
and
at
the
same
time
raises
serious
concerns
of
trust.
This
work
studies
possible
abuses
of
power
by
untrusted
learners.We
show
how
a
malicious
learner
can
plant
an
undetectable
backdoor
into
a
classifier.
On
the
surface,
such
a
backdoored
classifier
behaves
normally,
but
in
reality,
the
learner
maintains
a
mechanism
for
changing
the
classification
of
any
input,
with
only
a
slight
perturbation.
Importantly,
without
the
appropriate
“backdoor
key,”
the
mechanism
is
hidden
and
cannot
be
detected
by
any
computationally-bounded
observer.
We
demonstrate
two
frameworks
for
planting
undetectable
backdoors,
with
incomparable
guarantees.

First,
we
show
how
to
plant
a
backdoor
in
any
model,
using
digital
signature
schemes.
The
construction
guarantees
that
given
query
access
to
the
original
model
and
the
backdoored
version,
it
is
computationally
infeasible
to
find
even
a
single
input
where
they
differ.
This
property
implies
that
the
backdoored
model
has
generalization
error
comparable
with
the
original
model.
Moreover,
even
if
the
distinguisher
can
request
backdoored
inputs
of
its
choice,
they
cannot
backdoor
a
new
inputa
property
we
call
non-replicability.

Second,
we
demonstrate
how
to
insert
undetectable
backdoors
in
models
trained
using
the
Random
Fourier
Features
(RFF)
learning
paradigm
(Rahimi,
Recht;
NeurIPS
2007).
In
this
construction,
undetectability
holds
against
powerful
white-box
distinguishers:
given
a
complete
description
of
the
network
and
the
training
data,
no
efficient
distinguisher
can
guess
whether
the
model
is
“clean”
or
contains
a
backdoor.
The
backdooring
algorithm
executes
the
RFF
algorithm
faithfully
on
the
given
training
data,
tampering
only
with
its
random
coins.
We
prove
this
strong
guarantee
under
the
hardness
of
the
Continuous
Learning
With
Errors
problem
(Bruna,
Regev,
Song,
Tang;
STOC
2021).
We
show
a
similar
white-box
undetectable
backdoor
for
random
ReLU
networks
based
on
the
hardness
of
Sparse
PCA
(Berthet,
Rigollet;
COLT
2013).

Our
construction
of
undetectable
backdoors
also
sheds
light
on
the
related
issue
of
robustness
to
adversarial
examples.
In
particular,
by
constructing
undetectable
backdoor
for
an
“adversarially-robust”
learning
algorithm,
we
can
produce
a
classifier
that
is
indistinguishable
from
a
robust
classifier,
but
where
every
input
has
an
adversarial
example!
In
this
way,
the
existence
of
undetectable
backdoors
represent
a
significant
theoretical
roadblock
to
certifying
adversarial
robustness.

Turns
out
that
securing
ML
systems
is
really
hard.

Tags:

academic
papers,

backdoors,

machine
learning

About Author

AndyC

Andy Curtis is an award-winning security consultant, researcher and public speaker. He has been working in the computer security industry since the early 1990s, having been employed by state and federal government, leading healthcare and banking providers across three continents. He has given talks about computer security for some of the world’s largest companies, worked with law enforcement agencies on investigations into hacking groups, and is a regular voice on TV and radio explaining IT security threats.

See author's posts