ABSTRACT
The efficacy of Anomaly Detection (AD) sensors depends heavily on the
quality of the data used to train them. Artificial or contrived
training data may not provide a realistic view of the deployment
environment. Most realistic data sets are dirty; that is, they
contain a number of attacks or anomalous events. The size of these
high-quality training data sets makes manual removal or labeling of
attack data infeasible. As a result, sensors trained on this data can
miss attacks and their variations. We propose extending the training
phase of AD sensors (in a manner agnostic to the underlying AD
algorithm) to include a sanitization phase.
This phase generates multiple models conditioned on small slices of
the training data. We use these 'micro-models' to produce
provisional labels for each training input, and we combine the
micro-models in a voting scheme to determine which parts of the
training data may represent attacks. Our results suggest that this
phase automatically and significantly improves the quality of
unlabeled training data by making it as 'attack-free' and
'regular' as possible in the absence of absolute ground truth. We
also show how a collaborative approach that combines models from
different networks or domains can further refine the sanitization
process to thwart targeted training or mimicry attacks against a
single site.
(This is joined work with Gabriela F. Cretu, Michael E. Locasto,
Salvatore J. Stolfo, and Angelos D. Keromytis.)
BIOGRAPHY
Angelos Stavrou is Assistant Professor in the Department of Computer Science and a member of the Center for Secure Information Systems at George Mason University, Fairfax, Virginia. He received his M.Sc. in Electrical Engineering, M.Phil. and Ph.D. (with distinction) in Computer Science all from Columbia University. He also holds an M.Sc. in theoretical Computer Science from University of Athens, and a B.Sc. in Physics with distinction from University of Patras, Greece. His current research interests include security for distributed systems, network reliability, anonymity, and statistical inference with a focus on building and deploying large-scale systems.