Hot Fuzzing

My favorite line of research is always boring things. Boring things are overlooked and understudied, and usually turn out to be quite interesting. One area of work that we’ve been doing at Bloomberg (though I’ve been only peripherally involved with) is table understanding. As a natural document phenomena, tables are a relatively boring — but being able to apply deep learning to their understanding has lead to a lot of interesting questions.
Another area, similarly inspired, is the application of machine learning to problems in software engineering and (computer) systems infrastructure.

While at Google, I started an effort in automatic debugging, that lead to a few papers on debugging latency in deeply stacked SOA (service-oriented-architectures) e.g. a paper in Hotcloud 2011. In the SOA architectures we looked at, there typically were tens of services, each maintained by separate teams, that coordinated processing of an incoming request. The work attempted to model this processing (tricky for a number of reasons) and then to use these models to assess why a particular service was slow.

For the past two years, I’ve been look at a different problem: software fuzzing. The goal of software fuzzing is to test an arbitrary binary by searching over the input space and trying to find inputs that crash the program. The problem is appealing for a number of reasons. Computer security is so vastly asymmetric — defense is much harder than offense. Just one bug in the wrong place (e.g. heartbleed) can let attackers win. Fuzzing is a surprisingly practical solution that has found many bugs in widely deployed software (see American_fuzzy_lop). From a machine learning perspective, fuzzing is a rich target. Each fuzzing run for a particular binary generates hundreds of thousands to millions of observations — observed system behavior from a particular input. With so many natural observations, the door is open to increasingly interesting modeling.

Based on this set-up, Sidd Karamcheti, David Rosenberg and I published two articles. The first paper came out in the AI Security conference and described a way to improve the sampling distribution on a per program level by (1) sampling new inputs by a variety of mechanisms, (2) observing what inputs were successful at finding new program behavior, and (3) giving higher weights to successful sampling procedures.

Today, we released another paper on arxiv. In it, we propose an approach to directly model program input/output behavior and show how to use that filter out potential inputs that the model is able to well predict how it will behave. As a result, we can execute significantly fewer inputs to discover the same number of crashes.

I am incredibly excited about this result, because it leads to a practical test of statistical program understanding. An imperfect fast model can be useful for fuzzing, and this seems like a great fit for machine learning. I’m looking forward to future research in this area.

read original article at——artificial_intelligence-5