I am a French New Yorker, and CS student at Columbia. Among many other things, I like to build stuff, learn new programming languages, travel, and rock climb. I also love to read, on topics ranging from econ to history and biology.
My research is in distributed systems, I currently focus on leveraging statistics and machine learning to improve data management and transparency. I am also interested in the impact of the sharing economy.
Data has become the principal asset of the Internet era. While this data offers
unique opportunities to improve personal and business effectiveness, it also
poses serious risks to users' privacy, and to organizations, by exposing
extensive data stores to external and internal attacks.
In my research, I build tools and design mechanisms that leverage statistics and
machine learning to: increase the current Web's transparency by revealing
how personal data is being used;
and enable a more rigorous and selective approach to big data collection, access,
and protection, to reap its benefits without imposing undue risks.
Selective Data Systems
Challenging the common practice in both private and public sectors of
collecting vast quantities of personal information, I ask whether it is
possible to build data-driven systems that are more selective with the
data they collect. To explore this question I built Pyramid, a data
management system that leverages training set minimization techniques to
reduce data exposure in ML applications. More precisely Pyramid uses
count-based featurization to summarize past data before it is archived
in cold storage. The counts, kept differentially private, are used with
a small amount of recent observations, called the hot data, to train ML
models. Using this technique, as well as system mechanisms to reduce
the impact of differentially private noise, Pyramid is within 4% of
previous models' accuracy while training on, and thus exposing, less
than 1% of the raw data. This way, ML based applications can reap the
benefits of big data without undue risks.
Data Use Transparency Infrastructure
To add transparency to data uses on the Web, I am building a series of
scalable, generic, and reliable tools to detect data flows within and
across web services. My initial system, XRay, offers a first system
design and theoretical building blocks to detect the use of digital
personal data for targeting and personalization. The key insight in XRay
is to infer targeting by correlating user inputs (such as searches,
emails, or locations) to service outputs (such as ads, recommendations,
or prices) based on observations obtained from user profiles populated
with different subsets of the inputs. My latest tool, Sunlight,
leverages rigorous statistical methods to determine the causes of online
targeting at great scale and based on solid statistical justification.
Mathias Lecuyer, Riley B. Spahn, Roxana Geambasu, Tzu-Kuo Huang, and Siddhartha Sen. "Pyramid: Enhancing selectivity in big data protection with count featurization." (S&P'17) [PDF][Website]
Mathias Lecuyer, Max Tucker, Augustin Chaintreau. "Improving the transparency of the sharing economy." (WWW'17) [PDF][Blog post]
Mathias Lecuyer, Riley Spahn, Giannis Spiliopoulos, Augustin Chaintreau, Roxana Geambasu, and Daniel Hsu. "Sunlight: Fine-grained Targeting Detection at Scale with Statistical Confidence." (CCS'15) [PDF][Website][Mentioned in The Economist]
Nicolas Viennot , Mathias Lecuyer, Jonathan Bell, Roxana Geambasu, and Jason Nieh. "Synapse: New Data Integration Abstractions for Agile Web Application Development." (EuroSys'15) [PDF][Website]
Mathias Lecuyer, Guillaume Ducoffe, Francis Lan, Andrei Papancea, Theofilos Petsios, Riley Spahn, Augustin Chaintreau, and Roxana Geambasu. "XRay: Increasing the Web's Transparency with Differential Correlation." (USENIX Security'14) [PDF][Website][NYT Bits article]