Computer Science Department
School of Computer Science, Carnegie Mellon University
Increasing the Scalability of
The continued growth of the Web and its increasing role in our daily life has created new technical and social challenges. On the technical side, applications deployed on the Internet suffer from unpredictable load, especially due to events such as breaking news (e.g., Hurricane Katrina) and sudden popularity spikes (e.g., the "Slashdot Effect"). A large number of these Web applications increasingly use a database to generate customized and personalized responses to users' requests. Because of the widely varying load, currently there is no economical way to provision infrastructure for many of these applications in which the database system is the bottleneck. On the social side, Web applications increasingly collect sensitive data, which must be kept private.
In this dissertation we address both these technical and social challenges. We design and implement a Database Scalability Service (DBSS), which can offer scalability to data-intensive applications as a plug-in subscription service with a per-usage charge. A DBSS works by caching applications' data and answering queries on their behalf. It uses a large shared infrastructure to absorb load spikes that may occur in any individual application. We address two key issues in designing a DBSS: (a) the privacy concerns of applications in allowing the DBSS to cache their data, and (b) the performance concerns due to the high latency applications face in accessing their data in a DBSS setting.
Simply encrypting all the data that passes through the DBSS is not a feasible solution to an application's privacy concerns. On an update, the DBSS must invalidate (at least) data from its cache that have changed . If an application encrypts all the data passing through the DBSS, the DBSS cannot discern any information about what data it is caching. The DBSS then is forced to invalidate large amounts of data from its cache on any update, which leads to poor scalability. In deciding how much data (that passes through the DBSS) to encrypt, the application faces a tradeoff between privacy and scalability. On the one hand, encrypting more data means that the DBSS will invalidate far more than needed, decreasing scalability. On the other hand, encrypting less data raises privacy concerns. We study this tradeoff both formally and empirically. To simplify the task of managing this tradeoff, we devise a method for statically identifying segments of the database that can be encrypted without impacting scalability. Experiments with three realistic benchmark applications show that our static method is effective. For each application, it identifies a significant fraction of the database that can be encrypted without any scalability penalty. Moreover, most of the data that it identifies is "moderately" sensitive, which application designers will want to encrypt, if doing so has no performance overhead.
For some applications, extra information from the database, beyond the data passing through the DBSS, is useful in making invalidation decisions. We present invalidation clues, a framework that allows applications to provide this extra information to the DBSS. Clues also provide fine-grained control to the applications for disclosing any other information to the DBSS that reveals little, yet limits the number of unnecessary invalidations. Our experiments using three Web benchmark applications on our prototype DBSS confirm that invalidation clues are indeed a low-overhead, effective, and general technique for applications to balance their privacy and scalability needs.
To address the performance concerns due to the high latency an application faces in accessing its data in the DBSS setting, we devise compiler-driven transformations that reduce the number of times an application must access its data. Using our three benchmark applications, we show that our transformations apply widely and indeed reduce the number of times an application has to access its data. Finally, on our prototype DBSS, we confirm that this reduction significantly improves scalability.