Making Big Data Active: From Petabytes to Megafolks in Milliseconds

First-generation Big Data management efforts have given us MapReduce-based frameworks such as Hadoop, Pig, and Hive (and most recently Spark) that focus on after-the-fact data analytics, key-value ("NoSQL") stores that provide scalable key-based record storage and retrieval, and a handful of specialized systems that target problems like parallel graph analytics and data stream analysis. This project aims to continuously and reliably capture Big Data arising from social, mobile, Web, and sensed data sources and enable timely delivery of information to users with indicated interests. In a nutshell, we aim to develop techniques to enable the accumulation and monitoring of petabytes of data of potential interest to millions of end users; when "interesting" new data appears, it should be delivered to end users in a timeframe measured in (100's of) milliseconds. The effort involves challenges related to parallel databases, Big Data platforms, stream data management, and publish/subscribe systems. It will require massively scaling out solutions to individual problems as well as integrating the results into a coherent overall software architecture.

On the "data in" side, the technical challenges that we are addressing include resource management in very large scale LSM-based storage systems and the provision of a highly available and elastic facility for fast data ingestion. On the "inside", the challenges include the parallel evaluation of a large number of declarative data subscriptions over (multiple) highly partitioned data sets. Amplifying this challenge is a need to efficiently support spatial, temporal, and similarity predicates in data subscriptions. Big Data also makes result ranking and diversification techniques critical in order for large result sets to be manageable. On the "data out" side, the technical challenges include the reliable and timely dissemination of data of interest to a sometimesconnected subscriber base of unprecedented scale. The software basis for this work is AsterixDB, an open-source Big Data Management System (BDMS) that supports the scalable storage, searching, and analysis of mass quantities of semi-structured data. By leveraging AsterixDB, this project aims at the next step in big data management: creating an "active BDMS" and an open source prototype that will be made available to the Big Data research community.