![]() |
ASTERIX: A Highly Scalable Parallel Platform for Semi-structured Data Management and Analysis |
![]() |
Home ASTERIX Cast Publications
Follow @ASTERIXUCI
Nov. 6, 2012: On the Election Day Chen gave an invited talk about Election and ASTERIX at the ACM GIS BigSpatial workshop in Redondo Beach, CA.
(04/2012) Our paper ASTERIX: Scalable Warehouse-Style Web Data Integration is accepted in IIWeb'12.
(04/2012) Mike's keynote talk in EDBT'12: Inside "Big Data Management": Ogres, Onions, or Parfaits?
(04/2012) Raman Grover is one of the winners of the 2012 Yahoo! Key Scientific Challenges Program, sponsored by Yahoo! Inc.
(03/2012) Follow our twitter account!
Overview
The ASTERIX project is developing new technologies for ingesting, storing, managing, indexing, querying, analyzing, and subscribing to vast quantities of semi-structured information. The project is combining ideas from three distinct areas – semi-structured data, parallel databases, and data-intensive computing – to create a next-generation, open source software platform that scales by running on large, shared-nothing commodity computing clusters. ASTERIX targets a wide range of semi-structured information, ranging from “data” use cases – where information is well-tagged and highly regular – to “content” use cases – where data is irregular and much of each datum is textual. ASTERIX is taking an open stance on data formats and addressing research issues including highly scalable data storage and indexing, semi-structured query processing on very large clusters, and merging parallel database techniques with today’s data-intensive computing techniques to support performant yet declarative solutions to the problem of analyzing semi-structured information.
ASTERIX Eco-System
Hyracks Data-Parallel Platform
Hyracks is a generalized alternative to infrastructures such as
MapReduce (Hadoop) and Dryad for solving data-parallel problems. It balances
the need for expressiveness beyond MapReduce while providing out-of-the-box
support for commonly occurring communication patterns and operators for
data-oriented tasks. If you have use cases involving massive amounts of data
and thus requiring parallelism, be sure to check out the Hyracks project.
Hyracks-to-Hadoop Compatibility Layer
Given that many data analysts are adopting the Hadoop platform, we
believe that ASTERIX must provide an easy migration path for existing Hadoop
projects in order to attract new users and support clusters running a mix of
old and new use cases. In that spirit, we have built a Hadoop compatibility
layer on top of Hyracks so that existing Hadoop programs can be executed
using Hyracks. If you are a Hadoop user, please check out this aspect of the
Hyracks project if you would like speed up your job execution in a low-cost
and seamless fashion.
ASTERIX Query Processing Engine
The growing popularity of Hive and Pig for parallel data analysis shows the
importance of high-level data langugages: they can greatly reduce
development time and make data analysts' lives much easier. We are
developing the ASTERIX query processor on top of the Hyracks runtime. This
includes the AQL (ASTERIX Query Language) compiler, algebra, and optimizer.
AQL queries are compiled to cost-efficient Hyracks jobs. If you want to
analyze large scale semi-structured data in parallel, plan to try AQL when
it becomes available.
HiveQL Relational Query Processor Plug-in
Given the data-model-agnostic ASTERIX algebra layer, we are able to easily
layer a relational query processor such as Hive on top of the Hyracks runtime.
In this project, Hive runtime plans are translated to ASTERIX algebra plans,
but all functions, expression evaluators, metadata, intermediate data
formats, and input/output formats in Hive are reused. If you are a Hive
user, please check-out this project as a way to get better performance
without any changes in your HiveQL queries.
Event Warehouse
This brand new
project is trying to build an event warehouse that combines traditional
information, such as map data, business listings, scheduled events,
population data, and traffic data with additional dynamic information such
as online news stories, blogs, geo-coded or geo-tagged tweets, status
updates, wall posts, geo-coded or geo-tagged photos, etc. This project is
being developed by the UCI multimedia research group, and uses ASTERIX with
Hyracks as the runtime execution engine.
Acknowledgement: This project is supported by an eBay matching grant, one Facebook Fellowship Award, the NSF Awards No. IIS-0910989, IIS-0910859, and IIS-0910820, a UC Discovery grant, three Yahoo! Key Scentific Challenge Awards, and generous industrial gifts from Google, HTC, Microsoft and Oracle Labs.
For any questions regarding this project, please send email to asterix AT ics.uci.edu.