Abstract
The widespread use of GPS-enabled smartphones along with the popularity of micro-blogging and social networking applications, e.g., Twitter and Facebook, has resulted in the generation of huge streams of geo-tagged textual data. Many applications require realtime processing of these streams. For example, location-based adtargeting systems enable advertisers to register millions of ads to millions of users based on the users' location and textual proile. Existing streaming systems are either centralized or are not spatial-keyword aware, and hence these systems cannot eiciently support the processing of rapidly arriving spatial-keyword data streams. In this paper, we introduce a two-layered indexing scheme for the distributed processing of spatial-keyword data streams. We realize this indexing scheme in Tornado, a distributed spatial-keyword streaming system. The irst layer, termed the routing layer, is used to fairly distribute the workload, and furthermore, co-locate the data objects and the corresponding queries at the same processing units. The routing layer uses the Augmented-Grid, a novel structure that is equipped with an eicient search algorithm for distributing the data objects and queries. The second layer, termed the evaluation layer, resides within each processing unit to reduce the processing overhead. The two-layered index adapts to changes in the workload by applying a cost formula that continuously represents the processing overhead at each processing unit. Extensive experimental evaluation using real Twitter data indicates that Tornado achieves high scalability and more than 2x improvement over the baseline approach in terms of the overall system throughput.