Parallel Motif Extraction from Very Long Sequences

Majed Sahli; Essam Mansour; Panos Kalnis; ACM

doi:10.1145/2505515.2505575

Back

Conference proceeding

Parallel Motif Extraction from Very Long Sequences

Majed Sahli, Essam Mansour, Panos Kalnis and ACM

PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13), pp.549-558

01/01/2013

DOI: https://doi.org/10.1145/2505515.2505575

Abstract

Computer Science

Computer Science, Artificial Intelligence

Computer Science, Information Systems

Science & Technology

Technology

Motifs are frequent patterns used to identify biological functionality in genomic sequences, periodicity in time series, or user trends in web logs. In contrast to a lot of existing work that focuses on collections of many short sequences, modern applications require mining of motifs in one very long sequence (i.e., in the order of several gigabytes). For this case, there exist statistical approaches that are fast but inaccurate; or combinatorial methods that are sound and complete. Unfortunately, existing combinatorial methods are serial and very slow. Consequently, they are limited to very short sequences (i.e., a few megabytes), small alphabets (typically 4 symbols for DNA sequences), and restricted types of motifs. This paper presents ACME, a combinatorial method for extracting motifs from a single very long sequence. ACME arranges the search space in contiguous blocks that take advantage of the cache hierarchy in modern architectures, and achieves almost an order of magnitude performance gain in serial execution. It also decomposes the search space in a smart way that allows scalability to thousands of processors with more than 90% speedup. ACME is the only method that: (i) scales to gigabyte-long sequences; (ii) handles large alphabets; (iii) supports interesting types of motifs with minimal additional cost; and (iv) is optimized for a variety of architectures such as multi-core systems, clusters in the cloud, and supercomputers. ACME reduces the extraction time for an exact-length query from 4 hours to 7 minutes on a typical workstation; handles 3 orders of magnitude longer sequences; and scales up to 16,384 cores on a supercomputer.

Metrics

1 Record Views

Details

Title: Parallel Motif Extraction from Very Long Sequences
Creators - without role: Majed Sahli - King Abdullah University of Science and Technology
Essam Mansour - Qatar Cardiovascular Research Center
Panos Kalnis - King Abdullah University of Science and Technology
ACM
Publication Details: PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13), pp.549-558
Publisher: Assoc Computing Machinery
Number of pages: 10
Identifiers: 9945974508331
Academic Unit: King Abdullah University of Science & Technology
Language: English
Resource Type: Conference proceeding