Open Access Open Access  Restricted Access Subscription Access

Constraint Based Periodic Pattern Mining in Multiple Longest Common Subsequences


Affiliations
1 SACS MAVMM Engineering College, Kidaripatti (PO), Algarkoil via, Madurai, India
2 KGiSL Institute of Technology, Saravanampatti, Coimbatore, India
 

The periodicity search in longest common subsequences in multiple strings has a number of application, is an interesting data mining problem. Periodicity is very common practice in longest common subsequence mining algorithm. This work introduces a new parallel algorithm for finding periodicity in multiple strings. Few existing algorithms lacks in poor scalability, lacks in finding all longest pattern, and for finding symbol, partial and full periodicity. We designed the algorithm using FP-tree for finding periodicity for most common longest substring in multiple sources. We introduce a parallel algorithm for Constraint Based Periodic Pattern Mining (CBPPM) algorithm, which takes O(kN) for finding periodicity and O((N×L×h)/p) time for MLCS pattern. We tested parallel algorithm on a coarse-grained multi-computer (BSP/CGM) model with p<√m processors that takes O(N×L√p) space per processor, with O(log p) communication rounds. We derive a practical implementation that works better for arbitrary length of input sequence. The algorithm is noise resilient, and shown its performance in presence of replacement, insertion, deletion, or mixture of these types of noise. We experimented with synthetic and real data reveals a near linear speedup with scalable performance. The comparative study shows algorithm's applicability and effectiveness, generally more noise resilient.

Keywords

Frequent Pattern (FP) Tree, Multiple Longest Common Subsequence (MLCS), Periodicity Mining, Noise Resilient, Parallel Processing, Data Mining
User

  • Apostolico A, Browne S et al. (1992). Fast linear-space computations of longest common subsequences, Theoretical Computer Science, vol 92(1), No. 1, 3–17.
  • Weigend A, and Gershenfeld N (1994). Time series prediction: forecasting the future and understanding the past, Addison-Wesley.
  • Ptitsyn A A, Zvonic S et al. (2007). Permutation test for periodicity in short time series data, BMC Bioinformatics (BMCBI), vol 8, 395.
  • Berberidis C, Aref W et al. (2002). Multiple and partial periodicity mining in time series databases, Proceedings of European Conference Artificial Intelligence.
  • Rick C (1994). New algorithms for the longest common subsequence problem, Technical Report No. 85123-CS, Computer Science Department, University of Bonn.
  • Sheng C, Hsu W et al. (2005). Mining dense periodic patterns in time series data, Proceedings of 22nd IEEE International Conference on Data Engineering, 115.
  • Maier D (1978). The complexity of some problems on subsequences and supersequences, Journal of the ACM, vol 25(2), 322–336.
  • Glynn E F, Chen J et al. (2006). Detecting periodic patterns in unevenly spaced gene expression time series using lomb-scargle periodograms, Bioinformatics, vol 22(3), 310–316.
  • Rasheed F, and Alhajj R (2010). STNR: a suffix tree based noise resilient algorithm for periodicity detection in time series databases, Applied Intelligence, vol 32(3), 267–278.
  • Rasheed F, Al-Shalalfa M et al. (2011). Efficient periodicity mining in time series databases using suffix trees, IEEE Transactions on Knowledge and Data Engineering (TKDE), vol 23, No.1, 79–94.
  • Chin F Y, and Poon C K (1990). A fast algorithm for computing longest common subsequences of small alphabet size, Journal of Information Processing, vol 13, No.4, 463–469.
  • Karthik G M, and Pujeri R V (2008). Constraint based frequent pattern mining technique for solving GCS problem, proceedings of World Academy of Science, Engineering and Technology, vol 32, 672–679.
  • Karthik G M, and Pujeri R V (2012). Constraint based periodicity mining in time series databases, International Journal of Computer Network and Information Security, vol 10, 37–46.
  • Han J, Lakshmanan L V S et al. (1999). Constraint-based multidimensional data mining, IEEE Computer, vol 32, No. 8, 46–50.
  • Han J, Gong W et al. (1998). Mining segment-wise periodic patterns in time related databases, Proceedings of ACM International Conference on Knowledge Discovery and Data Mining, 214–218.
  • Han J, Yin Y et al. (1999). Efficient mining of partial periodic patterns in time series database, Proceedings of 15th IEEE International Conference in Data Engineering, 106.
  • Yang J, Wang W et al. (2002). InfoMiner: mining partial periodic patterns with gap penalties, Proceedings of Second IEEE International Conference on Data Mining.
  • Hakata K, and Imai H (1998). Algorithms for the longest common subsequence problem for multiple strings based on geometric maxima, Optimization Methods and Software, vol 10(2), 233–260.
  • Huang K Y, and Chang C H (2005). SMCA: a general model for mining asynchronous periodic patterns in temporal databases, IEEE Transaction on Knowledge and Data Engineering, vol 17, No. 6, 774–785.
  • Dubiner M et al. (1994). Faster tree pattern matching, Journal of ACM, vol 14(2), 205–213.
  • Lu M, and Lin H (1994). Parallel algorithms for the longest common subsequence problem, IEEE Trans. Parallel and Distributed System, vol 5, No. 8, 835–848.
  • Larkin M A, Blackshields G et al. (2007). Clustal W and Clustal X Version 2.0, Bioinformatics, vol 23, 2947–2948.
  • Elfeky M G, Aref W G et al. (2005). Periodicity detection in time series databases, IEEE Transactions on Knowledge and Data Engineering, vol 17, No. 7, 875–887.
  • Elfeky M. G, Aref W G et al. (2005). WARP: Time WARPing for periodicity detection, Proceedings of Fifth IEEE International Conference on Data Mining, 138–145.
  • Wang Q et al. (2011). A fast multiple long common subsequence (MLCS), IEEE Transactions on Knowledge and Data Engineering, vol 23, No. 3, 321–334.
  • Ng R et al. (1998). Exploratory mining and pruning optimizations of constrained associations rules, Proceedings of 1998 ACM SIGMOD International Conference on Management of Data, ACM Press, New York, 13–24.
  • Finn R D, Mistry J et al. (2006). Pfam: clans, web tools and services, Nucleic Acids Research, vol 34, D247–D251.
  • Finn R D, Tate J et al. (2008). The Pfam protein families database, Nucleic Acids Research, vol 36, D281–D288.
  • Urrutia R (2003). KRAB-containing zinc finger repressor proteins, Genome Biology, vol 4, No. 10, 231.
  • Papadimitriou S, Brockwell A et al. (2003). Adaptive, hands off-stream mining, Proceedings of 29th Very Large Data Bases (VLDB) conference, 560–571.
  • Lee S D, and De Raedt L (2004). Constraint based mining of first order sequences in SeqLog, Database Support for Data Mining Applications, 154–173.
  • Chen Y, Wan A et al. (2006). A fast parallel algorithm for finding the longest common sequence of multiple biosequences, BMC Bioinformatics, vol 7, S4.

Abstract Views: 581

PDF Views: 0




  • Constraint Based Periodic Pattern Mining in Multiple Longest Common Subsequences

Abstract Views: 581  |  PDF Views: 0

Authors

G. M. Karthik
SACS MAVMM Engineering College, Kidaripatti (PO), Algarkoil via, Madurai, India
Ramachandra V. Pujeri
KGiSL Institute of Technology, Saravanampatti, Coimbatore, India

Abstract


The periodicity search in longest common subsequences in multiple strings has a number of application, is an interesting data mining problem. Periodicity is very common practice in longest common subsequence mining algorithm. This work introduces a new parallel algorithm for finding periodicity in multiple strings. Few existing algorithms lacks in poor scalability, lacks in finding all longest pattern, and for finding symbol, partial and full periodicity. We designed the algorithm using FP-tree for finding periodicity for most common longest substring in multiple sources. We introduce a parallel algorithm for Constraint Based Periodic Pattern Mining (CBPPM) algorithm, which takes O(kN) for finding periodicity and O((N×L×h)/p) time for MLCS pattern. We tested parallel algorithm on a coarse-grained multi-computer (BSP/CGM) model with p<√m processors that takes O(N×L√p) space per processor, with O(log p) communication rounds. We derive a practical implementation that works better for arbitrary length of input sequence. The algorithm is noise resilient, and shown its performance in presence of replacement, insertion, deletion, or mixture of these types of noise. We experimented with synthetic and real data reveals a near linear speedup with scalable performance. The comparative study shows algorithm's applicability and effectiveness, generally more noise resilient.

Keywords


Frequent Pattern (FP) Tree, Multiple Longest Common Subsequence (MLCS), Periodicity Mining, Noise Resilient, Parallel Processing, Data Mining

References





DOI: https://doi.org/10.17485/ijst%2F2013%2Fv6i8%2F36343