Next Article in Journal
Evaluation of Contactless Identification Card Immunity against a Current Pulse in an Adjacent Conductor
Previous Article in Journal
UHF Textronic RFID Transponder with Bead-Shaped Microelectronic Module
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Efficient Bit-Based Approach for Mining Skyline Periodic Itemset Patterns

1
College of Software, Jilin University, Changchun 130012, China
2
Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China
3
College of Computer Science and Technology, Jilin University, Changchun 130012, China
*
Author to whom correspondence should be addressed.
Electronics 2023, 12(23), 4874; https://doi.org/10.3390/electronics12234874
Submission received: 23 October 2023 / Revised: 29 November 2023 / Accepted: 1 December 2023 / Published: 3 December 2023

Abstract

:
Periodic itemset patterns (PIPs) are widely used in predicting the occurrence of periodic events. However, extensive redundancy arises due to a large number of patterns. Mining skyline periodic itemset patterns (SPIPs) can reduce the number of PIPs and guarantee the accuracy of prediction. The existing SPIP mining algorithm uses FP-Growth to generate frequent patterns (FPs), and then identify SPIPs from FPs. Such separate steps lead to a massive time consumption, so we propose an efficient bit-based approach named BitSPIM to mine SPIPs. The proposed method introduces efficient bitwise representations and makes full use of the data obtained in the previous steps to accelerate the identification of SPIPs. A novel cutting mechanism is applied to eliminate unnecessary steps. A series of comparative experiments were conducted on various datasets with different attributes to verify the efficiency of BitSPIM. The experiment results demonstrate that our algorithm significantly outperforms the latest SPIP mining approach.

1. Introduction

Data mining plays a significant role in data analysis and knowledge extraction [1]; it has become an efficient tool for pattern discovery due to its applicability in a variety of circumstances such as association rule mining (ARM) [2], clustering analysis [3], and classification [4]. Mining frequent patterns (FPs) [2] are fundamental in ARM. The connection-based algorithm, called Apriori [5], is a classical breadth-first iterative algorithm for mining FPs. Many algorithms have been developed to accelerate the mining of FP. Han et al. proposed a depth-first algorithm called FP-Growth [6,7], based on FP-tree. It uses a prefix tree structure without generating candidates and only scans the dataset twice. BitTableFI [8], as proposed by Dong et al., employs an efficient bit structure to compress the dataset.
After the proposal of ARM, many new types of patterns have emerged, including high-utility patterns [9], periodic itemset patterns (PIPs) [10], subgraph patterns [11], and sequential patterns [12], etc. Among them, PIPs are one of the most well-studied types of patterns. For instance, the opportunity for online or offline retailers to recommend suitable products to their customers is very critical, because the right recommendation may satisfy the customers, while a completely wrong one may be a turnoff to the customers. Customers may buy a new product when the old one reaches its expected life or is consumed, therefore, it is safe to assume that there is a relationship between the lifespan or consumption cycle of a product and its purchase number and cycle. By tapping into the purchase frequency and period of a product in customers’ shopping records, retailers cannot only improve the shopping experience of customers but also allow themselves to better understand the buying habits of customers, raise the recommendation hit rates, promote similar products, increase user stickiness, and so on. Accordingly, when the criteria of frequency and period are considered together, the retailers can make advisable marketing strategies. Therefore, it is very necessary to utilize the periodic itemset patterns in the shopping records in the decision-making department of retailers.
PIPs can be used to predict the occurrence of periodic events [13], deal with the seasonality information of products [14], and serve in recommendation systems [15]. PIPs consider both the frequency and periodicity of an itemset and are regarded as an expanded derivative of FPs. There are various periodicity measures for PIPs [16], which lead to different definitions, including the maximum period [17], variance of periods [18], and so on. In 2021, Chen et al. adopted a measure based on the coefficient of variation to define PIPs [19]. In their work, an itemset is a PIP if its coefficient of variation is less than or equal to the threshold of coefficient of variation, indicating that the fluctuation of the period of the itemset is below the average level. They proposed a probability model for predicting periodic patterns. The frequency and periodicity influence the prediction accuracy of the probability model. For an itemset, a higher frequency indicates a wider range of sample sizes of the periods, and a lower coefficient of variation means less fluctuation. The model is limited due to the redundancies originating from predicting items that are contained in different PIPs multiple times. The redundancies are proportional to the number of PIPs.
In 2023, Chen et al. proposed a special sort of PIP, named the Skyline Periodic Itemset Pattern (SPIP) [20], aimed at making accurate pattern predictions. In SPIPs, PIPs with either higher frequency or lower coefficient of variation, or both, are preferred. They provided the definition of SPIP and proposed an effective algorithm named SPIM for mining SPIPs. Patterns that are not dominated by any other patterns in two dimensions constitute the skyline of a 2-dimensional dataset [21]. A PIP is an SPIP if there are no other PIPs with both higher frequency and lower coefficient of variation. By mining SPIPs, we can significantly reduce the number of patterns while ensuring the accuracy of predictions. The aim of mining SPIPs is to avoid a vast number of PIPs and relieve users from an overload of patterns.
SPIM is divided into two steps. The first step is to mine all FPs in advance using FP-Growth, and then identify SPIPs from the FPs obtained in the first step. Using FP-Growth to mine FPs makes SPIM consist of two very independent steps. Additionally, the occurrence sets of an itemset are generated in the second step, even if the itemset has already been identified as an FP. Confined by these two complicated stages, SPIM consumes massive computational resources. The running time of SPIM is longer than that of FP-Growth, as FP-Growth essentially serves as part of SPIM. In terms of memory usage, constructing FP-trees in FP-Growth consumes significant memory resources. These disadvantages of SPIM motivate the development of a more efficient SPIP mining approach.
Instead of using separate steps, we found that the identification of SPIP can proceed as soon as an itemset is recognized as an FP. Additionally, efficient bitwise representations can accelerate set operations. We present a novel approach called Bitwise Skyline Periodic Itemset Pattern Mining (BitSPIM) for mining SPIPs. This method utilizes bitwise representations in an Apriori-like algorithm named BitTableFI [8] to deal with FPs while incorporating a novel cutting mechanism. Once an itemset is recognized as an FP, the bitset for its occurrence set is directly used to derive its period list and coefficient of variation, which are then used to determine whether the itemset is an SPIP. Simulated experiments were conducted on ten transaction datasets with divergent characteristics to compare the performance of BitSPIM and SPIM. The experimental results demonstrate the effectiveness of the proposed method in terms of running time and memory usage. We believe that BitSPIM could be an influential alternative in mining SPIPs.

2. Related Works

In this section, we review related works and techniques concerning mining SPIPs. SPIP is a special type of PIP. In the field of PIP mining, different periodicity measures can lead to various types of PIPs. Maximum period [17] can be used as the periodicity measure for PIPs, and such PIPs are mined by periodic frequent pattern growth, which utilizes a tree structure. Fournier-Viger et al. provided various kinds of periodic measures. Three measures named minimum periodicity, maximum periodicity, and average periodicity are proposed in [22], and an algorithm named Periodic Frequent Pattern Miner mines PIPs with the aid of the monotonicity of these three types of periodicity. Additionally, they introduced the definitions of periodic standard deviation and sequence periodic ratio [23] to mine PIPs common to multiple sequences. A regularity measure for PIPs is defined using the variance of periods [18]. Based on the standard deviation, the coefficient of variation is adopted to measure PIPs in the works of Chen et al. [19]. They then inherited the coefficient of variation measure to define SPIP in [20].
Mining FPs is a fundamental procedure in mining SPIPs. Depth-first search and breadth-first search are two main methods for mining FPs, known as candidate generation and pattern growth, respectively [24]. Depth-first algorithms search for FPs in a bottom-up manner. Starting from itemsets containing a single item, larger FPs with more items are recursively generated by appending items according to the total order. Han et al. proposed a depth-first algorithm called FP-Growth [6,7], based on the FP-tree, to compress database transactions. This method consumes a significant amount of running time in creating multiple subtrees. Additionally, the performance of the algorithm is affected by the storage consumption from recording a substantial number of FP-tree nodes.
As for breadth-first search, Apriori [5] proposed by Agrawal et al., is a classical breadth-first FP mining algorithm. It is a fundamental iterative algorithm that uses a layer-by-layer search to find FPs, employing an iterative search pattern and a test-and-generate approach. Based on the Apriori algorithm, several algorithms have been developed to compress the database, allowing for the quick generation of candidate itemsets and the calculation of their support. T-Apriori [25] uses an overlap strategy when counting support to ensure high efficiency. BitTableFI [8], proposed by Dong et al., employs an efficient bit structure to compress the database.
Apart from approaches like BitTableFI for mining FPs, bitwise representations and operations are exploited in various works in mining metadata. Index-BitTableFI [26] is an improved version of BitTableFI, which utilizes heuristic information provided by an index array. SPAM [27], aimed at mining sequential patterns, employs a bitmap representation of the database. In IndiBits [28], proposed by Breve et al., the binary representation of data similarities is used, and bitwise operations are employed to update the Binary Attribute Satisfiability (BAS) Distance Matrix. For mining frequent closed itemsets, algorithms for efficiently calculating the intersection between two dynamic bit vectors [29] are proposed. CloFS-DBV [30] also utilizes dynamic bit vectors to mine frequent closed itemsets. The computation of support is based on dynamic bit vectors when generating new patterns. These bit vectors can also be used in mining web access patterns [31]. Trang et al. proposed two algorithms named MWAPC and EMWAPC, which are based on the prefix-web access pattern tree (PreWAP) structure for mining web access patterns with a super-pattern constraint. In DPMmine [32], vector column intersection bitwise operations are used to aid the algorithm in mining colossal pattern sequences.

3. Background and Preliminaries

Let I = i 1 , i 2 , , i m denote a set of finite items, | I | is the number of items in I. The items are discrete real numbers or symbols. As shown in Figure 1, there are mapping relations that map these discrete numbers and symbols into a group of continuous items. In our paper, we assume that there exist mapping relations that map the real numbers or symbols into a series of continuous integers starting from 1. The relevant definitions of mining SPIPs are presented as follows:
Definition 1.
A transaction T k is a set of items in I, i.e., T k I . T k holds a unique index k called the transaction identifier.
A transaction dataset D B = T 1 , T 2 , , T n comprises n transactions. | D B | is the number of transactions in D B . Table 1 shows an example transaction dataset D B 1 containing five transactions denoted by T 1 to T 5 , where I = 1 , 2 , 3 , 4 , 5 , | D B 1 | = 5. Example 1 shows the relationship between T k and I, where 1 ≤ k ≤ 5. Transactions represent a shopping list of products from the retailer that are purchased by a customer; I can be used to represent the whole set of products on the shopping list. The transaction dataset can be extracted from the database of the retailer, which is served as the shopping record in a time interval.
Example 1.
For the set of items T 1 = {1, 2, 3, 5} in Table 1, since T 1 I , T 1 is a transaction. For another set of items {1, 2, 6, 8}, which is not a subset of I, it is not a transaction.
Definition 2.
An itemset, X, is a non-empty set, and X ⊆ I. An itemset, X, containing n items is called an n-itemset. n is the size of the itemset. Specifically, { i } is a 1-itemset that contains a single item i.
Example 2.
X 1 = {3} and X 2 = {2, 3} are two itemsets with sizes of 1 and 2. Thus, the two itemsets are also called a 1-itemset and a 2-itemset, respectively.
Definition 3.
The occurrence set O X for an itemset X is a set of transaction identifiers, O X = { k | X T k , T k D B } .
Example 3.
In Table 1, T 1 , T 3 , and T 5 incorporate X = {2, 3}, so O X = {1, 3, 5}.
Definition 4.
The frequency F r e q X for an itemset X is the ratio of the size of O X to the number of transactions in the dataset, F r e q X = | O X | / | D B | . Given a frequency threshold θ, an itemset X is a frequent pattern if F r e q X θ .
Example 4.
For D B 1 in Table 1 with a frequency threshold θ = 0.7, for X 1 = {2, 3} and X 2 = {3, 4}, O X 1 = {1, 3, 5} and O X 2 = {2, 3, 4, 5}. Thus, F r e q X 1 = 0.6 and F r e q X 2 = 0.8. Since F r e q X 2 = 0.8 > 0.6 , X 2 is a frequent pattern. Similarly, X 1 is not a frequent pattern since F r e q X 1 = 0.6 < 0.7 .
Definition 5.
The period list P e r X for an itemset X is the set of periods of X: P e r X = { w p + 1 w p | p { 1 , , | O X | 1 } , w p O X }.
Definition 6.
The coefficient of variation C X of an itemset X is the ratio of the standard deviation of P e r X to the mean of P e r X : C X = s t d ( P e r X ) / m e a n ( P e r X ) . s t d ( * ) and m e a n ( * ) represent the standard deviation and mean, respectively.
Example 5.
X 1 = {1, 2} and X 2 = {3, 6} in D B 2 , as shown in Table 2, O X 1 = {1, 4, 6, 8, 11}. Thus by Definition 5, P e r X 1 = {3, 2, 2, 3}. The standard deviation and the mean of P e r X 1 are 0.5 and 2.5, respectively, C X 1 = s t d ( P e r X 1 ) / m e a n ( P e r X 1 ) = 0.2. Similarly, O X 2 = {3, 4, 5, 12} and P e r X 2 = {1, 1, 7}, C X 2 = 0.943.
The coefficient of variation is a suitable metric for measuring the periodicity of patterns [19]. It reflects the fluctuation in the appearance of patterns in the transaction dataset. Patterns with a lower coefficient of variation exhibit better periodicity, while a higher coefficient of variation indicates irregularity in occurrence. We follow the approach of Chen et al. in introducing the coefficient of variation as a measure of periodicity [19].
Definition 7.
For a transaction dataset, a frequency threshold θ, and a coefficient of variation threshold δ, an itemset X is a periodic itemset pattern if X is a frequent pattern and C X δ . The set of PIPs is denoted by P I P :
P I P = { X | X F P , C X δ } .
Example 6.
X 1 = {1, 2} and X 2 = {3, 6} in D B 2 , as shown in Table 2, with a frequency threshold θ = 0.2 and a coefficient of variation threshold δ = 0.5. Both X 1 and X 2 are FPs, as their frequencies are beyond 0.2. As C X 1 = 0.2 < 0.5, by Definition 7, X 1 is a PIP. Similarly, X 2 is not a PIP since C X 2 = 0.943 > 0.5.
Definition 8.
For two itemsets (X and Y) in a transaction dataset, X is dominated by Y if F r e q X < F r e q Y and C X C Y , or F r e q X F r e q Y and C X > C Y . ‘X is dominated by Y’ is equivalent to ‘Y dominates X’.
Example 7.
For X 1 = {1, 2}, X 2 = {3, 6}, and X 3 = {3} in D B 2 , as shown in Table 2, the frequency and the coefficient of variation of X 1 , X 2 , and X 3 are listed in Table 3. By Definition 8, neither X 2 nor X 3 dominate X 1 , as C X 1 < C X 2 and C X 1 < C X 3 . Neither X 1 nor X 2 dominate X 3 as F r e q X 3 > F r e q X 1 and F r e q X 3 > F r e q X 2 . As F r e q X 2 < F r e q X 1 and C X 2 > C X 1 , X 2 is dominated by X 1 . Similarly, it is dominated by X 3 .
Definition 9.
For a transaction dataset, D B , a frequency threshold, θ, and a coefficient of variation threshold, δ, an itemset, X, is an SPIP if X is a periodic itemset pattern and X is not dominated by other itemsets in D B . The set of SPIPs is denoted by S P I P :
S P I P = { X | X P I P , / Y s . t . Y d o m i n a t e s X } .
By Definitions 8 and 9, the aim of mining SPIPs is to explore the patterns that are more frequent or have better periodicity or both.

4. BitSPIM: The Proposed Method

4.1. The Preliminaries of Bitwise Representation

In our approach, bitsets and efficient bitwise representations are introduced to deal with set operations.
Definition 10.
The bitset for a set, X, is denoted by BS X . BS X [ i ] is the ith bit of BS X . If an item i X , then BS X [ i ] is assigned as 1. Otherwise, it is assigned as 0:
BS X [ i ] = 1 i X 0 i X
The Set operation and Clear operation are used to assign 1 and 0 to the bits in the bitset, respectively.
| X | and | BS X | are the sizes of X and BS X , respectively. | BS X | equals the number of bits assigned, as 1 in BS X . Obviously, | BS X | = | X | . By Definition 10, a mapping relation between the set and its bitwise representation is established. This relation enables the efficient use of bitwise operations when handling sets. For example, the intersection operation and union operation between sets are equivalent to performing “&” and “|” on their bitsets, respectively.
Definition 11.
The value of a bitset BS X denoted by V BS X is the binary number of BS X .
As shown in Example 8, the bitsets can be regarded as binary numbers; thus, the value of bitsets can directly be compared.
Example 8.
For X 1 = {2, 3, 5} and X 2 = {3, 4, 5} in D B 1 , BS X 1 and BS X 2 are 01101 and 00111, respectively. V BS X 1 > V BS X 2 as 01101 > 00111.
The transactions are also sets of items. If an item i is in a transaction T k , BS T k [ i ] is assigned as 1. Hereby, the bitset for a transaction is obtained. For a dataset D B , the bitwise representation of D B is derived by obtaining the bitsets for all transactions. The bitwise representation of D B 1 is shown in Table 4.
Definition 12.
The head of an itemset X denoted by h e a d X is the minimal item in X, it corresponds to the first 1 bit in BS X . Accordingly, the tail of an itemset, X, denoted by t a i l X , is the maximal item in X, and it corresponds to the last 1 bit in BS X .
Example 9.
As shown in Figure 2, for the 3-itemsets X 1 = {2, 3, 4} and X 2 = {2, 3, 5} in D B 1 , h e a d X 1 = h e a d X 2 = 2, t a i l X 1 = 4 and t a i l X 2 = 5.
Definition 13.
Given a transaction dataset, D B , and its bitwise representation, I is the set of items in D B ; for an item i I , the column C o l i for i is the bitset for the occurrence set O { i } , where C o l i = BS O { i } .
By Definitions 3 and 10, with BS O X , the frequency of X can be calculated as | BS O X | = | O X | . For an itemset X, Algorithm 1 shows the procedures to obtain the bitset for O X . Initially, BS O X equals C o l h e a d X (line 4). Then, BS O X is obtained by performing bitwise “&” operations on the columns for other items in X (lines 5 to 9). The worst time complexity of Algorithm 1 is O ( | I | 2 /64), where | I | is the number of items in the dataset. Example 10 provides an illustration of acquiring BS O X for an itemset X in Table 1.
Algorithm 1 GetOccur
1:
Input: BS X : bitset
2:
Output: BS O X : bitset
3:
h e a d X the head of X
4:
BS O X C o l h e a d X
5:
for each  i BS X do
6:
    if  i > h e a d X  then
7:
         BS O X BS O X & C o l i
8:
    end if
9:
end for
10:
return  BS O X
Example 10.
By Table 4, for an itemset X = {2, 4} in Table 1, C o l 2 = 10101, C o l 4 =01111. As shown in Figure 3, by performing “&” on C o l 2 and C o l 4 , BS O X is 00101.
Definition 14.
For an itemset, X, the prefix for X is denoted by P X . It is a bitset equal to BS X while the last 1 bit is Cleared.
For two k-itemsets, X and Y, if X and Y have the same prefix, they have k 1 items in common and can be merged into a new ( k + 1 ) -itemset Z. By Definition 12, the two k-itemsets, X and Y, and the new ( k + 1 ) -itemset, Z, have an identical head, and the tail of Z is the larger one between t a i l X and t a i l Y . Example 11 provides an illustration.
Example 11.
As shown in Figure 4, for the 3-itemsets X 1 = {2, 3, 4} and X 2 = {2, 3, 5} in D B 1 , since P X 1 = P X 2 = 01100, by merging X 1 and X 2 , a new 4-itemset X 3 is generated and BS X 3 = 01111, h e a d X 3 = h e a d X 1 = h e a d X 2 = 2. As t a i l X 2 > t a i l X 1 , t a i l X 3 = t a i l X 2 = 5.
In this paper, we specify that only itemsets with the same prefix can be merged.

4.2. Our Theories and Data Structure

Based on the aforementioned preliminary definitions and concepts, we introduce the critical knowledge and basic data structure to induce our proposed method. In BitSPIM, SPIPs are identified iteratively. We mark the iteration that generates the SPIPs of size k as kth iteration.
Definition 15.
Given dataset D B , X and Y are two itemsets in D B , | X | = | Y | . If V BS X > V BS Y , then BS X BS Y .
≻ reflects the relative position of the bitsets. If BS X BS Y , BS Y is after BS X . Obviously, the transitivity of ≻ between bitsets is satisfied. For three bitsets, BS X , BS Y , and BS Z , if BS X BS Y and BS Y BS Z , then BS X BS Z .
Corollary 1.
For two bitsets, BS X and BS Y , if BS X BS Y , then V P X V P Y .
Proof. 
We denote V t a i l X as the value of the binary number for the bitset with the only 1 at the tail of X, V P X = V BS X V t a i l X , V P X > V t a i l X and accordingly, V P Y = V BS Y V t a i l Y , V P Y > V t a i l Y . If BS X BS Y , then | X | = | Y | and V BS X > V BS Y . If V t a i l X < V t a i l Y , then V P X > V P Y . If V t a i l X V t a i l Y , assume V P X < V P Y , as V P X > V t a i l X and V P Y > V t a i l Y , V P X + V t a i l X < V P Y + V t a i l Y , in other words, V BS X < V BS Y , which is contradictory with BS X BS Y . The assumption is invalid and Corollary 1 is proved.    □
Definition 16.
The ItemsetList L is an ordered list; its containing elements are unique bitsets with identical sizes. The ≻ relation holds between any two of the bitsets in L .
PIPs and SPIPs are contained in the sets named S p i p and S s l p , respectively. The notations and functions of the ItemsetLists and sets in BitSPIM are shown in Table 5:
Theorem 1.
Suppose BS X and BS Y are two bitsets in L and BS X BS Y . If P X P Y , then there exists no BS Z , such that P X = P Z and BS Y BS Z .
Proof. 
Since BS X BS Y , V P X V P Y by Corollary 1. As P X P Y , there is
V P X > V P Y .
Suppose there exists BS Z , such that P X = P Z and BS Y BS Z , then V P Y V P Z and V P X = V P Z , there is
V P Y V P X .
Obviously, (1) and (2) contradict each other. Consequently, Theorem 1 is proved.    □
Theorem 1 is the basic efficient cutting mechanism. An illustration of Theorem 1 is provided in Example 12.
Example 12.
By Table 1, for five itemsets, X 1 = {1, 2, 3}, X 2 = {1, 2, 5}, X 3 = {1, 3, 4}, X 4 = {1, 3, 5}, and X 5 = {2, 4, 5} in D B 1 , their bitsets and prefixes are shown in Figure 5. BS X 1 to BS X 5 are contained in L and there is BS X 1 BS X 2 BS X 3 BS X 4 BS X 5 . According to Theorem 1, since P X 1 P X 3 , neither the prefix for X 4 nor that of X 5 equals P X 1 . As depicted in Figure 5, different types of bitsets are colored with different colors, respectively. BS X 1 and the bitsets that have the same prefix with BS X 1 are marked in blue; the first bitset that has a different prefix with BS X 1 is marked in green, and the bitsets that are not processed according to Theorem 1 are marked in gray.

4.3. Mining SPIPs Efficiently

In this section, a detailed illustration of BitSPIM is provided. We demonstrate our proposed method with an example of mining SPIPs in the D B 2 dataset, as shown in Table 2, with a frequency threshold θ = 0.4. For simplicity, the coefficient of variation threshold δ is set to , which implies that all FPs are also PIPs.

4.3.1. Identification of SPIPs with Bitset

We follow the key steps of the identification of SPIPs described in [20] while several modifications are adopted. According to Chen et al., the identification of SPIPs does not proceed until all FPs are obtained, and at that moment, the occurrence set of each itemset is discovered.
Rather than acquiring all FPs in advance before the identification of SPIPs, in BitSPIM, once an itemset, X, is recognized as an FP, the identification of whether X is an SPIP is executed immediately. The bitset for O X , denoted by BS O X , can be directly utilized, which has already been obtained when calculating F r e q X . The steps of judging whether an FP is an SPIP are described in Algorithms 2 and 3. The function of Algorithm 2 is to remove all itemsets in S s l p that are dominated by an itemset X. Suppose | S s l p | is the maximal number of itemsets in S s l p ; the worst time complexity of Algorithm 2 is O ( | S s l p | ). F r e q m a x and C m i n record the current maximal frequency and the minimal coefficient of variation of the itemsets in S s l p , respectively. The steps of Algorithm 3 are as follows:
Algorithm 2 ClearNonSPIP
1:
Input: S s l p : Set, X: Itemset
2:
Output: S s l p : Set
3:
for each Y S s l p do
4:
    if Y is dominated by X then
5:
         S s l p S s l p { Y }
6:
    end if
7:
end for
8:
return  S s l p
Algorithm 3 CheckSPIP
1:
Input: BS X : Bitset, BS O X : Bitset, δ : Double, S p i p : Set, S s l p : Set
2:
Output: L c u r : List
3:
P e r X the period list of X
4:
C X the coefficient of variation of X
5:
if  C X > δ then
6:
    return  S s l p , S p i p
7:
end if
8:
S p i p S p i p { X }
9:
if  F r e q m a x < F r e q X , C m i n > C X  then
10:
     F r e q m a x = F r e q X , C m i n = C X
11:
     S s l p { X }
12:
else
13:
    if  F r e q m a x < F r e q X  then
14:
         F r e q m a x = F r e q X
15:
    else if  C m i n > C X  then
16:
         C m i n = C X
17:
    else if  Y S s l p  s.t. Y dominates X then
18:
        return  S s l p , S p i p
19:
    end if
20:
     S s l p call Algorithm 2 ( S s l p , X)
21:
     S s l p S s l p { X }
22:
end if
23:
return  S s l p , S p i p
(1)
With BS O X , by Definitions 5 and 6, P e r X and C X are acquired (lines 3 to 4), respectively.
(2)
If C X > δ , by Definition 7, X is not a PIP, and the algorithm terminates (line 6). Otherwise, X is added to S p i p (line 8).
(3)
If F r e q m a x < F r e q X and C m i n > C X , by Definition 7, X dominates all itemsets in S s l p . Therefore, X is the only element in S s l p ; the value of F r e q m a x and the value of C m i n are updated with F r e q X and C X , respectively (lines 9 to 11).
(4)
If F r e q m a x < F r e q X and C m i n C X , or, F r e q m a x F r e q X and C m i n > C X , X may dominate some itemsets in S s l p and none of the itemsets in S s l p can dominate X. S s l p contains X and the itemsets that are not dominated by X. Specifically, in the former case, the value of F r e q m a x is updated with F r e q X , and in the latter case, the value of C m i n is updated with C X (lines 13 to 16 and lines 20 to 21).
(5)
If F r e q m a x F r e q X and C m i n C X , X may be dominated by some itemsets in S s l p . If any itemset dominates X (line 17), X is not an SPIP and the identification of X stops (line 18), X is not in S s l p . Otherwise, S s l p contains X and the itemsets that are not dominated by X (lines 20 to 21).
In Algorithm 3, as Algorithm 2 is invoked and P e r X is utilized, the worst-case time complexity of Algorithm 3 is O ( m a x { | I | , | S s l p | } ), where | S s l p | represents the maximal number of itemsets in S s l p .

4.3.2. First Iteration

The aim of first iteration is to generate bitsets for frequent 1-itemsets and identify SPIPs of size 1 (if any). Algorithm 4 illustrates the process of first iteration. I is the set of items in the transaction dataset. To guarantee the ≻ relation between any two bitsets in the ItemsetLists, the items in I are in ascending order. Initially, the values of F r e q m a x and C m i n are set to 0 and , respectively (line 3). L c u r , S p i p , and S s l p are empty (line 4). For each item i in I, all bits in BS i are Cleared except that the ith bit is set to 1 (lines 6 to 7). Then, by Algorithm 1, BS O { i } is formulated on line 8. As BS i contains one 1 bit, the process of lines 5 to 9 in Algorithm 1 is omitted. F r e q { i } is computed by Definition 4 (line 9). If F r e q { i } is not less than the frequency threshold θ , i is an FP and BS i is added to the end of L c u r (line 11). Algorithm 3 is then invoked to identify whether i is an SPIP, as discussed in Section 4.3.1.
Algorithm 4 First iteration
1:
Input: I: Set, θ : Double, δ : Double
2:
Output: L c u r
3:
F r e q m a x = 0, C m i n =
4:
L c u r ← an empty List, S p i p , S s l p
5:
for each iI do
6:
     BS { i } an empty Bitset
7:
    Set  BS { i } [ i ]
8:
     BS O { i } call Algorithm 1( BS { i } )
9:
     F r e q { i } | BS O { i } | / | D B |
10:
    if  F r e q { i } θ  then
11:
        Add  BS i to the end of L c u r
12:
        call Algorithm 3 ( BS i , BS O { i } , S p i p , S s l p , δ )
13:
    end if
14:
end for
15:
return  L c u r
When Algorithm 4 stops, all infrequent 1-itemsets are eradicated and will not be involved in the subsequent iterations. L c u r becomes the input to the second iteration. In Algorithm 4, Algorithms 1 and 3 are invoked for each item i in I. Thus, the worst time complexity of Algorithm 4 is O ( | I | * ( m a x { | I | , | S s l p | } + | I | 2 / 64 )).
An illustration of first iteration is provided for mining SPIPs in the D B 2 dataset, as shown in Table 2, with a frequency threshold θ = 0.4 and the coefficient of variation threshold δ = . Table 6 shows the frequencies and the coefficients of variation for all eight 1-itemsets in D B 2 , denoted by { 1 } to { 8 } .
On line 5 of Algorithm 4, the items in I are in ascending order, the bitsets for all 1-itemsets, { 1 } to { 8 } , are sequentially processed by Algorithm 4. As the threshold of the coefficient of variation is set to , the coefficients of variation for all 1-itemsets are not larger than . Consequently, lines 5 to 7 of Algorithm 3 are skipped. Initially, for BS 1 , as F r e q m a x = 0 and C m i n = , {1} is added to S s l p , F r e q m a x = 0.583 and C m i n = 0.447. As F r e q m a x = F r e q 2 and C m i n = C 2 , lines 17 to 21 of Algorithm 3 are used to process {2}; {2} can also be added to S s l p as { 1 } does not dominate { 2 } . F r e q m a x and C m i n remain invariant. As F r e q m a x < F r e q 3 and C m i n > C 3 , lines 9 to 11 of Algorithm 3 are used to process {3}; {3} dominates {1} and {2} and is added to S s l p while {1} and {2} are removed from S s l p . F r e q m a x = 0.667 and C m i n = 0.351. As F r e q m a x > F r e q 4 and C m i n > C 4 , lines 15 to 16 and 20 to 21 of Algorithm 3 are used to process {4}; {3} stays in S s l p as it is not dominated by {4}. After {4} is processed, S s l p contains {3} and {4}, F r e q m a x = 0.667 and C m i n = 0.2. As F r e q 5 is less than the frequency threshold, {5} is not an SPIP as it is not an FP (line 10 of Algorithm 4). For itemsets {6} to {8}, their frequencies are less than F r e q m a x . Moreover, they can be dominated by some itemsets in S s l p (line 17 of Algorithm 3). At the end of 1st iteration, {3} and {4} are two SPIPs. According to lines 10 to 11 of Algorithm 4, L c u r contains the bitsets for {1} to {8} except {5}, as the frequency of {5} is less than θ . L c u r is then used as the input to the second iteration.

4.3.3. kth Iteration (k > 1)

As shown in Algorithm 5, in kth iteration, SPIPs of size k are obtained, and frequent ( k + 1 ) -itemsets are generated and used as the input to ( k + 1 ) th iteration. kth iteration activates as L c u r covers the bitsets for all frequent ( k 1 ) -itemsets. The procedures of Algorithm 5 are as follows:
(1)
When L c u r is not empty, Algorithm 5 runs iteratively (line 3).
(2)
L n e x t is set to empty (line 4).
(3)
For each BS X in L c u r , P X is preliminarily constructed (line 6). According to Definition 14, P X is equal to BS X while the last 1 bit is substituted by 0 (lines 7 to 8).
(4)
To generate new ( k + 1 ) -itemsets, for each BS Y after BS X in L c u r , if P X differentiates from P Y , all bitsets after BS Y have a different prefix compared to that of BS X , according to Theorem 1; thus, no bitset can be combined with BS X . Therefore, none of the bitsets after BS Y will be further processed while determining which bitsets can be merged with BS X (line 11). Otherwise, X and Y can be merged as they share an identical prefix. This approach of limiting the traversal of bitsets avoids extensive, pointless searches on itemsets that are inevitably unable to be merged.
(5)
When BS Y processes an identical prefix, the last bit that indicates the tail is the only discrepancy between them. The combination of BS X and BS Y focuses on the last 1-bit rather than trivially performing a bitwise “|” operation on BS X and BS Y . A new bitset BS N is constructed for the ( k + 1 ) -itemset, which initially equals BS X (line 13).
(6)
The t a i l Y th bit in BS N is set to 1 (line 14).
(7)
Resembles 1st iteration, F r e q N is calculated by Algorithm 1 and Definition 4 (lines 15 and 16).
(8)
If F r e q N is greater than or equal to the frequency threshold, BS N is added to the end of L n e x t (line 18).
(9)
With BS N and BS O N , Algorithm 3 is invoked to examine whether itemset N is an SPIP (line 19).
(10)
While L n e x t covers the bitsets for all ( k + 1 ) -itemsets, the bitsets in L n e x t are transferred to L c u r (line 23). This step declares both the end of kth iteration and the beginning of ( k + 1 ) th iteration.
Algorithm 5 kth iteration (k > 1)
1:
Input: L c u r : List, θ : Double, δ : Double
2:
Output: L c u r : List  S s l p : Set, S p i p : Set
3:
while  L c u r is not an empty List do
4:
     L n e x t ← an Empty List
5:
    for each  BS X L c u r  do
6:
         P X BS X
7:
         t a i l X ← the tail of X
8:
        Clear  P X [ t a i l X ]
9:
        for each  BS Y L c u r  and  BS X BS Y  do
10:
           if  P X P Y  then
11:
                break
12:
           end if
13:
            BS N clone BS X
14:
           Set  BS N [ t a i l Y ]
15:
            BS O N call Algorithm 1( BS N )
16:
            F r e q N | BS O N | / | D B |
17:
           if  F r e q N θ  then
18:
               Add BS N to the end of L n e x t
19:
               call Algorithm 3 ( BS N , BS O N , S p i p , S s l p , δ )
20:
           end if
21:
        end for
22:
    end for
23:
     L c u r L n e x t
24:
end while
25:
return  L c u r , S s l p , S p i p
When L c u r is an empty list, no frequent ( k + 1 ) -itemset is generated in kth iteration, ( k + 1 ) th iteration will not proceed, and the algorithm terminates; all SPIPs are identified.
Suppose | L c u r | is the maximal number of bitsets in L c u r , the worst time complexity of an arbitrary kth iteration is O ( | L c u r | 2 * ( m a x { | I | , | S s l p | } + | I | 2 / 64 )).
We provide an illustration of 2nd iteration for mining SPIPs in the D B 2 dataset, as shown in Table 2 with a frequency threshold θ = 0.4 and the coefficient of variation threshold δ = . L c u r contains the bitset for { 1 } , { 2 } , { 3 } , { 4 } , { 6 } , { 7 } , and { 8 } . Algorithm 3 only checks if X 1 = { 1 , 2 } and X 2 = { 1 , 6 } are SPIPs, as among all the 2-itemsets, only X 1 and X 2 are FPs with a frequency beyond θ . For simplicity, Table 7 merely gives the frequency and the coefficient of variation of X 1 and X 2 in Table 2.
At the beginning of 2nd iteration, F r e q m a x = 0.667 and C m i n = 0.2. As F r e q X 1 < F r e q m a x and C X 1 = C m i n , lines 17 to 21 of Algorithm 3 are used to process X 1 . Neither { 3 } nor { 4 } dominates X 1 and X 1 cannot dominate { 3 } or { 4 } ; thus. { 3 } , { 4 } , and X 1 = { 1 , 2 } are SPIPs. F r e q m a x and C m i n remain invariant. Similarly, for X 2 = { 1 , 6 } , as F r e q X 2 F r e q 4 and C X 2 > C 4 , X 2 is dominated by { 4 } ; thus, it is not an SPIP. At the end of 2nd iteration, S s l p contains three SPIPs: {3}, {4}, and {1, 2}. L c u r contains two bitsets for X 1 and X 2 , which are used as the inputs of 3rd iteration.
In 3rd iteration, only a bitset for a 3-itemset X 3 = { 1 , 2 , 6 } can be merged. As X 3 is not an FP, L c u r is an empty list at the end of 3rd iteration (line 4 and line 23 of Algorithm 5). 4th iteration starts with an empty L c u r , the algorithm terminates as 4th iteration stops (line 3 of Algorithm 5), and the final SPIPs in Table 2 with θ = 40% and δ = are { 3 } , { 4 } and { 1 , 2 } .

5. Empirical Evaluation

We conducted a series of experiments to compare the performances of BitSPIM and SPIM on a Windows 10 PC equipped with an AMD Ryzen 3950X processor, with 64 GB of memory. The CPU clock speed is locked to 3.5 GHz to avoid the adverse effects of CPU overclocking. The characteristics of the datasets involved in our experiments are presented in Table 8, including four synthetic datasets and six real datasets. All datasets in the experiments are downloaded from the website SPMF (http://www.philippe-fournier-viger.com/spmf, accessed on 1 September 2023).
As SPIM [20] is the state-of-the-art and the only algorithm focusing on mining SPIPs, we primarily compare the running time and memory usage between our approach and SPIM. All datasets used in SPIM are included in our experiments. Additionally, as FP-Growth is a fundamental component of SPIM, the running time of FP-Growth is also considered to further explore the effectiveness of BitSPIM. For simplicity, all δ in our experiment are set to , which implies that all frequent patterns are also periodic itemset patterns. In the first experiment, the numbers of PIPs and SPIPs identified by both algorithms were recorded. The second experiment focuses on the running time of BitSPIM, SPIM, and FP-Growth. Finally, we compare the performance in terms of memory usage between BitSPIM and SPIM.

5.1. Number of Patterns

To verify that the SPIPs obtained by the proposed method are complete and correct, we counted the number of PIPs and SPIPs obtained by BitSPIM and SPIM. The results show that, on all datasets involved in the experiment, the PIP and SPIP numbers mined by BitSPIM are always consistent with those obtained by SPIM for various values θ , verifying the correctness of the proposed method.

5.2. Running Time

The running time of BitSPIM is compared with that of SPIM and FP-Growth. In SPIM, FPs are identified in advance using FP-Growth before the recognition of SPIPs; thus, the running time of FP-Growth can be recorded. Figure 6 demonstrates the running times of BitSPIM, SPIM, and FP-Growth on different datasets with various frequency thresholds θ when δ = . In each subfigure representing the running time on different datasets, the range of θ includes the approximate frequency threshold value, where BitSPIM and SPIM have the same running times. The horizontal axis indicates the value of θ , and the vertical axis represents the running time. The red curve, blue curve, and gray curve indicate the running times of BitSPIM, SPIM, and FP-Growth, respectively. The circles on the red curve, the triangles on the blue curve, and the squares on the gray curve represent the running times of our method and that of SPIM and FP-Growth on the specific θ , respectively. The intersection points of the red and blue curves mean the running times of BitSPIM and SPIM are identical. This is projected on the horizontal axis by a dotted line parallel to the vertical axis. The horizontal coordinate of the intersection point indicates the frequency threshold at which the two algorithms have the same running time.
As shown in Figure 6, except at the smaller thresholds, BitSPIM outpaces SPIM in terms of running time across most of the threshold ranges. The curve of the running time for BitSPIM is steeper than that of SPIM. Observing the gradient of the running time curve, as θ increases, once θ goes beyond the horizontal coordinate of the intersection point of BitSPIM’s curve and SPIM’s curve, the running time of BitSPIM is consistently less than that of SPIM. For example, as shown in Figure 6g, the horizontal coordinate of the intersection point is 0.215% on the OnlineRetail dataset. It can be concluded that BitSPIM runs faster than SPIM at 99.785% of the threshold range.
The improvement achieved by BitSPIM over SPIM with respect to running time is significant on datasets T20I6D100K, Chainstore, OnlineRetail, and Kosarak. For example, on the T20I6D100K dataset, when the frequency threshold is 0.3%, BitSPIM is approximately 2 times faster than SPIM. For frequency thresholds beyond 0.6%, SPIM takes at least 4 times longer than BitSPIM. On datasets T25I10D10K, C20D10K, Foodmart, and BMS-Webview-1, although the improvement is not as pronounced, BitSPIM still shows an advantage over SPIM on the majority of frequency thresholds. Since BitSPIM utilizes the basic idea of Apriori, it is acknowledged that BitSPIM can be outpaced by SPIM at small frequency thresholds. In fact, the experimental results support the conclusion of [33] that no algorithm is an absolute and clear winner, able to outperform all others across all datasets and the entire range of thresholds. Overall, BitSPIM is observed to require less running time compared with SPIM for the majority of frequency thresholds
Mining FPs is fundamental to the identification of SPIPs. SPIM identifies SPIPs from all FPs mined by FP-Growth, and as a result, SPIM naturally takes longer to run than FP-Growth. However, BitSPIM does not adopt separate steps to mine FPs and can demonstrate better performance compared with FP-Growth. On datasets like T20I6D100K, Chainstore, OnlineRetail, and Kosarak, BitSPIM runs faster than FP-Growth for the majority of frequency thresholds. Although on datasets such as T10I4D100K, Foodmart, and BMS-Webview-2, BitSPIM does not show much superiority over FP-Growth, it can still be observed that there is an intersection between the red curve and gray curve, representing the running times of BitSPIM and FP-Growth, respectively. This indicates that BitSPIM can outperform FP-Growth at some frequency thresholds. The comparison between the running times of BitSPIM and FP-Growth further demonstrates the superior performance of our approach over SPIM.

5.3. Memory Usage

The results comparing the average memory usage of BitSPIM and SPIM with different frequency thresholds θ on empirical datasets are presented in Table 9. The better results are highlighted in bold. The coefficient of variance threshold δ is set to and the same range of frequency thresholds as in the running time experiment are adopted. As shown in Table 9, except on datasets with a large number of transactions and items, such as Chainstore and Kosarak, BitSPIM outperforms SPIM in terms of average memory usage.

5.4. Discussion

From the results, the proposed method shows better performance, as it consumes less time compared with SPIM for the vast majority of frequency threshold values across different datasets. Regarding memory usage, BitSPIM generally consumes less memory than SPIM, except on datasets with an extensive number of transactions and items.
The advantages of the proposed method can be summarized as follows: (1) The bitset representation of the transaction dataset is more compact than the original dataset. (2) Bitwise operations are involved in mining SPIPs by mapping ordinary sets to bitsets. The generation of new itemsets and the calculation of their frequency can be realized by performing efficient bitwise operations. (3) A novel cutting technique avoids many unnecessary operations. When certain conditions are met, the loop stops without exploring the entire search space. (4) The off-the-shelf occurrence set of the itemset can be utilized directly when identifying whether an FP is an SPIP. (5) Space for constructing FP-trees is saved as FP-Growth is not used in identifying FPs.
However, due to the inherent drawbacks originating from Apriori, BitSPIM repeatedly scans the dataset to generate new bitsets and calculate the frequency of the itemsets. This leads to higher time consumption at smaller thresholds. On datasets with numerous transactions and items, a large number of bitsets need to be stored and operated in BitSPIM; thus, in such cases, it is outperformed by SPIM in terms of memory usage.

6. Conclusions

In this paper, we propose a more efficient approach for mining SPIPs, called BitSPIM, compared with the SPIM algorithm. Apart from utilizing a novel bitwise representation that is capable of mining SPIPs, BitSPIM adopts a cutting mechanism to reduce the search space. We evaluate the performance of our approach in comparison with the latest algorithm for mining SPIPs on a variety of real and synthetic datasets. The results demonstrate that BitSPIM is faster and consumes less memory than SPIM in most cases. We believe that our approach is a significant alternative in mining SPIPs and can be applied to diverse fields within ARM.

Author Contributions

Y.L. implemented the experiment and wrote the first draft of the paper, Z.L. provided funding for the paper and revised it. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China under grant no. 62276060, Development and Reform Committee Foundation of Jilin province of China under grant no. 2019C053-9.

Data Availability Statement

The datasets are available at the following links: http://www.philippe-fournier-viger.com/spmf (accessed on 1 September 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Baralis, E.; Cagliero, L.; Cerquitelli, T.; Chiusano, S.; Garza, P.; Grimaudo, L.; Pulvirenti, F. NEMICO: Mining Network Data through Cloud-Based Data Mining Techniques. In Proceedings of the 2014 IEEE/ACM 7th International Conference on Utility and Cloud Computing, London, UK, 8–11 December 2014. [Google Scholar]
  2. Agrawal, R. Mining association rules between sets of items in large databases. In Proceedings of the ACM Sigmod International Conference on Management of Data, Washington, DC, USA, 25–28 May 1993. [Google Scholar]
  3. Le, H.S. A novel kernel fuzzy clustering algorithm for Geo-Demographic Analysis. Inf. Sci. Int. J. 2015, 317, 202–223. [Google Scholar]
  4. Nguyen, L.; Nguyen, N.T. Updating mined class association rules for record insertion. Appl. Intell. 2015, 42, 707–721. [Google Scholar] [CrossRef]
  5. Agrawal, R.; Srikant, R. Fast Algorithms for Mining Association Rules. In Proceedings of the 20th International Conference on Very Large Data Bases, Santiago, Chile, 12–15 September 1994. [Google Scholar]
  6. Han, J.; Jian, P. Mining frequent patterns without candidate generation. ACM Sigmod Rec. 2000, 29, 1–12. [Google Scholar] [CrossRef]
  7. Han, J.; Jian, P.; Yin, Y.; Mao, R. Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach. Data Min. Knowl. Discov. 2004, 8, 53–87. [Google Scholar] [CrossRef]
  8. Jie, D.; Min, H. BitTableFI: An efficient mining frequent itemsets algorithm. Knowl.-Based Syst. 2007, 20, 329–335. [Google Scholar]
  9. Lin, J.C.; Li, T.; Fournier-Viger, P.; Hong, T.; Su, J. Efficient Mining of High Average-Utility Itemsets with Multiple Minimum Thresholds. In Proceedings of the Advances in Data Mining. Applications and Theoretical Aspects—16th Industrial Conference, ICDM 2016, New York, NY, USA, 13–17 July 2016; Proceedings; Lecture Notes in Computer Science. Perner, P., Ed.; Springer: Berlin/Heidelberg, Germany, 2016; Volume 9728, pp. 14–28. [Google Scholar] [CrossRef]
  10. Lee, G.; Yang, W.; Lee, J. A parallel algorithm for mining multiple partial periodic patterns. Inf. Sci. 2006, 176, 3591–3609. [Google Scholar] [CrossRef]
  11. Elseidy, M.; Abdelhamid, E.; Skiadopoulos, S.; Kalnis, P. GRAMI: Frequent Subgraph and Pattern Mining in a Single Large Graph. Proc. VLDB Endow. 2014, 7, 517–528. [Google Scholar] [CrossRef]
  12. Hosseininasab, A.; van Hoeve, W.; Ciré, A.A. Constraint-Based Sequential Pattern Mining with Decision Diagrams. In Proceedings of the The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, HI, USA, 27 January–1 February 2019; AAAI Press: Cambridge, MA, USA, 2019; pp. 1495–1502. [Google Scholar] [CrossRef]
  13. Chanda, A.K.; Saha, S.; Nishi, M.A.; Samiullah, M.; Ahmed, C.F. An efficient approach to mine flexible periodic patterns in time series databases. Eng. Appl. Artif. Intell. 2015, 44, 46–63. [Google Scholar] [CrossRef]
  14. Rana, S.; Mondal, M.N.I. An Approach for Seasonally Periodic Frequent Pattern Mining in Retail Supermarket. In Proceedings of the International Conference on Smart Data Intelligence, ICSMDI 2021, Tamil Nadu, India, 29–30 April 2021. [Google Scholar]
  15. Zhou, H.; Hirasawa, K. Evolving temporal association rules in recommender system. Neural Comput. Appl. 2019, 31, 2605–2619. [Google Scholar] [CrossRef]
  16. Chen, G.; Li, Z. Discovering periodic cluster patterns in event sequence databases. Appl. Intell. 2022, 52, 15387–15404. [Google Scholar] [CrossRef]
  17. Tanbeer, S.K.; Ahmed, C.F.; Jeong, B.; Lee, Y. Discovering Periodic-Frequent Patterns in Transactional Databases. In Proceedings of the Advances in Knowledge Discovery and Data Mining, 13th Pacific-Asia Conference, PAKDD 2009, Bangkok, Thailand, 27–30 April 2009; Proceedings; Lecture Notes in Computer Science. Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.B., Eds.; Springer: Berlin/Heidelberg, Germany, 2009; Volume 5476, pp. 242–253. [Google Scholar] [CrossRef]
  18. Rashid, M.M.; Karim, M.R.; Jeong, B.; Choi, H. Efficient Mining Regularly Frequent Patterns in Transactional Databases. In Proceedings of the Database Systems for Advanced Applications—17th International Conference, DASFAA 2012, Busan, Republic of Korea, 15–19 April 2012; Proceedings, Part I; Lecture Notes in Computer Science. Lee, S., Peng, Z., Zhou, X., Moon, Y., Unland, R., Yoo, J., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; Volume 7238, pp. 258–271. [Google Scholar] [CrossRef]
  19. Chen, G.; Li, Z. A New Method Combining Pattern Prediction and Preference Prediction for Next Basket Recommendation. Entropy 2021, 23, 1430. [Google Scholar] [CrossRef] [PubMed]
  20. Chen, G.; Li, Z. Discovering Skyline Periodic Itemset Patterns in Transaction Sequences. In Proceedings of the Advanced Data Mining and Applications—19th International Conference, ADMA 2023, Shenyang, China, 21–23 August 2023; Proceedings, Part I; Lecture Notes in Computer Science. Yang, X., Suhartanto, H., Wang, G., Wang, B., Jiang, J., Li, B., Zhu, H., Cui, N., Eds.; Springer: Berlin/Heidelberg, Germany, 2023; Volume 14176, pp. 494–508. [Google Scholar] [CrossRef]
  21. Papadias, D.; Tao, Y.; Fu, G.; Seeger, B. Progressive skyline computation in database systems. ACM Trans. Database Syst. 2005, 30, 41–82. [Google Scholar] [CrossRef]
  22. Fournier-Viger, P.; Lin, C.W.; Duong, Q.H.; Dam, T.L.; Voznak, M. PFPM: Discovering Periodic Frequent Patterns with Novel Periodicity Measures. In Proceedings of the 2nd Czech-China Scientific Conference 2016; IntechOpen: London, UK, 2017. [Google Scholar]
  23. Fournier-Viger, P.; Li, Z.; Lin, J.C.; Kiran, R.U.; Fujita, H. Efficient algorithms to identify periodic patterns in multiple sequences. Inf. Sci. 2019, 489, 205–226. [Google Scholar] [CrossRef]
  24. Nagarajan, K.; Kannan, S.; Sumathi, K. Maximal Frequent Itemset Mining Using Breadth-First Search with Efficient Pruning. In Proceedings of the International Conference on Computer Networks and Communication Technologies, Alghero, Italy, 29 September–2 October 2019. [Google Scholar]
  25. Yuan, X. An improved Apriori algorithm for mining association rules. AIP Conf. Proc. 2017, 1820, 080005. [Google Scholar]
  26. Song, W.; Yang, B.; Xu, Z. Index-BitTableFI: An improved algorithm for mining frequent itemsets. Knowl.-Based Syst. 2008, 21, 507–513. [Google Scholar] [CrossRef]
  27. Ayres, J.; Flannick, J.; Gehrke, J.; Yiu, T. Sequential PAttern mining using a bitmap representation. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, AB, Canada, 23–26 July 2002; ACM: New York, NY, USA, 2002; pp. 429–435. [Google Scholar] [CrossRef]
  28. Breve, B.; Caruccio, L.; Cirillo, S.; Deufemia, V.; Polese, G. IndiBits: Incremental Discovery of Relaxed Functional Dependencies using Bitwise Similarity. In Proceedings of the 39th IEEE International Conference on Data Engineering, ICDE 2023, Anaheim, CA, USA, 3–7 April 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1393–1405. [Google Scholar] [CrossRef]
  29. Vo, B.; Hong, T.; Le, B. DBV-Miner: A Dynamic Bit-Vector approach for fast mining frequent closed itemsets. Expert Syst. Appl. 2012, 39, 7196–7206. [Google Scholar] [CrossRef]
  30. Tran, M.; Le, B.; Vo, B. Combination of dynamic bit vectors and transaction information for mining frequent closed sequences efficiently. Eng. Appl. Artif. Intell. 2015, 38, 183–189. [Google Scholar] [CrossRef]
  31. Prasanna, K.; Seetha, M. Efficient and Accurate Discovery of Colossal Pattern Sequences from Biological Datasets: A Doubleton Pattern Mining Strategy (DPMine). Procedia Comput. Sci. 2015, 54, 412–421. [Google Scholar] [CrossRef]
  32. Van, T.; Yoshitaka, A.; Le, B. Mining web access patterns with super-pattern constraint. Appl. Intell. 2018, 48, 3902–3914. [Google Scholar] [CrossRef]
  33. Goethals, B.; Zaki, M.J. Advances in frequent itemset mining implementations: Report on FIMI’03. ACM Sigkdd Explor. Newsl. 2004, 6, 109–117. [Google Scholar] [CrossRef]
Figure 1. The diagram of the mapping relations. (a) Map the discontinuous numbers to continuous items. (b) Map the symbols to continuous items.
Figure 1. The diagram of the mapping relations. (a) Map the discontinuous numbers to continuous items. (b) Map the symbols to continuous items.
Electronics 12 04874 g001
Figure 2. Diagram of Example 9. The bits corresponding to the head and the tail of X 1 and X 2 are colored, respectively.
Figure 2. Diagram of Example 9. The bits corresponding to the head and the tail of X 1 and X 2 are colored, respectively.
Electronics 12 04874 g002
Figure 3. Diagram of Example 10. Given the bitwise representation of D B 1 , X = { 2 , 4 } and BS X = 01010. C o l 2 , C o l 4 , and BS O X are colored. BS O X = C o l 2 & C o l 4 = 00101.
Figure 3. Diagram of Example 10. Given the bitwise representation of D B 1 , X = { 2 , 4 } and BS X = 01010. C o l 2 , C o l 4 , and BS O X are colored. BS O X = C o l 2 & C o l 4 = 00101.
Electronics 12 04874 g003
Figure 4. Diagram of Example 11. The bitset for X 1 , X 2 , and X 3 , as well as the prefix for X 1 and X 2 are depicted. The bits corresponding to the head and the tail of the itemset are colored, respectively.
Figure 4. Diagram of Example 11. The bitset for X 1 , X 2 , and X 3 , as well as the prefix for X 1 and X 2 are depicted. The bits corresponding to the head and the tail of the itemset are colored, respectively.
Electronics 12 04874 g004
Figure 5. Diagram of Example 12. The bitsets in L are presented. Different types of bitsets are colored with different colors.
Figure 5. Diagram of Example 12. The bitsets in L are presented. Different types of bitsets are colored with different colors.
Electronics 12 04874 g005
Figure 6. Running time (ms) with different frequency thresholds θ (%) on empirical datasets. The horizontal axis and the vertical axis in each subfigure represent the value of θ and the running time, respectively. The intersection points of the red and blue curves in each subfigure are projected on the horizontal axis by a dotted line parallel to the vertical axis.
Figure 6. Running time (ms) with different frequency thresholds θ (%) on empirical datasets. The horizontal axis and the vertical axis in each subfigure represent the value of θ and the running time, respectively. The intersection points of the red and blue curves in each subfigure are projected on the horizontal axis by a dotted line parallel to the vertical axis.
Electronics 12 04874 g006aElectronics 12 04874 g006b
Table 1. Example transaction dataset D B 1 .
Table 1. Example transaction dataset D B 1 .
TransactionItems
T 1 1, 2, 3, 5
T 2 1, 3, 4, 5
T 3 1, 2, 3, 4, 5
T 4 3, 4, 5
T 5 2, 3, 4
Table 2. Example transaction dataset D B 2 .
Table 2. Example transaction dataset D B 2 .
TransactionItemsTransactionItems
T 1 1, 2, 5, 6, 8 T 7 2, 3, 4
T 2 1, 4, 5, 6 T 8 1, 2, 3, 7
T 3 1, 3, 6, 7 T 9 3, 5, 7, 8
T 4 1, 2, 3, 6 T 10 2, 3, 4
T 5 3, 4, 6, 8 T 11 1, 2, 7, 8
T 6 1, 2, 6, 7 T 12 3, 4, 6, 7, 8
Table 3. The frequency and the coefficient of variation of X 1 , X 2 , and X 3 in D B 2 .
Table 3. The frequency and the coefficient of variation of X 1 , X 2 , and X 3 in D B 2 .
PatternFrequencyCoefficient of Variation
X 1 0.4160.2
X 2 0.3330.943
X 3 0.6670.351
Table 4. Bitwise representation of D B 1 .
Table 4. Bitwise representation of D B 1 .
TransactionItems BS T k
T 1 1, 2, 3, 511101
T 2 1, 3, 4, 510111
T 3 1, 2, 3, 4, 511111
T 4 3, 4, 500111
T 5 2, 3, 401110
Table 5. The notations and functions of different ItemsetLists and sets in BitSPIM.
Table 5. The notations and functions of different ItemsetLists and sets in BitSPIM.
NotationFunction
S s l p containing SPIPs
S p i p containing PIPs
L c u r representing the bitsets to kth iteration
L n e x t transferring the bitsets to ( k + 1 ) th iteration
Table 6. The frequency and the coefficient of variation of eight 1-itemsets in D B 2 . F r e q m a x , C m i n , and S s l p denote the maximal frequency, the minimal coefficient of variation, and the set of SPIPs after itemset { i } is processed. I is the set of items in D B 2 , i I .
Table 6. The frequency and the coefficient of variation of eight 1-itemsets in D B 2 . F r e q m a x , C m i n , and S s l p denote the maximal frequency, the minimal coefficient of variation, and the set of SPIPs after itemset { i } is processed. I is the set of items in D B 2 , i I .
Itemset ( { i } )FrequencyCoefficient of Variation Freq max C min S slp
{ 1 } 0.5830.4470.5830.447 { 1 }
{ 2 } 0.5830.4470.5830.447 { 1 } , { 2 }
{ 3 } 0.6670.3510.6670.351 { 3 }
{ 4 } 0.4160.20.6670.2 { 3 } , { 4 }
{ 5 } 0.250.750.6670.2 { 3 } , { 4 }
{ 6 } 0.5831.0160.6670.2 { 3 } , { 4 }
{ 7 } 0.50.4150.6670.2 { 3 } , { 4 }
{ 8 } 0.4160.4720.6670.2 { 3 } , { 4 }
Table 7. The frequency and the coefficient of variation of X 1 and X 2 in D B 2 . F r e q m a x , C m i n , and S s l p show the maximal frequency, the minimal coefficient of variation, and the set of SPIPs after itemset X i is processed.
Table 7. The frequency and the coefficient of variation of X 1 and X 2 in D B 2 . F r e q m a x , C m i n , and S s l p show the maximal frequency, the minimal coefficient of variation, and the set of SPIPs after itemset X i is processed.
Itemset ( X i )FrequencyCoefficient of Variation Freq max C min S slp
X 1 = { 1 , 2 } 0.4160.20.6670.2 { 3 } { 4 } { 1 , 2 }
X 2 = { 1 , 6 } 0.4160.3460.6670.2 { 3 } { 4 } { 1 , 2 }
Table 8. The characteristics of the empirical datasets.
Table 8. The characteristics of the empirical datasets.
Dataset# Trans# ItemsAveLenDensity
T10I4D100K100,000870101.15%
T20I6D100K99,92289319.92.23%
T25I10D10K997692924.772.67%
C20D10K10,0001922010.42%
Chainstore1,112,94946,0867.230.02%
Foodmart414115594.420.28%
OnlineRetail541,90926034.370.17%
Kosarak990,00241,2708.10.02%
BMS-WebView-159,6024972.510.51%
BMS-WebView-277,51233404.620.14%
“#” represents “the number of”, “Trans” represents “Transactions”, “AveLen” represents “Average Length”.
Table 9. Average memory usage (MB) of BitSPIM and SPIM on empirical datasets. The better result in each row is marked in bold.
Table 9. Average memory usage (MB) of BitSPIM and SPIM on empirical datasets. The better result in each row is marked in bold.
DatasetSPIMBitSPIM
T10I4D100K3021.2247.1
T20I6D100K3496.2381.8
T25I10D10K5237.82255.2
C20D10K4360.12072.8
Chainstore5371.66558.1
Foodmart4308.71670.0
OnlineRetail5286.12936.9
Kosarak4622.15667.4
BMS-WebView-15112.31697.8
BMS-WebView-24781.41805.1
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, Y.; Li, Z. An Efficient Bit-Based Approach for Mining Skyline Periodic Itemset Patterns. Electronics 2023, 12, 4874. https://doi.org/10.3390/electronics12234874

AMA Style

Li Y, Li Z. An Efficient Bit-Based Approach for Mining Skyline Periodic Itemset Patterns. Electronics. 2023; 12(23):4874. https://doi.org/10.3390/electronics12234874

Chicago/Turabian Style

Li, Yanzhi, and Zhanshan Li. 2023. "An Efficient Bit-Based Approach for Mining Skyline Periodic Itemset Patterns" Electronics 12, no. 23: 4874. https://doi.org/10.3390/electronics12234874

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop