The general methodology proposed in this work is fully configurable according to system administrators’ needs and can be adapted through time by performing offline periodic updates to find potential new patterns that could emerge due to changes in the city assets or people’s habits.
The proposed methodology is composed of two main phases: (i) pattern-based model training on historical data and (ii) online planning of rebalancing operations and their application. The first phase extracts frequent patterns (association rules) that represent sets of recurrent critical situations among nearby stations. Each of the considered pattern contains both positively and negatively critical stations and can be used to plan a rebalancing operation. Those patterns model recurrent critical situations in the analysed data. Based on the patterns extracted from historical data, rebalancing actions can be planned and applied to dynamically manage critical stations status.
Since different time periods are characterized by different behaviours, the proposed methodology trains and applies a set of contextualised models, each one tailored for a specific time period.
4.1. Pattern-Based Model Training
Given the input dataset
containing information about the historical stations’ statuses and the set
containing the stations and their geographical locations, the pattern-based model training phase (depicted in
Figure 1) is based on the following steps:
Neighbourhood identification. Given the stations’ geographical locations and the neighbourhood radius d, the neighbourhood N(s) for each station is computed.
Occupation rate computation. Given the input dataset and the neighbourhood N(s) for each station , the occupation rate for all stations and all timestamps is computed, i.e., the occupation is computed for each pair .
Identification of critical stations. Given the criticality threshold , the occupation rate for each pair and the identified neighbourhoods, the critical rate for all the pairs is computed. Then, only the pairs associated with either a positively or a negatively critical situations are selected and stored in the dataset , enriched with the critical status (positive or negative).
Contextualised data partitioning. Given a contextualised partitioning schema based on timestamp, is split into N non-overlapping partitions . A partition is a logical group defined on input data, related to a specific temporal context, on which we are interested in training a tailored model, e.g., if we are interested in a contextualised model for each of day of the week, is split in seven partitions (one for each day of the week).
Generation of transactional datasets. Given the partitions , a transactional dataset that encodes the critical stations in each timestamp t is built from . Each transaction includes the set of stations that are positively or negatively critical at timestamp t and their status (positive or negative).
Rule extraction. Finally, the association rules are mined from each transactional dataset to extract for each context the set of frequent patterns representing recurrent critical situations among nearby stations.
The set of extracted association rules represents the training model that is used to plan online rebalancing actions using the approach described in
Section 4.2.
In the following subsections we provide more details about the contextualised data partition step and the rule mining one, which are the building blocks of the proposed methodology.
4.1.1. Contextualised Data Partitioning and Models
The rebalancing framework proposes different contextualised data partitioning strategies to extract meaningful insights at different time partitioning groups and to let the framework adapt according to different habits of users and contexts. Each partition includes the data used to train a tailored model for a specific temporal context. The proposed contextualised partition-based approach allows to model more precisely the characteristics of each context and hence plan more effective rebalancing strategies, as shown in the experimental section.
The data partitioning component of our framework receives an input dataset
with the timestamp
t information associated to each record and divides the dataset into N non-overlapping partitions
, based on the value of
t, such that
It is useful to notice that, given a contextualised temporal partition strategy, a specific timestamp t belongs to only one single partition .
Different partitioning strategies lead to different frequent patterns and different contextualised models. Using different time partitioning strategies, our framework is able to collect meaningful information about users’ usage patterns in different moments of the day, week, and month and allows us to understand how the usage behaviour changes in different temporal contexts. Considering different partitioning strategies, we can understand if finer or coarser temporal contexts should be used in the bike sharing domain. Finer partitioning strategies should provide more tailored and precise models but overfitting could occur with a higher probability. Conversely, coarser partitioning strategies avoid overfitting but could be too general and do not perform well for all contexts.
We proposed and evaluated the following temporal contextualised data partitioning strategies:
Per month partitioning. Data belonging to the same month are kept together and monthly models are trained.
Per day of the week partitioning. Data belonging to the same day of the week are included in the same partition and analysed together. By doing this, the mined patterns are able to get insights about critical stations within the same day of the week. A total number of seven groups are generated.
Per time slot partitioning. Three timeslots are defined: 5:00–13:00, 13:00–21:00, and 21:00–05:00. One partition for each timeslot is defined. Association rules in this case gather insights about frequent critical stations in certain time slots of the day, independently of the day of the week.
Per day of the week and time slot partitioning. This approach combines together the latter two partitioning approaches, defining one partition for each combination (timeslot, day of the week).
We consider the time information the most relevant context in this domain and for this reason we decided to partition data based on the timestamp dimension. However, the proposed methodology can be easily adapted also to other contextual dimensions.
4.1.2. Transactional Dataset Generation and Rule Extraction
The proposed methodology uses the association rules mined from historical data as a model of recurrent behaviours to plan the rebalancing operations. However, since the itemset and association rule mining algorithms operate on transactional datasets, the original data needs to be manipulated to obtain the input in a transactional format. Each partition must be mapped to a transactional dataset . Specifically, for each partition and for each distinct timestamp t in , we generate a transaction that is stored in the transactional dataset . The transaction contains all the stations that are in a critical status at time t and their critical status. In particular, we associate the plus sign (+) to station s if the station is positively critical at time t and the minus sign (−) if it is negatively critical. A single transaction in thus represents the list of positively and negatively critical stations present in partition at timestamp t.
For instance, suppose that at timestamp the stations and are positively critical, station is negatively critical, and all the other stations in are not in a critical status. It follows that the following transaction will be inserted in the transactional dataset.
Given the N transactional datasets, the FP-growth algorithm is used to extract frequent itemsets and the set of association rules, given a minimum support threshold minSupport and a minimum confidence threshold minConfidence. Specifically, a set of association rules is extracted from each transactional dataset , i.e., a set of contextualised rules is mined for each (temporal) context.
The extracted rules are characterized by a set of stations, with the associated critical status, in the antecedent and one single station, with its critical status, in the consequent. Some examples of neighbourhoods and mined rules are reported in
Table 1 and
Table 2.
Among the extracted rules, only a subset of them can be used to plan local rebalancing operations. Specifically, only the rules (i) that contain only stations belonging to the same neighbourhood and (ii) such that the critical status of the stations in the antecedent of the rule is of opposite sign with respect to the critical status of the station in the consequent of the rule are useful for planning local rebalancing operations. We refer to the rules satisfying the second constraint as discordant rules. The interesting rules must contain only nearby stations because, as we introduced in the problem statement, we want to apply only local rebalancing operations. Moreover, they must be discordant because only if the critical status of the stations is opposite we can plan to move bicycles from positively critical stations to the negatively critical ones to remove the critical situations.
Let us consider the extracted rules reported in
Table 2. Rule 2 and 5 are the only discordant rules. Rule 2 has support 60% and confidence 80% and it can be interpreted as “the combination
is present in 60% of the input transactions and in 80% of the cases when
and
are both positively critical, it follows that
is negatively critical”. Stations
belong to neighbourhood 2 (
Table 1). Hence, Rule 2 satisfies also the first condition and it can be used to plan a local rebalancing operation. Rule 5 is very similar to Rule 2, but it is discarded by our approach because a neighbourhood containing the stations
does not exist. All the other example rules are discarded because they are not discordant rules.
4.2. Planning of Rebalancing Operations by Means of Association Rules
In this section, we describe how the mined association rules are used to plan at a given time the rebalancing actions to apply in the next time. This phase is composed of two steps: (i) identification of critical stations at time and (ii) definition of the rebalancing operations needed to address the identified critical situations.
Let be a batch of records containing the (new) online information about the status of all the stations of the system at timestamp . is gathered using a streaming real-time system. First, for each station , its critical rate at timestamp is computed and the set of positively and negatively critical stations are identified. The set of critical stations at timestamp , with the associated critical status, is denoted as . Each element in is a pair where s is a station and is the critical status of s at time .
Now we have the set of stations for which a rebalancing operation should be planned at time
and executed in at most
time. To define the rebalancing operations we consider the association rules mined in the training phase. Let
be the sorted list of association rules extracted from partition
to which
is associated (e.g., if we are using the day of the week partitioning strategy, the day of the week of timestamp
is used to decide which contextualised partition of the historical data and which set of rules must be considered). For each rule
, we search whether all the items (i.e., all the pairs (station, critical status)) belonging to the rule are contained in
. In such case, the rule
r can be used to define a rebalancing operation that fixes the critical situations of the stations in
r. A rule that satisfies such constraint is defined as applicable rule. Specifically, rule
r represents a recurrent frequent critical pattern: by moving bikes from positively critical to negatively critical stations it is possible to fix critical situations. Movements are allowed only from stations in the antecedent to stations in the consequent or vice versa. The strength of the pattern is numerically represented by the support and confidence values of the extracted rule. For example, suppose stations
and
are positively critical at timestamp
while station
is negatively critical at the same timestamp. Suppose that those stations are in the same neighbourhood and the rule
has been extracted from the historical data. This rule can be used to plan a rebalancing operation that fixes the critical situations of stations
,
, and
, moving bicycles from
and
to
. Since the extracted patterns are frequent, we can suppose that, with a high probability, the stations that are in a critical situation at time
will still be in the same critical status in the next
minutes. Hence, the rebalancing operation planned at time
would be probably applicable and useful when the truck will actually reach the critical stations. Since each rule is composed solely of neighbouring stations, our methodology performs local rebalancing operations (
Figure 2) such that the final occupancy rate of every station present in the rule is the same across stations in the neighbourhood. Thus, the goal of our framework is to flatten the occupancy rate of identified critical neighbourhoods by moving bicycles from positively critical to negatively critical stations or vice versa, fixing critical situations. Each bike station is visited only once during the operation.
Figure 3 shows the framework’s structure at reallocation time.
The list of applicable rules represents potential rebalancing operations that should be planned at timestamp and executed in at most time to fix imbalanced situations across stations. However, due to the limited amount of trucks, operators and time, not all critical stations can be fixed and thus a priority needs to be defined. Such priority is determined by the quality indices of the mined and applicable association rules. Specifically, we consider the applicable extracted association rules in descending order using the following quality indices: confidence, support, and length of the rule. The higher the confidence, the stronger the pattern is among the training data. Hence, given two rules, the confidence is initially considered and the one with the highest confidence is selected first. Given two rules with equal confidence, the one with higher support (i.e., the rule which is more frequent in the historical data) is considered first. In case also the support value is the same among the two considered rules, the length of the rule is considered. If also the length is the same, the lexicographical order is considered.
We decided to impose the number of rebalancing operations planned at time to be equal to the number of trucks available, i.e., among all the extracted association rules, only the top applicable rules with highest priority will be applied. Moreover, since the trucks used for the actual rebalance require time to travel from the deposit to critical neighbourhoods, we distinguish between the operation planning time and the actual rebalancing operation time, namely the reallocation time, which must be at most minutes after . Specifically, we define as the operation planning time the time in which the system analyses the online data collected by the stations, computes the occupancy rate and performs the match operations between the extracted rules and identified critical stations to select the top applicable rules. During the bicycles reallocation, due to the system being online, the status of the bike stations involved in the rebalancing operation may change within minutes. For such reason, the system checks whether the planned rebalancing operation is still applicable at reallocation time (i.e., when the truck reaches the stations to rebalance) and in case it is not, the system tries to identify the largest subset of stations for which the rule still holds, i.e., identify a subset of critical stations to rebalance. If no subset is found, the rule is not applied.
Consequently, the association rules extracted at training time are used as an associative classifier to detect at test time recurrent critical neighbourhoods that need to be fixed. Given the set of rules, only the applicable ones are selected by analysing the patterns of critical stations at a new time instant . By sorting the applicable rules with the aforementioned criterion, we prioritize certain neighbourhoods with respect to others by selecting only the top rules. Then, the framework generates the set of bicycle movements that need to be performed to flatten the occupancy rate across such problematic neighbourhoods.
Considering a real scenario, it is not feasible to perform a rebalancing operation at every timestamp. Because of such reason, we propose to apply our rebalancing methodology a limited number of times per day. The number of times and the timestamps at which such operations are planned and performed are fully configurable and can be decided by systems’ administrators according to the city and the population’s habits.