The objective of this case study was to develop and train ML model(s) that could accurately estimate the probability of a frac-hit type of communication between two wells, a hydraulically stimulated child-well and a producing parent-well, in near real time.
2.2.1. ML Model Structure
The data from the selected pads used for ML model development included time-series data for fracking time and duration, production rates for oil and water, and non-temporal geometric data from a drilling survey (only the distance between the well pairs was used in the study). These data were preprocessed and ingested as feature variables. The output of the trained ML model was expected to be the probability of a frac-hit type of communication between the two selected wells at any time.
All ML models in this study were constructed using long short-term memory (LSTM) and multilayer perceptron (MLP) neural networks [
16]. These models take advantage of LSTM’s ability to handle temporal data, i.e., the time-series data in this study, and utilize MLP for feature learning and feature extraction. The effect of the number of LSTM and MLP layers, as well as the layer sizes (the number of nodes in each layer), on the ML model performance was tested using the configurations shown in
Table 2. The number of model parameters ranged from approximately 44,000 to 916,000. The details of a midsize model structure are shown in
Figure A2. The corresponding number of parameters in each layer is shown in
Figure A3.
2.2.2. Data Preparation for ML Model Training
As shown in
Figure A2, the non-temporal data were first transformed into a constant time series, of the same size as the other time-series datasets, and then concatenated and input into the LSTM layer. Additionally, all input and output variables were normalized, so that the model was agnostic of the parameters’ physical units. Proper normalization factors, based on the minimum and maximum parameter values that the model ingested (full ranges of pressure, production rates, distance, etc.), were used to normalize each of the parameters within the range of [0, 1].
To prepare the data for the training and testing of the model, the source data from all 35 wells on Pads 133, 137, and 138 were used. Then, those 35 wells were mixed and matched to generate the binary permutation of wells with repetition, for a total number of 1225 ordered well pairs. Transforming the input data structure into vector data reflects the inherent causality of the communication between the wells within each data vector. A standard time range of 1000 days was used for all training datasets. Each temporal dataset had the time values computed with reference to DOFP at Well ID 137-42006 in Pad 137. All data prior to the reference point were truncated.
The LSTM layer’s input data had dimensions of (
m,
t,
n), where the first element,
m, is the total number of the datasets in a batch supplied to the model (number of training/validation or testing datasets), the second element,
t, is the size of time series, and the third element,
n, is the number of input features. For the sample model illustrated in
Figure A2 and
Figure A3,
t = 1000 and
n = 14. The feature variables were the initial well distance based on a drilling survey, the binary fracking indicator (0 for a non-fracking day and 1 for a fracking day), and the remaining twelve parameters, which were the measured values, interpolated values, time derivatives of the measured values, and time derivatives of the interpolated values for each pressure, as well as the oil and water production measurements. The model output had dimensions of (
m,
t,
s), in which the first two dimensions are the same as in the model input, and the last dimension,
s, is a single output variable for the probability of a frac hit at each time step.
To generate training targets for the ML model, the algorithm for computing the frac-hit probability,
p_fh, as a function of the input features was uniformly applied as shown below:
where
νk is the
kth input feature, and
f is the probability contribution function of
νk. The frac-hit probability can thus be computed as a sum of the outputs of individual feature-dependent functions. The following linear function was used to initially approximate the target frac-hit probability,
P_fh:
where the coefficients,
ck, for the key features related to the frac hits were used in the calculations. After inspecting the available data and manually testing for various coefficients, their non-zero values were estimated as shown in
Table 3.
These values were arrived at by trial and error after extensive review and analysis of the available data and domain-knowledge guidance, with some degree of expert judgement in identifying the frac hit features hidden in the data. Initially, fourteen relevant features were pre-screened and then down-selected to six. The major challenge was the lack of a well-defined frac-hit likelihood measure, especially in low likelihood cases.
Figure 5,
Figure 6 and
Figure 7 show examples of the computed frac-hit probability with the corresponding pressure, oil production, and water production data from the parent well (137-42106 in Pad 137) around the time of fracking at the child well (133-40787 in Pad 133). As shown in
Figure 5, a large spike in the time derivative of pressure due to fracking is the top contributor to the instant jump in frac-hit probability to 0.88.
After generating the data for all 1225 cases (ordered well pairs), a random data shuffling was performed with the same permutation for the input and output datasets. Then, ~90% of the data cases, or 1100 cases, were used for model training/validation, and the remaining ~10%, or 125 cases, were used for model testing. Cross validation was performed during the training process with an approximately 80%/20% split into training and validation data, respectively.