Content uploaded by Long Ren
Author content
All content in this area was uploaded by Long Ren on Feb 19, 2019
Content may be subject to copyright.
SPE-192833-MS
Optimization of Models for Rapid Identification of Oil and Water Layers
During Drilling - A Win-Win Strategy Based on Machine Learning
Jian Sun and Qi Li, School of Petroleum Engineering, China University of Petroleum - Beijing; Mingqiang Chen and
Long Ren, School of Petroleum Engineering, Xi'an Shiyou University - Xi'an; Fengrui Sun, School of Petroleum
Engineering, China University of Petroleum - Beijing; Yong Ai, Exploration and Development Institution Tarim Oil
Field - Korla; Kang Tang, School of Petroleum Engineering, Xi'an Shiyou University - Xi'an
Copyright 2018, Society of Petroleum Engineers
This paper was prepared for presentation at the Abu Dhabi International Petroleum Exhibition & Conference held in Abu Dhabi, UAE, 12-15 November 2018.
This paper was selected for presentation by an SPE program committee following review of information contained in an abstract submitted by the author(s). Contents
of the paper have not been reviewed by the Society of Petroleum Engineers and are subject to correction by the author(s). The material does not necessarily reflect
any position of the Society of Petroleum Engineers, its officers, or members. Electronic reproduction, distribution, or storage of any part of this paper without the written
consent of the Society of Petroleum Engineers is prohibited. Permission to reproduce in print is restricted to an abstract of not more than 300 words; illustrations may
not be copied. The abstract must contain conspicuous acknowledgment of SPE copyright.
Abstract
The identification of oil and water layers (OWL) from well log data is an important task in petroleum
exploration and engineering. At present, the commonly used methods for OWL identification are time-
consuming, low accuracy or need better experience of researchers. Therefore, some machine learning
methods have been developed to identify the lithology and OWL. Based on logging while drilling data, this
paper optimizes machine learning methods to identify OWL while drilling.
Recently, several computational algorithms have been used for OWL identification to improve the
prediction accuracy. In this paper, we evaluate three popular machine learning methods, namely the one-
against-rest support vector machine, one-against-one support vector machine, and random forest. First, we
choose apposite training set data as a sample for model training. Then, GridSearch method was used to find
the approximate range of reasonable parameters' value. And then using k-fold cross validation to optimize
the final parameters and to avoid overfitting. Finally, choosing apposite test set data to verify the model.
The method of using machine learning method to identify OWL while drilling has been successfully
applied in Weibei oilfield. We select 1934 groups of well logging response data for 31 production wells.
Among them, 198 groups of LWD data were selected as the test set data. Natural gamma, shale content,
acoustic time difference, and deep-sensing logs were selected as input feature parameters. After GridSearch
and 10-fold cross validation, the results suggest that random forest method is the best algorithm for
supervised classification of OWL using well log data. The accuracy of the three classifiers after the
calculation of the training set is greater than 90%, but their differences are relative large. For the test set,
the calculated accuracy of the three classifiers is about 90%, with a small difference. The one-against-rest
support vector machine classifier spends much more time than other methods. The one-against-one support
vector machine classifier is the classifier which training set accuracy and test set accuracy are the lowest
in three methods.
Although all the calculation results have diffierences in accuracy of OWL identification, their accuracy
is relatively high. For different reservoirs, taking into account the time cost and model calculation accuracy,
2 SPE-192833-MS
we can use random forest and one-against-one support vector machine models to identify OWL in real time
during drilling.
Key words: Machine learning, Identification of oil and water layers, Support vector machine, Random
forest, Optimization
Introduction
The development of oil and gas in unconventional reservoirs such as low-permeability reservoirs, tight
reservoirs, and shale reservoirs has become popular in global oil and gas development. The development
of oil and gas resources in these reservoirs is different from that in conventional reservoirs and often
requires more time and higher economic costs. Improvements in technology or methods in any one area can
bring about undeniable benefits (Sun Fengrui et al. 2017). In recent years, Logging While Drilling (LWD)
technology has been widely adopted in the drilling of unconventional reservoir horizontal wells. However,
LWD data are generally used to explain reservoir lithology and to guide drilling and geosteering work. These
data are applied less often to the identification of the oil and water layers (OWL) drilled during drilling. The
identification methods of OWL are mostly traditional methods, such as the traditional intersection charts
method (WAN Qiao- sheng et al. 2017), stripping method and multi-curve joint qualitative identification
combined with intersection charts method (Song Peng et al. 2016). Currently, based on statistics and
computer science, some methods relying on machine learning theory for the identification of reservoir
lithology and OWL have also been developed (Zhong Yihua et al. 2009, Li Rong et al. 2009, Song
Yanjie et al. 2007, Liu H. et al. 2009, Xiongyan Li et al. 2013, Yunxin Xie et al. 2018, Shaoqun Dong
et al. 2016, Arsalan A. Othman and Richard Gloaguen 2017). Due to complex geological conditions
and sedimentary environments, a nonlinear relationship between reservoir heterogeneity and reservoir
logging response characteristics indicates that the use of linear logging response equations and statistical
empirical formulas does not effectively characterize the reservoir's true characteristics and cannot meet
actual production needs. The traditional intersection charts method is directly related to the experience
of researchers and exhibits a certain degree of instability. Therefore, the use of non-linear information
processing technology to reveal the distribution characteristics of OWL can better meet the needs of oil and
gas exploration and development, while conventional linear and empirical logging interpretation technology
performs insufficiently. Artificial neural network, and support vector machines methods have been used
in the identification of OWL. Although they can play an interpretive role, there are still many problems:
the artificial neural network method is inadequate in giving satisfactory results in a local optimality,
dimension disaster, and small data sample (Ahmed Amara Konaté et al. 2015, Morteza Raeesi et al. 2012,
Baouche Rafik et al. 2017, B. Shokooh Saljooghi et al. 2015); the support vector machine can overcome the
shortcomings of the neural network method, but the classical support vector machine algorithm only gives
two types of classification algorithms. In practical applications of data mining, it is often necessary to solve
multi-category classification problems. Therefore, one-against-rest support vector machines (OVR SVMs),
one-against-one support vector machines (OVO SVMs) (Hsu, C.-W. and Lin, C.-J. 2002), and random forest
methods have emerged. These three methods can effectively avoid the deficiencies of the above methods,
especially the random forest algorithm, which is composed of multiple decision trees. Compared with a
single decision tree algorithm, this algorithm has a higher training accuracy, better classification effect, and
is less likely to overfit. However, a certain classifier cannot be considered to be sufficient, and specific
issues should be analysed. After all, this problem is challenging. Therefore, in this paper, the OVR SVMs
classifier, the OVO SVMs classifier and the random forest classifier are constructed according to various
characteristic data obtained from well logging. Various target values of the OWL are classified, and the
results obtained by each classifier are analysed. The optimal classifier and corresponding parameters are
selected to solve the problem of the accurate identification of the OWL while drilling.
SPE-192833-MS 3
The principles and methods of support vector machine and random forest
The principle of support vector machine
The support vector machine (SVM) is developed from the optimal classification surface in the case of
linear separability. Its core idea is that the optimal classification surface not only can correctly separate
the two types of samples, but also maximize their classification intervals. In the actual situation, most of
the problems encountered are non-linear, and the nonlinear problem needs to be converted into a linear
problem of a high-dimensional space through a nonlinear transformation, and an optimal classification
surface is obtained in the transformed space (Neda Mahvash Mohammadi et al. 2018, Jaime Ortegon et
al. 2018, Xiaoling Lu et al. 2018, Xiekai Zhang et al. 2017, Italo Zoppis et al. 2018). Suppose that in the
nonlinear case, the sample points are (xi, yi) (i=1,…,n). In the high-dimensional space, the classification
surface equation is formula (1), φ(x) is a mapping function from the input space to the feature space. ω is
a space vector, and b is a constant term. The schematic diagram is shown in Figure 1. Under the constraint
of condition as formula (2), find the minimum value of the function (formula (3)). This can be converted
into a dual problem through the Lagrangian optimization method, which is attributed to a quadratic function
extremum problem, that is, solving the maximum value of the formula (4) under the constraint condition
as formula (5), a is a Lagrangian multiplier, and c is a constant that controls the degree of punishment of
the wrong sample. Among them, formula (6) is a kernel function that satisfies the mercer condition. The
optimal classification discriminant function obtained after solving the above problem is formula (7), where
N is the number of support vectors.
(1)
(2)
(3)
(4)
(5)
(6)
(7)
4 SPE-192833-MS
Figure 1—Support vector machine schematic diagram.
The multi-classification method of support vector machine
The SVM algorithm was originally designed for the binary classification problem. When multi-category
classification problems are encountered, corresponding multi-category classifiers need to be constructed.
Currently, there are two main methods for constructing SVM multiclass classifiers. One is to directly
modify the objective function, and the parameter solutions of multiple classification planes are combined
into one optimization problem, and the optimization problem is solved "once". This method to achieve
multi-category classification, is called the direct method. This method seems simple, but its computational
complexity is relatively high, it is more difficult to achieve, and it only applies to small problems. The other
is the combination of multiple binary classifiers to construct multiple classifiers, called indirect methods,
usually OVO SVMs and OVR SVMs.
The OVR SVMs classifier classifies samples of a certain category as one class in training, and classifies
the remaining samples into another class. Samples of k categories can construct k SVMs, which is to
construct k two class classifiers. The i-th classifier divides the i-th congruent categories, and the i-th
classifier takes the i-th class of the training set as a positive class and the rest of the class points as a negative
class for training. In the discrimination, the input signal is obtained through k classifiers to obtain k output
values fi(x) =sgn (gi(x)). If the output is +1, the corresponding class is the input signal class. However,
the decision function constructed under actual conditions always has errors. If the output is more than one
+1, or if none of the outputs is +1, comparing the output values, the largest one corresponds to the input
category. This method has a more obvious deficiency: the training set occupies a smaller proportion, so it
will be affected by the remaining samples, and there will be deviations.
The OVO SVMs classifier designs an SVM between any two types of samples, so k samples are needed
to design k(k-1)/2 SVMs. When classifying an unknown sample, each classifier will judge its class and vote
for the corresponding class. The class which gets the most votes is the class of the unknown sample. Voting
is completed as follows: Let A = B = C = D = 0. (A, B) –classifier: if A wins, then A = A + 1; otherwise,
B = B + 1. (A, C) –classifier: if A wins, then A=A+1; otherwise, C=C+1…. (C, D)-classifier: if C wins,
then C=C+1; otherwise, D=D+1. The final decision is the max of (A, B, C, D). Although this method is
better than using OVR SVMs, when the number of categories is large, the number of models is n*(n-1)/2,
which will greatly increase the calculation time.
The principle of the random forest classification algorithm
The random forest is the most popular machine learning model. In the 1980s, Breiman et al. invented the
classification tree algorithm (Breiman 1996). This method repeatedly divides the data into two categories for
classification or regression, which greatly reduces the amount of computation. In 2001, Breiman combined
SPE-192833-MS 5
the classification trees into a random forest (Breiman 2001), which randomizes the use of variables
(columns) and the use of data (rows), generates many classification trees, and statistically summarizes
the classification tree results. The results are robust to missing data and non-equilibrium data and can
appropriately predict the effects of up to thousands of explanatory variables, and it is hailed as one of the
best algorithms available today (E. Vigneau et al. 2018, Michele Fratello et al. 2018, Robin Genuer et al.
2017, Christoph Behrens et al. 2018, Behnam Partopour et al. 2018). As the name implies, random forests
create a forest in a random manner. Forests are composed of many decision trees. All decision trees in a
random forest are unrelated. After the forest is acquired, when a new sample is input, each decision tree in
the forest is judged separately to determine to which category the sample belongs. Finally, the number of
times each category was selected is counted. The category that is selected the most is the category predicted
for the sample. A decision tree is a tree structure (can be a binary tree or a non-binary tree). Each non-leaf
node represents a test of a characteristic attribute, and each leaf node stores a category. The decision process
of the decision tree is shown in Figure 2. Starting from the root node, the corresponding feature attributes
in the classification items are tested, and the output branches are selected according to the test results, until
reaching the leaf nodes. The decision result is the category which is stored in the leaf nodes.
Figure 2—Random forest schematic diagram.
In establishing each decision tree, two things need to be taken care of: sampling and complete division.
The first step is the process of two random samplings. The random forest samples the input data in rows
and columns. For row sampling, there is a method of putting back, that is, in the sample set obtained by
sampling, there may be duplicate samples. Assuming that the input sample is N, the sampling sample is also
N, so when training, the input sample of each tree is not a complete sample, and it is relatively easy to avoid
over-fitting. For column sampling, this method selects m features from M features (m<M). M is the total
number of features. The second step is to establish a decision tree for the sampled data in a completely split
method so that one leaf node of the decision tree cannot continue to split or all samples point to the same
classification. In general, the decision tree algorithm has an important step - pruning. Because the first two
random sampling processes guarantee randomness, no overfitting occurs even without pruning.
The classification method of random forest
Each tree in the random forest is a binary tree. Its generation follows the top-down recursive splitting
principle. That is, the training set is divided from the root node in turn. In the binary tree, the root node
contains all the training data. According to the principle of minimum node purity, it splits into the left node
and the right node. Each node contains a subset of the training data. According to the same rule, the node
continues to split until it meets the branch stop rule and stops growing. Each decision tree actually learns
the classification of specific data, while random sampling ensures that the repeated samples are classified
by different decision trees, so that the classification ability of different decision trees can be evaluated.
The specific steps for random forest classification are as follows:
6 SPE-192833-MS
1. From the original training data set, apply the bootstrap method to randomly select k new sample
sets with putting back and construct k classification trees; each sample which is not extracted
constitutes k out-of-pocket data.
2. Assuming there are n features, randomly extract m features per node per tree. Calculate the amount
of information contained in each feature, and select the feature with the highest classification ability
for a node split.
3. Do not trim every tree, and make it grow to the maximum size.
4. Let many generated classification trees compose random forests, and use the random forest
classifier to discriminate and classify new data. The classification results are determined by the
number of votes cast by the tree classifier.
This method has many advantages over other machine learning classification methods:
1. It has a high accuracy.
2. It handles higher dimensional data.
3. The introduction of randomness makes the method less susceptible to overfitting, and the trained
model has a small variance and a strong generalization ability.
4. The training can be highly parallelized, with is advantageous for high sample training speeds in
the era of Big Data.
5. Relative to the boosting series of Adaboost and GBDT, random forest implementation is relatively
simple.
However, for most statistical modelers, the random forest is likes a black box. They cannot control
the internal operation of the model, and they can only manipulate different parameters and random seeds.
For small data or low-dimensional data (less-characteristic data), the model may not produce a good
classification. Compared with other machine learning classification methods, the accuracy of the training
set of this method is often high, but the accuracy of the test set is not always high.
The selection of model training data
The selection of the logging information type
Under the current technical conditions, conventional well logging data usually can be obtained through
logging while drilling. There are many types of logging-while-drilling information. According to the logging
principle, there are usually electric logging data, sonic logging data, nuclear logging data, etc. However,
using more data types and feature parameters does not guarantee higher machine learning accuracy. The
well logging data usually contain a considerable amount of noise, which will have an impact on the machine
learning identification results. Because the logging data of the same logging principle have a relatively
high correlation with the logging data of different logging principles, if the data volume is too large or the
correlation of the characteristic parameters is strong, the parameter redundancy phenomenon will occur to
increase the machine learning time or even affect the accuracy of the model learning. Comprehensively
consider the selection of natural gamma (GR), shale content (SH), acoustic time difference (AC), and deep-
sensing (ILD) logs as input data for the logging response to identify the reservoir oil and gas properties.
Logging data standardization
The data standardization (normalization) processing is a basic task for machine learning data classification.
Different evaluation indicators often have different dimensions and dimension units, which will affect the
results of the data analysis. It is necessary to standardize the data and eliminate the dimensionality impact
between the indicators to compare the data indicators. After the original data have been standardized, each
indicator is in the same order of magnitude, which is suitable for a comprehensive comparative evaluation.
SPE-192833-MS 7
Due to the different dimension of each type of logging data and the large difference in numerical magnitudes,
it is necessary to standardize the original logging data to eliminate the impact on the analysis results. There
are two commonly used data normalization methods; min-max normalization and Z-score normalization.
In this paper, formula (8), the min-max normalization, is used to standardize the logging data.
(8)
Where xmax denotes the maximum value of the sample data; xmin denotes the minimum value of the sample
data. After normalization, logging data values are all in the interval [0, 1].
The selection of the logging information training data
The sample data used in the paper consist of 1934 groups of well logging response data for 31 production
wells in the Chang 3 reservoir of the Weibei 2 well area. Among them, 198 groups of LWD data were
selected as the test set data. Combining the existing well response data and according to the well logging
principle, four types of logging well response data are selected as the characteristic parameters, namely,
natural gamma-ray logging response data (GR), shale content data (SH), acoustic time logging response
data (AC) and deep induction logging response data (ILD), as shown in Figure 3, is a plot of the well
logging curves for the WB2P27 well. The reservoirs are classified into four categories, i.e., oil-water layers,
dry layers, water layers, and oil layers. These target categories are replaced by the numbers 1, 2, 3, and 4,
respectively. The training data are standardized, and part of the data is shown in Table 1.
Figure 3—well logging curves of WB2P27 well.
8 SPE-192833-MS
Table 1—Partial training data.
Application of machine learning classification method in the identification of
OWL during drilling
Relying on the logging response data to identify OWL, in the final analysis, a nonlinear function mapping
problem is solved. The relationship between the logging response and the actual reservoir interval
is complex, so this mapping is usually highly nonlinear. There are many types of logging response
characteristics, and the target categories of OWL are usually larger than two types. Therefore, using support
vector machine multiple classifiers or random forest classifiers is an effective way to solve this complex
problem.
The OVR SVMs classifier identifies OWL during the drilling process
1. The selection of the OVR SVMs kernel function and the training parameters
The kernel function usually includes polynomial kernel function, Gaussian kernel function, and
linear kernel function. After a comparative analysis and study, to obtain the best model, the Gaussian
kernel function is selected here (Chang, Y.W. et al., 2010); furthermore,
its flexibility is very high. We use the GridSearch method to select the approximate range of the
optimal training parameters C and γ in the OVR SVMs, where . The GridSearch method is an
exhaustive search method that specifies parameter values. The optimal learning algorithm is obtained
by optimizing the parameters of the estimation function through a cross-validation method. Cross-
validation is a statistical analysis method used to verify the performance of a classifier and can avoid
overfitting problems. There are three main types of cross-validation: a. double cross-validation; b. k-
folder cross-validation; and c. leave-one-out cross-validation.
Double cross-validation, also known as 2-fold cross-validation (2-CV), splits the data set into two
equally sized subsets for two rounds of classifier training. In practice, 2-CV is not commonly used.
SPE-192833-MS 9
The main reason is that the number of training set samples is too small and is usually insufficient to
represent the distribution of the maternal sample, leading to a significant drop in the recognition rate
in the test phase. Additionally, the 2-CV subset has a high degree of variation, often failing to meet
the requirement that the experimental process must be replicable.
K-folder cross-validation is an extension of double cross-validation. The practice is to divide the
data set into k subsets. Each subset is used as a test set, and the rest is used one time as a training
set. The k-CV cross-validation is repeated k times, and it selects one subset as the test set every time.
Then, the method uses the k-fold average cross-validation recognition rate as a result. In this method,
all samples can be used as training sets and test sets, and each sample is verified once.
Leave-one-out cross-validation, assuming there are n samples in the data set and that LOOCV is
also n-CV, uses each sample as a test set, and the remaining n-1 samples are used as the training
set. Almost all the samples in each round are used to train the model in this method. Therefore, the
results of this method are the closest to the distribution of the maternal sample, and the estimated
generalization error is more reliable. In the case of a small sample of the experimental data set,
LOOCV can be considered. It can be seen that the cost of the LOOCV calculation is higher and the
number of models to be built is the same as the total number of samples. When the total number of
samples is quite large, LOOCV has more difficulty in the actual operation unless the training speed
of each model is very fast. However, parallel calculations can be used to reduce the time required
for the calculations.
Therefore, this paper uses the K-fold cross-validation method to optimize the objective function
and find the best parameters' value so that the accuracy of cross-validation can avoid over-fitting.
First, take C= [1000, 3000, 5000, 7000, 9000], γ= [15, 20, 25, 30, 35] to search the range of the
optimal parameters' values in 25 combinations of (C, γ) by the GridSearch method. We get the optimal
values for the parameters C and γ, which are approximately 5000 and 25, respectively. Then, use 10-
fold cross-validation to perform a fine search around C=5000 and γ=25. Figure 4 shows the 10-fold
cross-validation of parameter C at γ =25. Figure 5 shows the 10-fold cross-validation of parameter γ
at C=5000. Finally, select the combination with the highest accuracy (5000, 25) as the optimal OVR
SVMs training parameters. The accuracy of the training set is 0.93299, which takes 760 s.
2. The validation of the test set data by the OVR SVMs classifier
The OVR SVMs classifier was used to classify LWD test data. The results are shown in Table 2.
The test accuracy is 0.90909.
Figure 4—The 10 fold cross-validation of parameter C in OVR SVMs, γ=25.
10 SPE-192833-MS
Figure 5—The 10 fold cross-validation of parameter γ in OVR SVMs, C=5000.
The OVO SVMs classifier identifies OWL during the drilling process
1. The selection of the OVO SVMs kernel function and the training parameters
Also select the Gaussian kernel function and use the GridSearch
method to select the approximate range of the optimal training parameters C and γ in the OVO SVMs,
where . Then, K-fold cross-validation method is used to optimize the objective function, and
the best parameters' value are found so that the accuracy of cross-validation is the highest to avoid
over-fitting.
First, take C= [500, 1000, 1500, 2000, 2500], γ= [5, 10, 15, 20, 25] to search the range of the
optimal parameters' values in 25 combinations of (C, γ) by the GridSearch method. We get the optimal
values for the parameters C and γ, which are approximately 500 and 20, respectively. Then, use 10-
fold cross-validation to perform a fine search around C=500 and γ=20. Figure 6 shows the 10-fold
cross-validation of parameter C at γ =20. Figure 7 shows the 10-fold cross-validation of parameter
γ at C=900. Finally, select the combination with the highest accuracy (900, 20) as the optimal OVO
SVMs training parameters. The accuracy of the training set is 0.91753, which takes 100 s.
2. The validation of the test set data by the OVO SVMs classifier
The OVO SVMs classifier was used to classify LWD test data. The results are shown in Table 2.
The test accuracy is 0.88889.
SPE-192833-MS 11
Figure 6—The 10 fold cross-validation of parameter C in OVO SVMs, γ=20.
Figure 7—The 10 fold cross-validation of parameter γ in OVR SVMs, C=900.
The random forest classifier identifies OWL during the drilling process
1. The selection of the random forest classifier n_estimators and the max_features parameters
The n_estimators represents the number of trees in the forest. This value is not as large as possible.
As the number of trees increases, the calculation time will also increase, and the best predictive value
will appear at a reasonable tree value. Max_features represents the maximum number of features
allowed to be used by a single decision tree, that is, a subset of randomly selected feature sets. The
smaller the number of subset, the faster the variance will decrease, but at the same time the deviation
will increase faster. In the classification problem, it usually takes max_features=sqrt (n_features)
(Behnam Partopour et al., 2018, E. Vigneau et al., 2018). The GridSearch method was used to
select the approximate range of random forest optimal training parameters the n_estimators and the
max_features, and then K-fold cross validation was used to optimize the objective function and find
the best parameters' value so that the accuracy of cross-validation can avoid over-fitting.
12 SPE-192833-MS
First, take n_estimators= [5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60], max_features=
[1, 2, 3, 4] to search the range of the optimal parameters' values in 48 combinations of
(n_estimators, max_features) by the GridSearch method. We get the optimal values for the parameters
n_estimators and max_features, which are approximately 25 and 1, respectively. Since it usually takes
max_features=sqrt (n_features), let max_features=2. Then, use 10-fold cross-validation to perform
a fine search around n_estimators =25. Figure 8 shows the 10-fold cross-validation of parameter
n_estimators at max_features =2. Finally, select the combination with the highest accuracy (29, 2)
as the optimal random forest training parameter. The accuracy of the training set is 0.95361, which
takes 180 s.
2. The validation of the test set data by the random forest classifier
The random forest classifier was used to classify LWD test data. The results are shown in Table
2. The test accuracy is 0.89899. And the random forest classifier can get the proportion of each
feature parameter in the classification calculation. The importance of each feature is [GR=0.28552335,
AC=0.19097589, ILD=0.19690391, SH=0.32659686].
Figure 8—The 10 fold cross-validation of parameter n_estimators in random forest, max_features=2.
Comparison of classification results of LWD test data by three classifiers
The model training results and test results obtained by the three classifiers are shown in Table 2 (A- actual
OWL category, B- identified OWL category).
SPE-192833-MS 13
Table 2—Training results and test results obtained by the three classifiers.
The parameters and operation results obtained by the three classifiers are compared and analyzed, as
shown in Table 3.
Table 3—Classification algorithm comparison table.
As can be seen from Table 3, the accuracy of the three classifiers after the calculation of the training
set is greater than 90%, but their differences are relative large. For the test set, the calculated accuracy of
the three classifiers is about 90%, with a small difference. In calculation time-consuming aspects, OVR
SVMs consumes much more time than other two classifiers. Considering comprehensively, the random
forest classifier has the highest training set calculation accuracy, and its test set calculation accuracy is
only 1% less than OVR SVMs, but its computational time-consuming is only a quarter of the OVR SVMs.
Therefore, the random forest classifier was selected to identify the OWL during the drilling process.
Conclusions
This work presents an optimal model that could be used to quickly identify the OWL while drilling. The
identification results, recognition accuracy and calculation time of OWL classification are obtained by three
machine learning methods. Some meaningful conclusions are listed below:
a. For the optimization of the three method parameters, the initial value range of the GridSearch is
very important. It will directly influence the accuracy and training time of the training model.
14 SPE-192833-MS
b. After the parameters are optimized, the training set accuracy of random forest is the highest. It is
about 2% more than OVR SVMs, and about 4% more than OVO SVMs.
c. The test set accuracy of three methods are very close, all around 90%.
d. The calculation time of OVR SVMs is much larger than OVO SVMs and random forest.
e. For the oilfield sample data selected in this paper, the random forest method is the optimal
classification algorithm.
Acknowledgements
This work was supported by the National Natural Science Foundation of China (NSFC) (No. 51704235).
References
Ahmed Amara Konaté, Heping Pan, Sinan Fang, et al. Capability of self-organizing map neural network in geophysical
log data classification: Case study from the CCSD-MH. Journal of Applied Geophysics, 2015, Vol 118: 37–46. https://
doi.org/10.1016/j.jappgeo.2015.04.004.
Arsalan A. Othman, Richard Gloaguen. Integration of spectral, spatial and morphometric data into lithological mapping:
A comparison of different Machine Learning Algorithms in the Kurdistan Region, NE Iraq. Journal of Asian Earth
Sciences, 2017, Vol 146: 90–102. https://doi.org/10.1016/j.jseaes.2017.05.005.
Baouche Rafik, Baddari Kamel. Prediction of permeability and porosity from well log data using the nonparametric
regression with multivariate analysis and neural network, Hassi R'Mel Field, Algeria. Egyptian Journal of Petroleum,
2017, Vol 26: 763–778. https://doi.org/10.1016/j.ejpe.2016.10.013.
Behnam Partopour, Randy C. Paffenroth, Anthony G. Dixon. Random Forests for mapping and analysis of microkinetics
models. Computers & Chemical Engineering, 2018. https://doi.org/10.1016/j.compchemeng.2018.04.019.
Breiman, L., 2001. Random forests. Mach. Learn. 45, 5–32.
Breiman, L., 1996. Bagging predictors. Mach. Learn. 24, 123–140.
B. Shokooh Saljooghi, A. Hezarkhani. A new approach to improve permeability prediction of petroleum reservoirs using
neural network adaptive wavelet (wavenet). Journal of Petroleum Science and Engineering, 2015, Vol 133: 851–861.
https://doi.org/10.1016/j.petrol.2015.04.002.
Chang, Y.W., Hsieh, C.J., Chang, K.W., et al, 2010. Training and testing low-degree polynomial data mappings via linear
SVM. J. Mach. Learn. Res. 11 (11), 1471–1490.
Christoph Behrens, Christian Pierdzioch, Marian Risse. Testing the optimality of inflation forecasts under flexible loss
with random forests. Economic Modelling, 2018, Vol 72: 270–277. https://doi.org/10.1016/j.econmod.2018.02.004.
E. Vigneau, P. Courcoux, R. Symoneaux, et al Random forests: A machine learning methodology to highlight the volatile
organic compounds involved in olfactory perception. Food Quality and Preference, 2018, Vol 68: 135–145. https://
doi.org/10.1016/j.foodqual.2018.02.008.
Hsu, C.-W., Lin, C.-J., 2002. A comparison of methods for multiclass support vector machines. Trans. Neur. Netw. 13,
415–425.
Italo Zoppis, Giancarlo Mauri, Riccardo Dondi. Kernel Methods: Support Vector Machines. Reference Module in Life
Sciences, 2018. https://doi.org/10.1016/B978-0-12-809633-8.20342-7.
Jaime Ortegon, Rene Ledesma-Alonso, Romeli Barbosa, et al Material phase classification by means of
Support Vector Machines. Computational Materials Science, 2018, Vol 148: 336–342. https://doi.org/10.1016/
j.commatsci.2018.02.054.
Li Rong, Zhong Yihua. DENTIFICATION METHOD OF OIL/GAS/WATER LAYER BASED ON LEAST SQUARE
SUPPORT VECTOR MACHINE. NATURAL GAS EXPLORATION & DEVELOPMENT, 2009, 32(03): 15–18+72.
Liu H., Wen S., Li W., Xu C., Hu C. (2009) Study on Identification of Oil/Gas and Water Zones in Geological Logging
Base on Support-Vector Machine. Fuzzy Information and Engineering Volume 2. Advances in Intelligent and Soft
Computing, Vol 62. Springer, Berlin, Heidelberg.
Michele Fratello, Roberto Tagliaferri. Decision Trees and Random Forests. Reference Module in Life Sciences, 2018.
https://doi.org/10.1016/B978-0-12-809633-8.20337-3.
Morteza Raeesi, Ali Moradzadeh, Faramarz Doulati Ardejani, et al Classification and identification of hydrocarbon
reservoir lithofacies and their heterogeneity using seismic attributes, logs data and artificial neural networks. Journal
of Petroleum Science and Engineering, 2012, Vol 82–83: 151–165. https://doi.org/10.1016/j.petrol.2012.01.012.
Neda Mahvash Mohammadi, Ardeshir Hezarkhani. Application of support vector machine for the separation of
mineralised zones in the Takht-e-Gonbad porphyry deposit, SE Iran. Journal of African Earth Sciences, 2018, Vol 143:
301–308. https://doi.org/10.1016/j.jafrearsci.2018.02.005.
SPE-192833-MS 15
Robin Genuer, Jean-Michel Poggi, Christine Tuleau-Malot, et al Random Forests for Big Data. Big Data Research, 2017,
Vol 9: 28–46. https://doi.org/10.1016/j.bdr.2017.07.003.
Shaoqun Dong, Zhizhang Wang, Lianbo Zeng. Lithology identification using kernel Fisher discriminant analysis with
well logs. Journal of Petroleum Science and Engineering, 2016, Volume 143: 95–102. https://doi.org/10.1016/
j.petrol.2016.02.017.
Song Peng, Yang Weiguo, Sun Dong, et al OIL- WATER- LAYER IDENTIFYING METHOD FOR RESERVOIR
CHANG6 IN HUAQING OILFIELD. Petroleum Geology and Oilfield Development in Daqing, 2016, 35(6): 144–147.
Song Yanjie, Zhang Jianfeng, Yan Weilin, et al A new identification method for complex lithology with support vector
machine. Journal of Daqing Petroleum Institute, 2007,31(5): 18–20.
Sun Fengrui, Yao Yuedong, Chen Mingqiang, Li Xiangfang, Zhao Lin, Meng Ye, Sun Zheng, Zhang Tao, Feng Dong.
Performance analysis of superheated steam injection for heavy oil recovery and modeling of wellbore heat efficiency.
Energy, 2017, 125: 795–804.
WAN Qiao- sheng, LI Xue- ying, ZHAO Yu- qiu, et al Oil and water layer identification method of Gaotaizi reservoirs in
Qijiabei area. Progress in Geophysics, 2017, 32(2):0714–0720, doi:10.6038/pg20170236.
Xiaoling Lu, Fengchi Dong, Xiexin Liu, et al Varying Coefficient Support Vector Machines. Statistics & Probability
Letters, 2018, Vol 132: 107–115. https://doi.org/10.1016/j.spl.2017.09.006.
Xiekai Zhang, Shifei Ding, Yu Xue. An improved multiple birth support vector machine for pattern classification.
Neurocomputing, 2017, Vol 225: 119–128. https://doi.org/10.1016/j.neucom.2016.11.006.3
Xiongyan Li, Hongqi Li. A new method of identification of complex lithologies and reservoirs: task-driven data mining.
Journal of Petroleum Science and Engineering, 2013, Vol 109: 241–249. https://doi.org/10.1016/j.petrol.2013.08.049.
Yunxin Xie, Chenyang Zhu, Wen Zhou, et al Evaluation of machine learning methods for formation lithology
identification: A comparison of tuning processes and model performances. Journal of Petroleum Science and
Engineering, 2018, Vol 160: 182–193. https://doi.org/10.1016/j.petrol.2017.10.028.
Zhong Yihua, Li Rong. Application of Principal Component Analysis and Least Square Support Vector Machine
toLithology Identification. WELL LOGGING T ECHNO LOGY, 2009, 33(05):425–429.
16 SPE-192833-MS
Appendix A
Three training models' classification results of test sets
Table A—Three training models' classification results of test sets
No. GR AC ILD SH actual type identified type
by OVR SVMs
identified type
by OVO SVMs
identified
type by
random
forest
1 0.116089 0.096942 0.400179 0.05468 1 2 2 1
2 0.093133 0.051894 0.435171 0.05469 1 1 1 1
3 0.140357 0.050909 0.452366 0.05471 1 1 2 1
4 0.134357 0.064829 0.401235 0.05472 1 2 2 1
5 0.101213 0.104653 0.389918 0.05473 1 2 2 1
6 0.11059 0.028326 0.486268 0.05482 1 1 1 1
7 0.097519 0.069152 0.376002 0.05491 1 2 2 1
8 0.048411 0.081646 0.456296 0.05499 1 3 3 1
9 0.086996 0.027431 0.413717 0.05504 1 1 2 1
10 0.10145 0.075731 0.347666 0.05508 1 2 2 1
11 0.093321 0.052928 0.414261 0.05512 1 1 2 1
12 0.087399 0.078059 0.481719 0.05518 1 1 1 1
13 0.110548 0.214586 0.442463 0.05519 1 1 1 4
14 0.115368 0.097061 0.391812 0.05522 1 2 2 1
15 0.105354 0.108942 0.371031 0.05526 1 2 2 1
16 0.115801 0.061189 0.402327 0.0553 1 2 2 1
17 0.115259 0.064559 0.3861 0.05534 1 2 2 1
18 0.105459 0.088368 0.396466 0.05544 1 2 2 1
19 0.095664 0.061505 0.439703 0.05547 1 1 1 1
20 0.105674 0.069896 0.415734 0.05557 1 1 2 1
21 0.163836 0.067933 0.395874 0.12198 1 1 1 1
22 0.168097 0.057272 0.441259 0.12199 1 1 1 1
23 0.137618 0.070063 0.469064 0.12202 1 1 1 1
24 0.135075 0.116844 0.413593 0.12203 1 1 4 1
25 0.128039 0.062846 0.417837 0.12205 1 1 1 1
26 0.142556 0.061435 0.367942 0.12206 1 2 2 1
27 0.464709 0.09437 0.112175 0.12208 1 1 1 2
28 0.143162 0.032004 0.472009 0.12208 1 1 1 1
29 0.185951 0.048207 0.420951 0.12213 1 1 1 1
30 0.169695 0.061033 0.420158 0.1222 1 1 1 1
31 0.147916 0.031689 0.542651 0.12221 1 1 1 1
32 0.133118 0.0345 0.421555 0.12222 1 2 1 1
33 0.158063 0.099266 0.402537 0.12224 1 1 1 1
34 0.159824 0.087124 0.402147 0.12225 1 1 1 1
SPE-192833-MS 17
No. GR AC ILD SH actual type identified type
by OVR SVMs
identified type
by OVO SVMs
identified
type by
random
forest
35 0.168219 0.05957 0.427848 0.12228 1 1 1 1
36 0.13614 0.133094 0.386076 0.12232 1 2 2 2
37 0.144019 0.054874 0.420568 0.12235 1 1 1 1
38 0.229379 0.095255 0.419888 0.12235 1 1 1 1
39 0.124975 0.091524 0.384267 0.12238 1 1 2 2
40 0.145918 0.06305 0.438382 0.12239 1 1 1 1
41 0.171598 0.078392 0.392307 0.12487 1 1 1 1
42 0.175802 0.073361 0.401294 0.12494 1 1 1 1
43 0.153868 0.044822 0.451974 0.125 1 1 1 1
44 0.161192 0.08632 0.417621 0.12507 1 1 1 1
45 0.122151 0.031818 0.506742 0.12509 1 1 1 1
46 0.144176 0.105396 0.395701 0.12512 1 1 1 2
47 0.122202 0.033715 0.504282 0.12517 1 1 1 1
48 0.155712 0.100281 0.397754 0.12526 1 1 1 2
49 0.169543 0.069907 0.418579 0.12534 1 1 1 1
50 0.14465 0.035174 0.440296 0.12544 1 1 1 1
51 0.12477 0.096546 0.387239 0.04789 2 1 1 2
52 0.105168 0.088048 0.371421 0.05707 2 2 2 1
53 0.127323 0.085472 0.376093 0.0571 2 2 2 1
54 0.115366 0.123103 0.336264 0.05718 2 2 2 1
55 0.09704 0.109209 0.356372 0.05722 2 2 2 1
56 0.111235 0.093313 0.355797 0.05722 2 2 2 1
57 0.113238 0.064385 0.38622 0.05725 2 2 2 1
58 0.095083 0.062337 0.372016 0.05726 2 2 2 1
59 0.130477 0.099633 0.388648 0.05729 2 2 2 1
60 0.145923 0.069724 0.424973 0.05732 2 1 2 1
61 0.09512 0.064524 0.373528 0.05732 2 2 2 1
62 0.127491 0.084555 0.385464 0.05734 2 2 2 1
63 0.103077 0.044248 0.393751 0.05736 2 2 2 1
64 0.126336 0.068057 0.401996 0.05739 2 2 2 1
65 0.116641 0.067068 0.357247 0.05744 2 2 2 1
66 0.109397 0.074987 0.411857 0.05748 2 1 2 1
67 0.115317 0.113172 0.355185 0.0691 2 2 2 2
68 0.098599 0.088842 0.380222 0.06912 2 2 2 2
69 0.118857 0.079711 0.389082 0.06913 2 2 2 2
70 0.142886 0.093396 0.379467 0.06918 2 2 2 2
71 0.120946 0.052307 0.388528 0.0692 2 2 2 2
72 0.129197 0.113914 0.363293 0.06928 2 2 2 2
18 SPE-192833-MS
No. GR AC ILD SH actual type identified type
by OVR SVMs
identified type
by OVO SVMs
identified
type by
random
forest
73 0.101941 0.096244 0.358495 0.06928 2 2 2 2
74 0.101953 0.097231 0.351698 0.06931 2 2 2 2
75 0.094934 0.097809 0.374366 0.06933 2 2 2 2
76 0.15549 0.05827 0.435795 0.06935 2 2 2 2
77 0.133267 0.070657 0.38394 0.06936 2 2 2 2
78 0.118735 0.066187 0.390106 0.06937 2 2 2 2
79 0.142839 0.134442 0.359029 0.06949 2 2 2 2
80 0.111618 0.07402 0.376729 0.06949 2 2 2 2
81 0.135832 0.135151 0.391522 0.06954 2 2 2 2
82 0.074773 0.044555 0.385313 0.06957 2 3 2 3
83 0.102127 0.094061 0.35634 0.06962 2 2 3 2
84 0.093864 0.046194 0.392656 0.06962 2 2 2 2
85 0.092278 0.113633 0.365266 0.06968 2 2 2 2
86 0.106842 0.152879 0.346474 0.0697 2 2 2 2
87 0.09232 0.105724 0.363174 0.06975 2 2 2 2
88 0.134494 0.059594 0.369076 0.06975 2 2 2 2
89 0.119974 0.040733 0.402027 0.06978 2 2 2 2
90 0.082466 0.084187 0.355723 0.0698 2 2 2 2
91 0.150982 0.079781 0.403129 0.08513 2 2 2 2
92 0.152439 0.131453 0.371112 0.08516 2 2 2 2
93 0.091509 0.065507 0.363856 0.08521 2 2 2 2
94 0.135842 0.067661 0.386954 0.08523 2 2 2 2
95 0.122024 0.110761 0.340472 0.08523 2 2 2 2
96 0.12142 0.074461 0.359575 0.08523 2 2 2 2
97 0.156597 0.062139 0.420275 0.08528 2 2 2 2
98 0.113783 0.062481 0.39005 0.0853 2 2 2 2
99 0.126675 0.06132 0.350674 0.08531 2 2 2 2
100 0.11917 0.14549 0.360685 0.08533 2 2 2 2
101 0.119632 0.099883 0.362623 0.08541 2 2 2 2
102 0.136355 0.042065 0.356886 0.08541 2 2 2 2
103 0.162798 0.099153 0.369932 0.08543 2 2 2 2
104 0.073962 0.071837 0.357579 0.0855 2 2 2 2
105 0.12281 0.131907 0.358317 0.08552 2 2 2 2
106 0.121728 0.047518 0.370768 0.08571 2 2 2 2
107 0.119116 0.124355 0.393161 0.08573 2 2 2 2
108 0.136521 0.040915 0.368286 0.08577 2 2 2 2
109 0.128879 0.08755 0.382968 0.08577 2 2 2 2
110 0.122561 0.108296 0.365183 0.08578 2 2 2 2
SPE-192833-MS 19
No. GR AC ILD SH actual type identified type
by OVR SVMs
identified type
by OVO SVMs
identified
type by
random
forest
111 0.144401 0.049313 0.413369 0.08581 2 2 2 2
112 0.153059 0.115833 0.350277 0.08591 2 2 2 2
113 0.128982 0.072302 0.393509 0.08594 2 2 2 2
114 0.11833 0.064652 0.396145 0.08596 2 2 2 2
115 0.142659 0.132677 0.392649 0.08597 2 2 2 2
116 0.110919 0.092135 0.347005 0.08613 2 2 2 2
117 0.122021 0.046748 0.368181 0.08617 2 2 2 2
118 0.104677 0.094635 0.353205 0.08627 2 2 2 2
119 0.114452 0.105109 0.352917 0.08629 2 2 2 2
120 0.135639 0.113166 0.374483 0.08632 2 2 2 2
121 0.104721 0.081748 0.391893 0.08634 2 2 2 2
122 0.111027 0.084587 0.314361 0.08634 2 2 2 2
123 0.129903 0.150066 0.384152 0.08634 2 2 2 2
124 0.14229 0.117594 0.390074 0.08643 2 2 2 2
125 0.119431 0.126168 0.366043 0.08646 2 2 2 2
126 0.144134 0.052972 0.370812 0.0865 2 2 2 2
127 0.114562 0.091461 0.356864 0.08651 2 2 2 2
128 0.144149 0.083559 0.362779 0.08652 2 2 2 2
129 0.12724 0.059415 0.409968 0.0866 2 2 2 2
130 0.108039 0.107024 0.385483 0.08662 2 2 2 2
131 0.051062 0.060244 0.383384 0.02585 3 3 2 3
132 0.051077 0.060568 0.39287 0.02587 3 3 3 3
133 0.05127 0.06005 0.394397 0.02614 3 3 3 3
134 0.033733 0.055272 0.419781 0.02617 3 3 3 3
135 0.051436 0.057852 0.393353 0.02636 3 3 3 3
136 0.057773 0.038152 0.432485 0.0264 3 3 3 3
137 0.051592 0.0608 0.388477 0.02658 3 3 3 3
138 0.034197 0.056715 0.375483 0.02682 3 3 3 3
139 0.052125 0.060198 0.388814 0.02732 3 3 3 3
140 0.052535 0.061115 0.395577 0.02788 3 3 3 3
141 0.052875 0.060713 0.390398 0.02835 3 3 3 3
142 0.052943 0.05922 0.394543 0.02845 3 3 3 3
143 0.052999 0.059598 0.381141 0.02853 3 3 3 3
144 0.053053 0.060329 0.404663 0.0286 3 3 3 3
145 0.053427 0.061409 0.40064 0.02912 3 3 3 3
146 0.054624 0.060629 0.396622 0.0308 3 3 3 3
147 0.060422 0.040511 0.437307 0.03144 3 3 3 3
148 0.055726 0.061998 0.349 0.03235 3 3 3 3
20 SPE-192833-MS
No. GR AC ILD SH actual type identified type
by OVR SVMs
identified type
by OVO SVMs
identified
type by
random
forest
149 0.056036 0.05662 0.367496 0.03279 3 3 3 3
150 0.056158 0.062029 0.40193 0.03296 3 3 3 3
151 0.071272 0.034763 0.475042 0.05284 3 3 3 3
152 0.071311 0.038596 0.448853 0.05291 3 3 3 3
153 0.071519 0.053728 0.423849 0.05333 3 3 3 3
154 0.070045 0.061776 0.394562 0.0535 3 3 3 3
155 0.071885 0.043018 0.451844 0.05411 3 3 3 3
156 0.052679 0.053424 0.418452 0.05423 3 3 3 3
157 0.072095 0.038146 0.469532 0.05451 3 3 3 3
158 0.070783 0.060683 0.404831 0.05464 3 3 3 3
159 0.053039 0.053885 0.42146 0.05479 3 3 3 3
160 0.071277 0.061765 0.378674 0.05541 3 3 3 3
161 0.053471 0.048074 0.441664 0.05547 3 3 3 3
162 0.072567 0.055437 0.42764 0.05548 3 3 3 3
163 0.053618 0.053628 0.419293 0.0557 3 3 3 3
164 0.053938 0.051498 0.352532 0.0562 3 3 3 3
165 0.072105 0.058394 0.399199 0.0567 3 3 3 3
166 0.072506 0.053139 0.382619 0.05732 3 3 3 3
167 0.073522 0.039996 0.452193 0.05743 3 3 3 3
168 0.072867 0.056911 0.39388 0.05789 3 3 3 3
169 0.096077 0.13179 0.489616 0.03588 4 4 3 4
170 0.09694 0.122214 0.43374 0.03712 4 4 4 4
171 0.096937 0.1048 0.472465 0.03712 4 4 4 4
172 0.097463 0.123872 0.490084 0.03788 4 4 4 4
173 0.097599 0.118281 0.497154 0.03808 4 4 4 4
174 0.098222 0.126746 0.500226 0.03899 4 4 4 4
175 0.098872 0.121166 0.496639 0.03994 4 4 4 4
176 0.121332 0.242216 0.509814 0.04 4 4 4 4
177 0.099654 0.133744 0.467101 0.0411 4 4 4 4
178 0.122417 0.258801 0.477665 0.04133 4 4 4 4
179 0.513103 0.067942 0.046743 0.05671 4 4 4 4
180 0.10972 0.136331 0.482309 0.05671 4 4 4 4
181 0.134107 0.221344 0.364771 0.05691 4 4 4 4
182 0.109928 0.112903 0.420119 0.05705 4 4 4 4
183 0.110121 0.127888 0.467199 0.05736 4 4 4 4
184 0.510044 0.064315 0.047469 0.05773 4 4 4 4
185 0.537554 0.061087 0.047915 0.05836 4 4 4 4
186 0.110912 0.127901 0.426797 0.05866 4 4 4 4
SPE-192833-MS 21
No. GR AC ILD SH actual type identified type
by OVR SVMs
identified type
by OVO SVMs
identified
type by
random
forest
187 0.111183 0.120148 0.447875 0.05911 4 4 4 4
188 0.501529 0.064159 0.048913 0.05979 4 4 4 4
189 0.111606 0.123194 0.435859 0.0598 4 4 4 4
190 0.541756 0.060424 0.048964 0.05986 4 4 4 4
191 0.523081 0.055139 0.049151 0.06013 4 4 4 4
192 0.54995 0.056965 0.049451 0.06057 4 4 4 4
193 0.112129 0.131924 0.475544 0.06067 4 4 4 4
194 0.112207 0.127374 0.448268 0.0608 4 4 4 4
195 0.545059 0.06457 0.049654 0.06086 4 4 4 4
196 0.112319 0.104572 0.481675 0.06098 4 4 4 4
197 0.112398 0.101544 0.413681 0.06111 4 4 4 4
198 0.538814 0.064755 0.049946 0.06129 4 4 4 4