Conference PaperPDF Available

Optimization of Models for Rapid Identification of Oil and Water Layers During Drilling - A Win-Win Strategy Based on Machine Learning

Authors:
  • Xi'an Shiyou University

Abstract and Figures

The identification of oil and water layers (OWL) from well log data is an important task in petroleum exploration and engineering. At present, the commonly used methods for OWL identification are time-consuming, low accuracy or need better experience of researchers. Therefore, some machine learning methods have been developed to identify the lithology and OWL. Based on logging while drilling data, this paper optimizes machine learning methods to identify OWL while drilling. Recently, several computational algorithms have been used for OWL identification to improve the prediction accuracy. In this paper, we evaluate three popular machine learning methods, namely the one-against-rest support vector machine, one-against-one support vector machine, and random forest. First, we choose apposite training set data as a sample for model training. Then, GridSearch method was used to find the approximate range of reasonable parameters' value. And then using k-fold cross validation to optimize the final parameters and to avoid overfitting. Finally, choosing apposite test set data to verify the model. The method of using machine learning method to identify OWL while drilling has been successfully applied in Weibei oilfield. We select 1934 groups of well logging response data for 31 production wells. Among them, 198 groups of LWD data were selected as the test set data. Natural gamma, shale content, acoustic time difference, and deep-sensing logs were selected as input feature parameters. After GridSearch and 10-fold cross validation, the results suggest that random forest method is the best algorithm for supervised classification of OWL using well log data. The accuracy of the three classifiers after the calculation of the training set is greater than 90%, but their differences are relative large. For the test set, the calculated accuracy of the three classifiers is about 90%, with a small difference. The one-against-rest support vector machine classifier spends much more time than other methods. The one-against-one support vector machine classifier is the classifier which training set accuracy and test set accuracy are the lowest in three methods. Although all the calculation results have diffierences in accuracy of OWL identification, their accuracy is relatively high. For different reservoirs, taking into account the time cost and model calculation accuracy, we can use random forest and one-against-one support vector machine models to identify OWL in real time during drilling.
Content may be subject to copyright.
SPE-192833-MS
Optimization of Models for Rapid Identification of Oil and Water Layers
During Drilling - A Win-Win Strategy Based on Machine Learning
Jian Sun and Qi Li, School of Petroleum Engineering, China University of Petroleum - Beijing; Mingqiang Chen and
Long Ren, School of Petroleum Engineering, Xi'an Shiyou University - Xi'an; Fengrui Sun, School of Petroleum
Engineering, China University of Petroleum - Beijing; Yong Ai, Exploration and Development Institution Tarim Oil
Field - Korla; Kang Tang, School of Petroleum Engineering, Xi'an Shiyou University - Xi'an
Copyright 2018, Society of Petroleum Engineers
This paper was prepared for presentation at the Abu Dhabi International Petroleum Exhibition & Conference held in Abu Dhabi, UAE, 12-15 November 2018.
This paper was selected for presentation by an SPE program committee following review of information contained in an abstract submitted by the author(s). Contents
of the paper have not been reviewed by the Society of Petroleum Engineers and are subject to correction by the author(s). The material does not necessarily reflect
any position of the Society of Petroleum Engineers, its officers, or members. Electronic reproduction, distribution, or storage of any part of this paper without the written
consent of the Society of Petroleum Engineers is prohibited. Permission to reproduce in print is restricted to an abstract of not more than 300 words; illustrations may
not be copied. The abstract must contain conspicuous acknowledgment of SPE copyright.
Abstract
The identification of oil and water layers (OWL) from well log data is an important task in petroleum
exploration and engineering. At present, the commonly used methods for OWL identification are time-
consuming, low accuracy or need better experience of researchers. Therefore, some machine learning
methods have been developed to identify the lithology and OWL. Based on logging while drilling data, this
paper optimizes machine learning methods to identify OWL while drilling.
Recently, several computational algorithms have been used for OWL identification to improve the
prediction accuracy. In this paper, we evaluate three popular machine learning methods, namely the one-
against-rest support vector machine, one-against-one support vector machine, and random forest. First, we
choose apposite training set data as a sample for model training. Then, GridSearch method was used to find
the approximate range of reasonable parameters' value. And then using k-fold cross validation to optimize
the final parameters and to avoid overfitting. Finally, choosing apposite test set data to verify the model.
The method of using machine learning method to identify OWL while drilling has been successfully
applied in Weibei oilfield. We select 1934 groups of well logging response data for 31 production wells.
Among them, 198 groups of LWD data were selected as the test set data. Natural gamma, shale content,
acoustic time difference, and deep-sensing logs were selected as input feature parameters. After GridSearch
and 10-fold cross validation, the results suggest that random forest method is the best algorithm for
supervised classification of OWL using well log data. The accuracy of the three classifiers after the
calculation of the training set is greater than 90%, but their differences are relative large. For the test set,
the calculated accuracy of the three classifiers is about 90%, with a small difference. The one-against-rest
support vector machine classifier spends much more time than other methods. The one-against-one support
vector machine classifier is the classifier which training set accuracy and test set accuracy are the lowest
in three methods.
Although all the calculation results have diffierences in accuracy of OWL identification, their accuracy
is relatively high. For different reservoirs, taking into account the time cost and model calculation accuracy,
2 SPE-192833-MS
we can use random forest and one-against-one support vector machine models to identify OWL in real time
during drilling.
Key words: Machine learning, Identification of oil and water layers, Support vector machine, Random
forest, Optimization
Introduction
The development of oil and gas in unconventional reservoirs such as low-permeability reservoirs, tight
reservoirs, and shale reservoirs has become popular in global oil and gas development. The development
of oil and gas resources in these reservoirs is different from that in conventional reservoirs and often
requires more time and higher economic costs. Improvements in technology or methods in any one area can
bring about undeniable benefits (Sun Fengrui et al. 2017). In recent years, Logging While Drilling (LWD)
technology has been widely adopted in the drilling of unconventional reservoir horizontal wells. However,
LWD data are generally used to explain reservoir lithology and to guide drilling and geosteering work. These
data are applied less often to the identification of the oil and water layers (OWL) drilled during drilling. The
identification methods of OWL are mostly traditional methods, such as the traditional intersection charts
method (WAN Qiao- sheng et al. 2017), stripping method and multi-curve joint qualitative identification
combined with intersection charts method (Song Peng et al. 2016). Currently, based on statistics and
computer science, some methods relying on machine learning theory for the identification of reservoir
lithology and OWL have also been developed (Zhong Yihua et al. 2009, Li Rong et al. 2009, Song
Yanjie et al. 2007, Liu H. et al. 2009, Xiongyan Li et al. 2013, Yunxin Xie et al. 2018, Shaoqun Dong
et al. 2016, Arsalan A. Othman and Richard Gloaguen 2017). Due to complex geological conditions
and sedimentary environments, a nonlinear relationship between reservoir heterogeneity and reservoir
logging response characteristics indicates that the use of linear logging response equations and statistical
empirical formulas does not effectively characterize the reservoir's true characteristics and cannot meet
actual production needs. The traditional intersection charts method is directly related to the experience
of researchers and exhibits a certain degree of instability. Therefore, the use of non-linear information
processing technology to reveal the distribution characteristics of OWL can better meet the needs of oil and
gas exploration and development, while conventional linear and empirical logging interpretation technology
performs insufficiently. Artificial neural network, and support vector machines methods have been used
in the identification of OWL. Although they can play an interpretive role, there are still many problems:
the artificial neural network method is inadequate in giving satisfactory results in a local optimality,
dimension disaster, and small data sample (Ahmed Amara Konaté et al. 2015, Morteza Raeesi et al. 2012,
Baouche Rafik et al. 2017, B. Shokooh Saljooghi et al. 2015); the support vector machine can overcome the
shortcomings of the neural network method, but the classical support vector machine algorithm only gives
two types of classification algorithms. In practical applications of data mining, it is often necessary to solve
multi-category classification problems. Therefore, one-against-rest support vector machines (OVR SVMs),
one-against-one support vector machines (OVO SVMs) (Hsu, C.-W. and Lin, C.-J. 2002), and random forest
methods have emerged. These three methods can effectively avoid the deficiencies of the above methods,
especially the random forest algorithm, which is composed of multiple decision trees. Compared with a
single decision tree algorithm, this algorithm has a higher training accuracy, better classification effect, and
is less likely to overfit. However, a certain classifier cannot be considered to be sufficient, and specific
issues should be analysed. After all, this problem is challenging. Therefore, in this paper, the OVR SVMs
classifier, the OVO SVMs classifier and the random forest classifier are constructed according to various
characteristic data obtained from well logging. Various target values of the OWL are classified, and the
results obtained by each classifier are analysed. The optimal classifier and corresponding parameters are
selected to solve the problem of the accurate identification of the OWL while drilling.
SPE-192833-MS 3
The principles and methods of support vector machine and random forest
The principle of support vector machine
The support vector machine (SVM) is developed from the optimal classification surface in the case of
linear separability. Its core idea is that the optimal classification surface not only can correctly separate
the two types of samples, but also maximize their classification intervals. In the actual situation, most of
the problems encountered are non-linear, and the nonlinear problem needs to be converted into a linear
problem of a high-dimensional space through a nonlinear transformation, and an optimal classification
surface is obtained in the transformed space (Neda Mahvash Mohammadi et al. 2018, Jaime Ortegon et
al. 2018, Xiaoling Lu et al. 2018, Xiekai Zhang et al. 2017, Italo Zoppis et al. 2018). Suppose that in the
nonlinear case, the sample points are (xi, yi) (i=1,…,n). In the high-dimensional space, the classification
surface equation is formula (1), φ(x) is a mapping function from the input space to the feature space. ω is
a space vector, and b is a constant term. The schematic diagram is shown in Figure 1. Under the constraint
of condition as formula (2), find the minimum value of the function (formula (3)). This can be converted
into a dual problem through the Lagrangian optimization method, which is attributed to a quadratic function
extremum problem, that is, solving the maximum value of the formula (4) under the constraint condition
as formula (5), a is a Lagrangian multiplier, and c is a constant that controls the degree of punishment of
the wrong sample. Among them, formula (6) is a kernel function that satisfies the mercer condition. The
optimal classification discriminant function obtained after solving the above problem is formula (7), where
N is the number of support vectors.
(1)
(2)
(3)
(4)
(5)
(6)
(7)
4 SPE-192833-MS
Figure 1—Support vector machine schematic diagram.
The multi-classification method of support vector machine
The SVM algorithm was originally designed for the binary classification problem. When multi-category
classification problems are encountered, corresponding multi-category classifiers need to be constructed.
Currently, there are two main methods for constructing SVM multiclass classifiers. One is to directly
modify the objective function, and the parameter solutions of multiple classification planes are combined
into one optimization problem, and the optimization problem is solved "once". This method to achieve
multi-category classification, is called the direct method. This method seems simple, but its computational
complexity is relatively high, it is more difficult to achieve, and it only applies to small problems. The other
is the combination of multiple binary classifiers to construct multiple classifiers, called indirect methods,
usually OVO SVMs and OVR SVMs.
The OVR SVMs classifier classifies samples of a certain category as one class in training, and classifies
the remaining samples into another class. Samples of k categories can construct k SVMs, which is to
construct k two class classifiers. The i-th classifier divides the i-th congruent categories, and the i-th
classifier takes the i-th class of the training set as a positive class and the rest of the class points as a negative
class for training. In the discrimination, the input signal is obtained through k classifiers to obtain k output
values fi(x) =sgn (gi(x)). If the output is +1, the corresponding class is the input signal class. However,
the decision function constructed under actual conditions always has errors. If the output is more than one
+1, or if none of the outputs is +1, comparing the output values, the largest one corresponds to the input
category. This method has a more obvious deficiency: the training set occupies a smaller proportion, so it
will be affected by the remaining samples, and there will be deviations.
The OVO SVMs classifier designs an SVM between any two types of samples, so k samples are needed
to design k(k-1)/2 SVMs. When classifying an unknown sample, each classifier will judge its class and vote
for the corresponding class. The class which gets the most votes is the class of the unknown sample. Voting
is completed as follows: Let A = B = C = D = 0. (A, B) –classifier: if A wins, then A = A + 1; otherwise,
B = B + 1. (A, C) –classifier: if A wins, then A=A+1; otherwise, C=C+1…. (C, D)-classifier: if C wins,
then C=C+1; otherwise, D=D+1. The final decision is the max of (A, B, C, D). Although this method is
better than using OVR SVMs, when the number of categories is large, the number of models is n*(n-1)/2,
which will greatly increase the calculation time.
The principle of the random forest classification algorithm
The random forest is the most popular machine learning model. In the 1980s, Breiman et al. invented the
classification tree algorithm (Breiman 1996). This method repeatedly divides the data into two categories for
classification or regression, which greatly reduces the amount of computation. In 2001, Breiman combined
SPE-192833-MS 5
the classification trees into a random forest (Breiman 2001), which randomizes the use of variables
(columns) and the use of data (rows), generates many classification trees, and statistically summarizes
the classification tree results. The results are robust to missing data and non-equilibrium data and can
appropriately predict the effects of up to thousands of explanatory variables, and it is hailed as one of the
best algorithms available today (E. Vigneau et al. 2018, Michele Fratello et al. 2018, Robin Genuer et al.
2017, Christoph Behrens et al. 2018, Behnam Partopour et al. 2018). As the name implies, random forests
create a forest in a random manner. Forests are composed of many decision trees. All decision trees in a
random forest are unrelated. After the forest is acquired, when a new sample is input, each decision tree in
the forest is judged separately to determine to which category the sample belongs. Finally, the number of
times each category was selected is counted. The category that is selected the most is the category predicted
for the sample. A decision tree is a tree structure (can be a binary tree or a non-binary tree). Each non-leaf
node represents a test of a characteristic attribute, and each leaf node stores a category. The decision process
of the decision tree is shown in Figure 2. Starting from the root node, the corresponding feature attributes
in the classification items are tested, and the output branches are selected according to the test results, until
reaching the leaf nodes. The decision result is the category which is stored in the leaf nodes.
Figure 2—Random forest schematic diagram.
In establishing each decision tree, two things need to be taken care of: sampling and complete division.
The first step is the process of two random samplings. The random forest samples the input data in rows
and columns. For row sampling, there is a method of putting back, that is, in the sample set obtained by
sampling, there may be duplicate samples. Assuming that the input sample is N, the sampling sample is also
N, so when training, the input sample of each tree is not a complete sample, and it is relatively easy to avoid
over-fitting. For column sampling, this method selects m features from M features (m<M). M is the total
number of features. The second step is to establish a decision tree for the sampled data in a completely split
method so that one leaf node of the decision tree cannot continue to split or all samples point to the same
classification. In general, the decision tree algorithm has an important step - pruning. Because the first two
random sampling processes guarantee randomness, no overfitting occurs even without pruning.
The classification method of random forest
Each tree in the random forest is a binary tree. Its generation follows the top-down recursive splitting
principle. That is, the training set is divided from the root node in turn. In the binary tree, the root node
contains all the training data. According to the principle of minimum node purity, it splits into the left node
and the right node. Each node contains a subset of the training data. According to the same rule, the node
continues to split until it meets the branch stop rule and stops growing. Each decision tree actually learns
the classification of specific data, while random sampling ensures that the repeated samples are classified
by different decision trees, so that the classification ability of different decision trees can be evaluated.
The specific steps for random forest classification are as follows:
6 SPE-192833-MS
1. From the original training data set, apply the bootstrap method to randomly select k new sample
sets with putting back and construct k classification trees; each sample which is not extracted
constitutes k out-of-pocket data.
2. Assuming there are n features, randomly extract m features per node per tree. Calculate the amount
of information contained in each feature, and select the feature with the highest classification ability
for a node split.
3. Do not trim every tree, and make it grow to the maximum size.
4. Let many generated classification trees compose random forests, and use the random forest
classifier to discriminate and classify new data. The classification results are determined by the
number of votes cast by the tree classifier.
This method has many advantages over other machine learning classification methods:
1. It has a high accuracy.
2. It handles higher dimensional data.
3. The introduction of randomness makes the method less susceptible to overfitting, and the trained
model has a small variance and a strong generalization ability.
4. The training can be highly parallelized, with is advantageous for high sample training speeds in
the era of Big Data.
5. Relative to the boosting series of Adaboost and GBDT, random forest implementation is relatively
simple.
However, for most statistical modelers, the random forest is likes a black box. They cannot control
the internal operation of the model, and they can only manipulate different parameters and random seeds.
For small data or low-dimensional data (less-characteristic data), the model may not produce a good
classification. Compared with other machine learning classification methods, the accuracy of the training
set of this method is often high, but the accuracy of the test set is not always high.
The selection of model training data
The selection of the logging information type
Under the current technical conditions, conventional well logging data usually can be obtained through
logging while drilling. There are many types of logging-while-drilling information. According to the logging
principle, there are usually electric logging data, sonic logging data, nuclear logging data, etc. However,
using more data types and feature parameters does not guarantee higher machine learning accuracy. The
well logging data usually contain a considerable amount of noise, which will have an impact on the machine
learning identification results. Because the logging data of the same logging principle have a relatively
high correlation with the logging data of different logging principles, if the data volume is too large or the
correlation of the characteristic parameters is strong, the parameter redundancy phenomenon will occur to
increase the machine learning time or even affect the accuracy of the model learning. Comprehensively
consider the selection of natural gamma (GR), shale content (SH), acoustic time difference (AC), and deep-
sensing (ILD) logs as input data for the logging response to identify the reservoir oil and gas properties.
Logging data standardization
The data standardization (normalization) processing is a basic task for machine learning data classification.
Different evaluation indicators often have different dimensions and dimension units, which will affect the
results of the data analysis. It is necessary to standardize the data and eliminate the dimensionality impact
between the indicators to compare the data indicators. After the original data have been standardized, each
indicator is in the same order of magnitude, which is suitable for a comprehensive comparative evaluation.
SPE-192833-MS 7
Due to the different dimension of each type of logging data and the large difference in numerical magnitudes,
it is necessary to standardize the original logging data to eliminate the impact on the analysis results. There
are two commonly used data normalization methods; min-max normalization and Z-score normalization.
In this paper, formula (8), the min-max normalization, is used to standardize the logging data.
(8)
Where xmax denotes the maximum value of the sample data; xmin denotes the minimum value of the sample
data. After normalization, logging data values are all in the interval [0, 1].
The selection of the logging information training data
The sample data used in the paper consist of 1934 groups of well logging response data for 31 production
wells in the Chang 3 reservoir of the Weibei 2 well area. Among them, 198 groups of LWD data were
selected as the test set data. Combining the existing well response data and according to the well logging
principle, four types of logging well response data are selected as the characteristic parameters, namely,
natural gamma-ray logging response data (GR), shale content data (SH), acoustic time logging response
data (AC) and deep induction logging response data (ILD), as shown in Figure 3, is a plot of the well
logging curves for the WB2P27 well. The reservoirs are classified into four categories, i.e., oil-water layers,
dry layers, water layers, and oil layers. These target categories are replaced by the numbers 1, 2, 3, and 4,
respectively. The training data are standardized, and part of the data is shown in Table 1.
Figure 3—well logging curves of WB2P27 well.
8 SPE-192833-MS
Table 1—Partial training data.
Application of machine learning classification method in the identification of
OWL during drilling
Relying on the logging response data to identify OWL, in the final analysis, a nonlinear function mapping
problem is solved. The relationship between the logging response and the actual reservoir interval
is complex, so this mapping is usually highly nonlinear. There are many types of logging response
characteristics, and the target categories of OWL are usually larger than two types. Therefore, using support
vector machine multiple classifiers or random forest classifiers is an effective way to solve this complex
problem.
The OVR SVMs classifier identifies OWL during the drilling process
1. The selection of the OVR SVMs kernel function and the training parameters
The kernel function usually includes polynomial kernel function, Gaussian kernel function, and
linear kernel function. After a comparative analysis and study, to obtain the best model, the Gaussian
kernel function is selected here (Chang, Y.W. et al., 2010); furthermore,
its flexibility is very high. We use the GridSearch method to select the approximate range of the
optimal training parameters C and γ in the OVR SVMs, where . The GridSearch method is an
exhaustive search method that specifies parameter values. The optimal learning algorithm is obtained
by optimizing the parameters of the estimation function through a cross-validation method. Cross-
validation is a statistical analysis method used to verify the performance of a classifier and can avoid
overfitting problems. There are three main types of cross-validation: a. double cross-validation; b. k-
folder cross-validation; and c. leave-one-out cross-validation.
Double cross-validation, also known as 2-fold cross-validation (2-CV), splits the data set into two
equally sized subsets for two rounds of classifier training. In practice, 2-CV is not commonly used.
SPE-192833-MS 9
The main reason is that the number of training set samples is too small and is usually insufficient to
represent the distribution of the maternal sample, leading to a significant drop in the recognition rate
in the test phase. Additionally, the 2-CV subset has a high degree of variation, often failing to meet
the requirement that the experimental process must be replicable.
K-folder cross-validation is an extension of double cross-validation. The practice is to divide the
data set into k subsets. Each subset is used as a test set, and the rest is used one time as a training
set. The k-CV cross-validation is repeated k times, and it selects one subset as the test set every time.
Then, the method uses the k-fold average cross-validation recognition rate as a result. In this method,
all samples can be used as training sets and test sets, and each sample is verified once.
Leave-one-out cross-validation, assuming there are n samples in the data set and that LOOCV is
also n-CV, uses each sample as a test set, and the remaining n-1 samples are used as the training
set. Almost all the samples in each round are used to train the model in this method. Therefore, the
results of this method are the closest to the distribution of the maternal sample, and the estimated
generalization error is more reliable. In the case of a small sample of the experimental data set,
LOOCV can be considered. It can be seen that the cost of the LOOCV calculation is higher and the
number of models to be built is the same as the total number of samples. When the total number of
samples is quite large, LOOCV has more difficulty in the actual operation unless the training speed
of each model is very fast. However, parallel calculations can be used to reduce the time required
for the calculations.
Therefore, this paper uses the K-fold cross-validation method to optimize the objective function
and find the best parameters' value so that the accuracy of cross-validation can avoid over-fitting.
First, take C= [1000, 3000, 5000, 7000, 9000], γ= [15, 20, 25, 30, 35] to search the range of the
optimal parameters' values in 25 combinations of (C, γ) by the GridSearch method. We get the optimal
values for the parameters C and γ, which are approximately 5000 and 25, respectively. Then, use 10-
fold cross-validation to perform a fine search around C=5000 and γ=25. Figure 4 shows the 10-fold
cross-validation of parameter C at γ =25. Figure 5 shows the 10-fold cross-validation of parameter γ
at C=5000. Finally, select the combination with the highest accuracy (5000, 25) as the optimal OVR
SVMs training parameters. The accuracy of the training set is 0.93299, which takes 760 s.
2. The validation of the test set data by the OVR SVMs classifier
The OVR SVMs classifier was used to classify LWD test data. The results are shown in Table 2.
The test accuracy is 0.90909.
Figure 4—The 10 fold cross-validation of parameter C in OVR SVMs, γ=25.
10 SPE-192833-MS
Figure 5—The 10 fold cross-validation of parameter γ in OVR SVMs, C=5000.
The OVO SVMs classifier identifies OWL during the drilling process
1. The selection of the OVO SVMs kernel function and the training parameters
Also select the Gaussian kernel function and use the GridSearch
method to select the approximate range of the optimal training parameters C and γ in the OVO SVMs,
where . Then, K-fold cross-validation method is used to optimize the objective function, and
the best parameters' value are found so that the accuracy of cross-validation is the highest to avoid
over-fitting.
First, take C= [500, 1000, 1500, 2000, 2500], γ= [5, 10, 15, 20, 25] to search the range of the
optimal parameters' values in 25 combinations of (C, γ) by the GridSearch method. We get the optimal
values for the parameters C and γ, which are approximately 500 and 20, respectively. Then, use 10-
fold cross-validation to perform a fine search around C=500 and γ=20. Figure 6 shows the 10-fold
cross-validation of parameter C at γ =20. Figure 7 shows the 10-fold cross-validation of parameter
γ at C=900. Finally, select the combination with the highest accuracy (900, 20) as the optimal OVO
SVMs training parameters. The accuracy of the training set is 0.91753, which takes 100 s.
2. The validation of the test set data by the OVO SVMs classifier
The OVO SVMs classifier was used to classify LWD test data. The results are shown in Table 2.
The test accuracy is 0.88889.
SPE-192833-MS 11
Figure 6—The 10 fold cross-validation of parameter C in OVO SVMs, γ=20.
Figure 7—The 10 fold cross-validation of parameter γ in OVR SVMs, C=900.
The random forest classifier identifies OWL during the drilling process
1. The selection of the random forest classifier n_estimators and the max_features parameters
The n_estimators represents the number of trees in the forest. This value is not as large as possible.
As the number of trees increases, the calculation time will also increase, and the best predictive value
will appear at a reasonable tree value. Max_features represents the maximum number of features
allowed to be used by a single decision tree, that is, a subset of randomly selected feature sets. The
smaller the number of subset, the faster the variance will decrease, but at the same time the deviation
will increase faster. In the classification problem, it usually takes max_features=sqrt (n_features)
(Behnam Partopour et al., 2018, E. Vigneau et al., 2018). The GridSearch method was used to
select the approximate range of random forest optimal training parameters the n_estimators and the
max_features, and then K-fold cross validation was used to optimize the objective function and find
the best parameters' value so that the accuracy of cross-validation can avoid over-fitting.
12 SPE-192833-MS
First, take n_estimators= [5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60], max_features=
[1, 2, 3, 4] to search the range of the optimal parameters' values in 48 combinations of
(n_estimators, max_features) by the GridSearch method. We get the optimal values for the parameters
n_estimators and max_features, which are approximately 25 and 1, respectively. Since it usually takes
max_features=sqrt (n_features), let max_features=2. Then, use 10-fold cross-validation to perform
a fine search around n_estimators =25. Figure 8 shows the 10-fold cross-validation of parameter
n_estimators at max_features =2. Finally, select the combination with the highest accuracy (29, 2)
as the optimal random forest training parameter. The accuracy of the training set is 0.95361, which
takes 180 s.
2. The validation of the test set data by the random forest classifier
The random forest classifier was used to classify LWD test data. The results are shown in Table
2. The test accuracy is 0.89899. And the random forest classifier can get the proportion of each
feature parameter in the classification calculation. The importance of each feature is [GR=0.28552335,
AC=0.19097589, ILD=0.19690391, SH=0.32659686].
Figure 8—The 10 fold cross-validation of parameter n_estimators in random forest, max_features=2.
Comparison of classification results of LWD test data by three classifiers
The model training results and test results obtained by the three classifiers are shown in Table 2 (A- actual
OWL category, B- identified OWL category).
SPE-192833-MS 13
Table 2—Training results and test results obtained by the three classifiers.
The parameters and operation results obtained by the three classifiers are compared and analyzed, as
shown in Table 3.
Table 3—Classification algorithm comparison table.
As can be seen from Table 3, the accuracy of the three classifiers after the calculation of the training
set is greater than 90%, but their differences are relative large. For the test set, the calculated accuracy of
the three classifiers is about 90%, with a small difference. In calculation time-consuming aspects, OVR
SVMs consumes much more time than other two classifiers. Considering comprehensively, the random
forest classifier has the highest training set calculation accuracy, and its test set calculation accuracy is
only 1% less than OVR SVMs, but its computational time-consuming is only a quarter of the OVR SVMs.
Therefore, the random forest classifier was selected to identify the OWL during the drilling process.
Conclusions
This work presents an optimal model that could be used to quickly identify the OWL while drilling. The
identification results, recognition accuracy and calculation time of OWL classification are obtained by three
machine learning methods. Some meaningful conclusions are listed below:
a. For the optimization of the three method parameters, the initial value range of the GridSearch is
very important. It will directly influence the accuracy and training time of the training model.
14 SPE-192833-MS
b. After the parameters are optimized, the training set accuracy of random forest is the highest. It is
about 2% more than OVR SVMs, and about 4% more than OVO SVMs.
c. The test set accuracy of three methods are very close, all around 90%.
d. The calculation time of OVR SVMs is much larger than OVO SVMs and random forest.
e. For the oilfield sample data selected in this paper, the random forest method is the optimal
classification algorithm.
Acknowledgements
This work was supported by the National Natural Science Foundation of China (NSFC) (No. 51704235).
References
Ahmed Amara Konaté, Heping Pan, Sinan Fang, et al. Capability of self-organizing map neural network in geophysical
log data classification: Case study from the CCSD-MH. Journal of Applied Geophysics, 2015, Vol 118: 37–46. https://
doi.org/10.1016/j.jappgeo.2015.04.004.
Arsalan A. Othman, Richard Gloaguen. Integration of spectral, spatial and morphometric data into lithological mapping:
A comparison of different Machine Learning Algorithms in the Kurdistan Region, NE Iraq. Journal of Asian Earth
Sciences, 2017, Vol 146: 90–102. https://doi.org/10.1016/j.jseaes.2017.05.005.
Baouche Rafik, Baddari Kamel. Prediction of permeability and porosity from well log data using the nonparametric
regression with multivariate analysis and neural network, Hassi R'Mel Field, Algeria. Egyptian Journal of Petroleum,
2017, Vol 26: 763–778. https://doi.org/10.1016/j.ejpe.2016.10.013.
Behnam Partopour, Randy C. Paffenroth, Anthony G. Dixon. Random Forests for mapping and analysis of microkinetics
models. Computers & Chemical Engineering, 2018. https://doi.org/10.1016/j.compchemeng.2018.04.019.
Breiman, L., 2001. Random forests. Mach. Learn. 45, 5–32.
Breiman, L., 1996. Bagging predictors. Mach. Learn. 24, 123–140.
B. Shokooh Saljooghi, A. Hezarkhani. A new approach to improve permeability prediction of petroleum reservoirs using
neural network adaptive wavelet (wavenet). Journal of Petroleum Science and Engineering, 2015, Vol 133: 851–861.
https://doi.org/10.1016/j.petrol.2015.04.002.
Chang, Y.W., Hsieh, C.J., Chang, K.W., et al, 2010. Training and testing low-degree polynomial data mappings via linear
SVM. J. Mach. Learn. Res. 11 (11), 1471–1490.
Christoph Behrens, Christian Pierdzioch, Marian Risse. Testing the optimality of inflation forecasts under flexible loss
with random forests. Economic Modelling, 2018, Vol 72: 270–277. https://doi.org/10.1016/j.econmod.2018.02.004.
E. Vigneau, P. Courcoux, R. Symoneaux, et al Random forests: A machine learning methodology to highlight the volatile
organic compounds involved in olfactory perception. Food Quality and Preference, 2018, Vol 68: 135–145. https://
doi.org/10.1016/j.foodqual.2018.02.008.
Hsu, C.-W., Lin, C.-J., 2002. A comparison of methods for multiclass support vector machines. Trans. Neur. Netw. 13,
415–425.
Italo Zoppis, Giancarlo Mauri, Riccardo Dondi. Kernel Methods: Support Vector Machines. Reference Module in Life
Sciences, 2018. https://doi.org/10.1016/B978-0-12-809633-8.20342-7.
Jaime Ortegon, Rene Ledesma-Alonso, Romeli Barbosa, et al Material phase classification by means of
Support Vector Machines. Computational Materials Science, 2018, Vol 148: 336–342. https://doi.org/10.1016/
j.commatsci.2018.02.054.
Li Rong, Zhong Yihua. DENTIFICATION METHOD OF OIL/GAS/WATER LAYER BASED ON LEAST SQUARE
SUPPORT VECTOR MACHINE. NATURAL GAS EXPLORATION & DEVELOPMENT, 2009, 32(03): 15–18+72.
Liu H., Wen S., Li W., Xu C., Hu C. (2009) Study on Identification of Oil/Gas and Water Zones in Geological Logging
Base on Support-Vector Machine. Fuzzy Information and Engineering Volume 2. Advances in Intelligent and Soft
Computing, Vol 62. Springer, Berlin, Heidelberg.
Michele Fratello, Roberto Tagliaferri. Decision Trees and Random Forests. Reference Module in Life Sciences, 2018.
https://doi.org/10.1016/B978-0-12-809633-8.20337-3.
Morteza Raeesi, Ali Moradzadeh, Faramarz Doulati Ardejani, et al Classification and identification of hydrocarbon
reservoir lithofacies and their heterogeneity using seismic attributes, logs data and artificial neural networks. Journal
of Petroleum Science and Engineering, 2012, Vol 82–83: 151–165. https://doi.org/10.1016/j.petrol.2012.01.012.
Neda Mahvash Mohammadi, Ardeshir Hezarkhani. Application of support vector machine for the separation of
mineralised zones in the Takht-e-Gonbad porphyry deposit, SE Iran. Journal of African Earth Sciences, 2018, Vol 143:
301–308. https://doi.org/10.1016/j.jafrearsci.2018.02.005.
SPE-192833-MS 15
Robin Genuer, Jean-Michel Poggi, Christine Tuleau-Malot, et al Random Forests for Big Data. Big Data Research, 2017,
Vol 9: 28–46. https://doi.org/10.1016/j.bdr.2017.07.003.
Shaoqun Dong, Zhizhang Wang, Lianbo Zeng. Lithology identification using kernel Fisher discriminant analysis with
well logs. Journal of Petroleum Science and Engineering, 2016, Volume 143: 95–102. https://doi.org/10.1016/
j.petrol.2016.02.017.
Song Peng, Yang Weiguo, Sun Dong, et al OIL- WATER- LAYER IDENTIFYING METHOD FOR RESERVOIR
CHANG6 IN HUAQING OILFIELD. Petroleum Geology and Oilfield Development in Daqing, 2016, 35(6): 144–147.
Song Yanjie, Zhang Jianfeng, Yan Weilin, et al A new identification method for complex lithology with support vector
machine. Journal of Daqing Petroleum Institute, 2007,31(5): 18–20.
Sun Fengrui, Yao Yuedong, Chen Mingqiang, Li Xiangfang, Zhao Lin, Meng Ye, Sun Zheng, Zhang Tao, Feng Dong.
Performance analysis of superheated steam injection for heavy oil recovery and modeling of wellbore heat efficiency.
Energy, 2017, 125: 795–804.
WAN Qiao- sheng, LI Xue- ying, ZHAO Yu- qiu, et al Oil and water layer identification method of Gaotaizi reservoirs in
Qijiabei area. Progress in Geophysics, 2017, 32(2):0714–0720, doi:10.6038/pg20170236.
Xiaoling Lu, Fengchi Dong, Xiexin Liu, et al Varying Coefficient Support Vector Machines. Statistics & Probability
Letters, 2018, Vol 132: 107–115. https://doi.org/10.1016/j.spl.2017.09.006.
Xiekai Zhang, Shifei Ding, Yu Xue. An improved multiple birth support vector machine for pattern classification.
Neurocomputing, 2017, Vol 225: 119–128. https://doi.org/10.1016/j.neucom.2016.11.006.3
Xiongyan Li, Hongqi Li. A new method of identification of complex lithologies and reservoirs: task-driven data mining.
Journal of Petroleum Science and Engineering, 2013, Vol 109: 241–249. https://doi.org/10.1016/j.petrol.2013.08.049.
Yunxin Xie, Chenyang Zhu, Wen Zhou, et al Evaluation of machine learning methods for formation lithology
identification: A comparison of tuning processes and model performances. Journal of Petroleum Science and
Engineering, 2018, Vol 160: 182–193. https://doi.org/10.1016/j.petrol.2017.10.028.
Zhong Yihua, Li Rong. Application of Principal Component Analysis and Least Square Support Vector Machine
toLithology Identification. WELL LOGGING T ECHNO LOGY, 2009, 33(05):425–429.
16 SPE-192833-MS
Appendix A
Three training models' classification results of test sets
Table A—Three training models' classification results of test sets
No. GR AC ILD SH actual type identified type
by OVR SVMs
identified type
by OVO SVMs
identified
type by
random
forest
1 0.116089 0.096942 0.400179 0.05468 1 2 2 1
2 0.093133 0.051894 0.435171 0.05469 1 1 1 1
3 0.140357 0.050909 0.452366 0.05471 1 1 2 1
4 0.134357 0.064829 0.401235 0.05472 1 2 2 1
5 0.101213 0.104653 0.389918 0.05473 1 2 2 1
6 0.11059 0.028326 0.486268 0.05482 1 1 1 1
7 0.097519 0.069152 0.376002 0.05491 1 2 2 1
8 0.048411 0.081646 0.456296 0.05499 1 3 3 1
9 0.086996 0.027431 0.413717 0.05504 1 1 2 1
10 0.10145 0.075731 0.347666 0.05508 1 2 2 1
11 0.093321 0.052928 0.414261 0.05512 1 1 2 1
12 0.087399 0.078059 0.481719 0.05518 1 1 1 1
13 0.110548 0.214586 0.442463 0.05519 1 1 1 4
14 0.115368 0.097061 0.391812 0.05522 1 2 2 1
15 0.105354 0.108942 0.371031 0.05526 1 2 2 1
16 0.115801 0.061189 0.402327 0.0553 1 2 2 1
17 0.115259 0.064559 0.3861 0.05534 1 2 2 1
18 0.105459 0.088368 0.396466 0.05544 1 2 2 1
19 0.095664 0.061505 0.439703 0.05547 1 1 1 1
20 0.105674 0.069896 0.415734 0.05557 1 1 2 1
21 0.163836 0.067933 0.395874 0.12198 1 1 1 1
22 0.168097 0.057272 0.441259 0.12199 1 1 1 1
23 0.137618 0.070063 0.469064 0.12202 1 1 1 1
24 0.135075 0.116844 0.413593 0.12203 1 1 4 1
25 0.128039 0.062846 0.417837 0.12205 1 1 1 1
26 0.142556 0.061435 0.367942 0.12206 1 2 2 1
27 0.464709 0.09437 0.112175 0.12208 1 1 1 2
28 0.143162 0.032004 0.472009 0.12208 1 1 1 1
29 0.185951 0.048207 0.420951 0.12213 1 1 1 1
30 0.169695 0.061033 0.420158 0.1222 1 1 1 1
31 0.147916 0.031689 0.542651 0.12221 1 1 1 1
32 0.133118 0.0345 0.421555 0.12222 1 2 1 1
33 0.158063 0.099266 0.402537 0.12224 1 1 1 1
34 0.159824 0.087124 0.402147 0.12225 1 1 1 1
SPE-192833-MS 17
No. GR AC ILD SH actual type identified type
by OVR SVMs
identified type
by OVO SVMs
identified
type by
random
forest
35 0.168219 0.05957 0.427848 0.12228 1 1 1 1
36 0.13614 0.133094 0.386076 0.12232 1 2 2 2
37 0.144019 0.054874 0.420568 0.12235 1 1 1 1
38 0.229379 0.095255 0.419888 0.12235 1 1 1 1
39 0.124975 0.091524 0.384267 0.12238 1 1 2 2
40 0.145918 0.06305 0.438382 0.12239 1 1 1 1
41 0.171598 0.078392 0.392307 0.12487 1 1 1 1
42 0.175802 0.073361 0.401294 0.12494 1 1 1 1
43 0.153868 0.044822 0.451974 0.125 1 1 1 1
44 0.161192 0.08632 0.417621 0.12507 1 1 1 1
45 0.122151 0.031818 0.506742 0.12509 1 1 1 1
46 0.144176 0.105396 0.395701 0.12512 1 1 1 2
47 0.122202 0.033715 0.504282 0.12517 1 1 1 1
48 0.155712 0.100281 0.397754 0.12526 1 1 1 2
49 0.169543 0.069907 0.418579 0.12534 1 1 1 1
50 0.14465 0.035174 0.440296 0.12544 1 1 1 1
51 0.12477 0.096546 0.387239 0.04789 2 1 1 2
52 0.105168 0.088048 0.371421 0.05707 2 2 2 1
53 0.127323 0.085472 0.376093 0.0571 2 2 2 1
54 0.115366 0.123103 0.336264 0.05718 2 2 2 1
55 0.09704 0.109209 0.356372 0.05722 2 2 2 1
56 0.111235 0.093313 0.355797 0.05722 2 2 2 1
57 0.113238 0.064385 0.38622 0.05725 2 2 2 1
58 0.095083 0.062337 0.372016 0.05726 2 2 2 1
59 0.130477 0.099633 0.388648 0.05729 2 2 2 1
60 0.145923 0.069724 0.424973 0.05732 2 1 2 1
61 0.09512 0.064524 0.373528 0.05732 2 2 2 1
62 0.127491 0.084555 0.385464 0.05734 2 2 2 1
63 0.103077 0.044248 0.393751 0.05736 2 2 2 1
64 0.126336 0.068057 0.401996 0.05739 2 2 2 1
65 0.116641 0.067068 0.357247 0.05744 2 2 2 1
66 0.109397 0.074987 0.411857 0.05748 2 1 2 1
67 0.115317 0.113172 0.355185 0.0691 2 2 2 2
68 0.098599 0.088842 0.380222 0.06912 2 2 2 2
69 0.118857 0.079711 0.389082 0.06913 2 2 2 2
70 0.142886 0.093396 0.379467 0.06918 2 2 2 2
71 0.120946 0.052307 0.388528 0.0692 2 2 2 2
72 0.129197 0.113914 0.363293 0.06928 2 2 2 2
18 SPE-192833-MS
No. GR AC ILD SH actual type identified type
by OVR SVMs
identified type
by OVO SVMs
identified
type by
random
forest
73 0.101941 0.096244 0.358495 0.06928 2 2 2 2
74 0.101953 0.097231 0.351698 0.06931 2 2 2 2
75 0.094934 0.097809 0.374366 0.06933 2 2 2 2
76 0.15549 0.05827 0.435795 0.06935 2 2 2 2
77 0.133267 0.070657 0.38394 0.06936 2 2 2 2
78 0.118735 0.066187 0.390106 0.06937 2 2 2 2
79 0.142839 0.134442 0.359029 0.06949 2 2 2 2
80 0.111618 0.07402 0.376729 0.06949 2 2 2 2
81 0.135832 0.135151 0.391522 0.06954 2 2 2 2
82 0.074773 0.044555 0.385313 0.06957 2 3 2 3
83 0.102127 0.094061 0.35634 0.06962 2 2 3 2
84 0.093864 0.046194 0.392656 0.06962 2 2 2 2
85 0.092278 0.113633 0.365266 0.06968 2 2 2 2
86 0.106842 0.152879 0.346474 0.0697 2 2 2 2
87 0.09232 0.105724 0.363174 0.06975 2 2 2 2
88 0.134494 0.059594 0.369076 0.06975 2 2 2 2
89 0.119974 0.040733 0.402027 0.06978 2 2 2 2
90 0.082466 0.084187 0.355723 0.0698 2 2 2 2
91 0.150982 0.079781 0.403129 0.08513 2 2 2 2
92 0.152439 0.131453 0.371112 0.08516 2 2 2 2
93 0.091509 0.065507 0.363856 0.08521 2 2 2 2
94 0.135842 0.067661 0.386954 0.08523 2 2 2 2
95 0.122024 0.110761 0.340472 0.08523 2 2 2 2
96 0.12142 0.074461 0.359575 0.08523 2 2 2 2
97 0.156597 0.062139 0.420275 0.08528 2 2 2 2
98 0.113783 0.062481 0.39005 0.0853 2 2 2 2
99 0.126675 0.06132 0.350674 0.08531 2 2 2 2
100 0.11917 0.14549 0.360685 0.08533 2 2 2 2
101 0.119632 0.099883 0.362623 0.08541 2 2 2 2
102 0.136355 0.042065 0.356886 0.08541 2 2 2 2
103 0.162798 0.099153 0.369932 0.08543 2 2 2 2
104 0.073962 0.071837 0.357579 0.0855 2 2 2 2
105 0.12281 0.131907 0.358317 0.08552 2 2 2 2
106 0.121728 0.047518 0.370768 0.08571 2 2 2 2
107 0.119116 0.124355 0.393161 0.08573 2 2 2 2
108 0.136521 0.040915 0.368286 0.08577 2 2 2 2
109 0.128879 0.08755 0.382968 0.08577 2 2 2 2
110 0.122561 0.108296 0.365183 0.08578 2 2 2 2
SPE-192833-MS 19
No. GR AC ILD SH actual type identified type
by OVR SVMs
identified type
by OVO SVMs
identified
type by
random
forest
111 0.144401 0.049313 0.413369 0.08581 2 2 2 2
112 0.153059 0.115833 0.350277 0.08591 2 2 2 2
113 0.128982 0.072302 0.393509 0.08594 2 2 2 2
114 0.11833 0.064652 0.396145 0.08596 2 2 2 2
115 0.142659 0.132677 0.392649 0.08597 2 2 2 2
116 0.110919 0.092135 0.347005 0.08613 2 2 2 2
117 0.122021 0.046748 0.368181 0.08617 2 2 2 2
118 0.104677 0.094635 0.353205 0.08627 2 2 2 2
119 0.114452 0.105109 0.352917 0.08629 2 2 2 2
120 0.135639 0.113166 0.374483 0.08632 2 2 2 2
121 0.104721 0.081748 0.391893 0.08634 2 2 2 2
122 0.111027 0.084587 0.314361 0.08634 2 2 2 2
123 0.129903 0.150066 0.384152 0.08634 2 2 2 2
124 0.14229 0.117594 0.390074 0.08643 2 2 2 2
125 0.119431 0.126168 0.366043 0.08646 2 2 2 2
126 0.144134 0.052972 0.370812 0.0865 2 2 2 2
127 0.114562 0.091461 0.356864 0.08651 2 2 2 2
128 0.144149 0.083559 0.362779 0.08652 2 2 2 2
129 0.12724 0.059415 0.409968 0.0866 2 2 2 2
130 0.108039 0.107024 0.385483 0.08662 2 2 2 2
131 0.051062 0.060244 0.383384 0.02585 3 3 2 3
132 0.051077 0.060568 0.39287 0.02587 3 3 3 3
133 0.05127 0.06005 0.394397 0.02614 3 3 3 3
134 0.033733 0.055272 0.419781 0.02617 3 3 3 3
135 0.051436 0.057852 0.393353 0.02636 3 3 3 3
136 0.057773 0.038152 0.432485 0.0264 3 3 3 3
137 0.051592 0.0608 0.388477 0.02658 3 3 3 3
138 0.034197 0.056715 0.375483 0.02682 3 3 3 3
139 0.052125 0.060198 0.388814 0.02732 3 3 3 3
140 0.052535 0.061115 0.395577 0.02788 3 3 3 3
141 0.052875 0.060713 0.390398 0.02835 3 3 3 3
142 0.052943 0.05922 0.394543 0.02845 3 3 3 3
143 0.052999 0.059598 0.381141 0.02853 3 3 3 3
144 0.053053 0.060329 0.404663 0.0286 3 3 3 3
145 0.053427 0.061409 0.40064 0.02912 3 3 3 3
146 0.054624 0.060629 0.396622 0.0308 3 3 3 3
147 0.060422 0.040511 0.437307 0.03144 3 3 3 3
148 0.055726 0.061998 0.349 0.03235 3 3 3 3
20 SPE-192833-MS
No. GR AC ILD SH actual type identified type
by OVR SVMs
identified type
by OVO SVMs
identified
type by
random
forest
149 0.056036 0.05662 0.367496 0.03279 3 3 3 3
150 0.056158 0.062029 0.40193 0.03296 3 3 3 3
151 0.071272 0.034763 0.475042 0.05284 3 3 3 3
152 0.071311 0.038596 0.448853 0.05291 3 3 3 3
153 0.071519 0.053728 0.423849 0.05333 3 3 3 3
154 0.070045 0.061776 0.394562 0.0535 3 3 3 3
155 0.071885 0.043018 0.451844 0.05411 3 3 3 3
156 0.052679 0.053424 0.418452 0.05423 3 3 3 3
157 0.072095 0.038146 0.469532 0.05451 3 3 3 3
158 0.070783 0.060683 0.404831 0.05464 3 3 3 3
159 0.053039 0.053885 0.42146 0.05479 3 3 3 3
160 0.071277 0.061765 0.378674 0.05541 3 3 3 3
161 0.053471 0.048074 0.441664 0.05547 3 3 3 3
162 0.072567 0.055437 0.42764 0.05548 3 3 3 3
163 0.053618 0.053628 0.419293 0.0557 3 3 3 3
164 0.053938 0.051498 0.352532 0.0562 3 3 3 3
165 0.072105 0.058394 0.399199 0.0567 3 3 3 3
166 0.072506 0.053139 0.382619 0.05732 3 3 3 3
167 0.073522 0.039996 0.452193 0.05743 3 3 3 3
168 0.072867 0.056911 0.39388 0.05789 3 3 3 3
169 0.096077 0.13179 0.489616 0.03588 4 4 3 4
170 0.09694 0.122214 0.43374 0.03712 4 4 4 4
171 0.096937 0.1048 0.472465 0.03712 4 4 4 4
172 0.097463 0.123872 0.490084 0.03788 4 4 4 4
173 0.097599 0.118281 0.497154 0.03808 4 4 4 4
174 0.098222 0.126746 0.500226 0.03899 4 4 4 4
175 0.098872 0.121166 0.496639 0.03994 4 4 4 4
176 0.121332 0.242216 0.509814 0.04 4 4 4 4
177 0.099654 0.133744 0.467101 0.0411 4 4 4 4
178 0.122417 0.258801 0.477665 0.04133 4 4 4 4
179 0.513103 0.067942 0.046743 0.05671 4 4 4 4
180 0.10972 0.136331 0.482309 0.05671 4 4 4 4
181 0.134107 0.221344 0.364771 0.05691 4 4 4 4
182 0.109928 0.112903 0.420119 0.05705 4 4 4 4
183 0.110121 0.127888 0.467199 0.05736 4 4 4 4
184 0.510044 0.064315 0.047469 0.05773 4 4 4 4
185 0.537554 0.061087 0.047915 0.05836 4 4 4 4
186 0.110912 0.127901 0.426797 0.05866 4 4 4 4
SPE-192833-MS 21
No. GR AC ILD SH actual type identified type
by OVR SVMs
identified type
by OVO SVMs
identified
type by
random
forest
187 0.111183 0.120148 0.447875 0.05911 4 4 4 4
188 0.501529 0.064159 0.048913 0.05979 4 4 4 4
189 0.111606 0.123194 0.435859 0.0598 4 4 4 4
190 0.541756 0.060424 0.048964 0.05986 4 4 4 4
191 0.523081 0.055139 0.049151 0.06013 4 4 4 4
192 0.54995 0.056965 0.049451 0.06057 4 4 4 4
193 0.112129 0.131924 0.475544 0.06067 4 4 4 4
194 0.112207 0.127374 0.448268 0.0608 4 4 4 4
195 0.545059 0.06457 0.049654 0.06086 4 4 4 4
196 0.112319 0.104572 0.481675 0.06098 4 4 4 4
197 0.112398 0.101544 0.413681 0.06111 4 4 4 4
198 0.538814 0.064755 0.049946 0.06129 4 4 4 4
Article
Reservoir identification is important for reservoir evaluation and petroleum development. Existing methods cannot automatically identify the categories of the reservoir that exhibit: (a) local features differences of well logging data; (b) limited with non-reservoir interference; and (c) insufficient real labels. Transfer learning-based methods utilize other blocks partially address the problem of small samples. However, they ignore the significant geological differences between blocks. Therefore, this paper proposes a small sample reservoir identification method combining Convolutional Mask Attention Network (CMAN) and Value-aware Meta-Transfer Learning (VMTL) strategy. First, we pre-train the CMAN on the source block to adaptively extract the local information of each depth point. The CMAN also automatically masks the non-reservoir information while capturing the relationship between reservoirs and non-reservoirs to improve feature extraction. Then we design a VMTL strategy to learn valuable transfer knowledge for overcoming the geological difference. Finally, we fine-tune our model using target block data to address the insufficient samples. The average accuracy and F1 score of the proposed method on real-world oilfield data are respectively 92.61% and 88.85%. The results of the two cases demonstrate our method outperforms existing methods in convergence speed, stability, and generalizability.
Conference Paper
The demand for cost-effective drilling operations in oil and gas exploration is ever growing. One of the important aspects to tackling the aforementioned difficulty is determining the optimal rate of penetration (ROP) of the drill bit. The most important optimization objective is to achieve a high optimal rate of penetration in safe and stable drilling conditions. Several machine learning models have been developed to predict ROP, however, there have been few studies that consider the different optimization algorithms needed to optimize the conventional developed models other than the conventional grid search and random search techniques. Genetic algorithm (GA) has gained much attention as methods of optimizing the predictions of machine learning algorithms in different fields of study. In this study, GA optimization algorithm was implemented to optimize 5 machine learning algorithms: Linear Regression, Decision Tree, Support Vector Machine, Random Forest, and Multilayer Perceptron algorithm while using torque, weight on bit, surface RPM, mud flow, pump pressure, downhole temperature and pressure, etc, as input parameters. Three scenarios were analyzed using a train-test split ratio of 70-30, 80-20 and 85-15 percent on all the developed models. The results from the comparative study of all models developed shows that the implementation of the GA optimization algorithms increased the individual ROP models, with the multilayer perceptron model having the highest coefficient of determination of 0.989% after GA optimization.
Article
By the merits of self-stability and low energy consumption, high temperature superconducting (HTS) maglev has the potential to become a novel type of transportation mode. As a key index to guarantee the lateral self-stability of HTS maglev, guiding force has strong non-linearity and is determined by multitudinous factors, and these complexities impede its further researches. Compared to traditional finite element and polynomial fitting method, the prosperity of deep learning algorithms could provide another guiding force prediction approach, but the verification of this approach is still blank. Therefore, this paper establishes 5 different neural network models (RBF, DNN, CNN, RNN, LSTM) to predict HTS maglev guiding force, and compares their prediction efficiency based on 3720 pieces of collected data. Meanwhile, two adaptively iterative algorithms for parameters matrix and learning rate adjustment are proposed, which could effectively reduce computing time and unnecessary iterations. And according to the results, it is revealed that, the DNN model shows the best fitting goodness, while the LSTM model displays the smoothest fitting curve on guiding force prediction. Based on this discovery, the effects of learning rate and iterations on prediction accuracy of the constructed DNN model are studied. And the learning rate and iterations at the highest guiding force prediction accuracy are 0.00025 and 90000, respectively. Moreover, the K-fold cross validation method is also applied to this DNN model, whose result manifests the generalization and robustness of this DNN model. The imperative of K-fold cross validation method to ensure universality of guiding force prediction model is likewise assessed. This paper firstly combines HTS maglev guiding force prediction with deep learning algorithms considering different field cooling height, real-time magnetic flux density, liquid nitrogen temperature and motion direction of bulk. Additionally, this paper gives a convenient and efficient method for HTS guiding force prediction and parameter optimization.
Article
The updating of reservoir geological models has become a research hotspot. Nevertheless, two difficulties continue to hinder the development of reservoir geological model updating techniques. First, logging while drilling (LWD) is used mainly to guide geosteering operations and effectively identify the lithology; few scholars have researched the interpretation of reservoir physical characteristics while drilling, which is the basis of updating geological models. Second, interpretation results are difficult to transmit to geological models in real time. Based on the LWD technique, this paper uses logging interpretation, machine learning, computer science, and reservoir geological modeling theories and methods to conduct the research of real-time geological model updating around the well. First, based on effective logging data, two machine learning algorithms which are random forest (RF) and extreme gradient boosting tree (XGBoost) are used to establish interpretation models of the reservoir lithology, porosity, and permeability. The parameters of each model are optimized through cross-validation method, and LWD data are interpreted in real time by interpretation models. Second, based on the convenience of the Ocean secondary development platform and the functionality of Petrel software, a real-time transmission plug-in for the current well trajectory and reservoir interpretation results is compiled, and an automatic updating module for the geological model is established. A case study is performed with data from the Sulige gas field in the Ordos Basin, China. For the real-time interpretation of reservoir characteristics while drilling, after 228 trials, the XGBoost algorithm is chosen to establish reservoir lithology, porosity, and permeability interpretation models. For the real-time updating of the geological model around the well, given the consistent probability distributions and the agreement between adjacent wells, we obtain relative errors between the simulated and real values of the lithofacies, porosity, and permeability of 3.90%, 4.50%, and 7.60%, respectively. Therefore, this paper provides a new method for the real-time modification and updating of reservoir geological models, which preliminarily resolved the contradiction between accuracy and real time of geological model real-time updating.
Article
The predictions of porosity and permeability from well logging data are important in oil and gas field development. Currently, many scholars use machine learning algorithms to predict reservoir properties. However, few scholars have researched the prediction of reservoir porosity and permeability while drilling. This approach requires not only a high prediction accuracy but also short model processing and calculation times as new logging data are incorporated while drilling. In this paper, four machine learning algorithms were evaluated: the one-versus-rest support vector machine (OVR SVM), one-versus-one support vector machine (OVO SVM), random forest (RF) and gradient boosting decision tree (GBDT) algorithms. First, samples of wireline logging data from the Yan969 wellblock of the Yan’an gas field were chosen for model training. To improve the accuracy and reduce the input parameter dimensions and model training time as much as possible, data correlation analysis was performed. Second, we used the grid search method to approximate ranges of reasonable parameter values and then used k-fold cross-validation to optimize the final parameters and avoid overfitting. Third, we used the four classification models to predict porosity and permeability while drilling with data from logging while drilling (LWD) logs. Finally, we indicate the best porosity and permeability prediction models to use while drilling. To ensure that the prediction accuracy is as high as possible and that the model training time is as short as possible, the OVO SVM algorithm was suggested for porosity and permeability prediction. Therefore, appropriate machine learning algorithms can be used to predict porosity and permeability while drilling.
Article
Due to the oil price fluctuations in recent decades, international and national oil companies have developed programs of strategically oriented development and assets optimization. Many companies have also promptly promoted opportunities for oilfield joint ventures. Therefore, a fast and accurate assessment methodology of oilfield assets, including planning and costs assessment, is expected to be proposed. Oilfield development cost assessment can be dynamically affected by a number of factors, including oilfield internal indexes and macroeconomic indexes. Based on a machine learning algorithm and combining mathematical and statistical methodology, the Microsoft Azure machine learning studio has been used for modeling the oilfield development cost. The proposed method has adopted three algorithms: a neural network, a boosted decision tree, and a decision forest. Results showed that the boosted decision tree and decision forest algorithms can achieve PFI ranking with stable training results. The results of the machine training model have been analyzed, and they showed that the permutation feature importance (PFI) model can provide reasonable scientific and technical support for oil companies and help to attain more effective and accurate estimation and prediction of oilfield assets.
Article
Full-text available
The pixel's classification of images obtained from random heterogeneous materials is a relevant step to compute their physical properties, like Effective Transport Coefficients (ETC), during a characterization process as stochastic reconstruction. A bad classification will impact on the computed properties; however, the literature on the topic discusses mainly the correlation functions or the properties formulae, giving little or no attention to the classification; authors mention either the use of a threshold or, in few cases, the use of Otsu's method. This paper presents a classification approach based on Support Vector Machines (SVM) and a comparison with the Otsu's-based approach, based on accuracy and precision. The data used for the SVM training are the key for a better classification; these data are the grayscale value, the magnitude and direction of pixels gradient. For instance, in the case study, the accuracy of the pixel's classification is 77.6% for the SVM method and 40.9% for Otsu's method. Finally, a discussion about the impact on the correlation functions is presented in order to show the benefits of the proposal.
Article
The purpose of this paper is to discuss the application of the Random Forest methodology to sensory analysis. A methodological point of view is mainly adopted to describe as simply as possible the construction of binary decision trees and, more precisely, Classification and Regression Trees (CART), as well as the generation of an ensemble of trees or, in other words, a Random Forest. The interest of the permutation accuracy criterion, as a measure of variable importance, is specifically emphasized as a way of identifying the most predictive variables and selecting a subset of these variables for parsimonious and efficient predictive models. A two-step procedure is proposed for choosing this subset of variables. The principle of the method is illustrated in a case study in which the aim was to better understand and predict the olfactory characteristics of red wines made of the Cabernet Franc grape variety, from their Volatile Organic Compound (VOC) content. For two main olfactory attributes, the bell pepper odor and the leather odor, it was possible to list the most important compounds and to highlight a very small number of compounds useful for estimating each of the olfactory attributes considered. For the latter, it was also observed that Random Forest models had a better predictive ability than Partial Least Squares (PLS) Regression models.
Article
We introduce the application of an ensemble learning method known as Random Forests to microkinetics modeling and the computationally efficient integration of microkinetics into reaction engineering models. First, we show how Random Forests can be used for mapping pre-computed microkinetics data. Random Forests can be used to predict new datasets while keeping the prediction accuracy high and the computational load low. The method is also used to identify the important variables in the mechanism in regard to overall reaction rate and selectivity. The results are compared with results from a similar study using the Campbell's Degree of Rate Control approach and it is shown that the Random Forests method could be used to identify important features of the mechanism over a wide range of reacting conditions. Finally, the inclusion of the suggested method into reaction engineering models such as Computational Fluid Dynamics (CFD) resolved-particle simulations of fixed bed reactors is presented.
Article
We contribute to recent research on the optimality of macroeconomic forecasts. We start from the assumption that forecasters may have a flexible rather than a symmetric (quadratic) loss function assumed in standard tests. This assumption leads to the prediction that variables available to a forecaster when a forecast was formed should have no predictive value for a binary 0/1-indicator that captures the sign of the forecast error. A test of forecast optimality, thus, can be interpreted as a classification problem. We use random forests to model this classification problem. Random forests are a powerful nonparametric modeling instrument originally developed in the machine-learning literature. Unlike conventional linear-probability or logit/probit-models, random forests account in a natural way for potential nonlinear links between the signed forecast error and the variables in a forecaster's information set. Random forests also can handle a situation in which the number of forecasts is small relative to the number of predictor variables that a researcher uses to proxy a forecaster's information set. Random forests, therefore, are a powerful modeling device that is of interest for every researcher who studies the properties of macroeconomic forecasts. Upon estimating random forests on forecasts of four German research institutes, we document that optimality of longer-term inflation forecasts cannot be rejected and that inflation forecasts are weakly efficient. For shorter-term inflation forecasts, our results are heterogeneous across research institutes. When we pool the data across the research institutes, we reject optimality of both shorter-term and longer-term forecasts.
Article
Classification of mineralised zones is an important factor for the analysis of economic deposits. In this paper, the support vector machine (SVM), a supervised learning algorithm, based on subsurface data is proposed for classification of mineralised zones in the Takht-e-Gonbad porphyry Cu-deposit (SE Iran). The effects of the input features are evaluated via calculating the accuracy rates on the SVM performance. Ultimately, the SVM model, is developed based on input features namely lithology, alteration, mineralisation, the level and, radial basis function (RBF) as a kernel function. Moreover, the optimal amount of parameters λ and C, using n-fold cross-validation method, are calculated at level 0.001 and 0.01 respectively. The accuracy of this model is 0.931 for classification of mineralised zones in the Takht-e-Gonbad porphyry deposit. The results of the study confirm the efficiency of SVM method for classification the mineralised zones.
Article
Identification of underground formation lithology from well log data is an important task in petroleum exploration and engineering. Recently, several computational algorithms have been used for lithology identification to improve the prediction accuracy. In this paper, we evaluate five typical machine learning methods, namely the Naïve Bayes, Support Vector Machine, Artificial Neural Network, Random Forest and Gradient Tree Boosting, for formation lithology identification using data from the Daniudui gas field and the Hangjinqi gas field. The input to each model consists of features selected from different well log data samples. To determine the best model to classify the lithology type, this study used validation curve to determine the parameter search range and adopted the hyper-parameter optimization method to obtain the best parameter set for each model. The performance of each classifier is also evaluated using 5-fold cross validation. The results suggest that ensemble methods are good algorithm choices for supervised classification of lithology using well log data. The Gradient Tree Boosting classifier is robust to overfitting because it grows trees sequentially by adjusting the weight of the training data distribution to minimize a loss function. The random forest classifier is also a suitable option. An evaluation matrix showed that the Gradient Tree Boosting and Random Forest classifiers have lower prediction errors compared with the other three models. Although all the models have difficulties in distinguishing sandstone classes, the Gradient Tree Boosting performs well on this task compared with the other four methods. Moreover, the classification accuracy is remarkably similar across the lithology classes for both the Random Forest and Gradient Tree Boosting models.
Article
This paper proposes a Varying Coefficient Support Vector Machine (VCSVM). VCSMO, a variation of the classic algorithm SMO for standard SVMs, is also proposed to solve for VCSVM. Numerical examples validate the accuracy and efficiency of the proposed model.
Article
Lithological mapping in mountainous regions is often impeded by limited accessibility due to relief. This study aims to evaluate (1) the performance of different supervised classification approaches using remote sensing data and (2) the use of additional information such as geomorphology. We exemplify the methodology in the Bardi-Zard area in NE Iraq, a part of the Zagros Fold – Thrust Belt, known for its chromite deposits. We highlighted the improvement of remote sensing geological classification by integrating geomorphic features and spatial information in the classification scheme. We performed a Maximum Likelihood (ML) classification method besides two Machine Learning Algorithms (MLA): Support Vector Machine (SVM) and Random Forest (RF) to allow the joint use of geomorphic features, Band Ratio (BR), Principal Component Analysis (PCA), spatial information (spatial coordinates) and multispectral data of the Advanced Space-borne Thermal Emission and Reflection radiometer (ASTER) satellite. The RF algorithm showed reliable results and discriminated serpentinite, talus and terrace deposits, red argillites with conglomerates and limestone, limy conglomerates and limestone conglomerates, tuffites interbedded with basic lavas, limestone and Metamorphosed limestone and reddish green shales. The best overall accuracy (∼80%) was achieved by Random Forest (RF) algorithms in the majority of the sixteen tested combination datasets.