Conference PaperPDF Available

Optimization of Models for Rapid Identification of Oil and Water Layers During Drilling - A Win-Win Strategy Based on Machine Learning

November 2018

November 2018

DOI:10.2118/192833-MS

Conference: Abu Dhabi International Petroleum Exhibition & Conference

Authors:

Jian Sun

China University of Petroleum - Beijing

Long Ren

Xi'an Shiyou University

Show all 7 authorsHide

The identification of oil and water layers (OWL) from well log data is an important task in petroleum exploration and engineering. At present, the commonly used methods for OWL identification are time-consuming, low accuracy or need better experience of researchers. Therefore, some machine learning methods have been developed to identify the lithology and OWL. Based on logging while drilling data, this paper optimizes machine learning methods to identify OWL while drilling. Recently, several computational algorithms have been used for OWL identification to improve the prediction accuracy. In this paper, we evaluate three popular machine learning methods, namely the one-against-rest support vector machine, one-against-one support vector machine, and random forest. First, we choose apposite training set data as a sample for model training. Then, GridSearch method was used to find the approximate range of reasonable parameters' value. And then using k-fold cross validation to optimize the final parameters and to avoid overfitting. Finally, choosing apposite test set data to verify the model. The method of using machine learning method to identify OWL while drilling has been successfully applied in Weibei oilfield. We select 1934 groups of well logging response data for 31 production wells. Among them, 198 groups of LWD data were selected as the test set data. Natural gamma, shale content, acoustic time difference, and deep-sensing logs were selected as input feature parameters. After GridSearch and 10-fold cross validation, the results suggest that random forest method is the best algorithm for supervised classification of OWL using well log data. The accuracy of the three classifiers after the calculation of the training set is greater than 90%, but their differences are relative large. For the test set, the calculated accuracy of the three classifiers is about 90%, with a small difference. The one-against-rest support vector machine classifier spends much more time than other methods. The one-against-one support vector machine classifier is the classifier which training set accuracy and test set accuracy are the lowest in three methods. Although all the calculation results have diffierences in accuracy of OWL identification, their accuracy is relatively high. For different reservoirs, taking into account the time cost and model calculation accuracy, we can use random forest and one-against-one support vector machine models to identify OWL in real time during drilling.

Support vector machine schematic diagram.

…

Random forest schematic diagram.

…

well logging curves of WB2P27 well.

…

The 10 fold cross-validation of parameter C in OVR SVMs, γ=25.

…

The 10 fold cross-validation of parameter γ in OVR SVMs, C=5000. The OVO SVMs classifier identifies OWL during the drilling process 1. The selection of the OVO SVMs kernel function and the training parameters Also select the Gaussian kernel function and use the GridSearch method to select the approximate range of the optimal training parameters C and γ in the OVO SVMs,

…

Figures - uploaded by Long Ren

Content may be subject to copyright.

Content uploaded by Long Ren

Content may be subject to copyright.

SPE-192833-MS

Optimization of Models for Rapid Identification of Oil and Water Layers

During Drilling - A Win-Win Strategy Based on Machine Learning

Jian Sun and Qi Li, School of Petroleum Engineering, China University of Petroleum - Beijing; Mingqiang Chen and

Long Ren, School of Petroleum Engineering, Xi'an Shiyou University - Xi'an; Fengrui Sun, School of Petroleum

Engineering, China University of Petroleum - Beijing; Yong Ai, Exploration and Development Institution Tarim Oil

Field - Korla; Kang Tang, School of Petroleum Engineering, Xi'an Shiyou University - Xi'an

This paper was prepared for presentation at the Abu Dhabi International Petroleum Exhibition & Conference held in Abu Dhabi, UAE, 12-15 November 2018.

This paper was selected for presentation by an SPE program committee following review of information contained in an abstract submitted by the author(s). Contents

of the paper have not been reviewed by the Society of Petroleum Engineers and are subject to correction by the author(s). The material does not necessarily reflect

any position of the Society of Petroleum Engineers, its officers, or members. Electronic reproduction, distribution, or storage of any part of this paper without the written

consent of the Society of Petroleum Engineers is prohibited. Permission to reproduce in print is restricted to an abstract of not more than 300 words; illustrations may

not be copied. The abstract must contain conspicuous acknowledgment of SPE copyright.

Abstract

The identification of oil and water layers (OWL) from well log data is an important task in petroleum

exploration and engineering. At present, the commonly used methods for OWL identification are time-

consuming, low accuracy or need better experience of researchers. Therefore, some machine learning

methods have been developed to identify the lithology and OWL. Based on logging while drilling data, this

paper optimizes machine learning methods to identify OWL while drilling.

Recently, several computational algorithms have been used for OWL identification to improve the

prediction accuracy. In this paper, we evaluate three popular machine learning methods, namely the one-

against-rest support vector machine, one-against-one support vector machine, and random forest. First, we

choose apposite training set data as a sample for model training. Then, GridSearch method was used to find

the approximate range of reasonable parameters' value. And then using k-fold cross validation to optimize

the final parameters and to avoid overfitting. Finally, choosing apposite test set data to verify the model.

The method of using machine learning method to identify OWL while drilling has been successfully

applied in Weibei oilfield. We select 1934 groups of well logging response data for 31 production wells.

Among them, 198 groups of LWD data were selected as the test set data. Natural gamma, shale content,

acoustic time difference, and deep-sensing logs were selected as input feature parameters. After GridSearch

and 10-fold cross validation, the results suggest that random forest method is the best algorithm for

supervised classification of OWL using well log data. The accuracy of the three classifiers after the

calculation of the training set is greater than 90%, but their differences are relative large. For the test set,

the calculated accuracy of the three classifiers is about 90%, with a small difference. The one-against-rest

support vector machine classifier spends much more time than other methods. The one-against-one support

vector machine classifier is the classifier which training set accuracy and test set accuracy are the lowest

in three methods.

Although all the calculation results have diffierences in accuracy of OWL identification, their accuracy

is relatively high. For different reservoirs, taking into account the time cost and model calculation accuracy,

2 SPE-192833-MS

we can use random forest and one-against-one support vector machine models to identify OWL in real time

during drilling.

Key words: Machine learning, Identification of oil and water layers, Support vector machine, Random

forest, Optimization

Introduction

The development of oil and gas in unconventional reservoirs such as low-permeability reservoirs, tight

reservoirs, and shale reservoirs has become popular in global oil and gas development. The development

of oil and gas resources in these reservoirs is different from that in conventional reservoirs and often

requires more time and higher economic costs. Improvements in technology or methods in any one area can

bring about undeniable benefits (Sun Fengrui et al. 2017). In recent years, Logging While Drilling (LWD)

technology has been widely adopted in the drilling of unconventional reservoir horizontal wells. However,

LWD data are generally used to explain reservoir lithology and to guide drilling and geosteering work. These

data are applied less often to the identification of the oil and water layers (OWL) drilled during drilling. The

identification methods of OWL are mostly traditional methods, such as the traditional intersection charts

method (WAN Qiao- sheng et al. 2017), stripping method and multi-curve joint qualitative identification

combined with intersection charts method (Song Peng et al. 2016). Currently, based on statistics and

computer science, some methods relying on machine learning theory for the identification of reservoir

lithology and OWL have also been developed (Zhong Yihua et al. 2009, Li Rong et al. 2009, Song

Yanjie et al. 2007, Liu H. et al. 2009, Xiongyan Li et al. 2013, Yunxin Xie et al. 2018, Shaoqun Dong

et al. 2016, Arsalan A. Othman and Richard Gloaguen 2017). Due to complex geological conditions

and sedimentary environments, a nonlinear relationship between reservoir heterogeneity and reservoir

logging response characteristics indicates that the use of linear logging response equations and statistical

empirical formulas does not effectively characterize the reservoir's true characteristics and cannot meet

actual production needs. The traditional intersection charts method is directly related to the experience

of researchers and exhibits a certain degree of instability. Therefore, the use of non-linear information

processing technology to reveal the distribution characteristics of OWL can better meet the needs of oil and

gas exploration and development, while conventional linear and empirical logging interpretation technology

performs insufficiently. Artificial neural network, and support vector machines methods have been used

in the identification of OWL. Although they can play an interpretive role, there are still many problems:

the artificial neural network method is inadequate in giving satisfactory results in a local optimality,

dimension disaster, and small data sample (Ahmed Amara Konaté et al. 2015, Morteza Raeesi et al. 2012,

Baouche Rafik et al. 2017, B. Shokooh Saljooghi et al. 2015); the support vector machine can overcome the

shortcomings of the neural network method, but the classical support vector machine algorithm only gives

two types of classification algorithms. In practical applications of data mining, it is often necessary to solve

multi-category classification problems. Therefore, one-against-rest support vector machines (OVR SVMs),

one-against-one support vector machines (OVO SVMs) (Hsu, C.-W. and Lin, C.-J. 2002), and random forest

methods have emerged. These three methods can effectively avoid the deficiencies of the above methods,

especially the random forest algorithm, which is composed of multiple decision trees. Compared with a

single decision tree algorithm, this algorithm has a higher training accuracy, better classification effect, and

is less likely to overfit. However, a certain classifier cannot be considered to be sufficient, and specific

issues should be analysed. After all, this problem is challenging. Therefore, in this paper, the OVR SVMs

classifier, the OVO SVMs classifier and the random forest classifier are constructed according to various

characteristic data obtained from well logging. Various target values of the OWL are classified, and the

results obtained by each classifier are analysed. The optimal classifier and corresponding parameters are

selected to solve the problem of the accurate identification of the OWL while drilling.

SPE-192833-MS 3

The principles and methods of support vector machine and random forest

The principle of support vector machine

The support vector machine (SVM) is developed from the optimal classification surface in the case of

linear separability. Its core idea is that the optimal classification surface not only can correctly separate

the two types of samples, but also maximize their classification intervals. In the actual situation, most of

the problems encountered are non-linear, and the nonlinear problem needs to be converted into a linear

problem of a high-dimensional space through a nonlinear transformation, and an optimal classification

surface is obtained in the transformed space (Neda Mahvash Mohammadi et al. 2018, Jaime Ortegon et

al. 2018, Xiaoling Lu et al. 2018, Xiekai Zhang et al. 2017, Italo Zoppis et al. 2018). Suppose that in the

nonlinear case, the sample points are (xi, yi) (i=1,…,n). In the high-dimensional space, the classification

surface equation is formula (1), φ(x) is a mapping function from the input space to the feature space. ω is

a space vector, and b is a constant term. The schematic diagram is shown in Figure 1. Under the constraint

of condition as formula (2), find the minimum value of the function (formula (3)). This can be converted

into a dual problem through the Lagrangian optimization method, which is attributed to a quadratic function

extremum problem, that is, solving the maximum value of the formula (4) under the constraint condition

as formula (5), a is a Lagrangian multiplier, and c is a constant that controls the degree of punishment of

the wrong sample. Among them, formula (6) is a kernel function that satisfies the mercer condition. The

optimal classification discriminant function obtained after solving the above problem is formula (7), where

N is the number of support vectors.

(1)

(2)

(3)

(4)

(5)

(6)

(7)

4 SPE-192833-MS

Figure 1—Support vector machine schematic diagram.

The multi-classification method of support vector machine

The SVM algorithm was originally designed for the binary classification problem. When multi-category

classification problems are encountered, corresponding multi-category classifiers need to be constructed.

Currently, there are two main methods for constructing SVM multiclass classifiers. One is to directly

modify the objective function, and the parameter solutions of multiple classification planes are combined

into one optimization problem, and the optimization problem is solved "once". This method to achieve

multi-category classification, is called the direct method. This method seems simple, but its computational

complexity is relatively high, it is more difficult to achieve, and it only applies to small problems. The other

is the combination of multiple binary classifiers to construct multiple classifiers, called indirect methods,

usually OVO SVMs and OVR SVMs.

The OVR SVMs classifier classifies samples of a certain category as one class in training, and classifies

the remaining samples into another class. Samples of k categories can construct k SVMs, which is to

construct k two class classifiers. The i-th classifier divides the i-th congruent categories, and the i-th

classifier takes the i-th class of the training set as a positive class and the rest of the class points as a negative

class for training. In the discrimination, the input signal is obtained through k classifiers to obtain k output

values fi(x) =sgn (gi(x)). If the output is +1, the corresponding class is the input signal class. However,

the decision function constructed under actual conditions always has errors. If the output is more than one

+1, or if none of the outputs is +1, comparing the output values, the largest one corresponds to the input

category. This method has a more obvious deficiency: the training set occupies a smaller proportion, so it

will be affected by the remaining samples, and there will be deviations.

The OVO SVMs classifier designs an SVM between any two types of samples, so k samples are needed

to design k(k-1)/2 SVMs. When classifying an unknown sample, each classifier will judge its class and vote

for the corresponding class. The class which gets the most votes is the class of the unknown sample. Voting

is completed as follows: Let A = B = C = D = 0. (A, B) –classifier: if A wins, then A = A + 1; otherwise,

B = B + 1. (A, C) –classifier: if A wins, then A=A+1; otherwise, C=C+1…. (C, D)-classifier: if C wins,

then C=C+1; otherwise, D=D+1. The final decision is the max of (A, B, C, D). Although this method is

better than using OVR SVMs, when the number of categories is large, the number of models is n*(n-1)/2,

which will greatly increase the calculation time.

The principle of the random forest classification algorithm

The random forest is the most popular machine learning model. In the 1980s, Breiman et al. invented the

classification tree algorithm (Breiman 1996). This method repeatedly divides the data into two categories for

classification or regression, which greatly reduces the amount of computation. In 2001, Breiman combined

SPE-192833-MS 5

the classification trees into a random forest (Breiman 2001), which randomizes the use of variables

(columns) and the use of data (rows), generates many classification trees, and statistically summarizes

the classification tree results. The results are robust to missing data and non-equilibrium data and can

appropriately predict the effects of up to thousands of explanatory variables, and it is hailed as one of the

best algorithms available today (E. Vigneau et al. 2018, Michele Fratello et al. 2018, Robin Genuer et al.

2017, Christoph Behrens et al. 2018, Behnam Partopour et al. 2018). As the name implies, random forests

create a forest in a random manner. Forests are composed of many decision trees. All decision trees in a

random forest are unrelated. After the forest is acquired, when a new sample is input, each decision tree in

the forest is judged separately to determine to which category the sample belongs. Finally, the number of

times each category was selected is counted. The category that is selected the most is the category predicted

for the sample. A decision tree is a tree structure (can be a binary tree or a non-binary tree). Each non-leaf

node represents a test of a characteristic attribute, and each leaf node stores a category. The decision process

of the decision tree is shown in Figure 2. Starting from the root node, the corresponding feature attributes

in the classification items are tested, and the output branches are selected according to the test results, until

reaching the leaf nodes. The decision result is the category which is stored in the leaf nodes.

Figure 2—Random forest schematic diagram.

In establishing each decision tree, two things need to be taken care of: sampling and complete division.

The first step is the process of two random samplings. The random forest samples the input data in rows

and columns. For row sampling, there is a method of putting back, that is, in the sample set obtained by

sampling, there may be duplicate samples. Assuming that the input sample is N, the sampling sample is also

N, so when training, the input sample of each tree is not a complete sample, and it is relatively easy to avoid

over-fitting. For column sampling, this method selects m features from M features (m<M). M is the total

number of features. The second step is to establish a decision tree for the sampled data in a completely split

method so that one leaf node of the decision tree cannot continue to split or all samples point to the same

classification. In general, the decision tree algorithm has an important step - pruning. Because the first two

random sampling processes guarantee randomness, no overfitting occurs even without pruning.

The classification method of random forest

Each tree in the random forest is a binary tree. Its generation follows the top-down recursive splitting

principle. That is, the training set is divided from the root node in turn. In the binary tree, the root node

contains all the training data. According to the principle of minimum node purity, it splits into the left node

and the right node. Each node contains a subset of the training data. According to the same rule, the node

continues to split until it meets the branch stop rule and stops growing. Each decision tree actually learns

the classification of specific data, while random sampling ensures that the repeated samples are classified

by different decision trees, so that the classification ability of different decision trees can be evaluated.

The specific steps for random forest classification are as follows:

6 SPE-192833-MS

1. From the original training data set, apply the bootstrap method to randomly select k new sample

sets with putting back and construct k classification trees; each sample which is not extracted

constitutes k out-of-pocket data.

2. Assuming there are n features, randomly extract m features per node per tree. Calculate the amount

of information contained in each feature, and select the feature with the highest classification ability

for a node split.

3. Do not trim every tree, and make it grow to the maximum size.

4. Let many generated classification trees compose random forests, and use the random forest

classifier to discriminate and classify new data. The classification results are determined by the

number of votes cast by the tree classifier.

This method has many advantages over other machine learning classification methods:

1. It has a high accuracy.

2. It handles higher dimensional data.

3. The introduction of randomness makes the method less susceptible to overfitting, and the trained

model has a small variance and a strong generalization ability.

4. The training can be highly parallelized, with is advantageous for high sample training speeds in

the era of Big Data.

5. Relative to the boosting series of Adaboost and GBDT, random forest implementation is relatively

simple.

However, for most statistical modelers, the random forest is likes a black box. They cannot control

the internal operation of the model, and they can only manipulate different parameters and random seeds.

For small data or low-dimensional data (less-characteristic data), the model may not produce a good

classification. Compared with other machine learning classification methods, the accuracy of the training

set of this method is often high, but the accuracy of the test set is not always high.

The selection of model training data

The selection of the logging information type

Under the current technical conditions, conventional well logging data usually can be obtained through

logging while drilling. There are many types of logging-while-drilling information. According to the logging

principle, there are usually electric logging data, sonic logging data, nuclear logging data, etc. However,

using more data types and feature parameters does not guarantee higher machine learning accuracy. The

well logging data usually contain a considerable amount of noise, which will have an impact on the machine

learning identification results. Because the logging data of the same logging principle have a relatively

high correlation with the logging data of different logging principles, if the data volume is too large or the

correlation of the characteristic parameters is strong, the parameter redundancy phenomenon will occur to

increase the machine learning time or even affect the accuracy of the model learning. Comprehensively

consider the selection of natural gamma (GR), shale content (SH), acoustic time difference (AC), and deep-

sensing (ILD) logs as input data for the logging response to identify the reservoir oil and gas properties.

Logging data standardization

The data standardization (normalization) processing is a basic task for machine learning data classification.

Different evaluation indicators often have different dimensions and dimension units, which will affect the

results of the data analysis. It is necessary to standardize the data and eliminate the dimensionality impact

between the indicators to compare the data indicators. After the original data have been standardized, each

indicator is in the same order of magnitude, which is suitable for a comprehensive comparative evaluation.

SPE-192833-MS 7

Due to the different dimension of each type of logging data and the large difference in numerical magnitudes,

it is necessary to standardize the original logging data to eliminate the impact on the analysis results. There

are two commonly used data normalization methods; min-max normalization and Z-score normalization.

In this paper, formula (8), the min-max normalization, is used to standardize the logging data.

(8)

Where xmax denotes the maximum value of the sample data; xmin denotes the minimum value of the sample

data. After normalization, logging data values are all in the interval [0, 1].

The selection of the logging information training data

The sample data used in the paper consist of 1934 groups of well logging response data for 31 production

wells in the Chang 3 reservoir of the Weibei 2 well area. Among them, 198 groups of LWD data were

selected as the test set data. Combining the existing well response data and according to the well logging

principle, four types of logging well response data are selected as the characteristic parameters, namely,

natural gamma-ray logging response data (GR), shale content data (SH), acoustic time logging response

data (AC) and deep induction logging response data (ILD), as shown in Figure 3, is a plot of the well

logging curves for the WB2P27 well. The reservoirs are classified into four categories, i.e., oil-water layers,

dry layers, water layers, and oil layers. These target categories are replaced by the numbers 1, 2, 3, and 4,

respectively. The training data are standardized, and part of the data is shown in Table 1.

Figure 3—well logging curves of WB2P27 well.

8 SPE-192833-MS

Table 1—Partial training data.

Application of machine learning classification method in the identification of

OWL during drilling

Relying on the logging response data to identify OWL, in the final analysis, a nonlinear function mapping

problem is solved. The relationship between the logging response and the actual reservoir interval

is complex, so this mapping is usually highly nonlinear. There are many types of logging response

characteristics, and the target categories of OWL are usually larger than two types. Therefore, using support

vector machine multiple classifiers or random forest classifiers is an effective way to solve this complex

problem.

The OVR SVMs classifier identifies OWL during the drilling process

1. The selection of the OVR SVMs kernel function and the training parameters

The kernel function usually includes polynomial kernel function, Gaussian kernel function, and

linear kernel function. After a comparative analysis and study, to obtain the best model, the Gaussian

kernel function is selected here (Chang, Y.W. et al., 2010); furthermore,

its flexibility is very high. We use the GridSearch method to select the approximate range of the

optimal training parameters C and γ in the OVR SVMs, where . The GridSearch method is an

exhaustive search method that specifies parameter values. The optimal learning algorithm is obtained

by optimizing the parameters of the estimation function through a cross-validation method. Cross-

validation is a statistical analysis method used to verify the performance of a classifier and can avoid

overfitting problems. There are three main types of cross-validation: a. double cross-validation; b. k-

folder cross-validation; and c. leave-one-out cross-validation.

Double cross-validation, also known as 2-fold cross-validation (2-CV), splits the data set into two

equally sized subsets for two rounds of classifier training. In practice, 2-CV is not commonly used.

SPE-192833-MS 9

The main reason is that the number of training set samples is too small and is usually insufficient to

represent the distribution of the maternal sample, leading to a significant drop in the recognition rate

in the test phase. Additionally, the 2-CV subset has a high degree of variation, often failing to meet

the requirement that the experimental process must be replicable.

K-folder cross-validation is an extension of double cross-validation. The practice is to divide the

data set into k subsets. Each subset is used as a test set, and the rest is used one time as a training

set. The k-CV cross-validation is repeated k times, and it selects one subset as the test set every time.

Then, the method uses the k-fold average cross-validation recognition rate as a result. In this method,

all samples can be used as training sets and test sets, and each sample is verified once.

Leave-one-out cross-validation, assuming there are n samples in the data set and that LOOCV is

also n-CV, uses each sample as a test set, and the remaining n-1 samples are used as the training

set. Almost all the samples in each round are used to train the model in this method. Therefore, the

results of this method are the closest to the distribution of the maternal sample, and the estimated

generalization error is more reliable. In the case of a small sample of the experimental data set,

LOOCV can be considered. It can be seen that the cost of the LOOCV calculation is higher and the

number of models to be built is the same as the total number of samples. When the total number of

samples is quite large, LOOCV has more difficulty in the actual operation unless the training speed

of each model is very fast. However, parallel calculations can be used to reduce the time required

for the calculations.

Therefore, this paper uses the K-fold cross-validation method to optimize the objective function

and find the best parameters' value so that the accuracy of cross-validation can avoid over-fitting.

First, take C= [1000, 3000, 5000, 7000, 9000], γ= [15, 20, 25, 30, 35] to search the range of the

optimal parameters' values in 25 combinations of (C, γ) by the GridSearch method. We get the optimal

values for the parameters C and γ, which are approximately 5000 and 25, respectively. Then, use 10-

fold cross-validation to perform a fine search around C=5000 and γ=25. Figure 4 shows the 10-fold

cross-validation of parameter C at γ =25. Figure 5 shows the 10-fold cross-validation of parameter γ

at C=5000. Finally, select the combination with the highest accuracy (5000, 25) as the optimal OVR

SVMs training parameters. The accuracy of the training set is 0.93299, which takes 760 s.

2. The validation of the test set data by the OVR SVMs classifier

The OVR SVMs classifier was used to classify LWD test data. The results are shown in Table 2.

The test accuracy is 0.90909.

Figure 4—The 10 fold cross-validation of parameter C in OVR SVMs, γ=25.

10 SPE-192833-MS

Figure 5—The 10 fold cross-validation of parameter γ in OVR SVMs, C=5000.

The OVO SVMs classifier identifies OWL during the drilling process

1. The selection of the OVO SVMs kernel function and the training parameters

Also select the Gaussian kernel function and use the GridSearch

method to select the approximate range of the optimal training parameters C and γ in the OVO SVMs,

where . Then, K-fold cross-validation method is used to optimize the objective function, and

the best parameters' value are found so that the accuracy of cross-validation is the highest to avoid

over-fitting.

First, take C= [500, 1000, 1500, 2000, 2500], γ= [5, 10, 15, 20, 25] to search the range of the

optimal parameters' values in 25 combinations of (C, γ) by the GridSearch method. We get the optimal

values for the parameters C and γ, which are approximately 500 and 20, respectively. Then, use 10-

fold cross-validation to perform a fine search around C=500 and γ=20. Figure 6 shows the 10-fold

cross-validation of parameter C at γ =20. Figure 7 shows the 10-fold cross-validation of parameter

γ at C=900. Finally, select the combination with the highest accuracy (900, 20) as the optimal OVO

SVMs training parameters. The accuracy of the training set is 0.91753, which takes 100 s.

2. The validation of the test set data by the OVO SVMs classifier

The OVO SVMs classifier was used to classify LWD test data. The results are shown in Table 2.

The test accuracy is 0.88889.

SPE-192833-MS 11

Figure 6—The 10 fold cross-validation of parameter C in OVO SVMs, γ=20.

Figure 7—The 10 fold cross-validation of parameter γ in OVR SVMs, C=900.

The random forest classifier identifies OWL during the drilling process

1. The selection of the random forest classifier n_estimators and the max_features parameters

The n_estimators represents the number of trees in the forest. This value is not as large as possible.

As the number of trees increases, the calculation time will also increase, and the best predictive value

will appear at a reasonable tree value. Max_features represents the maximum number of features

allowed to be used by a single decision tree, that is, a subset of randomly selected feature sets. The

smaller the number of subset, the faster the variance will decrease, but at the same time the deviation

will increase faster. In the classification problem, it usually takes max_features=sqrt (n_features)

(Behnam Partopour et al., 2018, E. Vigneau et al., 2018). The GridSearch method was used to

select the approximate range of random forest optimal training parameters the n_estimators and the

max_features, and then K-fold cross validation was used to optimize the objective function and find

the best parameters' value so that the accuracy of cross-validation can avoid over-fitting.

12 SPE-192833-MS

First, take n_estimators= [5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60], max_features=

[1, 2, 3, 4] to search the range of the optimal parameters' values in 48 combinations of

(n_estimators, max_features) by the GridSearch method. We get the optimal values for the parameters

n_estimators and max_features, which are approximately 25 and 1, respectively. Since it usually takes

max_features=sqrt (n_features), let max_features=2. Then, use 10-fold cross-validation to perform

a fine search around n_estimators =25. Figure 8 shows the 10-fold cross-validation of parameter

n_estimators at max_features =2. Finally, select the combination with the highest accuracy (29, 2)

as the optimal random forest training parameter. The accuracy of the training set is 0.95361, which

takes 180 s.

2. The validation of the test set data by the random forest classifier

The random forest classifier was used to classify LWD test data. The results are shown in Table

2. The test accuracy is 0.89899. And the random forest classifier can get the proportion of each

feature parameter in the classification calculation. The importance of each feature is [GR=0.28552335,

AC=0.19097589, ILD=0.19690391, SH=0.32659686].

Figure 8—The 10 fold cross-validation of parameter n_estimators in random forest, max_features=2.

Comparison of classification results of LWD test data by three classifiers

The model training results and test results obtained by the three classifiers are shown in Table 2 (A- actual

OWL category, B- identified OWL category).

SPE-192833-MS 13

Table 2—Training results and test results obtained by the three classifiers.

The parameters and operation results obtained by the three classifiers are compared and analyzed, as

shown in Table 3.

Table 3—Classification algorithm comparison table.

As can be seen from Table 3, the accuracy of the three classifiers after the calculation of the training

set is greater than 90%, but their differences are relative large. For the test set, the calculated accuracy of

the three classifiers is about 90%, with a small difference. In calculation time-consuming aspects, OVR

SVMs consumes much more time than other two classifiers. Considering comprehensively, the random

forest classifier has the highest training set calculation accuracy, and its test set calculation accuracy is

only 1% less than OVR SVMs, but its computational time-consuming is only a quarter of the OVR SVMs.

Therefore, the random forest classifier was selected to identify the OWL during the drilling process.

Conclusions

This work presents an optimal model that could be used to quickly identify the OWL while drilling. The

identification results, recognition accuracy and calculation time of OWL classification are obtained by three

machine learning methods. Some meaningful conclusions are listed below:

a. For the optimization of the three method parameters, the initial value range of the GridSearch is

very important. It will directly influence the accuracy and training time of the training model.

14 SPE-192833-MS

b. After the parameters are optimized, the training set accuracy of random forest is the highest. It is

about 2% more than OVR SVMs, and about 4% more than OVO SVMs.

c. The test set accuracy of three methods are very close, all around 90%.

d. The calculation time of OVR SVMs is much larger than OVO SVMs and random forest.

e. For the oilfield sample data selected in this paper, the random forest method is the optimal

classification algorithm.

Acknowledgements

This work was supported by the National Natural Science Foundation of China (NSFC) (No. 51704235).

References

Ahmed Amara Konaté, Heping Pan, Sinan Fang, et al. Capability of self-organizing map neural network in geophysical

log data classification: Case study from the CCSD-MH. Journal of Applied Geophysics, 2015, Vol 118: 37–46. https://

doi.org/10.1016/j.jappgeo.2015.04.004.

Arsalan A. Othman, Richard Gloaguen. Integration of spectral, spatial and morphometric data into lithological mapping:

A comparison of different Machine Learning Algorithms in the Kurdistan Region, NE Iraq. Journal of Asian Earth

Sciences, 2017, Vol 146: 90–102. https://doi.org/10.1016/j.jseaes.2017.05.005.

Baouche Rafik, Baddari Kamel. Prediction of permeability and porosity from well log data using the nonparametric

regression with multivariate analysis and neural network, Hassi R'Mel Field, Algeria. Egyptian Journal of Petroleum,

2017, Vol 26: 763–778. https://doi.org/10.1016/j.ejpe.2016.10.013.

Behnam Partopour, Randy C. Paffenroth, Anthony G. Dixon. Random Forests for mapping and analysis of microkinetics

models. Computers & Chemical Engineering, 2018. https://doi.org/10.1016/j.compchemeng.2018.04.019.

Breiman, L., 2001. Random forests. Mach. Learn. 45, 5–32.

Breiman, L., 1996. Bagging predictors. Mach. Learn. 24, 123–140.

B. Shokooh Saljooghi, A. Hezarkhani. A new approach to improve permeability prediction of petroleum reservoirs using

neural network adaptive wavelet (wavenet). Journal of Petroleum Science and Engineering, 2015, Vol 133: 851–861.

https://doi.org/10.1016/j.petrol.2015.04.002.

Chang, Y.W., Hsieh, C.J., Chang, K.W., et al, 2010. Training and testing low-degree polynomial data mappings via linear

SVM. J. Mach. Learn. Res. 11 (11), 1471–1490.

Christoph Behrens, Christian Pierdzioch, Marian Risse. Testing the optimality of inflation forecasts under flexible loss

with random forests. Economic Modelling, 2018, Vol 72: 270–277. https://doi.org/10.1016/j.econmod.2018.02.004.

E. Vigneau, P. Courcoux, R. Symoneaux, et al Random forests: A machine learning methodology to highlight the volatile

organic compounds involved in olfactory perception. Food Quality and Preference, 2018, Vol 68: 135–145. https://

doi.org/10.1016/j.foodqual.2018.02.008.

Hsu, C.-W., Lin, C.-J., 2002. A comparison of methods for multiclass support vector machines. Trans. Neur. Netw. 13,

415–425.

Italo Zoppis, Giancarlo Mauri, Riccardo Dondi. Kernel Methods: Support Vector Machines. Reference Module in Life

Sciences, 2018. https://doi.org/10.1016/B978-0-12-809633-8.20342-7.

Jaime Ortegon, Rene Ledesma-Alonso, Romeli Barbosa, et al Material phase classification by means of

Support Vector Machines. Computational Materials Science, 2018, Vol 148: 336–342. https://doi.org/10.1016/

j.commatsci.2018.02.054.

Li Rong, Zhong Yihua. DENTIFICATION METHOD OF OIL/GAS/WATER LAYER BASED ON LEAST SQUARE

SUPPORT VECTOR MACHINE. NATURAL GAS EXPLORATION & DEVELOPMENT, 2009, 32(03): 15–18+72.

Liu H., Wen S., Li W., Xu C., Hu C. (2009) Study on Identification of Oil/Gas and Water Zones in Geological Logging

Base on Support-Vector Machine. Fuzzy Information and Engineering Volume 2. Advances in Intelligent and Soft

Computing, Vol 62. Springer, Berlin, Heidelberg.

Michele Fratello, Roberto Tagliaferri. Decision Trees and Random Forests. Reference Module in Life Sciences, 2018.

https://doi.org/10.1016/B978-0-12-809633-8.20337-3.

Morteza Raeesi, Ali Moradzadeh, Faramarz Doulati Ardejani, et al Classification and identification of hydrocarbon

reservoir lithofacies and their heterogeneity using seismic attributes, logs data and artificial neural networks. Journal

of Petroleum Science and Engineering, 2012, Vol 82–83: 151–165. https://doi.org/10.1016/j.petrol.2012.01.012.

Neda Mahvash Mohammadi, Ardeshir Hezarkhani. Application of support vector machine for the separation of

mineralised zones in the Takht-e-Gonbad porphyry deposit, SE Iran. Journal of African Earth Sciences, 2018, Vol 143:

301–308. https://doi.org/10.1016/j.jafrearsci.2018.02.005.

SPE-192833-MS 15

Robin Genuer, Jean-Michel Poggi, Christine Tuleau-Malot, et al Random Forests for Big Data. Big Data Research, 2017,

Vol 9: 28–46. https://doi.org/10.1016/j.bdr.2017.07.003.

Shaoqun Dong, Zhizhang Wang, Lianbo Zeng. Lithology identification using kernel Fisher discriminant analysis with

well logs. Journal of Petroleum Science and Engineering, 2016, Volume 143: 95–102. https://doi.org/10.1016/

j.petrol.2016.02.017.

Song Peng, Yang Weiguo, Sun Dong, et al OIL- WATER- LAYER IDENTIFYING METHOD FOR RESERVOIR

CHANG6 IN HUAQING OILFIELD. Petroleum Geology and Oilfield Development in Daqing, 2016, 35(6): 144–147.

Song Yanjie, Zhang Jianfeng, Yan Weilin, et al A new identification method for complex lithology with support vector

machine. Journal of Daqing Petroleum Institute, 2007,31(5): 18–20.

Sun Fengrui, Yao Yuedong, Chen Mingqiang, Li Xiangfang, Zhao Lin, Meng Ye, Sun Zheng, Zhang Tao, Feng Dong.

Performance analysis of superheated steam injection for heavy oil recovery and modeling of wellbore heat efficiency.

Energy, 2017, 125: 795–804.

WAN Qiao- sheng, LI Xue- ying, ZHAO Yu- qiu, et al Oil and water layer identification method of Gaotaizi reservoirs in

Qijiabei area. Progress in Geophysics, 2017, 32(2):0714–0720, doi:10.6038/pg20170236.

Xiaoling Lu, Fengchi Dong, Xiexin Liu, et al Varying Coefficient Support Vector Machines. Statistics & Probability

Letters, 2018, Vol 132: 107–115. https://doi.org/10.1016/j.spl.2017.09.006.

Xiekai Zhang, Shifei Ding, Yu Xue. An improved multiple birth support vector machine for pattern classification.

Neurocomputing, 2017, Vol 225: 119–128. https://doi.org/10.1016/j.neucom.2016.11.006.3

Xiongyan Li, Hongqi Li. A new method of identification of complex lithologies and reservoirs: task-driven data mining.

Journal of Petroleum Science and Engineering, 2013, Vol 109: 241–249. https://doi.org/10.1016/j.petrol.2013.08.049.

Yunxin Xie, Chenyang Zhu, Wen Zhou, et al Evaluation of machine learning methods for formation lithology

identification: A comparison of tuning processes and model performances. Journal of Petroleum Science and

Engineering, 2018, Vol 160: 182–193. https://doi.org/10.1016/j.petrol.2017.10.028.

Zhong Yihua, Li Rong. Application of Principal Component Analysis and Least Square Support Vector Machine

toLithology Identification. WELL LOGGING T ECHNO LOGY, 2009, 33(05):425–429.

16 SPE-192833-MS

Appendix A

Three training models' classification results of test sets

Table A—Three training models' classification results of test sets

No. GR AC ILD SH actual type identified type

by OVR SVMs

identified type

by OVO SVMs

identified

type by

random

forest

1 0.116089 0.096942 0.400179 0.05468 1 2 2 1

2 0.093133 0.051894 0.435171 0.05469 1 1 1 1

3 0.140357 0.050909 0.452366 0.05471 1 1 2 1

4 0.134357 0.064829 0.401235 0.05472 1 2 2 1

5 0.101213 0.104653 0.389918 0.05473 1 2 2 1

6 0.11059 0.028326 0.486268 0.05482 1 1 1 1

7 0.097519 0.069152 0.376002 0.05491 1 2 2 1

8 0.048411 0.081646 0.456296 0.05499 1 3 3 1

9 0.086996 0.027431 0.413717 0.05504 1 1 2 1

10 0.10145 0.075731 0.347666 0.05508 1 2 2 1

11 0.093321 0.052928 0.414261 0.05512 1 1 2 1

12 0.087399 0.078059 0.481719 0.05518 1 1 1 1

13 0.110548 0.214586 0.442463 0.05519 1 1 1 4

14 0.115368 0.097061 0.391812 0.05522 1 2 2 1

15 0.105354 0.108942 0.371031 0.05526 1 2 2 1

16 0.115801 0.061189 0.402327 0.0553 1 2 2 1

17 0.115259 0.064559 0.3861 0.05534 1 2 2 1

18 0.105459 0.088368 0.396466 0.05544 1 2 2 1

19 0.095664 0.061505 0.439703 0.05547 1 1 1 1

20 0.105674 0.069896 0.415734 0.05557 1 1 2 1

21 0.163836 0.067933 0.395874 0.12198 1 1 1 1

22 0.168097 0.057272 0.441259 0.12199 1 1 1 1

23 0.137618 0.070063 0.469064 0.12202 1 1 1 1

24 0.135075 0.116844 0.413593 0.12203 1 1 4 1

25 0.128039 0.062846 0.417837 0.12205 1 1 1 1

26 0.142556 0.061435 0.367942 0.12206 1 2 2 1

27 0.464709 0.09437 0.112175 0.12208 1 1 1 2

28 0.143162 0.032004 0.472009 0.12208 1 1 1 1

29 0.185951 0.048207 0.420951 0.12213 1 1 1 1

30 0.169695 0.061033 0.420158 0.1222 1 1 1 1

31 0.147916 0.031689 0.542651 0.12221 1 1 1 1

32 0.133118 0.0345 0.421555 0.12222 1 2 1 1

33 0.158063 0.099266 0.402537 0.12224 1 1 1 1

34 0.159824 0.087124 0.402147 0.12225 1 1 1 1

SPE-192833-MS 17

No. GR AC ILD SH actual type identified type

by OVR SVMs

identified type

by OVO SVMs

identified

type by

random

forest

35 0.168219 0.05957 0.427848 0.12228 1 1 1 1

36 0.13614 0.133094 0.386076 0.12232 1 2 2 2

37 0.144019 0.054874 0.420568 0.12235 1 1 1 1

38 0.229379 0.095255 0.419888 0.12235 1 1 1 1

39 0.124975 0.091524 0.384267 0.12238 1 1 2 2

40 0.145918 0.06305 0.438382 0.12239 1 1 1 1

41 0.171598 0.078392 0.392307 0.12487 1 1 1 1

42 0.175802 0.073361 0.401294 0.12494 1 1 1 1

43 0.153868 0.044822 0.451974 0.125 1 1 1 1

44 0.161192 0.08632 0.417621 0.12507 1 1 1 1

45 0.122151 0.031818 0.506742 0.12509 1 1 1 1

46 0.144176 0.105396 0.395701 0.12512 1 1 1 2

47 0.122202 0.033715 0.504282 0.12517 1 1 1 1

48 0.155712 0.100281 0.397754 0.12526 1 1 1 2

49 0.169543 0.069907 0.418579 0.12534 1 1 1 1

50 0.14465 0.035174 0.440296 0.12544 1 1 1 1

51 0.12477 0.096546 0.387239 0.04789 2 1 1 2

52 0.105168 0.088048 0.371421 0.05707 2 2 2 1

53 0.127323 0.085472 0.376093 0.0571 2 2 2 1

54 0.115366 0.123103 0.336264 0.05718 2 2 2 1

55 0.09704 0.109209 0.356372 0.05722 2 2 2 1

56 0.111235 0.093313 0.355797 0.05722 2 2 2 1

57 0.113238 0.064385 0.38622 0.05725 2 2 2 1

58 0.095083 0.062337 0.372016 0.05726 2 2 2 1

59 0.130477 0.099633 0.388648 0.05729 2 2 2 1

60 0.145923 0.069724 0.424973 0.05732 2 1 2 1

61 0.09512 0.064524 0.373528 0.05732 2 2 2 1

62 0.127491 0.084555 0.385464 0.05734 2 2 2 1

63 0.103077 0.044248 0.393751 0.05736 2 2 2 1

64 0.126336 0.068057 0.401996 0.05739 2 2 2 1

65 0.116641 0.067068 0.357247 0.05744 2 2 2 1

66 0.109397 0.074987 0.411857 0.05748 2 1 2 1

67 0.115317 0.113172 0.355185 0.0691 2 2 2 2

68 0.098599 0.088842 0.380222 0.06912 2 2 2 2

69 0.118857 0.079711 0.389082 0.06913 2 2 2 2

70 0.142886 0.093396 0.379467 0.06918 2 2 2 2

71 0.120946 0.052307 0.388528 0.0692 2 2 2 2

72 0.129197 0.113914 0.363293 0.06928 2 2 2 2

18 SPE-192833-MS

No. GR AC ILD SH actual type identified type

by OVR SVMs

identified type

by OVO SVMs

identified

type by

random

forest

73 0.101941 0.096244 0.358495 0.06928 2 2 2 2

74 0.101953 0.097231 0.351698 0.06931 2 2 2 2

75 0.094934 0.097809 0.374366 0.06933 2 2 2 2

76 0.15549 0.05827 0.435795 0.06935 2 2 2 2

77 0.133267 0.070657 0.38394 0.06936 2 2 2 2

78 0.118735 0.066187 0.390106 0.06937 2 2 2 2

79 0.142839 0.134442 0.359029 0.06949 2 2 2 2

80 0.111618 0.07402 0.376729 0.06949 2 2 2 2

81 0.135832 0.135151 0.391522 0.06954 2 2 2 2

82 0.074773 0.044555 0.385313 0.06957 2 3 2 3

83 0.102127 0.094061 0.35634 0.06962 2 2 3 2

84 0.093864 0.046194 0.392656 0.06962 2 2 2 2

85 0.092278 0.113633 0.365266 0.06968 2 2 2 2

86 0.106842 0.152879 0.346474 0.0697 2 2 2 2

87 0.09232 0.105724 0.363174 0.06975 2 2 2 2

88 0.134494 0.059594 0.369076 0.06975 2 2 2 2

89 0.119974 0.040733 0.402027 0.06978 2 2 2 2

90 0.082466 0.084187 0.355723 0.0698 2 2 2 2

91 0.150982 0.079781 0.403129 0.08513 2 2 2 2

92 0.152439 0.131453 0.371112 0.08516 2 2 2 2

93 0.091509 0.065507 0.363856 0.08521 2 2 2 2

94 0.135842 0.067661 0.386954 0.08523 2 2 2 2

95 0.122024 0.110761 0.340472 0.08523 2 2 2 2

96 0.12142 0.074461 0.359575 0.08523 2 2 2 2

97 0.156597 0.062139 0.420275 0.08528 2 2 2 2

98 0.113783 0.062481 0.39005 0.0853 2 2 2 2

99 0.126675 0.06132 0.350674 0.08531 2 2 2 2

100 0.11917 0.14549 0.360685 0.08533 2 2 2 2

101 0.119632 0.099883 0.362623 0.08541 2 2 2 2

102 0.136355 0.042065 0.356886 0.08541 2 2 2 2

103 0.162798 0.099153 0.369932 0.08543 2 2 2 2

104 0.073962 0.071837 0.357579 0.0855 2 2 2 2

105 0.12281 0.131907 0.358317 0.08552 2 2 2 2

106 0.121728 0.047518 0.370768 0.08571 2 2 2 2

107 0.119116 0.124355 0.393161 0.08573 2 2 2 2

108 0.136521 0.040915 0.368286 0.08577 2 2 2 2

109 0.128879 0.08755 0.382968 0.08577 2 2 2 2

110 0.122561 0.108296 0.365183 0.08578 2 2 2 2

SPE-192833-MS 19

No. GR AC ILD SH actual type identified type

by OVR SVMs

identified type

by OVO SVMs

identified

type by

random

forest

111 0.144401 0.049313 0.413369 0.08581 2 2 2 2

112 0.153059 0.115833 0.350277 0.08591 2 2 2 2

113 0.128982 0.072302 0.393509 0.08594 2 2 2 2

114 0.11833 0.064652 0.396145 0.08596 2 2 2 2

115 0.142659 0.132677 0.392649 0.08597 2 2 2 2

116 0.110919 0.092135 0.347005 0.08613 2 2 2 2

117 0.122021 0.046748 0.368181 0.08617 2 2 2 2

118 0.104677 0.094635 0.353205 0.08627 2 2 2 2

119 0.114452 0.105109 0.352917 0.08629 2 2 2 2

120 0.135639 0.113166 0.374483 0.08632 2 2 2 2

121 0.104721 0.081748 0.391893 0.08634 2 2 2 2

122 0.111027 0.084587 0.314361 0.08634 2 2 2 2

123 0.129903 0.150066 0.384152 0.08634 2 2 2 2

124 0.14229 0.117594 0.390074 0.08643 2 2 2 2

125 0.119431 0.126168 0.366043 0.08646 2 2 2 2

126 0.144134 0.052972 0.370812 0.0865 2 2 2 2

127 0.114562 0.091461 0.356864 0.08651 2 2 2 2

128 0.144149 0.083559 0.362779 0.08652 2 2 2 2

129 0.12724 0.059415 0.409968 0.0866 2 2 2 2

130 0.108039 0.107024 0.385483 0.08662 2 2 2 2

131 0.051062 0.060244 0.383384 0.02585 3 3 2 3

132 0.051077 0.060568 0.39287 0.02587 3 3 3 3

133 0.05127 0.06005 0.394397 0.02614 3 3 3 3

134 0.033733 0.055272 0.419781 0.02617 3 3 3 3

135 0.051436 0.057852 0.393353 0.02636 3 3 3 3

136 0.057773 0.038152 0.432485 0.0264 3 3 3 3

137 0.051592 0.0608 0.388477 0.02658 3 3 3 3

138 0.034197 0.056715 0.375483 0.02682 3 3 3 3

139 0.052125 0.060198 0.388814 0.02732 3 3 3 3

140 0.052535 0.061115 0.395577 0.02788 3 3 3 3

141 0.052875 0.060713 0.390398 0.02835 3 3 3 3

142 0.052943 0.05922 0.394543 0.02845 3 3 3 3

143 0.052999 0.059598 0.381141 0.02853 3 3 3 3

144 0.053053 0.060329 0.404663 0.0286 3 3 3 3

145 0.053427 0.061409 0.40064 0.02912 3 3 3 3

146 0.054624 0.060629 0.396622 0.0308 3 3 3 3

147 0.060422 0.040511 0.437307 0.03144 3 3 3 3

148 0.055726 0.061998 0.349 0.03235 3 3 3 3

20 SPE-192833-MS

No. GR AC ILD SH actual type identified type

by OVR SVMs

identified type

by OVO SVMs

identified

type by

random

forest

149 0.056036 0.05662 0.367496 0.03279 3 3 3 3

150 0.056158 0.062029 0.40193 0.03296 3 3 3 3

151 0.071272 0.034763 0.475042 0.05284 3 3 3 3

152 0.071311 0.038596 0.448853 0.05291 3 3 3 3

153 0.071519 0.053728 0.423849 0.05333 3 3 3 3

154 0.070045 0.061776 0.394562 0.0535 3 3 3 3

155 0.071885 0.043018 0.451844 0.05411 3 3 3 3

156 0.052679 0.053424 0.418452 0.05423 3 3 3 3

157 0.072095 0.038146 0.469532 0.05451 3 3 3 3

158 0.070783 0.060683 0.404831 0.05464 3 3 3 3

159 0.053039 0.053885 0.42146 0.05479 3 3 3 3

160 0.071277 0.061765 0.378674 0.05541 3 3 3 3

161 0.053471 0.048074 0.441664 0.05547 3 3 3 3

162 0.072567 0.055437 0.42764 0.05548 3 3 3 3

163 0.053618 0.053628 0.419293 0.0557 3 3 3 3

164 0.053938 0.051498 0.352532 0.0562 3 3 3 3

165 0.072105 0.058394 0.399199 0.0567 3 3 3 3

166 0.072506 0.053139 0.382619 0.05732 3 3 3 3

167 0.073522 0.039996 0.452193 0.05743 3 3 3 3

168 0.072867 0.056911 0.39388 0.05789 3 3 3 3

169 0.096077 0.13179 0.489616 0.03588 4 4 3 4

170 0.09694 0.122214 0.43374 0.03712 4 4 4 4

171 0.096937 0.1048 0.472465 0.03712 4 4 4 4

172 0.097463 0.123872 0.490084 0.03788 4 4 4 4

173 0.097599 0.118281 0.497154 0.03808 4 4 4 4

174 0.098222 0.126746 0.500226 0.03899 4 4 4 4

175 0.098872 0.121166 0.496639 0.03994 4 4 4 4

176 0.121332 0.242216 0.509814 0.04 4 4 4 4

177 0.099654 0.133744 0.467101 0.0411 4 4 4 4

178 0.122417 0.258801 0.477665 0.04133 4 4 4 4

179 0.513103 0.067942 0.046743 0.05671 4 4 4 4

180 0.10972 0.136331 0.482309 0.05671 4 4 4 4

181 0.134107 0.221344 0.364771 0.05691 4 4 4 4

182 0.109928 0.112903 0.420119 0.05705 4 4 4 4

183 0.110121 0.127888 0.467199 0.05736 4 4 4 4

184 0.510044 0.064315 0.047469 0.05773 4 4 4 4

185 0.537554 0.061087 0.047915 0.05836 4 4 4 4

186 0.110912 0.127901 0.426797 0.05866 4 4 4 4

SPE-192833-MS 21

No. GR AC ILD SH actual type identified type

by OVR SVMs

identified type

by OVO SVMs

identified

type by

random

forest

187 0.111183 0.120148 0.447875 0.05911 4 4 4 4

188 0.501529 0.064159 0.048913 0.05979 4 4 4 4

189 0.111606 0.123194 0.435859 0.0598 4 4 4 4

190 0.541756 0.060424 0.048964 0.05986 4 4 4 4

191 0.523081 0.055139 0.049151 0.06013 4 4 4 4

192 0.54995 0.056965 0.049451 0.06057 4 4 4 4

193 0.112129 0.131924 0.475544 0.06067 4 4 4 4

194 0.112207 0.127374 0.448268 0.0608 4 4 4 4

195 0.545059 0.06457 0.049654 0.06086 4 4 4 4

196 0.112319 0.104572 0.481675 0.06098 4 4 4 4

197 0.112398 0.101544 0.413681 0.06111 4 4 4 4

198 0.538814 0.064755 0.049946 0.06129 4 4 4 4

Value-aware meta-transfer learning and convolutional mask attention networks for reservoir identification with limited data

Article

Mar 2023
EXPERT SYST APPL

Reservoir identification is important for reservoir evaluation and petroleum development. Existing methods cannot automatically identify the categories of the reservoir that exhibit: (a) local features differences of well logging data; (b) limited with non-reservoir interference; and (c) insufficient real labels. Transfer learning-based methods utilize other blocks partially address the problem of small samples. However, they ignore the significant geological differences between blocks. Therefore, this paper proposes a small sample reservoir identification method combining Convolutional Mask Attention Network (CMAN) and Value-aware Meta-Transfer Learning (VMTL) strategy. First, we pre-train the CMAN on the source block to adaptively extract the local information of each depth point. The CMAN also automatically masks the non-reservoir information while capturing the relationship between reservoirs and non-reservoirs to improve feature extraction. Then we design a VMTL strategy to learn valuable transfer knowledge for overcoming the geological difference. Finally, we fine-tune our model using target block data to address the insufficient samples. The average accuracy and F1 score of the proposed method on real-world oilfield data are respectively 92.61% and 88.85%. The results of the two cases demonstrate our method outperforms existing methods in convergence speed, stability, and generalizability.

Application of Genetic Algorithm on Data Driven Models for Optimized ROP Prediction

Conference Paper

Aug 2022

The demand for cost-effective drilling operations in oil and gas exploration is ever growing. One of the important aspects to tackling the aforementioned difficulty is determining the optimal rate of penetration (ROP) of the drill bit. The most important optimization objective is to achieve a high optimal rate of penetration in safe and stable drilling conditions. Several machine learning models have been developed to predict ROP, however, there have been few studies that consider the different optimization algorithms needed to optimize the conventional developed models other than the conventional grid search and random search techniques. Genetic algorithm (GA) has gained much attention as methods of optimizing the predictions of machine learning algorithms in different fields of study. In this study, GA optimization algorithm was implemented to optimize 5 machine learning algorithms: Linear Regression, Decision Tree, Support Vector Machine, Random Forest, and Multilayer Perceptron algorithm while using torque, weight on bit, surface RPM, mud flow, pump pressure, downhole temperature and pressure, etc, as input parameters. Three scenarios were analyzed using a train-test split ratio of 70-30, 80-20 and 85-15 percent on all the developed models. The results from the comparative study of all models developed shows that the implementation of the GA optimization algorithms increased the individual ROP models, with the multilayer perceptron model having the highest coefficient of determination of 0.989% after GA optimization.

Prediction models establishment and comparison for guiding force of high-temperature superconducting maglev based on deep learning algorithms

Article

Dec 2021

By the merits of self-stability and low energy consumption, high temperature superconducting (HTS) maglev has the potential to become a novel type of transportation mode. As a key index to guarantee the lateral self-stability of HTS maglev, guiding force has strong non-linearity and is determined by multitudinous factors, and these complexities impede its further researches. Compared to traditional finite element and polynomial fitting method, the prosperity of deep learning algorithms could provide another guiding force prediction approach, but the verification of this approach is still blank. Therefore, this paper establishes 5 different neural network models (RBF, DNN, CNN, RNN, LSTM) to predict HTS maglev guiding force, and compares their prediction efficiency based on 3720 pieces of collected data. Meanwhile, two adaptively iterative algorithms for parameters matrix and learning rate adjustment are proposed, which could effectively reduce computing time and unnecessary iterations. And according to the results, it is revealed that, the DNN model shows the best fitting goodness, while the LSTM model displays the smoothest fitting curve on guiding force prediction. Based on this discovery, the effects of learning rate and iterations on prediction accuracy of the constructed DNN model are studied. And the learning rate and iterations at the highest guiding force prediction accuracy are 0.00025 and 90000, respectively. Moreover, the K-fold cross validation method is also applied to this DNN model, whose result manifests the generalization and robustness of this DNN model. The imperative of K-fold cross validation method to ensure universality of guiding force prediction model is likewise assessed. This paper firstly combines HTS maglev guiding force prediction with deep learning algorithms considering different field cooling height, real-time magnetic flux density, liquid nitrogen temperature and motion direction of bulk. Additionally, this paper gives a convenient and efficient method for HTS guiding force prediction and parameter optimization.

Real-time updating method of local geological model based on logging while drilling process

Article

May 2021

The updating of reservoir geological models has become a research hotspot. Nevertheless, two difficulties continue to hinder the development of reservoir geological model updating techniques. First, logging while drilling (LWD) is used mainly to guide geosteering operations and effectively identify the lithology; few scholars have researched the interpretation of reservoir physical characteristics while drilling, which is the basis of updating geological models. Second, interpretation results are difficult to transmit to geological models in real time. Based on the LWD technique, this paper uses logging interpretation, machine learning, computer science, and reservoir geological modeling theories and methods to conduct the research of real-time geological model updating around the well. First, based on effective logging data, two machine learning algorithms which are random forest (RF) and extreme gradient boosting tree (XGBoost) are used to establish interpretation models of the reservoir lithology, porosity, and permeability. The parameters of each model are optimized through cross-validation method, and LWD data are interpreted in real time by interpretation models. Second, based on the convenience of the Ocean secondary development platform and the functionality of Petrel software, a real-time transmission plug-in for the current well trajectory and reservoir interpretation results is compiled, and an automatic updating module for the geological model is established. A case study is performed with data from the Sulige gas field in the Ordos Basin, China. For the real-time interpretation of reservoir characteristics while drilling, after 228 trials, the XGBoost algorithm is chosen to establish reservoir lithology, porosity, and permeability interpretation models. For the real-time updating of the geological model around the well, given the consistent probability distributions and the agreement between adjacent wells, we obtain relative errors between the simulated and real values of the lithofacies, porosity, and permeability of 3.90%, 4.50%, and 7.60%, respectively. Therefore, this paper provides a new method for the real-time modification and updating of reservoir geological models, which preliminarily resolved the contradiction between accuracy and real time of geological model real-time updating.

Identification of Porosity and Permeability While Drilling Based on Machine Learning

Article

Feb 2021

The predictions of porosity and permeability from well logging data are important in oil and gas field development. Currently, many scholars use machine learning algorithms to predict reservoir properties. However, few scholars have researched the prediction of reservoir porosity and permeability while drilling. This approach requires not only a high prediction accuracy but also short model processing and calculation times as new logging data are incorporated while drilling. In this paper, four machine learning algorithms were evaluated: the one-versus-rest support vector machine (OVR SVM), one-versus-one support vector machine (OVO SVM), random forest (RF) and gradient boosting decision tree (GBDT) algorithms. First, samples of wireline logging data from the Yan969 wellblock of the Yan’an gas field were chosen for model training. To improve the accuracy and reduce the input parameter dimensions and model training time as much as possible, data correlation analysis was performed. Second, we used the grid search method to approximate ranges of reasonable parameter values and then used k-fold cross-validation to optimize the final parameters and avoid overfitting. Third, we used the four classification models to predict porosity and permeability while drilling with data from logging while drilling (LWD) logs. Finally, we indicate the best porosity and permeability prediction models to use while drilling. To ensure that the prediction accuracy is as high as possible and that the model training time is as short as possible, the OVO SVM algorithm was suggested for porosity and permeability prediction. Therefore, appropriate machine learning algorithms can be used to predict porosity and permeability while drilling.

Study of Forecasting and Estimation Methodology of Oilfield Development Cost Based on Machine Learning

Article

Feb 2021

Due to the oil price fluctuations in recent decades, international and national oil companies have developed programs of strategically oriented development and assets optimization. Many companies have also promptly promoted opportunities for oilfield joint ventures. Therefore, a fast and accurate assessment methodology of oilfield assets, including planning and costs assessment, is expected to be proposed. Oilfield development cost assessment can be dynamically affected by a number of factors, including oilfield internal indexes and macroeconomic indexes. Based on a machine learning algorithm and combining mathematical and statistical methodology, the Microsoft Azure machine learning studio has been used for modeling the oilfield development cost. The proposed method has adopted three algorithms: a neural network, a boosted decision tree, and a decision forest. Results showed that the boosted decision tree and decision forest algorithms can achieve PFI ranking with stable training results. The results of the machine training model have been analyzed, and they showed that the permutation feature importance (PFI) model can provide reasonable scientific and technical support for oil companies and help to attain more effective and accurate estimation and prediction of oilfield assets.

Material phase classification by means of Support Vector Machines

Article

Full-text available

Dec 2017
COMP MATER SCI

The pixel's classification of images obtained from random heterogeneous materials is a relevant step to compute their physical properties, like Effective Transport Coefficients (ETC), during a characterization process as stochastic reconstruction. A bad classification will impact on the computed properties; however, the literature on the topic discusses mainly the correlation functions or the properties formulae, giving little or no attention to the classification; authors mention either the use of a threshold or, in few cases, the use of Otsu's method. This paper presents a classification approach based on Support Vector Machines (SVM) and a comparison with the Otsu's-based approach, based on accuracy and precision. The data used for the SVM training are the key for a better classification; these data are the grayscale value, the magnitude and direction of pixels gradient. For instance, in the case study, the accuracy of the pixel's classification is 77.6% for the SVM method and 40.9% for Otsu's method. Finally, a discussion about the impact on the correlation functions is presented in order to show the benefits of the proposal.

Random forests: A machine learning methodology to highlight the volatile organic compounds involved in olfactory perception

Article

May 2018
FOOD QUAL PREFER

The purpose of this paper is to discuss the application of the Random Forest methodology to sensory analysis. A methodological point of view is mainly adopted to describe as simply as possible the construction of binary decision trees and, more precisely, Classification and Regression Trees (CART), as well as the generation of an ensemble of trees or, in other words, a Random Forest. The interest of the permutation accuracy criterion, as a measure of variable importance, is specifically emphasized as a way of identifying the most predictive variables and selecting a subset of these variables for parsimonious and efficient predictive models. A two-step procedure is proposed for choosing this subset of variables. The principle of the method is illustrated in a case study in which the aim was to better understand and predict the olfactory characteristics of red wines made of the Cabernet Franc grape variety, from their Volatile Organic Compound (VOC) content. For two main olfactory attributes, the bell pepper odor and the leather odor, it was possible to list the most important compounds and to highlight a very small number of compounds useful for estimating each of the olfactory attributes considered. For the latter, it was also observed that Random Forest models had a better predictive ability than Partial Least Squares (PLS) Regression models.

Random Forests for mapping and analysis of microkinetics models

Article

Apr 2018
COMPUT CHEM ENG

We introduce the application of an ensemble learning method known as Random Forests to microkinetics modeling and the computationally efficient integration of microkinetics into reaction engineering models. First, we show how Random Forests can be used for mapping pre-computed microkinetics data. Random Forests can be used to predict new datasets while keeping the prediction accuracy high and the computational load low. The method is also used to identify the important variables in the mechanism in regard to overall reaction rate and selectivity. The results are compared with results from a similar study using the Campbell's Degree of Rate Control approach and it is shown that the Random Forests method could be used to identify important features of the mechanism over a wide range of reacting conditions. Finally, the inclusion of the suggested method into reaction engineering models such as Computational Fluid Dynamics (CFD) resolved-particle simulations of fixed bed reactors is presented.

Testing the optimality of inflation forecasts under flexible loss with random forests

Article

Mar 2018
ECON MODEL

We contribute to recent research on the optimality of macroeconomic forecasts. We start from the assumption that forecasters may have a flexible rather than a symmetric (quadratic) loss function assumed in standard tests. This assumption leads to the prediction that variables available to a forecaster when a forecast was formed should have no predictive value for a binary 0/1-indicator that captures the sign of the forecast error. A test of forecast optimality, thus, can be interpreted as a classification problem. We use random forests to model this classification problem. Random forests are a powerful nonparametric modeling instrument originally developed in the machine-learning literature. Unlike conventional linear-probability or logit/probit-models, random forests account in a natural way for potential nonlinear links between the signed forecast error and the variables in a forecaster's information set. Random forests also can handle a situation in which the number of forecasts is small relative to the number of predictor variables that a researcher uses to proxy a forecaster's information set. Random forests, therefore, are a powerful modeling device that is of interest for every researcher who studies the properties of macroeconomic forecasts. Upon estimating random forests on forecasts of four German research institutes, we document that optimality of longer-term inflation forecasts cannot be rejected and that inflation forecasts are weakly efficient. For shorter-term inflation forecasts, our results are heterogeneous across research institutes. When we pool the data across the research institutes, we reject optimality of both shorter-term and longer-term forecasts.

Kernel Methods: Support Vector Machines

Chapter

Jan 2018

Application of support vector machine for the separation of mineralised zones in the Takht-e-Gonbad porphyry deposit, SE Iran

Article

Feb 2018
J AFR EARTH SCI

Classification of mineralised zones is an important factor for the analysis of economic deposits. In this paper, the support vector machine (SVM), a supervised learning algorithm, based on subsurface data is proposed for classification of mineralised zones in the Takht-e-Gonbad porphyry Cu-deposit (SE Iran). The effects of the input features are evaluated via calculating the accuracy rates on the SVM performance. Ultimately, the SVM model, is developed based on input features namely lithology, alteration, mineralisation, the level and, radial basis function (RBF) as a kernel function. Moreover, the optimal amount of parameters λ and C, using n-fold cross-validation method, are calculated at level 0.001 and 0.01 respectively. The accuracy of this model is 0.931 for classification of mineralised zones in the Takht-e-Gonbad porphyry deposit. The results of the study confirm the efficiency of SVM method for classification the mineralised zones.

Decision Trees and Random Forests

Chapter

Jan 2018

Evaluation of machine learning methods for formation lithology identification: A comparison of tuning processes and model performances

Article

Oct 2017
J PETROL SCI ENG

Identification of underground formation lithology from well log data is an important task in petroleum exploration and engineering. Recently, several computational algorithms have been used for lithology identification to improve the prediction accuracy. In this paper, we evaluate five typical machine learning methods, namely the Naïve Bayes, Support Vector Machine, Artificial Neural Network, Random Forest and Gradient Tree Boosting, for formation lithology identification using data from the Daniudui gas field and the Hangjinqi gas field. The input to each model consists of features selected from different well log data samples. To determine the best model to classify the lithology type, this study used validation curve to determine the parameter search range and adopted the hyper-parameter optimization method to obtain the best parameter set for each model. The performance of each classifier is also evaluated using 5-fold cross validation. The results suggest that ensemble methods are good algorithm choices for supervised classification of lithology using well log data. The Gradient Tree Boosting classifier is robust to overfitting because it grows trees sequentially by adjusting the weight of the training data distribution to minimize a loss function. The random forest classifier is also a suitable option. An evaluation matrix showed that the Gradient Tree Boosting and Random Forest classifiers have lower prediction errors compared with the other three models. Although all the models have difficulties in distinguishing sandstone classes, the Gradient Tree Boosting performs well on this task compared with the other four methods. Moreover, the classification accuracy is remarkably similar across the lithology classes for both the Random Forest and Gradient Tree Boosting models.

Varying Coefficient Support Vector Machines

Article

Oct 2017
STAT PROBABIL LETT

This paper proposes a Varying Coefficient Support Vector Machine (VCSVM). VCSMO, a variation of the classic algorithm SMO for standard SVMs, is also proposed to solve for VCSVM. Numerical examples validate the accuracy and efficiency of the proposed model.

Integration of spectral, spatial and morphometric data into lithological mapping: A comparison of different Machine Learning Algorithms in the Kurdistan Region, NE Iraq

Article

Sep 2017

Lithological mapping in mountainous regions is often impeded by limited accessibility due to relief. This study aims to evaluate (1) the performance of different supervised classification approaches using remote sensing data and (2) the use of additional information such as geomorphology. We exemplify the methodology in the Bardi-Zard area in NE Iraq, a part of the Zagros Fold – Thrust Belt, known for its chromite deposits. We highlighted the improvement of remote sensing geological classification by integrating geomorphic features and spatial information in the classification scheme. We performed a Maximum Likelihood (ML) classification method besides two Machine Learning Algorithms (MLA): Support Vector Machine (SVM) and Random Forest (RF) to allow the joint use of geomorphic features, Band Ratio (BR), Principal Component Analysis (PCA), spatial information (spatial coordinates) and multispectral data of the Advanced Space-borne Thermal Emission and Reflection radiometer (ASTER) satellite. The RF algorithm showed reliable results and discriminated serpentinite, talus and terrace deposits, red argillites with conglomerates and limestone, limy conglomerates and limestone conglomerates, tuffites interbedded with basic lavas, limestone and Metamorphosed limestone and reddish green shales. The best overall accuracy (∼80%) was achieved by Random Forest (RF) algorithms in the majority of the sixteen tested combination datasets.

Optimization of Models for Rapid Identification of Oil and Water Layers During Drilling - A Win-Win Strategy Based on Machine Learning

Abstract and Figures

Recommended publications

Optimization of models for a rapid identification of lithology while drilling - A win-win strategy b...

Identification of Porosity and Permeability While Drilling Based on Machine Learning

A new method for predicting formation lithology while drilling at horizontal well bit

Real-time updating method of local geological model based on logging while drilling process