ArticlePDF Available

Dow Jones Index Return Forecasting: Semantics Based Genetic Programming with Local Search Optimizer

January 2017
International Journal of Bio-Inspired Computation 10(3):1

January 2017
10(3):1

DOI:10.1504/IJBIC.2017.10004325

Authors:

Aleš Popovič

NEOMA Business School

Leonardo Trujillo

Tijuana Institute of Technology

Leonardo Vanneschi

Universidade NOVA de Lisboa

Mauro Castelli

Universidade NOVA de Lisboa

Features in the considered datasets Variable Description date The last business day of the work (this is typically Friday). Number from 1 to 5 open The price of the stock at the beginning of the week high The highest price of the stock during the week

…

p-values obtained in the statistical validation procedure

…

A graphical representation of (a) GSM and (b) GSM-LS (see online version for colours)

…

p-values obtained from the statistical validation procedure

…

Dow Jones dataset

…

Figures - uploaded by Aleš Popovič

Content may be subject to copyright.

Content uploaded by Aleš Popovič

Content may be subject to copyright.

Int. J. Bio-Inspired Computation, Vol. X, No. Y, 200x 1

Stock index return forecasting: semantics-based

genetic programming with local search optimiser

Mauro Castelli* and Leonardo Vanneschi

NOVA IMS,

Universidade Nova de Lisboa,

1070-312, Lisbon, Portugal

Email: mcastelli@novaims.unl.pt

Email: lvanneschi@novaims.unl.pt

*Corresponding author

Leonardo Trujillo

Departamento de Ingeniería Eléctrica y Electrónica,

Tree-Lab, Posgrado en Ciencias de la Ingeniería,

Instituto Tecnológico de Tijuana,

Tijuana, BC, Mexico

Email: leonardo.trujillo@tectijuana.edu.mx

Aleš Popovič

Faculty of Economics,

University of Ljubljana,

Kardeljeva Ploščad 17, 1000 Ljubljana, Slovenia

Email: ales.popovic@ef.uni-lj.si

Abstract: Making accurate stock price predictions is the pillar of effective decisions in

high-velocity environments since the successful prediction of future prices could yield significant

profit and reduce operational costs. Generally, solutions for this task are based on trend

predictions and are driven by various factors. To add to the existing body of knowledge, we

propose a semantics-based genetic programming framework. The proposed framework blends a

recently developed version of genetic programming that uses semantic genetic operators with a

local search method. To analyse the appropriateness of the proposed computational method for

stock market price prediction, we analysed data related to the Dow Jones index and to the

Istanbul Stock Index. Experimental results confirm the suitability of the proposed method for

predicting stock market prices. In fact, the system produces lower errors with respect to the

existing state-of-the art techniques, such as neural networks and support vector machines.

Keywords: forecasting; financial markets; genetic programming; semantics; local search.

Reference to this paper should be made as follows: Castelli, M., Vanneschi, L., Trujillo, L. and

Popovič, A. (xxxx) ‘Stock index return forecasting: semantics-based genetic programming with

local search optimiser’, Int. J. Bio-Inspired Computation, Vol. X, No. Y, pp.xxx–xxx.

Biographical notes: Mauro Castelli received his Masters degree in Computer Science from the

University of Milano Bicocca, Milan, Italy in 2008 (Summa Cum Laude), and PhD from the

University of Milano Bicocca in 2012. Since 2013, he has been an Assistant Professor with

NOVA IMS, Universidade Nova de Lisboa, Lisbon, Portugal.

Leonardo Vanneschi received his Masters degree in Computer Science from the University of

Pisa, Pisa, Italy in 1996 (Summa Cum Laude), and PhD from the University of Lausanne,

Switzerland in 2004. He is an Associate Professor with NOVA IMS, Universidade Nova de

Lisboa, Lisbon, Portugal.

Leonardo Trujillo is a Research Professor at the Instituto Tecnológico de Tijuana (ITT) in

Mexico. His primary research areas are evolutionary computation, genetic programming,

computer vision and pattern recognition. He has an Engineering degree in Electronics and

Masters in Computer Science from ITT, and Doctorate in Computer Science from the CICESE

Research Center in Mexico.

2 M. Castelli et al.

Aleš Popovič is an Associate Professor in the Academic Unit for Business Informatics and

Logistics at the Faculty of Economics University of Ljubljana. He received his PhD in

Information Management from University of Ljubljana.

1 Introduction

Financial markets are considered an important pillar of our

economy. The exchange and trade that occurs on these

markets are based on the price value of a variety of financial

instruments, such as bonds, stocks, commodities, and funds.

It is the ability to predict the direction and changes in prices

for profitability that provide for a multibillion-dollar

industry. Predicting stock prices has long been an intriguing

challenge that has been extensively studied by researchers

from different fields. Forecasting stock market prices is

regarded as a challenging task, mostly due to various

uncertainties surrounding the movement of the market. In

fact, many factors add to the complexity of the stock market

forecasting, including, but not limited to, political events,

general economic conditions, psychological reasons, and

traders’ expectations. While all these factors interrelate with

each other, predicting stock market prices is a complex and

resource-consuming undertaking. Considering the difficulty

of this task, any insights regarding future price performance

could secure huge profits in this market (Tay and Shen,

2002). Thus, proper forecast of this market is an important

factor for investors, buyers, sellers, fund managers, and

other stakeholders, as well as for researchers from the field.

Typically, as reported in Banik et al. (2014), in a stock

market, techniques employed to inform investment choices

fall into two broad categories: fundamental analysis and

technical analysis. The former technique is a complete

method that involves real and reliable information of a

firm’s financial report, economic conditions, and

competitive strength. This technique considers that a present

price depends on its fundamental value, expected return on

investment, and new information about the firm that will

collectively add to the fluctuation of a firm’s share value.

On the other hand, technical analysis is concerned with

market indicators representing the trend of price indices and

individual stocks. The common idea of these indicators is

that once a trend is in motion, it will persist in that path.

While fundamental and technical analysis have been the two

main techniques used to predict market trend and stock

prices, in recent years machine learning (ML) techniques

have shown their suitability in addressing the financial

markets forecasting task (Tay and Shen, 2002; Dash et al.,

2015; Patel et al., 2015). As reported in Shen et al. (2012),

well-known ML algorithms, like support vector machine

and reinforcement learning, have been shown to be

relatively effective in tracing the stock market and help

maximising the profit of stock option purchases while, at

the same time, keeping the associated risk low (Huang et al.,

2005; Moody and Saffell, 2001). Some of these studies,

however, showed that ML techniques have certain inherent

limitations in learning the patterns underlying the stock

market prices due to tremendous noise and complex

dimensionality of stock market data. This results in

inconsistent and unpredictable performance on noisy

data, a phenomenon regularly present in today’s business

environments. In this vein, our research has been motivated

by the challenge to predict, as accurately as possible, the

stock market prices using historical data of share prices and

considering a one week ahead forecasting. The

methodology used in the paper is based on a recently

defined variant of genetic programming, one of the most

successful existing computational intelligence methods.

Recently, genetic programming has obtained excellent

results on a large number of complex real-life applications

(Koza, 2010) and it has lately made an important

breakthrough: the definition of geometric semantic

operators (GSOs), new genetic operators that induce a

unimodal error surface on any supervised learning problem

(including forecasting). Eliminating local optima, GSOs

have a stronger problem solving ability. Thus, they are an

excellent step towards the development of an optimal

forecasting model. However, much work has still to be done

in order to use GSOs in a complex application like the one

taken into account in this work. In particular, GSOs

converge to optimal solution(s) very slowly and this

behaviour is an important limitation in all the applications

characterised by the presence of a large amount of data.

Hence, we propose the definition of a system that combines

GSOs with a local search algorithm. The main idea in

combining GSOs and a local searcher is to couple the

exploration ability of GSOs with the exploitation ability of

the local searcher. In this way we expect to achieve optimal

solutions faster and to obtain a final model that does not

overfit the training data. To analyse the suitability of the

proposed computational method for stock market prices

forecasting, the proposed method has been applied to data

from the Dow Jones index and the Instanbul stock index.

The remainder of the paper is organised as follows:

Section 2 presents an overview of standard genetic

programming and shows its suitability for addressing

symbolic regression problems. Section 3 introduces the

components of the proposed system: the geometric semantic

genetic operators for genetic programming, the motivation

for the use of a local searcher, and the approach used in this

work. Section 4 provides insights into the employed dataset,

the experimental settings, and provides a detailed discussion

about the results obtained. Moreover, a comparison between

the results obtained by the proposed system and the results

achieved on the same dataset by other state-of-the-art

method is presented. Finally, the conclusions summarise

and highlight the work’s main contributions.

Stock index return forecasting: semantics-based genetic programming with local search optimiser 3

2 Genetic programming

Genetic programming (GP) is one of the techniques that

belong to a larger computational intelligence research area

called evolutionary computation. GP consists in the

automated learning of computer programs by means of a

process inspired by biological evolution (Koza, 1992).

Generation by generation, GP stochastically transforms

populations of programs into new, hopefully improved,

populations of programs. The quality of a solution is

expressed by using an objective function (also called fitness

function). The search process of GP is graphically depicted

in Figure 1.

Figure 1 The GP algorithm

Hence, the recipe for solving a problem with GP is the

following:

• Choose a representation space in which candidate

solutions can be specified. This consists of choosing the

primitives of the programming language that will be

used to construct programs. A program is built up from

a terminal set (the input variables of the problem and,

optionally, a set of constant values) and a function set

(the primitive operators).

• Design the fitness criteria for evaluating the quality of a

solution. This involves the execution of a candidate

solution on a suite of test cases, also referred to as the

test set, reminiscent of the process of black-box testing.

In case of supervised learning, a distance-based

function is employed to quantify the divergence of a

candidate’s behaviour from the desired one.

• Design a parent selection and replacement policy.

Central to every EA is the concept of fitness-driven

selection in order to exert an evolutionary pressure

towards promising areas of the program space. The

replacement policy determines the way in which newly

created offspring programs replace their parents in the

population.

• Design a variation mechanism for generating offspring

from a parent or a set of parents. Standard GP uses two

main variation operators: crossover and mutation.

Crossover recombines parts of the structure of two

individuals, whereas mutation stochastically alters a

portion of the structure of an individual.

• After a random initialisation of a population of

computer programs, an iterative application of

selection-variation-replacement is employed to improve

the programs quality, which can be seen as a step-wise

refinement.

In order to transform a population into a new population of

candidate solutions, GP makes use of particular search

operators called genetic operators. Considering the common

tree representation of GP individuals, the standard genetic

operators (crossover and mutation) act on the structure of

the trees that represent the candidate solutions. In other

terms, standard genetic operators act on the syntax of the

programs. In this paper we used genetic operators that,

differently from the standard ones, are able to act at the

semantic level. The definition of semantics used in this

work is the one also proposed in Moraglio et al. (2012) and

will be presented in the following section.

However, to understand the differences between the

genetic operators used in this work and the ones used in the

standard GP algorithm, the latter are also briefly recalled.

The standard crossover operator is traditionally used to

combine the genetic material of two parents by swapping a

part of one parent with a part of the other. More in detail,

after choosing two individuals based on their fitness, the

crossover operator performs the following operations

1 selects a random subtree in each parent

2 swaps the selected subtrees between the two parents:

the resulting individuals are referred to as the offspring.

The mutation operator introduces random changes in the

structures of the individuals in the population. The most

well-known mutation operator, called sub-tree mutation,

works as follows:

1 it randomly selects a node in a tree

2 it removes the node and the subtree for which it is the

root

3 it inserts a randomly generated tree there.

This operation is controlled by a parameter that specifies the

maximum size (usually measured in terms of tree depth) for

the newly created subtree that is to be inserted.

2.1 Symbolic regression with genetic programming

In symbolic regression, the goal is to search for the

symbolic expression TO : Rp → R that best fits a particular

training set T = {(x1, t1), ...., (xn, tn)} of n input/output pairs

with xi ∈ Rp and ti ∈ R. The general symbolic regression

problem can then be defined as

(

)

;

(,) (,),

with 1,....,

∈∈

←

OO ii

TargminfTt

G R

θθ

(1)

4 M. Castelli et al.

where G is the solution or syntactic space defined by the

primitive set P (functions and terminals), f is the fitness

function based on the distance or error between a program’s

output T(xi, θ) and the expected, or target, output ti, and θ is

a particular parameterisation of the symbolic expression T,

assuming m real-valued parameters.

In standard GP, parameter optimisation is usually not

performed explicitly, since GP search operators only focus

on syntax. Therefore, the parameters are only implicitly

considered. However, recent works have begun to address

this issue, such as Z-Flores et al. (2014) where a nonlinear

numerical optimiser is used to tune the parameterisation of

the evolved programs, achieving substantial improvements

in terms of convergence speed and solution quality.

Let us consider the following hypothetical example to

grasp the importance of such a process. Imagine a GP

individual K with a syntax T(x) = x + sin(x) and the

following parameterisation: θ = (

3), with

T(x) =

1x +

2sin(

3x). In a traditional GP, these

parameters are usually set to 1, which does not necessarily

lead to the best possible performance for this particular

syntax. Indeed, if the optimal solution is, for instance,

TO(x) = 3.3x + 1.003sin(0.0001x), then individual K might

be easily discarded by the search. On the other hand, a local

search process that performs a numerical optimisation of

these implicit parameters might be able to tune them,

and produce a substantial improvement in a program’s

performance, potentially improving the fitness assigned to

the above syntax.

This is the view taken in this work, and while previous

works have applied parameter optimisation to a standard GP

search (Z-Flores et al., 2014), this work applies it to a new

GP variant based on geometric semantic genetic operators.

These operators are described in the next section.

3 Methodology

This section describes the components of the proposed

computational intelligence system used for financial market

return forecasting. In particular Section 3.1 describes the

geometric semantic operators and their properties, while

Section 3.2 presents the local search strategy that we used

with the GSOs.

3.1 Geometric semantic operators

Despite the large number of human-competitive results

achieved with the use of GP (Koza, 2010), researchers still

continue to develop new methods that improve the ability of

GP to produce high-quality solutions. In recent years, one of

the emerging idea is to include the concept of semantics in

the evolutionary process performed by GP. While several

studies exist (i.e., Vanneschi et al., 2014a) the definition of

semantics is not unique and this concept is interpreted in

different ways from different perspectives (Vanneschi et al.,

2014a). In this work we use the most common and widely

accepted definition of semantics in GP literature. The

semantics of a program Ti is defined as the vector of outputs

si = [Ti(x1), T

i(x2), ..., T

i(xn)] obtained after executing the

program (or candidate solution) on a set of training data T

(Moraglio et al., 2012); when Ti represents a real-valued

function then si ∈ Rn.

In this section, we briefly recall the definition of the

geometric semantic operators proposed by Moraglio et al.

(2012). The objective of GSOs is to define modifications on

the syntax of GP individuals that have a precise effect on

their semantics. The idea is to define transformations of the

syntax of GP individuals that correspond to well known

operators of genetic algorithms (GAs). In this way, GP

could ‘inherit’ the known properties of those GA operators.

Furthermore, contrary to what typically happens in

real-valued GAs or other heuristics, in the GP semantic

space the target point is also known (it corresponds to the

vector of expected output values in supervised learning) and

the fitness of an individual is simply given by the distance

between the point si it represents in the semantic space and

the target point t. It was shown in Moraglio et al. (2012)

that when fitness is defined in this way it induces a

unimodal error surface. The real-valued GA operators that

we want to ‘map’ into the GP semantic space are geometric

crossover and ball mutation. In real-valued GAs, geometric

crossover produces an offspring that lies on the segment that

joins the parents. It was proven in Krawiec and Lichocki

(2009) that in cases where the fitness is a direct function of

the distance to the target (like the case we are interested in

here) this offspring cannot have a worse fitness than the

worst of its parents. Ball mutation consists of a random

perturbation of the coordinates of an individual. Figure 2

shows a graphical representation of the mapping between

the syntactic and semantic space produced by geometric

semantic operators. The definitions of the operators that

correspond to geometric crossover and ball mutation in the

GP semantic space, as given in Moraglio et al. (2012), are

the following:

Definition 3.1: Geometric semantic crossover (GSC).

Given two parent functions T1, T

2 : Rn → R, the

geometric semantic crossover returns the real function

TXO = (T1 · TR) + ((1 − TR) · T2), where TR is a random real

function whose output values range in the interval [0, 1].

Definition 3.2: Geometric semantic mutation (GSM). Given

a parent function T : Rn → R, the geometric semantic

mutation with mutation step ms returns the real function

TM = T + ms · (TR1 − TR2), where TR1 and TR2 are random real

functions.

Figure 3 shows an example of application of GSC to two

arbitrary trees T1 and T2 [represented in plots 3(a) and 3(b)

respectively], using a random tree TR [represented in

plots 3(c)]. The offspring generated by this crossover is

shown in plot 3(d).

Stock index return forecasting: semantics-based genetic programming with local search optimiser 5

Hereafter, GP that uses geometric semantic operators

will be referred to as geometric semantic GP (GSGP). An

important drawback of GSGP, pointed out by Moraglio et

al., is that geometric semantic operators create much larger

offspring than their parents and that the fast growth of the

individuals in the population rapidly makes fitness

evaluation unbearably slow, making the system unusable.

Moreover, while this growth produces fitter solutions, it is

responsible for creating models that are too specialised on

training data, hence potentially generating overfitting. In

Castelli et al. (2015), a possible workaround to the problem

related to the slowness of the fitness evaluation process was

proposed, consisting in an implementation of these

operators that makes them not only usable in practice, but

also very efficient. Basically, this implementation is based

on the idea that, besides storing the initial trees, at every

generation it is enough to maintain in memory, for each

individual, its semantics and a reference to its parents. As

shown in Castelli et al. (2015), the computational cost of

evolving a population of n individuals for g generations is

O(ng), while the cost of evaluating a new, unseen, instance

is O(g). Hence, the system can be efficiently used to address

problems characterised by a large amount of data. This is

the implementation used in this work.

Figure 2 Geometric semantic crossover [plot (a)] (and respectively geometric semantic mutation [plot (b)]) perform a transformation of

the syntax of the individual that corresponds to geometric crossover (respectively geometric mutation) in the semantic space

(see online version for colours)

(a) (b)

Figure 3 Two parents T1 and T2 [plots (a) and (b), respectively], one random tree TR [plot (c)] and the offspring of the crossover between

T1 and T2 using TR [plot (d)]

(a) (b) (c)

(d)

6 M. Castelli et al.

3.2 Local search in GP

This section will first discuss previous approaches to

applying a local search strategy during a GP run. All of

them were developed for standard GP, and most of them

focused on symbolic regression problems. Afterwards, the

main contribution of this paper is presented, i.e., the first

integration of a local searcher within GSGP.

3.2.1 Local search in standard GP

Many works have studied how to combine an evolutionary

algorithm with a local optimiser so far (also referred to as a

refinement process). In general, such approaches are

considered to be a simple type of memetic search (Chen et

al., 2011). The basic idea is straightforward: include within

the optimisation process an additional search operator that

takes an individual (or several) as an initial point and

searches for the local optima around it. Such a strategy can

help ensure that the local region around each individual is

fully exploited. However, there can be some negative

consequences to such an approach. The most evident is the

computational overhead, while the cost of a single LS might

be negligible, performing it on every individual might

become inefficient. Second, LS can produce overfitted

solutions, stagnating the search on local optima. These

issues aside, these techniques have produced impressive

results in a variety of scenarios, some of which are reviewed

by Chen et al. (2011).

A noteworthy aspect of this survey is an almost

complete lack of papers that deal with GP. Of the more than

two hundred papers covered by Chen et al., in fact, only a

couple deal with memetic GP. This indicates that the GP

community may have not addressed the topic adequately.

Some examples are Wang et al. (2011) and Eskridge and

Hougen (2004), which present domain-specific memetic

approaches, that are not intended for LS in symbolic

regression with GP. In fact, LS in GP can be performed in

two ways, in the syntactic space or in the parameter space as

defined in equation (1). Regarding the former, in Azad and

Ryan (2014) the authors proposed a syntactic LS operator

that performs a greedy point mutation, with promising

results on several benchmarks. Regarding the latter

approach, the complete optimisation problem defined in

equation (1) has not received much attention.

In Topchy and Punch (2001), gradient descent is used to

optimise numerical constants within a GP tree, achieving

good results on five symbolic regression problems.

Similarly, in Zhang and Smart (2004) and Graff et al.

(2013) a LS algorithm is used to optimise the value of

constant terminal elements. In Zhang and Smart (2004)

gradient descent is used and tested on classification

problems, while Graff et al. (2013) uses resilient

backpropagation and evaluates the proposal on a real-world

problem, in both cases leading towards improved results. In

Smart and Zhang (2004), the authors include weight

parameters for each function node, which the authors call

inclusion factors; these weights modulate the importance

that each node has within the tree. Indeed, the authors

identify what we are here referring to as implicit program

parameters, and optimise these values by applying gradient

descent on all trees. The authors also propose a series of

new search operators that explicitly consider the

parameterisation of each GP tree.

In a recent work (Z-Flores et al., 2014), this problem

was addressed by implementing a very simple

parameterisation of the tree, by constraining the number of

internal parameters of each tree regardless of its size.

Several different strategies were compared to determine

when a local optimiser should be applied, showing that it is

often best to apply it on either all the population or a subset

of the best individuals. The LS used is called trust

region optimisation (Sorensen, 1982), and results showed

substantial improvements in performance compared with

standard GP search on several benchmark and real-world

problems. A similar approach was developed by Kommenda

et al. (2013), with two noteworthy differences. First,

parameters replace all constants present on a given tree, and

each GP tree is enhanced by adding an artificial root tree

that effectively adds a weight coefficient and a bias to the

entire tree, then the Levenberg-Marquardt optimiser is used

to find the optimal values for these parameters. Second, the

authors apply constant optimisation to the population using

different probabilities, as well as a strict offspring selection

variant for comparison.

3.2.2 Local search in geometric semantic operators

The goal of this work is to integrate a LS strategy within

GSGP. In particular, as an initial proposal, we include a

local searcher within the GSM mutation operator, since

previous works have shown that GSGP achieves its best

performance using only mutation during the search

(Vanneschi et al., 2014b). In particular, the GSM with LS

(GSM-LS) of a tree T generates an individual:

01 21 2

· · ( ) =+ + −

MRR

TTTT

αα α

(2)

where

i ∈ R; notice that

2 replaces the mutation step

parameter ms of the geometric semantic mutation (GSM).

This in fact defines a basic multivariate linear regression

problem, which could be solved, for example, by ordinary

least square regression (OLS). However, in this case we

have n linear equations, the number of fitness cases, and

only three unknowns (the

is). This gives an over-

determined multivariate linear fitting problem, which can be

solved through SVD (in this work, the GNU Scientific

Library available at http://www.gnu.org/software/gsl/ is

used). We argue that this should be seen as a LS operator,

that attempts to determine the best linear combination of the

parent tree and the random trees used to perturb it, which is

local in the sense of the linear problem posed by the GSM

operator. It should not be seen as a LS in the entire semantic

space, since in that case the LS would necessarily converge

to the optimum in this unimodal landscape.

Stock index return forecasting: semantics-based genetic programming with local search optimiser 7

Figure 4 A graphical representation of (a) GSM and (b) GSM-LS (see online version for colours)

(a) (b)

To illustrate how GSM and GSM-LS differ, a graphical

description of each method is provided in Figure 4. First,

Figure 4(a) shows a contour plot of the semantic space, the

space of all possible program outputs, with the highest

fitness peak at the desired program output t. Also, the

semantics of a single GP tree is depicted as s, a circle

around s is the area in which the semantics

′

s of the

offspring generated by GSM will lie, where the radius of the

circle is determined by the mutation step ms. Notice that

GSM can, in some cases, generate offspring with semantics

that are farther away from t than the parent, with lower

fitness. This can slow down the convergence speed of the

search.

Instead, GSM-LS will always produce offspring that

have a better fitness than the parent, by forcing the

geometric mutation to always move in the direction of the

known goal of the search, as depicted in Figure 4(b).

This approach is similar to two previously proposed

approaches. First, the linear fitting problem is reminiscent

of the linear scaling procedure proposed in Keijzer (2003),

which allows GP to fit the form of the desired output

without necessarily optimising the scale or bias. However,

in that case, the scaling process is only used to adjust the

fitness value of each individual, while the search operators

used are standard ones. Second, and more closely related to

this work, the non-isotropic Gaussian mutation proposed in

Moraglio and Mambrini (2013), that is used to perform a

run-time analysis of GSGP. However, the mutation

proposed in that work considers a fixed set of basis

functions instead of randomly generated GP trees, and

perturbs the linear combination with Gaussian-noise instead

of providing the best fit coefficients. Finally, the work

presented by Krawiec and O’Reilly (2014) also uses a

multivariate linear regression approach to optimise evolved

solutions, with several key differences. Particularly, the

search is conducted by standard GP, not GSGP, and each

tree is decomposed into a set of subtrees which are then

linearly combined. The method is much more explorative

then the one presented here.

Moreover, the approach we propose contrasts with

previous work (Z-Flores et al., 2014), that relied on a

nonlinear local optimiser, since the linear assumption is

mostly not satisfied by the expression evolved with standard

GP and the corresponding parameterisation. Instead, in this

new approach, it is simple to apply an optimiser based on a

linear regression, given that the GSM operator defines a

linear expression in parameter space.

The idea of including a LS method is based on a very

simple observation related to the properties of the geometric

semantic operators: while these operators are effective in

achieving good performance with respect to standard

syntax-based operators, they require many generations to

converge to optimal solutions. By including a local search

method, we expect to improve the convergence speed of the

search algorithm and to obtain better performance with

respect to the algorithm that only uses the geometric

semantic operators. Moreover, by speeding up the search

process, it will be possible to limit the construction of

over-specialised solutions that, in the end, would overfit the

data.

4 Experimental study

In Section 4.1 we present the data we have used. In

Section 4.2 we describe the experimental settings we

employed, including all the parameters of the systems we

have studied to allow the interested reader to fully replicate

our work. Finally, in Section 4.3 we discuss the obtained

results.

8 M. Castelli et al.

Table 1 Features in the considered datasets

Variable Description

date The last business day of the work (this is typically Friday). Number from 1 to 5

open The price of the stock at the beginning of the week

high The highest price of the stock during the week

low The lowest price of the stock during the week

close The price of the stock at the end of the week

volume The number of shares of stock that traded hands in the week

percent_change_price The percentage change in price throughout the week

percent_change_volume_over_last_week The percentage change in the number of shares of a stock that traded hands for this week

compared to the previous week

previous_weeks_volume The number of shares of stock that traded hands in the previous week

days_to_next_dividend The number of days until the next dividend

percent_return_next_dividend The percentage of return on the next dividend

next_weeks_price The price of the stock in the following week

4.1 Data description

To assess the suitability of the proposed system, we

considered data related to the Dow Jones index and to the

Instabul stock index. In particular, for the Dow Jones

dataset (also used in the study by Brown et al., 2013), the

training data came from the first quarter of 2011 and the test

data came from the second quarter of 2011. For the Istanbul

index dataset, the training data came from the first semester

of 2012 and the test data came from the second semester of

2012. Each record (instance) represents data from a single

week. The objective is to predict the stock price at the end

of week w + 1 considering a set of variables (reported in

Table 1) related to the stock in week w.

4.2 Experimental settings

Two different GP systems were compared: GSGP that only

uses the original GSM operator; and a HYBRID algorithm

that uses both the GSM operator and the proposed GSM-LS

operator.

All the runs used populations of 200 individuals evolved

for 300 generations. Tree initialisation was performed with

the Ramped Half-and-Half method (Koza, 1992) with a

maximum initial depth equal to 6. The function set

contained the arithmetic operators, including protected

division as in Koza (1992). The terminal set contained

12 variables, each one corresponding to a different feature

in the datasets (including the target), these are summarised

in Table 1. Mutation has been used with probability 1.

Survival from one generation to the following one was

always granted to the best individual of the population

(elitism). In GSM a random mutation step has been

considered in each mutation event as suggested in

Vanneschi et al. (2014b). Regarding the HYBRID system,

GSMLS has been used in the first 20 generations, while in

the remaining generations we considered the standard GSM

operator. We decided to limit the number of generations

where the local search was used to avoid overfitting the

training data.

For all the systems under consideration, we analysed the

performance obtained according to two different error

measures. In particular, these two measures are the mean

absolute error (MAE) and the mean square error (MSE).

The definition of these error measures are the following:

1||

∈

=−

∑ii

AE t y

N (3)

1||

∈

=−

∑ii

SE t y

N (4)

where yi = T(xi) is the output of the GP individual T on the

input data xi and ti is the target value for the instance xi. N

denotes the number of samples in the training or testing

subset, and Q contains the indices of that set.

In the next section, the experimental results obtained are

reported using plots of the median error on the training and

test set. In particular, in each generation the best individual

in the population (i.e., the one that has the smallest training

error) has been chosen and the value of its error on the

training and test sets has been stored. The reported curves

finally contain the median of all these values collected at the

end of each generation. The median was preferred over the

mean in the reported plots because of its higher robustness

to outliers.

The results discussed in the next section have been

obtained using the GSGP implementation freely available at

http://gsgp.sourceforge.net and documented in Castelli et al.

(2015).

4.3 Experimental results

Figures 5 and 6 report, for the datasets taken into account,

training and test error (MAE and MSE) for the considered

GP systems against generations. For all the considered GP

systems 30 runs have been performed.

Stock index return forecasting: semantics-based genetic programming with local search optimiser 9

Figure 5 Dow Jones dataset

(a) (b)

Notes: Training and test error: MAE and MSE. The plots show the median over 30 runs

Figure 6 Istanbul dataset

(a) (b)

Notes: Training and test error: MAE and MSE. The plots show the median over 30 runs.

We start the discussion of the results shown in the plots by

considering the performance on the Dow Jones dataset.

Figure 5 clearly shows that HYBRID outperforms GSGP on

both the training and the test set. In particular, it is possible

to note the fast convergence of the proposed algorithm as

well as the fact that the final model does not overfit the

training data.

Regarding the Istanbul dataset the situation is slightly

different: both GP systems under consideration are able to

converge to good quality solutions in a small number of

generations [Figures 6(a) and 6(c)]. Anyway, once again,

the HYBRID system is able to reach good quality solutions

in a smaller number of generations with respect to GSGP.

More interesting is the performance on unseen instances

[Figures 6(a) and 6(c)]. In this case, both GP systems overfit

the training data. When MAE is considered as an error

measure it is possible to see that both GP systems reach the

lower test error in less than 20 generations, while in the

10 M. Castelli et al.

remaining generations overfitting starts appearing. At the

end of the evolutionary search process, the two GP systems

present a comparable test error. When the MSE is used as

an error measure, overfitting is even more evident.

Nonetheless, note that the HYBRID method is able to

produce solutions with a test error that is lower than the one

achieved with the GSGP system.

To analyse the statistical significance of these results, a

set of tests has been performed on the median errors. In

particular, we wanted to assess whether the final results

(generation 300), produced by the considered GP systems,

were statistically significantly different. As a first step, the

Shapiro Wilk test (with

= 0.1) has shown that the data are

not normally distributed and hence a rank-based statistic has

been used. Successively, the Wilcoxon rank-sum test for

pairwise data comparison has been used (with

= 0.1)

under the alternative hypothesis that the samples do not

have equal medians. The p-values obtained are reported in

Table 2.

Table 2 p-values obtained in the statistical validation

procedure

MAE MSE

Training Test Training Test

Dow Jones 0.003 0.543 0.091 0.04

Istanbul 0.2549 0.408 0.116 0.07

Note: Comparison between results produced by GSGP

and HYBRID on training and test set for the two

considered error measures.

According to the p-values, for the Dow Jones dataset we can

clearly state that HYBRID produces solutions that are

significantly better (i.e., with lower error) than GSGP both

on training and test data when the MSE is used as a measure

of error. When the MAE is the considered error measure the

differences between the two techniques are statistically

significant only taking into account the results on the

training set. Regarding the Istanbul dataset, HYBRID

produces solutions that are significantly better than GSGP

on test data when the MSE is used as a measure of error. In

all the remaining cases, the difference between the quality

of the solutions produced by the two GP systems are not

statistically significant.

While this comparison is interesting to obtain

information about the behaviour of the two systems, it is

important to notice that the parameter settings used in the

experimental phase favour GSGP. More in detail, it is

possible to notice that HYBRID is able to achieve in a few

generations better fitness values than the ones achieved by

GSGP in 300 generations. This is an important aspect to

consider especially in application characterised by a large

amount of data, like the considered one. This last result is

promising, because it suggests that it is possible to achieve

better results using the proposed GSM-LS operator with

respect to the ones achieved by GSGP, and to do so using

fewer iterations of the algorithm, thus producing smaller

functions.

4.4 Comparison with other ML techniques

Besides comparing GSGP against the proposed variant that

uses a local searcher in the mutation operator, we are also

interested in considering the performance of other

well-known state-of-the-art ML methods, to evaluate the

competitiveness of the results obtained.

Table 3 reports the values of the training and test errors

(MAE) of the solutions obtained by all the studied

techniques including, in the last lines of the table, GSGP

and HYBRID. From these results, it is possible to see that

GSGP and HYBRID perform better than all the other

methods we studied, particularly on unseen test data.

Table 3 Experimental comparison between different

non-evolutionary techniques, GSGP and HYBRID

Dow Jones Istanbul

Method Training

error Test

error

Trainin

g error Test

error

Linear regression

(Weisberg, 2005)

1.79 2.17 1.05 1.07

Least square

regression (Seber and

Wild, 2003)

1.77 2.17 1.08 1.06

Radial basis function

network (Haykin,

1999)

1.80 2.17 1.35 1.17

Isotonic regression

(Hoffmann, 2009)

1.72 2.39 1.09 1.12

Neural network

(Haykin, 1998)

1.55 1.81 1.16 1.08

SVM polynomial

kernel (degree 2)

1.31 1.68 1.04 1.07

SVM polynomial

kernel (degree 3)

1.20 1.58 1.09 1.11

SVM polynomial

Kernel (degree 4)

1.10 1.59 1.11 1.13

SVM polynomial

kernel (degree 5)

1.05 1.63 1.14 1.21

GSGP 0.82 1.05 1.14 1.05

HYBRID 0.73 0.95 1.14 1.05

Notes: For non-deterministic techniques we reported the

median of the training error and test error (MAE)

calculated over 30 independent runs. Italics

values denote the best performer.

To assess the statistical significance of these results, the

same set of tests described in the previous section has been

performed. In this case, a Bonferroni correction for the

value of

has been considered, given that the number of

compared techniques is larger than two. All the obtained

p-values relative to the comparison between HYBRID and

the other methods are reported in Table 4.

Stock index return forecasting: semantics-based genetic programming with local search optimiser 11

Table 4 p-values obtained from the statistical validation procedure

MAE Dow Jones

LIN ISO SQ NN RBF SVM-2 SVM-3 SVM-4 SVM-5

TRAIN ≤0.001 ≤0.001 ≤0.001 ≤0.001 ≤0.001 ≤0.001 ≤0.001 ≤0.001 ≤0.001 HYBRID

TEST ≤0.001 ≤0.001 ≤0.001 ≤0.001 ≤0.001 ≤0.001 ≤0.001 ≤0.001 ≤0.001

MAE Istanbul

LIN ISO SQREG NN RBF SVM-2 SVM-3 SVM-4 SVM-5

TRAIN 0.01 0.118 0.081 0.411 ≤0.001 ≤0.001 0.106 0.306 0.988 HYBRID

TEST 0.863 0.093 1 0.271 ≤0.001 0.745 0.162 0.028 ≤0.001

In the table, LIN stands for linear regression, ISO stands for

isotonic regression, SQ stands for least square regression,

NN stands for neural networks, RBF stands for radial basis

function network, SVM-2 refers to the support vector

machines with polynomial kernel of second degree and

similarly for SVM-3, SVM-4 and SVM-5. According to the

results reported in the table, the differences in terms of

training and test fitness between HYBRID and all the other

considered techniques are statistically significant when the

Dow Jones dataset is considered.

Regarding the Istanbul dataset the HYBRID method is

the best performer on unseen instances, with a difference

that is statistically significant with respect to several of the

other non-evolutionary machine learning techniques

(RBF and SVM-5). On the other hand, on the training

instances, the proposed system performs poorly with respect

to simple linear regression models (according to the

p-values, the only techniques that significantly outperform

the HYBRID system on the training instances are LIN and

SVM-2). Moreover, HYBRID produces the same training

error obtained with SVM and NN and, for all of these

techniques, training error is greater than test error. While

performance on training instances, for this second dataset, is

not as good as the one achieved with other techniques, it is

important to highlights that HYBRID is the best performer

on the test instances. This is the more desirable behaviour

when a real life problem must be addressed.

These results are a clear indication of the

appropriateness of the proposed method to generate

predictive models of stock prices, at least for the studied test

cases.

5 Conclusions

Stock market prices forecasting is one of the most

challenging tasks for financial investors across the globe.

This challenge is due to the uncertainty and volatility of the

stock prices in the market. Due to technology and

globalisation of business and financial markets it is

important to predict the stock prices more quickly and

accurately. In this study the stock market prices forecasting

task has been considered and, in order to address it, a

computational intelligence technique has been proposed.

The proposed system is based on a variant of the genetic

programming algorithm. In particular, the GP system makes

use of particular genetic operators that, differently from the

standard genetic operators used in GP, work on the

semantics of the solutions. While the use of semantic

methods in GP has been successfully investigated and

applied, several important problems that do not allow to

efficiently use these methods are still open. In particular, the

GP system that uses the semantic operators (GSGP) requires

a huge amount of generations in order to converge towards

optimal solutions.

Under this light, the contribution of this work consists of

integrating the GSGP framework with a local search

optimiser. The use of a local searcher is aimed at improving

the speed with which GSGP converges, in order to find

good-quality solutions. Moreover, by combining the

exploration ability of GSGP with the exploitation ability of

a local search method we expect to find good quality

solutions in a small number of generations, hence avoiding

or limiting the excessive specialisation of a model on the

training instances and, consequently, overfitting.

To validate the proposed system, called HYBRID, an

extensive experimental analysis has been performed,

considering stock prices related to the Dow Jones index and

to the Istanbul index. We tested the proposed system against

a standard semantic GP systems. The reported results have

shown that semantic GP with the local searcher is able to

produce results that outperforms the ones obtained by

GSGP on the Dow Jones index while, on the second dataset,

results achieved by the two systems are comparable. More

interesting is the fact that the proposed system is

able to produce better results in a significantly lower

number of generations with respect to GSGP, hence saving

computational effort. This is extremely important in a

domain where a large amount of data are daily available.

To summarise, the paper provides two contributions:

from the point of view of the stock market prices

forecasting, a system able to outperform the existing

state-of-the-art techniques has been defined; from the

machine learning perspective, this case study has shown that

including a local searcher in the geometric semantic GP

system can speed up the convergence of the search process.

We hope that this contribution will pave the way for further

research in these areas.

12 M. Castelli et al.

Acknowledgements

CONACYT Basic Science Research Project No. 178323,

TecNM (Mexico) Research Project 5621.15-P, and the

FP7-Marie Curie-IRSES 2013 European Commission

program through project ACoBSEC with Contract

No. 612689.

References

Azad, R.M.A. and Ryan, C. (2014) ‘A simple approach to lifetime

learning in genetic programming-based symbolic regression’,

Evol. Comput., Vol. 22, No. 2, pp.287–317.

Banik, S., Khan, A.F.M.K. and Anwer, M. (2014) ‘Hybrid

machine learning technique for forecasting Dhaka stock

market timing decisions’, Computational Intelligence and

Neuroscience, pp.1–6.

Brown, M., Pelosi, M. and Dirska, H. (2013) ‘Dynamic-radius

species-conserving genetic algorithm for the financial

forecasting of Dow Jones index stocks’, in P. Perner (Ed.):

Machine Learning and Data Mining in Pattern Recognition,

Lecture Notes in Computer Science, Vol. 7988, pp.27–41,

Springer, Berlin, Heidelberg.

Castelli, M., Silva, S. and Vanneschi, L. (2015) ‘A C++

framework for geometric semantic genetic programming’,

Genetic Programming and Evolvable Machines, Vol. 16,

No. 1, pp.73–81.

Chen, X., Ong, Y-S., Lim, M-H. and Tan, K.C. (2011)

‘A multifacet survey on memetic computation’, Trans. Evol.

Comp., Vol. 15, No. 5, pp.591–607.

Dash, R., Dash, P.K. and Bisoi, R. (2015) ‘A differential harmony

search based hybrid interval type2 fuzzy EGARCH model for

stock market volatility prediction’, International Journal of

Approximate Reasoning, Vol. 59, No. C, pp.81–104.

Eskridge, B. and Hougen, D. (2004) ‘Imitating success: a memetic

crossover operator for genetic programming’, in Proceedings

of the IEEE Congress on Evolutionary Computation, IEEE

Press, Portland, Oregon, pp.809–815.

Graff, M., Peña, R. and Medina, A. (2013) ‘Wind speed

forecasting using genetic programming’, in IEEE Congress

on Evolutionary Computation, IEEE, pp.408–415.

Haykin, S. (1998) Neural Networks: A Comprehensive

Foundation, 2nd ed., Prentice Hall PTR, Upper Saddle River,

NJ, USA.

Haykin, S. (1999) Neural Networks: A Comprehensive

Foundation, Prentice Hall, Upper Saddle River, NJ, USA.

Hoffmann, L. (2009) Multivariate Isotonic Regression and its

Algorithms, Wichita State University, College of Liberal Arts

and Sciences, Department of Mathematics and Statistics.

Huang, W., Nakamori, Y. and Wang, S-Y. (2005) ‘Forecasting

stock market movement direction with support vector

machine’, Computers and Operations Research, Applications

of Neural Networks, Vol. 32, No. 10, pp.2513–2522.

Keijzer, M. (2003) ‘Improving symbolic regression with interval

arithmetic and linear scaling’, in Proceedings of the 6th

European Conference on Genetic Programming EuroGP,

Springer-Verlag, Berlin, Heidelberg, pp.70–82.

Kommenda, M., Kronberger, G., Winkler, S., Affenzeller, M. and

Wagner, S. (2013) ‘Effects of constant optimization by

nonlinear least squares minimization in symbolic regression’,

Proceeding of the Fifteenth Annual Conference Companion

on Genetic and Evolutionary Computation Conference

Companion – GECCO Companion, pp.11–21.

Koza, J.R. (1992) Genetic Programming: On the Programming of

Computers by Means of Natural Selection, MIT Press,

Cambridge, MA, USA.

Koza, J.R. (2010) ‘Human-competitive results produced by

genetic programming’, Genetic Programming and Evolvable

Machines, Vol. 11, Nos. 3–4, pp.251–284.

Krawiec, K. and Lichocki, P. (2009) ‘Approximating geometric

crossover in semantic space’, in GECCO: Proceedings of the

11th Annual Conference on Genetic and Evolutionary

Computation, pp.987–994, ACM, Montreal.

Krawiec, K. and O’Reilly, U-M. (2014) ‘Behavioral programming:

a broader and more detailed take on semantic GP’,

in Proceedings of the 2014 Conference on Genetic and

Evolutionary Computation, GECCO, ACM, New York, NY,

USA, pp.935–942.

Moody, J. and Saffell, M. (2001) ‘Learning to trade via direct

reinforcement’, Neural Networks, IEEE Transactions on,

Vol. 12, No. 4, pp.875– 889.

Moraglio, A. and Mambrini, A. (2013) ‘Runtime analysis of

mutation-based geometric semantic genetic programming for

basis functions regression’, in Proceedings of the 15th Annual

Conference on Genetic and Evolutionary Computation

GECCO, ACM, New York, NY, USA, pp.989–996.

Moraglio, A., Krawiec, K. and Johnson, C.G. (2012) ‘Geometric

semantic genetic programming’, in C.A. Coello Coello, V.

Cutello, K. Deb, S. Forrest, G. Nicosia and M. Pavone (Eds.):

Parallel Problem Solving from Nature, PPSN XII (Part 1),

Lecture Notes in Computer Science, Vol. 7491, pp.21–31,

Springer

Patel, J., Shah, S., Thakkar, P. and Kotecha, K. (2015) ‘Predicting

stock market index using fusion of machine learning

techniques’, Expert Systems with Applications, Vol. 42, No. 4,

pp.2162–2172.

Seber, G. and Wild, C. (2003). Nonlinear Regression, Wiley Series

in Probability and Statistics, Wiley.

Shen, S., Jiang, H. and Zhang, T. (2012) Stock Market Forecasting

using Machine Learning Algorithms, Stanford University,

Santa Clara, California, USA.

Smart, W. and Zhang, M. (2004) ‘Continuously evolving programs

in genetic programming using gradient descent’, in R.I.

Mckay and S-B. Cho (Eds.): Proceedings of The Second

Asian-Pacific Workshop on Genetic Programming, p.16,

Cairns, Australia.

Sorensen, D.C. (1982) ‘Newton’s method with a model trust

region modification’, SIAM Journal on Numerical Analysis,

Vol. 19, No. 2, pp.409–426.

Tay, F.E. and Shen, L. (2002) ‘Economic and financial prediction

using rough sets model’, European Journal of Operational

Research, Vol. 141, No. 3, pp.641–659.

Topchy, A. and Punch, W.F. (2001) ‘Faster genetic programming

based on local gradient search of numeric leaf values’,

in L. Spector, E.D. Goodman, A. Wu, W.B. Langdon, H-M.

Voigt, M. Gen, S. Sen, M. Dorigo, S. Pezeshk, M.H. Garzon

and E. Burke (Eds.): Proceedings of the Genetic

and Evolutionary Computation Conference (GECCO1),

pp.155–162, Morgan Kaufmann.

Stock index return forecasting: semantics-based genetic programming with local search optimiser 13

Vanneschi, L., Castelli, M. and Silva, S. (2014a) ‘A survey of

semantic methods in genetic programming’, Genetic

Programming and Evolvable Machines, Vol. 15, No. 2,

pp.195–214.

Vanneschi, L., Silva, S., Castelli, M. and Manzoni, L. (2014b)

‘Geometric semantic genetic programming for real life

applications’, in Genetic Programming Theory and Practice

XI, Springer, New York, pp.191–209.

Wang, P., Tang, K., Tsang, E.P.K. and Yao, X. (2011) ‘A memetic

genetic programming with decision tree-based local search

for classification problems’, in IEEE Congress on

Evolutionary Computation, IEEE, pp.917–924.

Weisberg, S. (2005). Applied Linear Regression, Wiley Series in

Probability and Statistics, Wiley, Hoboken, New Jersey,

USA.

Z-Flores, E., Trujillo, L., Schuetze, O. and Legrand, P. (2014)

‘Evaluating the effects of local search in genetic

programming’, in A-A. Tantar et al. (Eds.): EVOLVE – A

Bridge between Probability, Set Oriented Numerics, and

Evolutionary Computatio, Advances in Intelligent Systems

and Computing, No. 288, pp.213–228, Springer.

Zhang, M. and Smart, W. (2004). Genetic programming with

gradient descent search for multiclass object classification’,

in M. Keijzer, U-M. O’Reilly, S.M. Lucas, E. Costa and

T. Soule (Eds.): Genetic Programming 7th European

Conference, EuroGP, Proceedings, LNCS, Vol. 3003,

pp.399–408, Springer-Verlag, Coimbra, Portugal.

ResearchGate has not been able to resolve any citations for this publication.

Geometric Semantic Genetic Programming for Real Life Applications

Article

Full-text available

Mar 2014

In a recent contribution we have introduced a new implementation of geometric semantic operators for Genetic Programming. Thanks to this implementation, we are now able to deeply investigate their usefulness and study their properties on complex real-life applications. Our experiments confirm that these operators are more effective than traditional ones in optimizing training data, due to the fact that they induce a unimodal fitness landscape. Furthermore, they automatically limit overfitting, something we had already noticed in our recent contribution, and that is further discussed here. Finally, we investigate the influence of some parameters on the effectiveness of these operators, and we show that tuning their values and setting them “a priori” may be wasted effort. Instead, if we randomly modify the values of those parameters several times during the evolution, we obtain a performance that is comparable with the one obtained with the best setting, both on training and test data for all the studied problems.

A C++ framework for geometric semantic genetic programming

Article

Full-text available

Mar 2014

Geometric semantic operators are new and promising genetic operators for genetic programming. They have the property of inducing a unimodal error surface for any supervised learning problem, i.e., any problem consisting in finding the match between a set of input data and known target values (like regression and classification). Thanks to an efficient implementation of these operators, it was possible to apply them to a set of real-life problems, obtaining very encouraging results. We have now made this implementation publicly available as open source software, and here we describe how to use it. We also reveal details of the implementation and perform an investigation of its efficiency in terms of running time and memory occupation, both theoretically and experimentally. The source code and documentation are available for download at http://gsgp.sourceforge.net.

A survey of semantic methods in genetic programming

Article

Full-text available

Jun 2014

Several methods to incorporate semantic awareness in genetic programming have been proposed in the last few years. These methods cover fundamental parts of the evolutionary process: from the population initialization, through different ways of modifying or extending the existing genetic operators, to formal methods, until the definition of completely new genetic operators. The objectives are also distinct: from the maintenance of semantic diversity to the study of semantic locality; from the use of semantics for constructing solutions which obey certain constraints to the exploitation of the geometry of the semantic topological space aimed at defining easy-to-search fitness landscapes. All these approaches have shown, in different ways and amounts, that incorporating semantic awareness may help improving the power of genetic programming. This survey analyzes and discusses the state of the art in the field, organizing the existing methods into different categories. It restricts itself to studies where semantics is intended as the set of output values of a program on the training data, a definition that is common to a rather large set of recent contributions. It does not discuss methods for incorporating semantic information into grammar-based genetic programming or approaches based on formal methods. The objective is keeping the community updated on this interesting research track, hoping to motivate new and stimulating contributions.

Runtime analysis of mutation-based geometric semantic genetic programming for basis functions regression

Conference Paper

Full-text available

Jul 2013

Geometric Semantic Genetic Programming (GSGP) is a recently introduced form of Genetic Programming (GP) that searches the semantic space of functions/programs. The fitness landscape seen by GSGP is always -- for any domain and for any problem -- unimodal with a linear slope by construction. This makes the search for the optimum much easier than for traditional GP, and it opens the way to analyse theoretically in a easy manner the optimisation time of GSGP in a general setting. Very recent work proposed a runtime analysis of mutation-based GSGP on the class of all Boolean functions. We present a runtime analysis of mutation-based GSGP on the class of all regression problems with generic basis functions (encompassing e.g., polynomial regression and trigonometric regression).

Hybrid Machine Learning Technique for Forecasting Dhaka Stock Market Timing Decisions

Article

Full-text available

Feb 2014
Comput Intell Neurosci

Forecasting stock market has been a difficult job for applied researchers owing to nature of facts which is very noisy and time varying. However, this hypothesis has been featured by several empirical experiential studies and a number of researchers have efficiently applied machine learning techniques to forecast stock market. This paper studied stock prediction for the use of investors. It is always true that investors typically obtain loss because of uncertain investment purposes and unsighted assets. This paper proposes a rough set model, a neural network model, and a hybrid neural network and rough set model to find optimal buy and sell of a share on Dhaka stock exchange. Investigational findings demonstrate that our proposed hybrid model has higher precision than the single rough set model and the neural network model. We believe this paper findings will help stock investors to decide about optimal buy and/or sell time on Dhaka stock exchange.

A Simple Approach to Lifetime Learning in Genetic Programming-Based Symbolic Regression

Article

Full-text available

Sep 2013
EVOL COMPUT

Abstract Genetic Programming (GP) coarsely models natural evolution to evolve computer programs. Unlike in nature, where individuals can often improve their fitness through lifetime experience, the fitness of GP individuals generally does not change during their lifetime, and there is usually no opportunity to pass on acquired knowledge. This paper introduces the Chameleon system to address this discrepancy and augment GP with lifetime learning by adding a simple local search that operates by tuning the internal nodes of individuals. Although not the first attempt to combine local search with GP, its simplicity means that it is easy to understand and cheap to implement. A simple cache is added which leverages the local search to reduce the tuning cost to a small fraction of the expected cost, and we provide a theoretical upper limit on the maximum tuning expense given the average tree size of the population and show that this limit grows very conservatively as the average tree size of the population increases. We show that Chameleon uses available genetic material more efficiently by exploring more actively than with standard GP, and demonstrate that not only does Chameleon outperform standard GP (on both training and test data) over a number of symbolic regression type problems, it does so by producing smaller individuals and that it works harmoniously with two other well known extensions to GP, namely, linear scaling and a diversity-promoting tournament selection method.

Applied Linear Regression, 2nd Edition.

Article

Mar 1987

S. Weisberg

A differential harmony search based hybrid interval Type2 fuzzy EGARCH model for stock market volatility prediction

Article

Feb 2015
INT J APPROX REASON

In this paper a new hybrid model integrating an interval type2 fuzzy logic system (IT2FLS) with a computationally efficient functional link artificial neural network (CEFLANN) and an Exponential Generalized Autoregressive Conditional Heteroskedasticity (EGARCH) model has been proposed for accurate forecasting and modeling of financial data with changing variance over time. The proposed model denoted as IT2F-CE-EGARCH helps to enhance the ability of EGARCH model through a joint estimation of the important features of EGARCH like leverage effect, asymmetric shock by leverage effect with the secondary membership functions of interval type2 TSK FLS and the functional expansion and learning component of a CEFLANN. The secondary membership functions with upper and lower limits of IT2FLS provide a forecasting interval for handling more complicated uncertainties involved in volatility forecasting compared to type1 FLS. The performance of the proposed model has been observed with two membership functions i.e. Gaussian with fixed mean, uncertain variance and Gaussian with fixed variance and uncertain mean. The proposed model has also been compared with a few other fuzzy time series models and GARCH family models based on four performance metrics: MSFE, RMSFE, MAFE and Rel MAE. Again a differential harmony search (DHS) algorithm has been suggested for optimizing the parameters of all the fuzzy time series models. The results indicate that the proposed IT2F-CE-EGARCH model offers significant improvements in volatility forecasting performance in comparison with all other specified models over BSE Sensex and CNX Nifty dataset.

MULTIVARIATE ISOTONIC REGRESSION AND ITS ALGORITHMS A Thesis by

Article

Linda Hoffmann

Effects of constant optimization by nonlinear least squares minimization in symbolic regression

Conference Paper

Jul 2013

In this publication a constant optimization approach for symbolic regression is introduced to separate the task of finding the correct model structure from the necessity to evolve the correct numerical constants. A gradient-based nonlinear least squares optimization algorithm, the Levenberg-Marquardt (LM) algorithm, is used for adjusting constant values in symbolic expression trees during their evolution. The LM algorithm depends on gradient information consisting of partial derivations of the trees, which are obtained by automatic differentiation. The presented constant optimization approach is tested on several benchmark problems and compared to a standard genetic programming algorithm to show its effectiveness. Although the constant optimization involves an overhead regarding the execution time, the achieved accuracy increases significantly as well as the ability of genetic programming to learn from provided data. As an example, the Pagie-1 problem could be solved in 37 out of 50 test runs, whereas without constant optimization it was solved in only 10 runs. Furthermore, different configurations of the constant optimization approach (number of iterations, probability of applying constant optimization) are evaluated and their impact is detailed in the results section.

Dow Jones Index Return Forecasting: Semantics Based Genetic Programming with Local Search Optimizer

Figures

Recommended publications

Energy Consumption Forecasting Using Semantic-Based Genetic Programming with Local Search Optimizer

Stock index return forecasting: Semantics-based genetic programming with local search optimiser

Prediction of relative position of CT slices using a computational intelligence system

Geometric Semantic Genetic Programming with Local Search