Fig 1 - uploaded by Bogdan Trawinski
Content may be subject to copyright.
Change trend of average transactional prices per square metre over time 

Change trend of average transactional prices per square metre over time 

Source publication
Conference Paper
Full-text available
The random subspace and random forest ensemble methods using a genetic fuzzy rule-based system as a base learning algorithm were developed in Matlab environment. The methods were applied to the real-world regression problem of predicting the prices of residential premises based on historical data of sales/purchase transactions. The computationally...

Context in source publication

Context 1
... improvement and variance reduction of learners for both classification and regression problems [5], [8]. Another approach to ensemble learning is called the random subspaces, also known as attribute bagging [4]. This approach seeks learners diversity in feature space subsampling. All component models are built with the same training data, but each takes into account randomly chosen subset of features bringing diversity to ensemble. For the most part, feature count is fixed at the same level for all committee components. The method is aimed to increase generalization accuracies of decision tree-based classifiers without loss of accuracy on training data. Ho showed that random subspaces can outperform bagging and in some cases even boosting [11]. While other methods are affected by the curse of dimensionality, random subspace technique can actually benefit out of it. Both bagging and random subspaces were devised to increase classifier or regressor accuracy, but each of them treats the problem from different point of view. Bagging provides diversity by operating on training set instances, whereas random subspaces try to find diversity in feature space subsampling. Breiman [3] developed a method called random forest which merges these two approaches. Random forest uses bootstrap selection for supplying individual learner with training data and limits feature space by random selection. Some recent studies have been focused on hybrid approaches combining random forests with other learning algorithms [9], [14]. We have been conducting intensive study to select appropriate machine learning methods which would be useful for developing an automated system to aid in real estate appraisal devoted to information centres maintaining cadastral systems in Poland. So far, we have investigated several methods to construct regression models to assist with real estate appraisal: evolutionary fuzzy systems, neural networks, decision trees, and statistical algorithms using MATLAB, KEEL, RapidMiner, and WEKA data mining systems [10], [15], [16]. A good performance revealed evolving fuzzy models applied to cadastral data [17], [20]. We studied also ensemble models created applying various weak learners and resampling techniques [13], [18], [19]. The first goal of the study presented in this paper was to compare empirically random subspace, random forest, bagging, repeated holdout, and repeated cross- validation ensemble models employing genetic fuzzy systems (GFS) as base learners. The algorithms were applied to real-world regression problem of predicting the prices of residential premises, based on historical data of sales/purchase transactions obtained from a cadastral system. The second goal was to examine the performance of the ensemble methods with noisy training data. The resilience to noised data can be an important criterion for choosing appropriate machine learning methods to our automated valuation system. The susceptibility of machine learning algorithms to noised data has been explored in several works, e.g. [1], [12], [21], [22]. The investigation was conducted with our experimental system implemented in Matlab environment using Fuzzy Logic, Global Optimization, Neural Network, and Statistics toolboxes. The system was designed to carry out research into machine learning algorithms using various resampling methods and constructing and evaluating ensemble models for regression problems. Real-world dataset used in experiments was drawn from an unrefined dataset containing above 50 000 records referring to residential premises transactions accomplished in one Polish big city with the population of 640 000 within 11 years from 1998 to 2008. The final dataset counted the 5213 samples. Five following attributes were pointed out as main price drivers by professional appraisers: usable area of a flat ( Area ), age of a building construction ( Age ), number of storeys in the building (Storeys), number of rooms in the flat including a kitchen ( Rooms ), the distance of the building from the city centre ( Centre ), in turn, price of premises ( Price ) was the output variable. For random subspace and random forest approaches four more features were employed: floor on which a flat is located ( Floor ), geodetic coordinates of a building ( Xc and Yc) , and its distance from the nearest shopping center ( Shopping ). Due to the fact that the prices of premises change substantially in the course of time, the whole 11-year dataset cannot be used to create data-driven models. In order to obtain comparable prices it was split into 20 subsets covering individual half-years. Then the prices of premises were updated according to the trends of the value changes over 11 years. Starting from the beginning of 1998 the prices were updated for the last day of subsequent half-years. The trends were modelled by polynomials of degree three. The chart illustrating the change trend of average transactional prices per square metre is given in Fig. 1. We might assume that half-year datasets differed from each- other and might constitute different observation points to compare the accuracy of ensemble models in our study and carry out statistical tests. The sizes of half-year datasets are given in Table 1. As a performance function the root mean square error (RMSE) was used, and as aggregation functions of ensembles arithmetic averages were employed. Each input and output attribute in individual dataset was normalized using the min-max approach. The parameters of the architecture of fuzzy systems as well as genetic algorithms are listed in Table 2. Similar designs are described in [6], [7], [15]. Following methods were applied in the experiments, the numbers in brackets denote the number of input features: CV(5) – Repeated cross-validation: 10-fold cv repeated five times to obtain 50 pairs of training and test sets, 5 input features pointed out by the experts, BA(5) – 0.632 Bagging: bootstrap drawing of 100% instances with replacement (Boot), test set – out of bag (OoB), the accuracy calculated as RMSE(BA) = 0.632 x RMSE(OoB) + 0.368 x RMSE(Boot), repeated 50 times, RH(5) – Repeated holdout: dataset was randomly split into training set of 70% and test set of 30% instances, repeated 50 times, 5 input features, RS(5of9) – Random subspaces: 5 input features were randomly drawn out of 9, then dataset was randomly split into training set of 70% and test set of 30% instances, repeated 50 times, RF(5of9) – Random forest: 5 input features were randomly drawn out of 9, then bootstrap drawing of 100% instances with replacement (Boot), test set – out of bag (OoB), the accuracy calculated as RMSE(BA) = 0.632 x RMSE(OoB) + 0.368 x RMSE(Boot), repeated 50 times. We examined the aforementioned ensemble methods also for their susceptibility to noised data. Each run of experiment was repeated four times: firstly each output value (price) in training datasets remained unchanged. Next we replaced the prices in 5%, 10%, and 20% of randomly selected training instances with noised values. The noised values were randomly drawn from the bracket [Q1- 1.5 x IQR, Q3+1.5 x IQR], where Q1 and Q2 denote first and third quartile, respectively, and IQR stands for the interquartile range. The performance of CV(5), BA(5), RH(5), RS(5of9), and RF(5od9) models created by genetic fuzzy systems (GFS) in terms of RMSE for non-noised data and data with injected 5%, 10%, and 20% noise is illustrated graphically in Figures 2-5 respectively. In the charts it is clearly seen that the biggest values of RMSE provide the RS and RF models for all levels of noised data. Nonparametric statistical tests confirm this observation. The Friedman test performed in respect of RMSE values of all models built over 20 half-year datasets showed that there are significant differences between some models. Average ranks of individual models are shown in Table 3, where the lower rank value the better model. For all levels of noise the rankings are the same BA(5) reveals the best performance, next are CV(5) and RH(5), RF(5of9) and RS(5of9) are in the last place. According to nonparametric Wilcoxon paired test for non-noised data there are statistically significant differences between each pair of ensembles. For 5% and 10% noise the performance of CV(5) and RH(5) ensembles is statistically equivalent. In turn, with 20% noise no statistically significant differences occur among CV(5), RH(5), and RF(5of9) ensembles. The significance level considered for the null hypothesis rejection was set to 0.05 in each test. As for the susceptibility to noise of individual ensemble methods the general outcome is as follows. Injecting subsequent levels of noise results in worse and worse accuracy. Percentage loss of performance for data with 5%, 10%, and 20% noise versus non-noised data is shown in Tables 4, 5, and 6, respectively. The amount of loss is different for individual datasets and it increases with the growth of percentage of noise. In some cases for 5% and 10% noise the injection of noise results in better performance. The most important observation is that in each case the average loss of accuracy for RS(5of9) and RF(5of9) is lower than for the ensembles built over datasets with five features pointed out by the experts. The Friedman test performed in respect of RMSE values of all ensembles built over 20 half-year datasets indicated significant differences between models. Average ranks of individual methods are shown in Table 7, where the lower rank value the better model. For each method the rankings are the same: ensembles built with non-noised data outperform the others, models with lower levels of noise reveal better accuracy than the ones with more noise. The nonparametric Wilcoxon paired test indicated statistically significant differences between each pair of ensembles with different amount of noise. The computationally intensive experiments aimed to compare the predictive accuracy of random subspace and ...

Similar publications

Conference Paper
Full-text available
The ensemble machine learning methods incorporating bagging, random subspace, random forest, and rotation forest employing decision trees, i.e. Pruned Model Trees, as base learning algorithms were developed in WEKA environment. The methods were applied to the real-world regression problem of predicting the prices of residential premises based on hi...

Citations

... The issue of this research on the other hand is the real estate appraisal. Many attempts over the last years have been carried out in order to achieve best accuracy of automated valuation models in this domain [12,9,13]. Improvements in this topic is very demanded by professional appraisals, financial institutions, investors and property developers. Due to the multiplicity and diversity of attributes that can be assigned to the particular real estate, this case is very good subject for machine learning and Big Data researchers [8]. ...
Chapter
Full-text available
This paper addresses a property valuation problem with machine learning models with pre-selection of attributes. The study aimed to examine to what extent the environmental attributes influenced real estate prices. Real-world data about purchase and sale transactions derived from a cadastral system and registry of real estate transactions in one of Polish big cities were employed in the experiments. Machine learning models were built using basic attributes of apartments and environmental ones taken from cadastral maps. Five market segmentations were made including administrative cadastral regions of a city and quality zones delineated by an expert, and classes of apartments. Feature selection was accomplished and property valuation models were built for each division of a city area. The study allowed also for a comparative analysis of performance of ensemble learning techniques applied to construct predictive models.
... R(5of9) revealed significantly worse performance than any other ensemble apart form 0% and 5% noise levels where it is statistically equivalent to RF(5of9). In our previous work we explored the susceptibility to noise of individual ensemble methods [19]. The most important observation was that in each case the average loss of accuracy for RS(5of9) and RF(5of9) was lower than for the ensembles built over datasets with five features pointed out by the experts. ...
... Percentage loss of performance for data with 20% noise vs non-noised data[19] ...
Conference Paper
Full-text available
The ensemble machine learning methods incorporating random subspace and random forest employing genetic fuzzy rule-based systems as base learning algorithms were developed in Matlab environment. The methods were applied to the real-world regression problem of predicting the prices of residential premises based on historical data of sales/purchase transactions. The accuracy of ensembles generated by the proposed methods was compared with bagging, repeated holdout, and repeated cross-validation models. The tests were made for four levels of noise injected into the benchmark datasets. The analysis of the results was performed using statistical methodology including nonparametric tests followed by post-hoc procedures designed especially for multiple N×N comparisons.
Article
Full-text available
Current technology still does not allow the use of quantum computers for broader and individual uses; however, it is possible to simulate some of its potentialities through quantum computing. Quantum computing can be integrated with nature-inspired algorithms to innovatively analyze the dynamics of the real estate market or any other economic phenomenon. With this main aim, this study implements a multidisciplinary approach based on the integration of quantum computing and genetic algorithms to interpret housing prices. Starting from the principles of quantum programming, the work applies genetic algorithms for the marginal price determination of relevant real estate characteristics for a particular segment of Naples' real estate market. These marginal prices constitute the quantum program inputs to provide, as results, the purchase probabilities corresponding to each real estate characteristic considered. The other main outcomes of this study consist of a comparison of the optimal quantities for each real estate characteristic as determined by the quantum program and the average amounts of the same characteristics but relative to the real estate data sampled, as well as the weights of the same characteristics obtained with the implementation of genetic algorithms. With respect to the current state of the art, this study is among the first regarding the application of quantum computing to interpretation of selling prices in local real estate markets.