Categories

Follow Us

Share Post

Share on facebook
Share on twitter
Share on linkedin
Share on whatsapp

Decision Trees & Random Forests: Deep Dive

If you’ve always wondered how decision trees and random forests work, as well as what problems they can solve, this blog is here to answer your questions! We will illustrate these concepts by using the practical example of a real estate company wanting to predict housing prices.

A practical example: Predicting housing prices

Let’s imagine that we are an employee of a real estate company, and we are tasked with developing a method to determine property prices (data source — external link). Features used to determine the property price could include the year the house was built and / or the size of the property. Let’s also assume that we have data on historical property sales and call this our dataset X. 

Table 1 shows a small part of our dataset. In addition to the mentioned features – year of construction, size of property and price – further information in this dataset include the type of foundation and heating system, the number of bathrooms, etc. The entire dataset contains over 80 features (columns) and over 1400 historic housing sales (rows).

Table 1: Example data of historic housing sales. The “SalePrice” column is the price of the house e.g. the quantity we want to predict. All other columns contain metadata.
Developing a predictor for the property value

How would we go about developing a reliable method to predict the price of a property based on the given dataset?

One of the first thoughts that probably comes to mind is splitting our big dataset of historic housing sales into smaller groups which will contain similar houses. In order to do that, we need to specify what we mean by similar. We could split our historical dataset X based on the year of construction. For example, houses built before 1980 would go to one group, while those built after 1980 would go to the other group. If we split our dataset according to this rule and look at the distribution of housing prices in the two groups, we can see that older houses tend to be cheaper than newer houses (see Figure 1).

Figure 1: Graph showing the distribution of housing values for the split data. The blue and orange distributions show the price of houses built before and after 1980, respectively. Blue houses built before 1980 and orange houses built after 1980. We can see that newer houses tend to be more expensive. This indicates that the year of construction might be an important factor in determining the housing price.
Further refining the predictor

This intuitively seems to be correct and we can conclude that the year of construction is a rough indicator of the price. To get a more accurate estimate, we can consider other factors that might be key in determining the price. We could split our two groups further by looking at the size of the property, then split them based on the heating system, etc. If we keep splitting our dataset and sub-datasets based on the rules that we come up with, we will end up with several chunks of the dataset, each containing houses with similar features.

By splitting our dataset based on those rules, we’ve essentially built our first decision tree (see Figure 2; sequence of splits).

Figure 2: Graph showing how we build a decision tree by splitting the data into different groups.

To infer the price of a new house, we would simply use the rules that we deduced and follow the respective branches of the decision tree until we arrived at one of the leaves. For instance, if we wanted to determine the price of a house which was built before 1980 on a property smaller than 10,000 units, we would end up in chunk 3. The idea is that the average price of the resulting leaf subset will be a good estimate of the actual price of the new house.

Building decision trees

But how do we know if the rules that we’ve defined are optimal? Why wouldn’t we split the dataset by houses that were built before 1960 instead of 1980? Furthermore, how many splits should we perform?

These are some of the questions that decision tree learning algorithms aim to answer. One of the popular ways to grow a decision tree is using the CART algorithm. It determines how to split the data into subsets by using statistical information of previously sold houses.


CART algorithm

For every node (branch point) in the tree:

  1. Find the best split for each column/feature:
    • Sort the column/feature entries from smallest to largest
    • Perform all possible splits (if the training dataset contains K different values we have (K-1) possible splits)
      *: for categorical values we have 2^(K-1)-1 possible splits.
    • Measure the quality of each split by using a metric suited for your problem (see discussion on metrics below)
    • Select the best split for this column / feature by comparing the scores computed in the previous step
  2. Select the column/feature with the best split.
  3. Repeat steps 1) and 2) until you hit a stopping criteria. Possible stopping criteria could be the minimum number of samples/data points in a leaf (last node) or the maximum depth (the number of splits performed after each other)

A question that remains is how to measure the quality of a split.

The answer depends on the task. For classification tasks, common metrics are “Gini impurity” and “Entropy”. For a regression task such as ours, common metrics are the “Mean squared” error and the “Median absolute” error.

Regression
Equation 1: Equation defining the Mean squared error.
  • Mean squared error (Variance reduction)
    The mean squared error is equivalent to computing the variance of all samples that end up in one leave, hence its other name – variance reduction. Generally, the lower the score, the better.
    The equation above describes the mean squared error, where:
    • are all the samples that end up in leave m,
    • is the total number of samples that end up in leave m,
    • is the average of all the samples that ended up in one leave.
  • Median absolute error
    Similar to the mean squared error, but with the median instead.

What are the problems with decision trees and how can random forests help out 

One of the major problems with fitting decision trees is that they are likely to overfit to the historical data. What do we mean by that?

Imagine again that you are looking at data of historical housing sales. While houses were mostly sold at reasonable prices, an occasional buyer might have misjudged the price and paid much more than the average. The real estate company does not want to repeat this mistake. As our tree grows in depth, the rules become more and more specific, and eventually a rule that splits off this expensive house into a leaf will be found. Deep trees are more likely to overfit to a finite training set, because the leaves only contain a few samples and the partition of the dataset is very fine-grained.

The metric would also be minimal if all samples from the historic dataset were placed in their own leaf. To prevent decision trees from overfitting, one can, for example, set a minimum number of samples required at a leaf node or setting the maximum depth of the tree. Another way which proved to enhance stability is building ensembles of different decision trees, also known as a forest.

Sources of randomness

Since the CART algorithm is deterministic, one usually uses two sources of randomness to build a diverse set of decision trees that help reduce the variance of the predictions (Random Forests).

The first source of randomness comes from perturbing the data used at the root node. This is done by sampling new datasets where samples are drawn with replacement from the original dataset (also called bagging). This means the same data points can appear several times in the newly formed datasets.

The second source of randomness comes from the amount of columns / features considered when building the decision tree.

We build a new decision tree for every dataset that we construct with these two sources of randomness. Individually, these decision trees give suboptimal results. When we average those results to build a forest, we can observe that in a lot of cases, errors cancel each other out. We are less likely to overfit to single samples in the dataset.

Random Forest hyperparameters

Different random forests can still give different results, not only because of the randomness, but also because of different hyperparameters that we have to set. Key parameters that need to be adapted in order to find and train an optimal random forest for a new dataset are:

  • The number of trees in the forest
  • The splitting metric to use when building individual decision trees
  • The number of random features to use when building new datasets
  • The number of depths we want individual trees to grow to

Random Forests in Modulos AutoML

The question that arises now is how to choose the optimal hyperparameters; do we need to preprocess our data to use a random forest algorithm? Is our problem even solvable with a random forest algorithm? How do we adapt to changes that appear in our data over time?

These are all valid questions which can have a tremendous impact on the performance of the model and reaching the solution for the problem at hand. With the Modulos AutoML platform, we do our best to resolve these questions in a systematic way. If you would like to know more, get in touch with us.

If you found our explanation of decision trees and random forests insightful, feel free to share it!

Share Post

Share on facebook
Share on twitter
Share on linkedin
Share on whatsapp

Find Out More

Read More:

Evangelina Mitsopoulou

Evangelia Mitsopoulou

Senior Frontend Engineer

Work? What is this? I only know the verb create.

She is g(r)eek frontend advocate. Evangelia holds a M.Sc on ICT (2008) from Aristotle University of Thesslaoniki and a B.Sc on Applied Computer Science (2006) from Univesity of Macedonia in Thessaloniki, Greece. She has worked as a semantic web researcher on EC-funded projects while living in London. The last 8 years she loves mastering the frontend world.

Kevin Schawinski

Kevin Schawinski

CEO / Co-Founder

Running a startup is super relaxing, right?

While a Ph.D student, he co-founded the Galaxy Zoo citizen science project involving more than a million members of the public in scientific research because machines weren’t quite good enough yet to go map the cosmos and classify galaxies. He stayed in Oxford as the Henry Skynner Junior Research fellow at Balliol College before moving to Yale as a NASA Einstein Fellow. In 2012, he started the galaxy and black hole research group at ETH Zurich as an assistant professor and began a close collaboration with Ce Zhang from computer science to work on the space.ml project. He is now the CEO of Modulos.

Ce Zhang

Ce Zhang

Co-Founder

Random is best.

He believes that by making data—along with the processing of data—easily accessible to non-computer scientists, we have the potential to make the world a better place. His current research focuses on building data systems to support machine learning and help facilitate other sciences. Before joining ETH, Ce was advised by Christopher Ré. He finished his PhD round-tripping between the University of Wisconsin-Madison and Stanford University, and spent another year as a postdoctoral researcher at Stanford. His PhD work produced DeepDive, a trained data system for automatic knowledge-base construction. He participated in the research efforts that won the SIGMOD Best Paper Award (2014) and SIGMOD Research Highlight Award (2015), and was featured in special issues including the Science magazine (2017), the Communications of the ACM (2017), “Best of VLDB” (2015), and the Nature magazine (2015).

Alexandra Arvaniti

Alexandra Arvaniti

Operations Manager

“You miss 100% of the shots you don’t take.” – Wayne Gretzky

During the last twenty years, she worked in different roles, setting up and running PMOs, supporting the Executive Management Team or as Operations Manager for the DACH region. She loves all organizational challenges, which she can use well at Modulos, like set up and establish administrative business processes.

Rudolf Bar

Rudolf Bär

Chairman of the Advisory Board

After initially working for Dow Corning International in Zurich and Brussels (1964 to 1969), he held various management functions in the Private Banking Group Julius Baer, Zurich, lastly as CEO from 1993 to 2000 and retired from its Board of Directors in 2005. Since 2014 he has been studying at the Institute for Particle Physics and Astrophysics at the ETH, Zurich.

Marianne Chiesi

Marianne Chiesi

Administration

Marianne has worked in administration of various companies and the ETH.

Marianne Chiesi worked in the administration of various companies before taking time off to raise her children. She translated text books and literary works into Braille and joined the ETH Zurich as an administrative assistant. At ETH, she worked with professorships and researchers in many areas, including astrophysicists, particle physicists and biochemists. She now runs the administration at Modulos.

Bojan Karlaš

Bojan Karlaš

Software Engineer

Real engineers must be a little bit lazy.

After getting a bachelor’s degree in software engineering at the University of Belgrade, Serbia, Bojan spent 2 years working as a developer at Microsoft building distributed database solutions. He then went to Switzerland to pursue a computer science master’s degree at EPFL. He did his master thesis with Ce Zhang at ETH Zürich on the topic of time series forecasting, after which he joined Ce’s group as a PhD student. His industry experience also includes internships at Microsoft, Oracle and Logitech. His research interests revolve around systems and abstractions for making machine learning accessible to non-experts.

Nikolay Komarevskiy

Nikolay Komarevskiy

Software Engineer

Software engineer in his prime

Passionate about nanophotonics and scientific research, he pursued his PhD degree in the Computational Optics group under the supervision of Prof. Christian Hafner at ETH Zurich. In addition to electromagnetics, Nikolay gained profound expertise in optimizations and in evolutionary optimizations in particular. Substantial part of his PhD work was conducted in collaboration with NASA Ames and was dedicated to the design and optimization of photonic reflectors. After a year of Postdoc, Nikolay moved to industry, where he joined an R&D team to employ his experience in electromagnetic/multiphysics simulations and stochastic optimizations. Fascinated by the recent advances in building smart software, Nikolay switched his gears to software engineering and eagerly faces new challenges.

Romain Lencou

Romain Lencou

Head of Engineering

Deleted code is debugged code. (Jeff Sickel)

Romain Lencou graduated from the Grenoble Institut National Polytechnique with M.Sc in Computer Science in 2008. Growing up in France in the 90’s, he developed an enthusiasm for pop culture, technology and food. Always eager for technological challenges, Romain worked for companies like VMware, Intel and Logitech, covering various topics including cryptography, virtualization and computer vision. Bitten by the machine learning bug, he is looking forward to apply his problem solving skills in Modulos.

Dominic Stark

Dominic Stark

Data Scientist

Code quality correlates with food quality.

Dominic Stark studied physics at ETH Zürich. The transition of his career path to Data Science began when he was analyzing UV images of galaxies. Together with Kevin Schawinski an Ce Zhang, he worked on applying the latest advances of deep learning research to his problem. It turned out that the method itself was at least as interesting as the problem they designed it for. After publishing the results, his research project was about using Reinforcement Learning to develop novel ideas for data acquisition in astronomy. As a Data Scientist at Modulos, he keeps on solving problems, that require new ideas and technologies.

Modulos Newsletter

Sign up for our newsletter to receive updates on our products and company.

Michael Röthlisberger

Michael Röthlisberger

Data Scientist

Data handling with structure

He started to take an interest in Data Science and Software Development during his master’s degree. For his master thesis he worked on the image reconstruction software for a new PET detector. Michael gained some first experience in an internship for Sensirion AG. There he was part of the R&D team, which was developing a new gas sensor. The participation of a machine learning hackathon was sparking the interest of Michael in ML and he decided to pursue a career in this field. He is now exited to face new challenges with modulos and experience working in a rising start-up.

Dennis Turp

Dennis Turp

Data Scientist

Dennis Turp is the first employee of Modulos.

Prior to his work at Modulos he studied physics at ETH Zurich. During his Master studies he worked together with Kevin Schawinski and Ce Zhang on exploring machine learning related topics in astronomy. In these one and a half years they published three scientific papers together. Dennis Turp is currently employed as a Data Scientist. His main expertise lies in the fields of generative modeling and anomaly detection.

Andrei Văduva

Andrei Văduva

Software Engineer

The trendsetter geek

He focused his attention on designing Architectures of Computer Systems. During university, he gained an excellent understanding of performance optimization and scalability on architectures such as distributed systems. Having a good experience in various Computer Science fields like big data analytics and Artificial Intelligence, he did his bachelor’s thesis designing a Machine Learning algorithm for social media platforms. After graduation, he joined the investment banking industry, in London, where he gained good experience in designing and building high-quality software. Andrei moved to Switzerland to explore new perspectives and found a great challenge in the startup world. Using his passion for technology and professional experience, he brings the best practices in software engineering to Modulos.

Modulos appoints Anna Weigel as CTO

Anna Weigel

Chief Technology Officer

After acquiring Bachelor and Master degrees in Physics, Anna completed her PhD in Astrophysics in Kevin Schawinski’s group at ETH. Her work on the relationship between supermassive black holes and their host galaxies is summarized in five first-author papers. After exploring the depths of our Universe, Anna joined Modulos as the Head of Data Science. She has since been appointed the role of CTO and is now leading the overall technology development.

Claudio Bruderer

Claudio Bruderer

Product Manager

Give me coffee to function.

After obtaining a BSc and a MSc degree in physics at ETH Zurich, Claudio decided to continue his studies of the Universe as a PhD student in Prof. Refregier’s Cosmology research group. He studied the gravitational lensing effect, whereby he measured the shapes of several billions of galaxy images (mostly synthetic ones). After acquiring his PhD, Claudio then joined the consulting company AWK Group AG and worked as a project manager and associate for IT and communications projects in the logistics and mobility sectors and for the federal government. Determined to create cutting-edge IT solutions, he decided to join Modulos as a product manager.

Thank you for submitting this form.

Christoph Golombek

Christoph Golombek

Sales Manager

Happy customers, happy Christoph – or is it the other way around?

After finishing his master studies in Energy Technology at RWTH in Germany, Christoph started his professional career as an expert and Sales Support Engineer for wind turbines in cold climates in Canada. There he started seeing the benefits of machine help in tackling data-driven challenges. Having explored the great North, his passion for cutting edge technology drove him into the machine vision sector in Switzerland, where he has worked as a fusion of Sales Engineer and Tech Support, while also acting as a Team Leader of a team of four. At Modulos, he can now focus again on bringing state-of-the-art technology to happy customers.

Florian Marty

Florian Marty

Sales Manager

Putting Science into the Art of Sales.

As a Ph.D. in Molecular Biology from the University of Zurich, Florian Marty was, like most scientists, not a big fan of sales initially. But, over the years and with growing experience in different commercial roles, he learned that there is a lot of science in what makes good salespeople. Coupled with his open mindset to learn new things and a communicative personality, Florian is fascinated to explore and test new strategies, tactics, and expert moves in sales. As a Sales Manager, he will be joining the commercial team helping to grow the customer base and make Machine Learning accessible to everyone. Fun fact, as Florian has never written a single line of code in his life.

We believe he is the perfect fit to bring across the Modulos value proposition to our customers. Do not hesitate to reach out to Florian to engage in a discussion about Modulos AutoML.