Anna Weigel (CTO at Modulos)
Have you ever wondered how features such as facial recognition on your phone, and autocomplete in text and emails actually work? The answer lies within artificial neural networks.
Today, artificial neural networks are commonly used in many segments of life – everywhere from machine translations, chatbots, YouTube’s recommended section, and speech recognition, all the way to automatic driving. Interestingly, artificial neural networks also play an important role in the perfume industry – they are used to combine ingredients and develop new fragrances in a manner unattainable for humans. Apart from these commercial purposes, neural networks have helped biologists solve a 50-year-old protein folding challenge. In a major scientific advance, an artificial intelligence system has been recognized as a solution to figuring out what shapes proteins fold into. This demonstrates the ability of AI to significantly accelerate progress in scientific discoveries.
Although significant in latest technologies and innovations, artificial neural networks are not a new invention. The first artificial neural networks are often accredited to Warren McCulloch and Walter Pitts. In 1943, McCulloch and Pitts used an electrical circuit to recreate the function of neurons in the human brain. They had thus created a computational model for neural networks based on algorithms known as Threshold Logic. In the 2000s, increased available computing power, the use of GPUs, and distributed computing helped to successfully overcome challenges in training neural networks. Artificial neural networks were deployed on a large scale, specifically in image and visual recognition problems, which we now know of as deep learning.
The Underlying Principle of Neural Networks
An artificial neural network consists of dozens to millions of artificial neurons (called processing units or nodes) arranged in a series of layers. Each node connects to other nodes via links. Each of those links has a specific weight that determines the strength of one node’s influence on the other. The node takes these into account and computes the weighted sum (weights w) of its inputs (X) and a bias (b). It then applies an activation function (φ) and passes the result on to the next node.
The purpose of the activation function is to introduce non-linearity. Commonly used methods include the sigmoid (scale sum to range between 0 and 1) and ReLU (Rectified Linear Unit, max(0, x)) activation functions.
The other term besides the weighted sum is the bias, which is similar to the constant in a linear function (y = ax + b). It allows an additional shift of the weighted sum.
Within an artificial neural network, nodes are arranged in multiple layers. We distinguish between the input layer, hidden layers, and the output layer.
Nodes in the input layer receive input but don’t perform any computation. Hidden nodes on the other hand receive information from previous layers, perform computations, and pass their outputs on to the output layer. Lastly, the output layer nodes perform the final computation and determine the network’s output. With deep neural networks therefore, the “deep” refers to the number of layers in the network.
The output layer’s activation function determines the model type (e.g. linear for a regression model, softmax for classification). In a fully connected network, each neuron connects to all neurons from the previous layer. That’s not the case with a convolutional neural network (see below).
What about building a neural network?
Even the aforementioned simple example has 7 free parameters (4 weights and 3 biases), while realistic networks contain many more layers and neurons. In order to build a neural network, we need to train it.
Training a Neural Network
A common method used to train an artificial neural networks is backpropagation. It works in the following way: To start off, we need to initialize all weights and biases. One option is to randomly assign weights and set biases to 0. To generate a prediction then, we feed the input forward through the network – this is referred to as forward propagation. After comparing the prediction to the true label, we calculate the error by using the loss function (e.g. MSE for regression, cross entropy for classification).
Our aim is to minimize the error of the output layer, i.e. to reduce the loss. The prediction depends on the parameters (weights and biases) of the nodes in previous layers. So, to reduce the loss and produce a more accurate prediction, the nodes’ parameters have to be updated. We therefore propagate the errors back through the network (backpropagation) and use an optimization method to choose new parameters.
Finally, we propagate the input through the network again and repeat the process.
The optimization method, or the employed strategy to update the parameters, is key here. Gradient descent is a popular optimization algorithm: to minimize the loss function, we compute the gradient. In general, the gradient tells us how a slight variation of a function’s input changes the output. When trying to find the minimum of the loss function, i.e. a network’s optimal parameters, the gradient points us in the direction of the minimum. As we move backwards through the network, we compute the gradient (i.e. the partial derivative) of the loss function with respect to the parameters in the previous layer. This allows us to estimate how a variation of a layer’s weights and biases impacts all successive layers and the final loss.
Once we have computed the gradient, we update our network’s parameters – for each weight, we subtract the corresponding gradient multiplied by the learning rate.
The learning rate corresponds to the step size we take to approach the minimum – choose a learning rate that is too high and you might miss the minimum; choose a learning rate that is too low and you will need more iterations to find the minimum, and might get stuck in a local minimum.
Gradient descent comes in different versions, e.g. batch gradient descent (inject all data at once), stochastic gradient descent (use a random sample for each iteration), and mini batch gradient descent (feed network with N random samples).
Now that we’ve mastered training of neural networks, it’s time to get to know some of the different types that exist, each serving a different use case and task.
Examples of Specialized Neural Networks
Convolutional Neural Networks
In contrast to fully connected networks, convolution neural networks contain convolutional layers. Nodes in one layer only connect to some local nodes in the previous layer. This is similar to convolving the input with a filter. Compared to a fully connected network, there are significantly fewer weights.
Convolution neural networks have proven especially effective for images.
Recurrent Neural Networks
Feedforward neural networks, like the one described above, assume that the input and output data are independent from each other – they map one input to one output. Recurrent neural networks (RNNs) have a “memory”; they remember information from previous inputs. Input and output are not independent from each other – the information from previous inputs impacts the current input and output.
RNNs are commonly used for sequential data, e.g. in natural language processing where the order of words impacts the meaning.
The goal of autoencoders is not to return classifications or numerical values for regression, rather it’s to learn a representation of the data. They can compress, as well as reconstruct data. Aiming to reproduce their own input, they are trained in an unsupervised way. In its most basic form, an autoencoder consists of an input layer, at least one hidden layer, and an output layer. The input layer maps the input data to the lower dimensional hidden layer (encoder). The output layer then maps the data back to its original dimension (decoder).
Autoencoders are useful for image denoising, anomaly detection, or general dimensionality reduction.
Finding the Best Model
During hyperparameter tuning, we try to find the best solution within a search space that can be infinite. It is hence plausible to automate the process.
As a rough rule, more complex data sets require more complex models, such as neural networks. Keep in mind that training neural networks is time consuming and computationally expensive. But you also might not always need a neural network. Depending on the task at hand, simpler models (e.g. random forest and ridge regression) can perform similarly.
However, this is a guideline. It’s difficult to tell ahead of time which type of model would be best for your use case. That’s why it’s important not to discard simpler models right away, and approach model selection and hyperparameter tuning in an unbiased and systematic way. This is exactly what we strive to do with the Modulos AutoML platform.
If you’d like to learn more about the concepts behind machine learning, please head over to our Resources Page where we have a series of videos on the topic.