Both networks are trained by what is called gradient descent, a method for algorithm optimization. There is a common analogy that illustrates how this method works: Picture yourself on a mountain. It’s early in the morning, heavy fog hangs over the area, severely limiting your view, making it impossible for you to see the path downhill. To get to the valley nonetheless, you use local information to find it. By looking at the steepest spot still visible from your current position, you can gradually descend downhill. To check if you are at the actual bottom of the mountain – and not just in a some small pit halfway (a local minimum, e.g. a mountain lake) – you have a special tool. However, using this tool is difficult and very time-consuming, so you use it sparingly. It is difficult for you to choose the frequency with which you want to use the tool, but eventually, you will reach the valley – the global minimum.