Lately I’ve been spending a lot of time thinking about my career and where it’s going. I don’t want to give the impression that I have never thought about my career before, but now the thoughts are becoming constant.
Gradient Descent is an incredibly powerful mathematical tool. It's purpose is to find the local minimum of a function of infinite complexity. Sounds a little boring right? But if you've ever had an issue where you were trying to find the best answer for a given problem, this would be a great, albeit simplistic, solution. I say simplistic, not in the sense that it is easy to understand, but because it is the basis of so many other optimization and learning algorithms.
The function you are trying to solve can be anything. Many simple problems can be solved by substitution and solving equations. However, once the functions start to get overly complex, this becomes impractical and sometimes downright impossible. This is where optimization is needed. Let's talk about a simple business example.
Let's say you are designing a product to be sold. It could be software with many components and engineers, or hardware with parts, engineers, and maintenance. The business level engineers have created cost functions for each individual part to the product. There are 3 cost functions for each part in the product. One is the cost of engineering the product, or the cost for an engineer (based on salary) to create that component. Then there is the cost of the product. And finally the maintenance cost. These are all functions of time. Time here refers to the time to market. Based on the quality of the component it may take less time to produce. Although, the quality will be worse and thus our maintenance cost goes up giving us a net higher cost when added together. But if we give the engineer more time to analyze and produce a better product, the maintenance cost goes down but the engineering cost goes up.
Now we have 100 parts with all these equations derived and figured out. How do we find the fastest time to market while minimizing cost? By adding all these functions together we will get a crazy function and it would be nearly impossible to derive this by hand. So we are better off trying to use the gradient descent algorithm to find our minimum by optimization.
For a scenario like this to work, every point in time would have to reflect certain products being chosen and may not be the most practical example, but it is a possible use-case in the business world.
The gradient descent algorithm is based on a Calculus concept called the derivative. A derivative of a function is simply the rate at which a function changes. So if you're driving a car, from the time you press the gas peddle on, you will be changing speeds. If you go from 0 to 60 mph in 5 seconds, you have a change in speed (or acceleration) of 8 mph per second. If you consistently changed speed throughout those 5 seconds, the derivative would be a straight line. But if you got up to 20 mph in 1 second then 50 mph in 3 seconds and finally 60 mph in 5 seconds, it wouldn't be consistent and your function would be more jagged.
For complex equations, the derivative can be estimated in many different ways and we will not cover those here. Instead, we will assume you know what the derivative is of your function (or know how to compute the value).
Gradient Descent has a very clear equation for executing the algorithm as shown below:
I chose to express the derivative portion in two ways here: Using the capital delta symbol vs the partial derivative form. The second form is more correct (in my opinion), however, they mean the same thing in this context. x' is pronounced "x prime" to indicate the new value. The gamma variable is sometimes called the learning rate (or more correctly the step size). This is used as a way to control how fast the algorithm "learns" or closes in on the minimum of the function.
The value of gamma is critical. If it is too high it will learn too fast and possibly overshoot the minimum. If it is too small, it may take too long to compute practically.
To execute the algorithm, you must pick a starting x. Once this is done, you fill in the equation and you have a new x'. Continue this until you reach a specified goal. What is the goal? Maybe you want the change in x to be below a threshold. so when x is changing only by 0.1% each iteration, you stop the algorithm and get your value out.
Let's go through a simple example.
Let's start out with x = 0. Then we run the algorithm over and over until we reach a reasonable value.
As you can see in the table, we approach an x = -1.5. This calculation was performed in Google Sheets fairly quickly by putting in the equation above with a gamma of 0.1. So what does F(x) look like on a graph? Below shows us what we should expect.
As you can see in the plot, our minimum occurs at -1.5. Exactly how the algorithm "predicted."
The example above was extremely trivial. We could have found the derivative in a much simpler way. However, it shows how the algorithm works. We would use this equation more realistically in much more complex situations. I have used this in circuit simulation to find necessary component values.
You should realize that if the domain of the function is complex, you may not find the absolute minimum. This is a problem. The algorithm only finds the local minimum. This is the nature of how it works. Once a minimum is found, the algorithm has no way of figuring out that there are any other minimums available. This is where the starting x becomes important. In the past, I would randomly start the algorithm in a variety of different places and choose the absolute minimum among those. However, even this isn't perfect.
Lastly, if you're performing this algorithm against more than one variable (instead of just x, you're using x, y, and z) then you need to make sure that you perform all the calculations before setting the new values. If you do not, the algorithm will not converge correctly and you may wonder why everything is out of wack.
The gradient descent is a powerful optimization tool. It is commonly used in Neural Networks as a basis for backpropagation. It is also used in many SPICE software packages as a way to optimize circuit values. For all its power, though, it isn't perfect by itself. You must augment its simplicity with logic that can determine the best possible values.
Along with that, how you determine your functions is also extremely critical. What do you want to minimize? Are all the variables accounted for? These types of questions should be asked regularly when using this algorithm.
If you have any questions, feel free to ask!
'Til next time
Check out our thoughts here.
There is one, and only one, primary focus that any software developer acknowledge: the ability for software to be maintainable. Of course, correctness, functionality, and performance are all important, these will always be easier to address with maintainable software.