Fixed Point Strategies in Data Science
This blog is primarily based on the paper
TL;DR - Fixed point methods identify general structures and principles of many big data problems and algorithms in simple, concise terms.
We begin with a quote from the introduction of the above paper:
"Nowadays, general convex optimization methods have
penetrated virtually all branches of data science."
-Combettes and Pesquet
An immediate implication is that it is essential for practitioners in this field to have at least some understanding of the fundamentals of optimization schemes. This blog discusses, at an introductory level, an abstract setting that generalizes convex optimization and carries over to frameworks used in additional areas in data science (e.g., variational inequalities, monotone inclusion models, game theoretic approaches, neural network structures, and plug-and-play methods). This abstract setting connects many famous optimization schemes (e.g., ADMM, ISTA, and PDHG) and their convergence analyses. Elements of this blog will be used in many subsequent blogs.
Common Core Elements
Herein we discuss a few core operations that arise in nearly every fixed point method. In each case, we define an operator T(u) that will later be used to weave each of these operations together.
Gradient Operator. Suppose we wish to minimize a convex function f. Given an estimate u of the minimizer, the classic approach for this task is to compute the gradient of f at the estimate u and then take a step in the negative gradient direction (with step size a). This procedure is defined via repeated application of the gradient operator, which is defined by
Projection Operator. The projection operator is typically defined in terms of a closed and convex set C. These sets often form constraints for a problem (e.g., a linear equation that must be satisfied). The projection is defined to be the closest point inside the set C to a provided point u, i.e.,
Proximal Operator. The proximal operator is a generalization of the projection operator. For a "nice" function f, the proximal operator is given by the formula
At first glance, the above expression appears to make things more complicated than necessary, but it turns out that many commonplace functions in optimization have explicit proximal formulas. For example, the proximal formula for the absolute value function (commonly called "shrink" or "soft threshold" operator) is given by
Abstraction of Core Elements
Each of the operators mentioned above can be further abstracted to a single, more general operator. We say an operator N is nonexpansive (also called 1-Lipschitz) provided
We say an operator T is averaged if there exists a nonexpansive operator N and number a in the interval (0,1) such that
This is where things get interesting. Under standard assumptions, both proximal and gradient operators are averaged. And, in both cases, to minimize the function f, we seek to find a fixed point of the gradient/proximal operator. We can use Krasnoselskii Mann (KM) iteration to find the fixed point of interest. The iteration is incredibly simple:
Classic theorems state this method will converge to a point u* satisfying u* = T(u*) (i.e., a fixed point of T ), which corresponds to a minimizer of the function f. This elegant theorem explains the convergence of many standard optimization algorithms. To show convergence, all one has to do is show that each update of an algorithm forms an averaged operator and that the fixed points of that operator correspond to the desired minimizers. There are a few key properties of averaged operators that are useful. For example, the composition of two averaged operators is averaged. This means that if T and R are averaged, then
is also averaged. Since proximal and gradient operators are averaged, we see the proximal gradient method using updates
converges. In this case, the limit u* of the method is a minimizer of f+g.
Connection ODE Discretization
Consider the ordinary differential equation (ODE) given by
The forward Euler and backward Euler schemes to evolve this ODE with step size a are given by repeated application of the gradient and proximal operators, respectively. Forward Euler uses updates
and backward Euler uses updates
This nifty connection hints at the many deep connections between differential equations and optimization (to be discussed in future blogs).
Data science problems are often closely tied to continuous convex optimization (and related problems). Algorithms for solving these problems can typically be written as fixed point updates with an operator T. Abstracting algorithms to their corresponding operators enables practitioners to identify the key pieces of information needed to ensure convergence (e.g., choices of step sizes and Lipschitz constants). This approach also enables practictioners to see how modifications to a particular algorithm or problem corresponds to changes in the associated operators and their properties. Thus, we can identify the core principles and general structures through fixed point strategies.
For a gentle introduction to these techniques, I recommend Cegielski's text Iterative Methods for Fixed Point Problems in Hilbert Spaces and Large Scale Convex Optimization via Monotone Operators by Ryu and Yin. For a more complete and dense reference, readers should get a copy of Convex Analysis and Monotone Operator Theory in Hilbert Spaces by Bauschke and Combettes. For a primer paper on related monotone methods, I also recommend by A primer on montone operator methods by Ryu and Boyd.
*In this setting, "nice" means convex, closed, and proper.
As always, I am happy to discuss in the comments below any comments, suggestions, criticisms, and/or questions pertaining to this post.