tfp.optimizer.proximal_hessian_sparse_one_step

One step of (the outer loop of) the minimization algorithm.

This function returns a new value of x, equal to x_start + x_update. The increment x_update in R^n is computed by a coordinate descent method, that is, by a loop in which each iteration updates exactly one coordinate of x_update. (Some updates may leave the value of the coordinate unchanged.)

The particular update method used is to apply an L1-based proximity operator, "soft threshold", whose fixed point x_update_fix is the desired minimum

x_update_fix = argmin{
    Loss(x_start + x_update')
      + l1_regularizer * ||x_start + x_update'||_1
      + l2_regularizer * ||x_start + x_update'||_2**2
    : x_update' }

where in each iteration x_update' is constrained to have at most one nonzero coordinate.

This update method preserves sparsity, i.e., tends to find sparse solutions if x_start is sparse. Additionally, the choice of step size is based on curvature (Hessian), which significantly speeds up convergence.

This algorithm assumes that Loss is convex, at least in a region surrounding the optimum. (If l2_regularizer > 0, then only weak convexity is needed.)

gradient_unregularized_loss (Batch of) Tensor with the same shape and dtype as x_start representing the gradient, evaluated at x_start, of the unregularized loss function (denoted Loss above). (In all current use cases, Loss is the negative log likelihood.)
hessian_unregularized_loss_outer (Batch of) Tensor or SparseTensor having the same dtype as x_start, and shape [N, n] where x_start has shape [n], satisfying the property Transpose(hessian_unregularized_loss_outer) @ diag(hessian_unregularized_loss_middle) @ hessian_unregularized_loss_inner = (approximation of) Hessian matrix of Loss, evaluated at x_start.
hessian_unregularized_loss_middle (Batch of) vector-shaped Tensor having the same dtype as x_start, and shape [N] where hessian_unregularized_loss_outer has shape [N, n], satisfying the property Transpose(hessian_unregularized_loss_outer) @ diag(hessian_unregularized_loss_middle) @ hessian_unregularized_loss_inner = (approximation of) Hessian matrix of Loss, evaluated at x_start.
x_start (Batch of) vector-shaped, float Tensor representing the current value of the argument to the Loss function.
tolerance scalar, float Tensor representing the convergence threshold. The optimization step will terminate early, returning its current value of x_start + x_update, once the following condition is met: ||x_update_end - x_update_start||_2 / (1 + ||x_start||_2) < sqrt(tolerance), where x_update_end is the value of x_update at the end of a sweep and x_update_start is the value of x_update at the beginning of that sweep.
l1_regularizer scalar, float Tensor representing the weight of the L1 regularization term (see equation above). If L1 regularization is not required, then tfp.glm.fit_one_step is preferable.
l2_regularizer scalar, float Tensor representing the weight of the L2 regularization term (see equation above). Default value: None (i.e., no L2 regularization).
maximum_full_sweeps Python integer specifying maximum number of sweeps to run. A "sweep" consists of an iteration of coordinate descent on each coordinate. After this many sweeps, the algorithm will terminate even if convergence has not been reached. Default value: 1.
learning_rate scalar, float Tensor representing a multiplicative factor used to dampen the proximal gradient descent steps. Default value: None (i.e., factor is conceptually 1).
name Python string representing the name of the TensorFlow operation. The default name is "minimize_one_step".

x (Batch of) Tensor having the same shape and dtype as x_start, representing the updated value of x, that is, x_start + x_update.
is_converged scalar, bool Tensor indicating whether convergence occurred across all batches within the specified number of sweeps.
iter scalar, int Tensor representing the actual number of coordinate updates made (before achieving convergence). Since each sweep consists of tf.size(x_start) iterations, the maximum number of updates is maximum_full_sweeps * tf.size(x_start).

References

[1]: Jerome Friedman, Trevor Hastie and Rob Tibshirani. Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 2010. https://www.jstatsoft.org/article/view/v033i01/v33i01.pdf

[2]: Guo-Xun Yuan, Chia-Hua Ho and Chih-Jen Lin. An Improved GLMNET for L1-regularized Logistic Regression. Journal of Machine Learning Research, 13, 2012. http://www.jmlr.org/papers/volume13/yuan12a/yuan12a.pdf