Google I/O is a wrap! Catch up on TensorFlow sessions

# tfp.optimizer.proximal_hessian_sparse_one_step

One step of (the outer loop of) the minimization algorithm.

This function returns a new value of `x`, equal to `x_start + x_update`. The increment `x_update in R^n` is computed by a coordinate descent method, that is, by a loop in which each iteration updates exactly one coordinate of `x_update`. (Some updates may leave the value of the coordinate unchanged.)

The particular update method used is to apply an L1-based proximity operator, "soft threshold", whose fixed point `x_update_fix` is the desired minimum

``````x_update_fix = argmin{
Loss(x_start + x_update')
+ l1_regularizer * ||x_start + x_update'||_1
+ l2_regularizer * ||x_start + x_update'||_2**2
: x_update' }
``````

where in each iteration `x_update'` is constrained to have at most one nonzero coordinate.

This update method preserves sparsity, i.e., tends to find sparse solutions if `x_start` is sparse. Additionally, the choice of step size is based on curvature (Hessian), which significantly speeds up convergence.

This algorithm assumes that `Loss` is convex, at least in a region surrounding the optimum. (If `l2_regularizer > 0`, then only weak convexity is needed.)

`gradient_unregularized_loss` (Batch of) `Tensor` with the same shape and dtype as `x_start` representing the gradient, evaluated at `x_start`, of the unregularized loss function (denoted `Loss` above). (In all current use cases, `Loss` is the negative log likelihood.)
`hessian_unregularized_loss_outer` (Batch of) `Tensor` or `SparseTensor` having the same dtype as `x_start`, and shape `[N, n]` where `x_start` has shape `[n]`, satisfying the property ```Transpose(hessian_unregularized_loss_outer) @ diag(hessian_unregularized_loss_middle) @ hessian_unregularized_loss_inner = (approximation of) Hessian matrix of Loss, evaluated at x_start```.
`hessian_unregularized_loss_middle` (Batch of) vector-shaped `Tensor` having the same dtype as `x_start`, and shape `[N]` where `hessian_unregularized_loss_outer` has shape `[N, n]`, satisfying the property ```Transpose(hessian_unregularized_loss_outer) @ diag(hessian_unregularized_loss_middle) @ hessian_unregularized_loss_inner = (approximation of) Hessian matrix of Loss, evaluated at x_start```.
`x_start` (Batch of) vector-shaped, `float` `Tensor` representing the current value of the argument to the Loss function.
`tolerance` scalar, `float` `Tensor` representing the convergence threshold. The optimization step will terminate early, returning its current value of `x_start + x_update`, once the following condition is met: ```||x_update_end - x_update_start||_2 / (1 + ||x_start||_2) < sqrt(tolerance)```, where `x_update_end` is the value of `x_update` at the end of a sweep and `x_update_start` is the value of `x_update` at the beginning of that sweep.
`l1_regularizer` scalar, `float` `Tensor` representing the weight of the L1 regularization term (see equation above). If L1 regularization is not required, then `tfp.glm.fit_one_step` is preferable.
`l2_regularizer` scalar, `float` `Tensor` representing the weight of the L2 regularization term (see equation above). Default value: `None` (i.e., no L2 regularization).
`maximum_full_sweeps` Python integer specifying maximum number of sweeps to run. A "sweep" consists of an iteration of coordinate descent on each coordinate. After this many sweeps, the algorithm will terminate even if convergence has not been reached. Default value: `1`.
`learning_rate` scalar, `float` `Tensor` representing a multiplicative factor used to dampen the proximal gradient descent steps. Default value: `None` (i.e., factor is conceptually `1`).
`name` Python string representing the name of the TensorFlow operation. The default name is `"minimize_one_step"`.

`x` (Batch of) `Tensor` having the same shape and dtype as `x_start`, representing the updated value of `x`, that is, `x_start + x_update`.
`is_converged` scalar, `bool` `Tensor` indicating whether convergence occurred across all batches within the specified number of sweeps.
`iter` scalar, `int` `Tensor` representing the actual number of coordinate updates made (before achieving convergence). Since each sweep consists of `tf.size(x_start)` iterations, the maximum number of updates is `maximum_full_sweeps * tf.size(x_start)`.

#### References

[1]: Jerome Friedman, Trevor Hastie and Rob Tibshirani. Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 2010. https://www.jstatsoft.org/article/view/v033i01/v33i01.pdf

[2]: Guo-Xun Yuan, Chia-Hua Ho and Chih-Jen Lin. An Improved GLMNET for L1-regularized Logistic Regression. Journal of Machine Learning Research, 13, 2012. http://www.jmlr.org/papers/volume13/yuan12a/yuan12a.pdf

[]
[]