PresentationPDF Available

Subspace Methods

Authors:
Subspace Methods: Dimension Balance between
Approximation to Optimization Problem and Solving
Subproblem
Pengcheng Xie
xpc@lsec.cc.ac.cn
Supervised by Prof. Ya-xiang Yuan
Institute of Computational Mathematics and Scientific/Engineering Computing
Academy of Mathematics and Systems Science
Chinese Academy of Sciences, China
Group Seminar
December 8, 2020
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 1 /51
Outline
Introduction
Why study subspace methods?
Subspace methods with different structure
How to design subspace methods?
Conclusion and future work
What kinds of subspace methods are wanted?
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 2 /51
Introduction
Why study subspace methods?
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 3 /51
Solve image reconstruction in CT by PDFO2: failed
An inverse problem in [Chen et al. 2017]:
find a best xRnwhich satisfies
f(x) = ymin
xRnkf(x)yk2
2,
where f:RnRnand yRn,xor yrepresent a long
vector reshaped from a matrix of 512 ×512 =262144.
We choose PDFO to solve, but PDFO shows an error:
“uobyqa: problem too large for uobyqa. Try other
solvers.
Figure 1:
Monochromatic image
of the DE-472 lung
phantoms
Sad: This problem can not and does not have to be solved by DFO
Happy: Tom’s words1
1Tom M. Ragonneau: Ph.D. Student in PolyU. Supervised by Prof. Zaikun Zhang and co-supervised by Prof. Xiaojun Chen.
2Powell’s Derivative Free Optimization solvers : https://www.pdfo.net
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 4 /51
Tom’s words and Zaikun’s subspace method
Tom:
In DFO,n=100 is considered as a large
problem,n=200 is considered as a very
large problem. I read once that
NEWUOA has been tested with n=1000,
but this is incredibly huge.
“Do you have any way to reduce the size
of your problem, to find some kind of
space (or lower dimension) in which your
variables may belong (even
approximately).
[Zhang 2012]:
Solve the subproblem
min
dSk
Qk(xk+d)
on the subspace
Sk=spanQk(xk),dk1,¯
dk,
where
¯
dk=
yIk
f(y)f(xk)
kyxkk2
·yxk
kyxkk2
f(xk),
where Ikis the interpolation point set.
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 5 /51
Zaikun’s subspace method
Algorithm 1 NEWUOAs
1. Choose positive sequences {hk},{pk}and constant ε0.Set the initial point x1;s0:=0; k:=1.
2. Choose mk[n+1,(n+1)(n+2)
2], call MODEL(xk,hk,mk), get ˜gk, the approximation of the gradient
at xk. If hk<εand k˜gkk<ε, end. Let
Sk=span{˜gk,sk1}.
3. Set RHOEND =pk,call NEWUOA to solve the subproblem
min
dSk
f(xk+d)
and get dk.
4. If f(xk+dk)<f(xk),then xk+1:=xk+dk,sk:=dk; otherwise, xk+1:=xk,sk:=sk1.k:=k+1.
go to step2.
NEWUOA: dimension <1000.
NEWUOAs: dimension = 2000.
Global convergence and R-linear convergence
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 6 /51
Optimization problem and subproblem
Optimization problem
Find xsatisfies
min
xf(x)
s.t. xX.
Subproblem
Find xk+1=xk+dsatisfies
min
dmk(xk+d)
s.t. dD.
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 7 /51
Choose xk+1from xkin the subproblem
Line search method
1. Generate a descent search direction dk.
2. Search along this direction for a step
size αk.
min
αRφk(α) = f(xk+αdk).
3. xk+1=xk+αkdk.
Direction: n-dimension problem
Stepsize: 1-dimension problem
Trust region method
1. Given trust region radius like a step size.
2. Compute a search direction in trust re-
gion.
min
sRnQk(s) = g>
ks+1
2s>Bks
s.t. ksk2k.
3. xk+1=xk+sk.
Radius: 1-dimension problem
Direction: n-dimension problem
Where is the mediant dimension problem?
(1<mediant <n)
Subspace methods
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 8 /51
Why do we want to solve the mediant dimension problem
Question:
There is no need to deliberately produce mediant dimension problem.
Answer:
There is unbalance in computing the direction and the stepsize3.
Reduce the dimension.
Gather more information.
Special problems or needs.
[Conn et al. 1994]:
Require that Skcontains at least two compo-
nents:
a Gradient-related direction, to
encourage global convergence.
a Newton-related direction, to
encourage fast asymptotic convergence.
Extension of the dogleg method:
min
dRnm(d)def
=f+gTd+1
2dTBd
s.t. kdk
min
dm(d) = f+gTd+1
2dTBd
s.t. kdk ,dspanhg,B1gi
3Prof. Ya-xiang Yuan’s presentation on ICM 2014
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 9 /51
Subspace methods with
different structure
How to design subspace methods?
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 10 /51
Typical scenarios to design subspace methods
[Liu, Wen and Yuan 2020]4:
Subproblem:
xkxk+1:
min
dmk(xk+d)
s.t. dD
Problem:
min
xf(x)
s.t. xX
Find a linear combination of several known directions.
Linear and nonlinear conjugate gradient methods [Sun and Yuan
2006; Nocedal and Wright 2006]
Nesterov’s accelerated gradient method [Nesterov 2003; Nesterov
1983]
Heavy-ball method [Polyak 1964]
Momentum method [Goodfellow, Bengio, and Courville 2016]
Keep the objective function and constraints, but add an extra
restriction in a certain subspace.
OMP [Tropp and Gilbert 2008]
CoSaMP [Needell and Tropp 2010]
LOBPCG [Andrew 2001]
LMSVD [Liu, Wen, and Zhang 2013]
4Subspace Methods for Nonlinear Optimization: http://bicmr.pku.edu.cn/ wenzw/paper/SubOptv.pdf
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 11 /51
Typical scenarios to design subspace methods
Subproblem:
xkxk+1:
min
dmk(xk+d)
s.t. dD
Problem:
min
xf(x)
s.t. xX
Approximate the objective function but keep the constraints.
BCD [Tseng and Yun 2009]
RBR [Wen, Goldfarb, and Scheinberg 2012]
Parallel subspace correction [Fornasier 2007; Fornasier and
Sch¨
onlieb 2008]
Use subspace techniques to approximate the objective
functions.
Sampling/Sketching [Goodfellow, Bengio, and Courville 2016;
Mahoney 2011]
Nystr¨
om approximation [Tropp et al. 2017]
Approximate the objective function and design new
constraints.
Trust region methods with subspaces [Shultz, Schnabel, and Byrd
1985]
FPC AS [Wen et al. 2010]
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 12 /51
Typical scenarios to design subspace methods
Subproblem:
xkxk+1:
min
dmk(xk+d)
s.t. dD
Problem:
min
xf(x)
s.t. xX
Add a postprocess procedure after the subspace problem is
solved.
Truncated subspace method for tensor train [Zhang, Wen, and
Zhang 2016]
Integrate the optimization method and subspace update in one
framework.
Polynomial-filtered subspace method for low-rank matrix
optimization [Liu, Wen and Yuan 2020]
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 13 /51
Subspace relationship
dim(Sk) = dim(Sk+1):SkSk+1
dim(Sk)dim(Sk+1):SkSk+1
p
k=1dim(Sk) = n:S1+· · · +Sp=Rn
dim(Sk) = ik:IkSk
dim(Sk)dim(Sk+1):SkSk+1
Direction-Gradient subspaces
One-add-one-drop subspaces
Krylov subspaces
Nested subspaces
Complement subspaces
Subsampling/Sketching
Stochastic optimization
Active set methods (towards)
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 14 /51
Subspace relationship
dim(Sk) = dim(Sk+1):SkSk+1
dim(Sk)dim(Sk+1):SkSk+1
p
k=1dim(Sk) = n:S1+· · · +Sp=Rn
dim(Sk) = ik:IkSk
dim(Sk)dim(Sk+1):SkSk+1
Direction-Gradient subspaces
One-add-one-drop subspaces
Krylov subspaces
Nested subspaces
Complement subspaces
Subsampling/Sketching
Stochastic optimization
Active set methods (towards)
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 15 /51
Direction-Gradient subspace method for xRn
Linear combination of several known directions
Conjugate gradient method:
dk=gk+βk1dk1,
Sk=span{xk1,gk,dk1}.
Nesterov’s accelerated gradient method
(FISTA method) [Beck and Teboulle 2009;
Nesterov 2003]:
yk=xk1+k2
k+1(xk1xk2),
xk=ykαkf(yk),
Sk=span{xk1,xk2,f(yk)}.
Global convergence
n-step local quadratic
convergence
Gradient method:
stepsize: 1
L
convergence rate: O(1
k)
FISTA method:
stepsize: 1
L
convergence rate: O(1
k2)
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 16 /51
Limited memory methods for eigenvalue computation
Find a p-dimensional eigenspace associated with plargest eigenvalues of A
max
XRn×ptrX>AX,s.t. X>X=I.(1)
The first-order optimality conditions of (1):
AX =XΛ,X>X=I,
where Λ=X>AX Rp×pis the matrix of Lagrangian multipliers.
At each iteration, the methods solve a subspace trace maximization problem
Y=argmax
XRn×pntrX>AX:X>X=I,XSo.
LOBPCG [Andrew 2001]:S=span{Xi1,Xi,AXi}.
No theory to predict accurately the convergence speed.
Not converge slower than block steepest ascent on every step.
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 17 /51
Truncated subspace method for tensor train
[Zhang, Wen, and Zhang 2016]:
xRnxRn1×n2×···×nd,e.g.nO(1042).
Tensor cores XµRrµ1×rµ×nµ, fixed dimension rµ, constant r: TT rank.
Figure 2: xi1i2...id=X1(i1)X2(i2)·· ·Xd(id)
TT format
Figure 3: Xi1,...,iµ,...,id;j=X1(i1)·· · Xµ,jiµ·· ·Xd(id)
µ-BTT format
Operator TT format: Ai1i2···id,j1j2···jd=A1(i1,j1)A2(i2,j2)· · · Ad(id,jd),
where Aµiµ,jµRrµ1×rµfor iµ,jµ1,...,nµ.
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 18 /51
Truncated subspace method for tensor train
The eigenvalue problem in the BTT format is
min
XRn×ptrX>AX,s.t. X>X=Ipand XTn,r,p.
Subspaces: ST
k=span{PT(AXk),Xk,Xk1},ST
k=span{Xk,PT(Rk),PT(Pk)},
where PT(AXk)is the truncation of Xto Tn,r,p. The subspace problem in the BTT
format is
Yk+1:=argmin
XRn×p
trX>AX,s.t. X>X=Ip,XST
k,(2)
which is equivalent to a generalized eigenvalue decomposition problem:
min
VRq×ptrV>S>ASV,s.t. V>S>SV =Ip.
We next project Yk+1to the required space Tn,r,pas
Xk+1=argmin
XRn×p
kXYk+1k2
F,s.t. X>X=Ip,XTn,r,p.
This problem can be solved by using the alternating minimization scheme.
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 19 /51
Subspace relationship
dim(Sk) = dim(Sk+1):SkSk+1
dim(Sk)dim(Sk+1):SkSk+1
p
k=1dim(Sk) = n:S1+· · · +Sp=Rn
dim(Sk) = ik:IkSk
dim(Sk)dim(Sk+1):SkSk+1
Direction-Gradient subspaces
One-add-one-drop subspaces
Krylov subspaces
Nested subspaces
Complement subspaces
Subsampling/Sketching
Stochastic optimization
Active set methods (towards)
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 20 /51
Quasi-Newton methods
L-BFGS: Matrix Bkand inverse matrix Hk[Sun and Yuan 2006; Nocedal and
Wright 2006].
The search direction is dk=B1
kgk=Hkgk(Both Bkand Hkcan be written in a
compact representation [Byrd, Nocedal, and Schnabel 1997].
Assume that there are ppairs of vectors:
Uk=skp,...,sk1Rn×p,Yk=ykp,...,yk1Rn×p,
where si=xi+1xi,yi=gi+1gi.
For a given initial matrix H(0)
k,Hk=H(0)
k+CkPkC>
k, where
Ck:=hUk,H(0)
kYkiRn×2p,Dk=diaghs>
kpykp,...,s>
k1yk1i,
Pk:="R−>
kDk+Y>
kH(0)
kYkR1
kR−>
k
R1
k0#,(Rk)i,j=s>
kp+i1ykp+j1,if ij,
0,otherwise.
The initial matrix H(0)
kis γkI. Then
dkspangk,sk1,...,skp,yk1,...,ykp.
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 21 /51
Limited memory methods for eigenvalue computation
max
XRn×ptrX>AX,s.t. X>X=I.(3)
The first-order optimality conditions of (3):
AX =XΛ,X>X=I,
where Λ=X>AX Rp×pis the matrix of Lagrangian multipliers.
At each iteration, the methods solve a subspace trace maximization problem
Y=argmax
XRn×pntrX>AX:X>X=I,XSo.
.....................................................................................................................................
LMSVD [Liu, Wen, and Zhang 2013]:S=span {Xi,Xi1,· ·· ,Xit}.
Global convergence under reasonable assumptions.
Table 1: SSI vs LMSVD (pkn)
method SSI LMSVD
total cost per iteration O(n+k)Ok(1+p)2
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 22 /51
Subspace relationship
dim(Sk) = dim(Sk+1):SkSk+1
dim(Sk)dim(Sk+1):SkSk+1
p
k=1dim(Sk) = n:S1+· · · +Sp=Rn
dim(Sk) = ik:IkSk
dim(Sk)dim(Sk+1):SkSk+1
Direction-Gradient subspaces
One-add-one-drop subspaces
Krylov subspaces
Nested subspaces
Complement subspaces
Subsampling/Sketching
Stochastic optimization
Active set methods (towards)
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 23 /51
Augmented Rayleigh-Ritz method for eigenvalue
computation
RR map (Y,Σ) = RR(A,Z)the trace-maximization subproblem with
S=R(Z).
The augmentation of the subspaces in LOGPCG and LMSVD is the main
reason why they generally achieve faster convergence than the classic SSI.
ARR: For some integer t0, design a block Krylov subspace structure:
S=spannX,AX,A2X,...,AtXo.(4)
RR procedure using (ˆ
Y,ˆ
Σ) = RR(A,Kt), where Kt=X,AX,A2X,...,AtX.
The pleading Ritz pairs (Y,Σ)is extracted from (ˆ
Y,ˆ
Σ).
The analysis of ARR in [Wen and Zhang 2017; Wen and Zhang 2015]:
the convergence rate of SSI:
λp+1
λpfor RR (t=0)
λ(t+1)p+1
λpfor ARR (t>0).
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 24 /51
Subspace relationship
dim(Sk) = dim(Sk+1):SkSk+1
dim(Sk)dim(Sk+1):SkSk+1
p
k=1dim(Sk) = n:S1+· · · +Sp=Rn
dim(Sk) = ik:IkSk
dim(Sk)dim(Sk+1):SkSk+1
Direction-Gradient subspaces
One-add-one-drop subspaces
Krylov subspaces
Nested subspaces
Complement subspaces
Subsampling/Sketching
Stochastic optimization
Active set methods (towards)
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 25 /51
Trust region methods with subspace method
The trust region subproblem (TRS) is normally
min
sRnQk(s) = g>
ks+1
2s>Bks
s.t. ksk2k,
(5)
where Bk2Qk(xk).
A subspace version of the trust region subproblem is suggested in [Shultz,
Schnabel, and Byrd 1985]:
min
sRnQk(s)
s.t. ksk2k,sSk.
(6)
The Steihaug truncated CG method [Steihaug 1983]
Dog leg method [Powell 1970]
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 26 /51
Parallel computing refinement of trust region methods
based on truncated CG method 5
Table 2: Speedup ratio of the parallel
refinement of the trust region method
dimension np=2 np=4 np=6
100 =1021.68180 1.94154 2.36451
1000 =1030.920956 1.47545 1.55419
10000 =1041.79342 2.94063 3.86112
50000 =5×1041.87369 3.04962 3.94852
100000 =1051.89060 3.55094 5.17231
dimension np=8 np=10 np=12
100 =1022.91613 3.43903 3.67575
1000 =1031.84841 2.43805 2.64320
10000 =1044.49823 4.94911 5.18691
50000 =5×1045.10126 6.29814 6.71970
100000 =1055.88022 6.52538 7.02531




Figure 4: Time versus number of process in
the parallel refinement
TRS min
sRnQk(s),s.t. ksk2kis solved by truncated CG.
5Homework of Parallel Computing taught by Prof. Tao Cui
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 27 /51
Trust region methods with subspace methods
Theorem (Wang and Yuan 2006)
Suppose B1=σI, with σ>0,the matrix updating formula is any one chosen from
PSB and Broyden family, (where the updates may be singular), and Bkis the kth
updated matrix. Let skbe an optimal solution of TRS (5) and set xk+1=xk+sk.Let
Sk=span{g1,g2,··· ,gk}. Then skSkand for any z Sk,uS
k,it holds
BkzSk,Bku=σu.
Subspace trust region quasi-Newton method for unconstrained optimization
[Wang and Yuan 2006].
Line search quasi-Newton methods [Gill and Leonard 1999; Gill and Leonard
2000].
Subspace Powell–Yuan trust region method for equality constrained
optimization [Grapiglia, Yuan, and Yuan 2013].
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 28 /51
Subspace relationship
dim(Sk) = dim(Sk+1):SkSk+1
dim(Sk)dim(Sk+1):SkSk+1
p
k=1dim(Sk) = n:S1+· · · +Sp=Rn
dim(Sk) = ik:IkSk
dim(Sk)dim(Sk+1):SkSk+1
Direction-Gradient subspaces
One-add-one-drop subspaces
Krylov subspaces
Nested subspaces
Complement subspaces
Subsampling/Sketching
Stochastic optimization
Active set methods (towards)
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 29 /51
Coordinate descent methods
Algorithm 2 Coordinate Descent Algorithm
1: Input initial value x(0).
2: For t=1,2, ....
3: Pick coordinate ifrom 1,2, . .. n,
x(t+1)
i=argmin
xiR
fxi,ω(t)
i,
where ω(t)
irepresent all other coordinates.
Si=span{xi}
4: End.
Converges slowly.
Does not require
calculation of the
gradient fk.
Several algorithms, such
as that of Hooke and
Jeeves [Hooke and
Jeeves 1961], are based
on these ideas
[Mackworth 1987;
Ricketts 1982].
Block coordinate descent method (BCD) [Tseng 2001]
The alternating direction method of multipliers (ADMM) [Boyd et al. 2011]
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 30 /51
Parallel line search subspace correction method
The optimization problem
min
xRnϕ(x):=f(x) + h(x),(7)
where f(x): differentiable convex; h(x): convex function (possibly nonsmooth).
The `1-regularized minimization (LASSO) [Tibshirani 1996] and the sparse
logistic regression [Shevade and Keerthi 2003] are examples of (7).
Rn=X1+X2+· · · +Xp,
where Xi={xRn|supp(x)Ji},1ip,s.t. J:={1,...,n}and J=Sp
i=1Ji.
Let ϕ(i)
kbe a surrogate function of ϕrestricted to the i-th subspace at k-th iteration.
The PSC framework for solving (7) is:
d(i)
k=argmin
d(i)X(i)
ϕ(i)
kd(i),i=1,...,p,xk+1=xk+
p
i=1
α(i)
kd(i)
k.(8)
The convergence: if p
i=1α(i)
k1 and α(i)
k>0(1ip).
Usually, α(i)
kis quite small and convergence becomes slow.
A parallel subspace correction method (PSCL) is proposed in [Dong et al. 2015],
with the Armjio backtracking line search for a large step size.
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 31 /51
Subspace relationship
dim(Sk) = dim(Sk+1):SkSk+1
dim(Sk)dim(Sk+1):SkSk+1
p
k=1dim(Sk) = n:S1+· · · +Sp=Rn
dim(Sk) = ik:IkSk
dim(Sk)dim(Sk+1):SkSk+1
Direction-Gradient subspaces
One-add-one-drop subspaces
Krylov subspaces
Nested subspaces
Complement subspaces
Subsampling/Sketching
Stochastic optimization
Active set methods (towards)
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 32 /51
Subspace by subsampling/sketching
For a linear least squares problem on massive data sets:
min
xkAx bk2
2min
xkW(Ax b)k2
2,(9)
where ARm×nand bRm.
The sketching technique chooses a matrix WRr×mwith rmand formulates a
reduced problem.
Each element of Wis sampled from an i.i.d. normal random variable with mean
zero and variance 1
r[Mahoney 2011; Woodruff 2014].
Consider the system of nonlinear equations
F(x) = 0,xRn(10)
and nonlinear least squares problem
min
xRnkF(x)k2
2,
where F(x) = (F1(x),F2(x),··· ,Fm(x))>Rm. Consider Fi(x) = 0,iIk.
More work has been done in [Yuan 2009].
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 33 /51
Subspace by coordinate directions
[Yuan 2014]:
For sparsity structures, let g(i)
kbe the i-th component of the gradient gk, satisfies
g(i1)
kg(i2)
kg(i3)
k · · · g(in)
k.
τ-steepest coordinates subspace:
Sk=spanne(i1),e(i2),...,e(iτ)o.
The steepest descent direction in the subspace is sufficiently descent:
min
dSk
d>gk
kdk2kgkk2
τ
n.
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 34 /51
Subspace relationship
dim(Sk) = dim(Sk+1):SkSk+1
dim(Sk)dim(Sk+1):SkSk+1
p
k=1dim(Sk) = n:S1+· · · +Sp=Rn
dim(Sk) = ik:IkSk
dim(Sk)dim(Sk+1):SkSk+1
Direction-Gradient subspaces
One-add-one-drop subspaces
Krylov subspaces
Nested subspaces
Complement subspaces
Subsampling/Sketching
Stochastic optimization
Active set methods (towards)
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 35 /51
Stochastic methods
An empirical risk minimization is
min
xf(x):=1
N
N
i=1
fi(x).
Stochastic gradient method [Goodfellow, Bengio, and Courville 2016]:
Selects a uniformly random sample skfrom {1,...,N}and updates
xk+1=xkαkfsk(xk).
The mini-batch SGD method:
xk+1=xkαk
|Ik|
skIk
fsk(xk).
The momentum method:vk+1=µkvkαkfsk(xk),xk+1=xk+vk+1.
Stochastic Second-Order method: The subsampled Newton method:
1
IH
k
iIH
k
2fi(x)
dk=1
|Ik|
skIk
fsk(xk).
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 36 /51
Subspace relationship
dim(Sk) = dim(Sk+1):SkSk+1
dim(Sk)dim(Sk+1):SkSk+1
p
k=1dim(Sk) = n:S1+· · · +Sp=Rn
dim(Sk) = ik:IkSk
dim(Sk)dim(Sk+1):SkSk+1
Direction-Gradient subspaces
One-add-one-drop subspaces
Krylov subspaces
Nested subspaces
Complement subspaces
Subsampling/Sketching
Stochastic optimization
Active set methods (towards)
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 37 /51
Active set methods for sparse optimization
The `1-regularized minimization problem
min
xRnφµ(x):=µkxk1+f(x),(11)
where µ>0 and f(x):RnRis continuously differentiable.
FPC AS [Wen et al. 2010], a two-stage active set algorithm.
Subspace optimization in the second stage:
For a given vector xRn,
A(x):=ni {1,··· ,n}|x(i)|=0oand I(x):=ni {1,· · · ,n}|x(i)|>0o.
Then a smooth subproblem seems an essentially unconstrained problem
min
xµsignx(Ik)
k>x(Ik)+f(x),s.t. x(i)=0,iA(xk).(12)
If |I(xk+1)|>mthen do hard truncation. Solve the subspace optimization problem
to obtain xk+1.
Problem (12) can be solved by L-BFGS-B [Byrd et al. 1995].
The active set strategies have also been studied in [Solntsev, Nocedal, and
Byrd 2014; Keskar et al. 2015].
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 38 /51
Conclusion and future work
What kinds of subspace methods are wanted?
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 39 /51
Conclusion
Optimization problem
Find x
min
xf(x)
s.t. xX
Subproblem
Find xk+1=xk+d
min
dmk(xk+d)
s.t. dD
dim(Sk) = dim(Sk+1):SkSk+1
dim(Sk)dim(Sk+1):SkSk+1
p
k=1dim(Sk) =n: S1+· · · +Sp=Rn
dim(Sk) = ik:IkSk
dim(Sk)dim(Sk+1):SkSk+1
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 40 /51
Future work
Relationship between subspaces in the iteration
Subspace methods in manifold optimization
Subspace methods in derivative free optimization
Subspace acceleration for given algorithms
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 41 /51
Future work: relationship between subspaces in the
iteration
Subspace is an evolution of the direction
Conjugate direction method
Conjugate subspace method
Definition
p0,p1,··· ,plis conjugate with respect to the
symmetric positive definite matrix Aif
pT
iApj=0,for all i6=j.
Only search in a subspace ONCE Figure 5: Coordinate search
method can make slow progress
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 42 /51
Future work: subspace methods in derivative free
optimization
Main difference between Powell’s derivative free
optimization and optimization with derivative:
How to get subproblem objective function mk(x).
α0+α>y1+1
2y>
1Hy1=F(y1)
α0+α>y2+1
2y>
2Hy2=F(y2)
.
.
.
α0+α>yk+1
2y>
kHyk=F(yk).Figure 6: Model function by
interpolation
NEWUOA:
number of interpolation points: (n+1)(n+2)
22n+1
min
Qk
2Qk2Qk1
2
F,s.t. Qk(y) = F(y),yYk.
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 43 /51
Future work: subspace methods in manifold optimization
Riemannian steepest descent method [Udriste 1994]:-grad f(x).
Robust global convergence
Slow local convergence: linear
Riemannian Newton method [Luenberger 1972; Gabay 1982]:
-Hess f(x)1grad f(x).
Fast local convergence: quadratic or even cubic
Requires additional work for global convergence
Riemannian trust-region method [Absil, Baker, and Gallivan 2007].
Find solution to η=argmin
ηTxM,kηk≤
mx(η),xnext =Rx(η).
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 44 /51
References I
P.-A Absil, Christopher Baker, and Kyle Gallivan. “Trust-Region Methods on Riemannian Manifolds”. In:
Foundations of Computational Mathematics 7 (July 2007), pp. 303–330. DOI:10.1007/s10208- 005-0179-9.
Knyazev Andrew. “Toward the Optimal Preconditioned Eigensolver: Locally Optimal Block Preconditioned
Conjugate Gradient Method”. In: (Nov. 2001).
Richard Byrd, Jorge Nocedal, and Robert Schnabel. “Representations Of Quasi-Newton Matrices And Their Use In
Limited Memory Methods”. In: Mathematical Programming 63 (Aug. 1997). DOI:10.1007/BF01582063.
Stephen Boyd et al. “Distributed Optimization and Statistical Learning via the Alternating Direction Method of
Multipliers”. In: Foundations and Trends in Machine Learning 3 (Jan. 2011), pp. 1–122. DO I:
10.1561/2200000016.
Amir Beck and Marc Teboulle. “A Fast IterativeShrinkage-Thresholding Algorithm for Linear Inverse Problems”.
In: SIAM J. Imaging Sciences 2 (Jan. 2009), pp. 183–202. DOI:10.1137/080716542.
Richard H. Byrd et al. “A limited memory algorithm for bound constrained optimization”. English. In: SIAM
Journal of Scientific Computing 16 (Sept. 1995), pp. 1190–1208. ISS N: 1064-8275. DOI:10.1137/0916069.
Buxin Chen et al. “Image reconstruction and scan configurations enabled by optimization-based algorithms in
multispectral CT”. In: Physics in Medicine and Biology 62 (Nov. 2017), pp. 8763–8793. DOI:
10.1088/1361-6560/aa8a4b.
A. R. Conn et al. On Iterated-Subspace Minimization Methods for Nonlinear Optimization.1994.
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 45 /51
References II
Qian Dong et al. “A Parallel Line Search Subspace Correction Method for Composite Convex Optimization”. In:
Journal of the Operations Research Society of China 3 (May 2015). DOI:10.1007/s40305- 015-0079-x.
Massimo Fornasier. “Domain decomposition methods for linear inverse problems with sparsity constraints”. In:
Inverse Problems - INVERSE PROBL 23 (Dec. 2007). DOI:10.1088/0266-5611/23/6/014.
Massimo Fornasier and Carola-Bibiane Sch¨
onlieb. “Subspace Correction Methods for Total Variation and
`1-Minimization”. In: SIAM Journal on Numerical Analysis 47 (Jan. 2008). DOI:10.1137/070710779.
Daniel Gabay. “Minimizing a differentiable function over a differential manifold”. In: Journal of Optimization
Theory and Applications 37 (June 1982). DOI :10.1007/BF00934767.
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning.http://www.deeplearningbook.org.
MIT Press, 2016.
Philip Gill and Michael Leonard. “Reduced-Hessian Quasi-Newton Methods For Unconstrained Optimization”. In:
SIAM Journal on Optimization 12 (Mar. 2000). DOI:10.1137/S1052623400307950.
Philip Gill and Michael Leonard. “Limited-Memory Reduced-Hessian Methods For Large-Scale Unconstrained
Optimization”. In: SIAM J. Optim. 14 (Aug. 1999). DOI:10.1137/S1052623497319973.
Geovani Grapiglia, Jin-Yun Yuan, and Ya-xiang Yuan. “A Subspace Version of the Powell–Yuan Trust-Region
Algorithm for Equality Constrained Optimization”. In: Journal of the Operations Research Society of China 4 (Dec.
2013). DOI:10.1007/s40305- 013-0029- 4.
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 46 /51
References III
Robert Hooke and T. A. Jeeves. ““ Direct Search” Solution of Numerical and Statistical Problems”. In:J. ACM 8.2
(Apr. 1961), 212–229. ISSN : 0004-5411. DOI:10.1145/321062.321069.U RL:
https://doi.org/10.1145/321062.321069.
Nitish Keskar et al. “A Second-Order Method for Convex `1-Regularized Optimization with Active Set Prediction”.
In: Optimization Methods and Software 31 (May 2015). DOI:10.1080/10556788.2016.1138222.
David Luenberger. “The Gradient Projection Method Along Geodesics”. In: Management Science 18 (July 1972),
pp. 620–631. DOI :10.1287/mnsc.18.11.620.
Xin Liu, Zaiwen Wen, and Yin Zhang. “Limited Memory Block Krylov Subspace Optimization for Computing
Dominant Singular Value Decompositions”. In: SIAM Journal on Scientific Computing 35 (May 2013). DOI:
10.1137/120871328.
A.K. Mackworth. “John Wiley , Sons”. In: Encyclopedia of Artificial Intelligence (Jan. 1987), pp. 205–211.
Michael Mahoney. “Randomized Algorithms for Matrices and Data”. In: Computing Research Repository - CORR 3
(Apr. 2011). DOI:10.1561/2200000035.
Y. Nesterov. “Introductory Lectures on Convex Optimization: A Basic Course”. In: Comput. Program. (Jan. 2003).
Yu Nesterov. “A method of solving a convex programming problem with convergence rate O(1/k2)”. In: vol. 27.
Jan. 1983, pp. 372–376.
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 47 /51
References IV
Deanna Needell and Joel Tropp. “CoSaMP: Iterative Signal Recovery from Incomplete and Inaccurate Samples”.
In: Communications of the ACM 53 (Dec. 2010). DOI:10.1145/1859204.1859229.
Jorge Nocedal and Stephen Wright. Numerical Optimization.Jan. 2006. ISB N: 978-0-387-30303-1. DOI:
10.1007/978-0- 387-40065- 5.
Boris Polyak. “Some methods of speeding up the convergence of iteration methods”. In: Ussr Computational
Mathematics and Mathematical Physics 4 (Dec. 1964), pp. 1–17. DOI:10.1016/0041-5553(64)90137- 5.
M. J. D. Powell. “A Hybrid Method for Nonlinear Equations”. In: Numerical Methods for Nonlinear Algebraic
Equations. Ed. by P. Rabinowitz. Gordon and Breach, 1970.
R. E. Ricketts. “Practical optimization, Philip E. Gill, Walter Murray and Margret H. Wright, Academic Press Inc.
(London) Limited, 1981. No. of pages: 401. Price 19.20, 46.50. ISBN: 0.12.283950.1”. In: International Journal for
Numerical Methods in Engineering 18.6 (1982), pp. 954–954. DOI :
https://doi.org/10.1002/nme.1620180612. eprint:
https://onlinelibrary.wiley.com/doi/pdf/10.1002/nme.1620180612.URL:
https://onlinelibrary.wiley.com/doi/abs/10.1002/nme.1620180612.
S Shevade and S Keerthi. “A Simple and Efficient Algorithm for Gene Selection Using Sparse Logistic Regression”.
In: Bioinformatics (Oxford, England) 19 (Dec. 2003), pp. 2246–2253. DOI:10.1093/bioinformatics/btg308.
Stefan Solntsev, Jorge Nocedal, and Richard Byrd. “An Algorithm for Quadratic `1-Regularized Optimization with
a Flexible Active-Set Strategy”. In: Optimization Methods and Software 30 (Dec. 2014). DOI:
10.1080/10556788.2015.1028062.
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 48 /51
References V
Gerald Shultz, Robert Schnabel, and Richard Byrd. “A Family of Trust-Region-Based Algorithms for
Unconstrained Minimization with Strong Global Convergence Properties”. In: Siam Journal on Numerical Analysis
- SIAM J NUMER ANAL 22 (Feb. 1985), pp. 47–67. DOI:10.1137/0722003.
Trond Steihaug. “The Conjugate Gradient Method and Trust Regions in Large Scale Optimization”. In: Siam
Journal on Numerical Analysis - SIAM J NUMER ANAL 20 (June 1983), pp. 626–637. DOI:10.1137/0720042.
Wenyu Sun and Ya-xiang Yuan. “Optimization theory and methods. Nonlinear programming”. In: 1 (Jan. 2006).
DOI:10.1007/b106451.
Joel Tropp and Anna Gilbert. “Signal Recovery From Random Measurements ViaOrthogonal Matching Pursuit”.
In: Information Theory, IEEE Transactionson 53 (Jan. 2008), pp. 4655 –4666. DOI:10.1109/TIT.2007.909108.
Robert Tibshirani. “Regression Shrinkage and Selection Via the Lasso”. In: Journal of the Royal Statistical Society:
Series B (Methodological) 58 (Jan. 1996), pp. 267–288. DOI :10.1111/j.2517-6161.1996.tb02080.x.
Joel Tropp et al. “Fixed-Rank Approximation of a Positive-Semidefinite Matrix from Streaming Data”. In: (June
2017).
P. Tseng. “Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization”. In: Journal of
Optimization Theory and Applications 109 (Jan. 2001), pp. 475–494. DOI :10.1023/A:1017501703105.
Paul Tseng and Sangwoon Yun. “A Coordinate Gradient Descent Method for Nonsmooth Separable Minimization”.
In: Math. Program. 117 (Mar.2009), pp. 387–423. DOI:10.1007/s10107- 007-0170- 0.
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 49 /51
References VI
Constantin Udriste. Convex Functions and Optimization Methods on Riemannian Manifolds.Jan. 1994. DOI:
10.1007/978-94- 015-8390- 9.
Zaiwen Wen et al. “A Fast Algorithm for Sparse Reconstruction Based on Shrinkage, Subspace Optimization, and
Continuation”. In: SIAM J. Scientific Computing 32 (Jan. 2010), pp. 1832–1857. DOI:10.1137/090747695.
Zaiwen Wen, Donald Goldfarb, and Katya Scheinberg. “Block Coordinate Descent Methods for Semidefinite
Programming”. In: vol. 166. Jan. 2012. DOI:10.1007/978-1- 4614-0769-0_19.
David Woodruff.“Sketching as a Tool for Numerical Linear Algebra”. In: Foundations and Trends in Theoretical
Computer Science 10 (Nov. 2014). DOI:10.1561/0400000060.
Zhouhong Wang and Ya-xiang Yuan. “A subspace implementation of quasi-Newton trust region methods for
unconstrained optimization”. In: Numerische Mathematik 104 (Aug. 2006), pp. 241–269. DOI :
10.1007/s00211-006- 0021-6.
Zaiwen Wen and Yin Zhang. “Block algorithms with augmented Rayleigh-Ritz projections for large-scale eigenpair
computation”. In: (July 2015).
Zaiwen Wen and Yin Zhang. “Accelerating Convergence by Augmented Rayleigh–Ritz Projections For Large-Scale
Eigenpair Computation”. In: SIAM Journal on Matrix Analysis and Applications 38 (Jan. 2017), pp. 273–296. DOI:
10.1137/16M1058534.
Ya-xiang Yuan. “Subspace methods for large scale nonlinear equations and nonlinear least squares”. In:
Optimization and Engineering 10 (June 2009), pp. 207–218. DOI :10.1007/s11081-008-9064- 0.
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 50 /51
References VII
Junyu Zhang, Zaiwen Wen, and Yin Zhang. “Subspace Methods with Local Refinements for Eigenvalue
Computation Using Low-Rank Tensor-Train Format”. In: Journal of Scientific Computing 70 (July 2016). DOI:
10.1007/s10915-016- 0255-0.
Pengcheng Xie (ICMSEC AMSS) Subspace methods December 8, 2020 51 /51
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Optimization-based algorithms for image reconstruction in multispectral (or photon-counting) computed tomography (MCT) remains a topic of active research. The challenge of optimization-based image reconstruction in MCT stems from the inherently non-linear data model that can lead to a non-convex optimization program for which no mathematically exact solver seems to exist for achieving globally optimal solutions. In this work, based upon a non-linear data model, we design a non-convex optimization program, derive its first-order-optimality conditions, and propose an algorithm to solve the program for image reconstruction in MCT. In addition to consideration of image reconstruction for the standard scan configuration, the emphasis is on investigating the algorithm's potential for enabling non-standard scan configurations with no or minimum hardware modification to existing CT systems, which has potential practical implications for lowered hardware cost, enhanced scanning flexibility, and reduced imaging dose/time in MCT. Numerical studies are carried out for verification of the algorithm and its implementation, and for a preliminary demonstration and characterization of the algorithm in reconstructing images and in enabling non-standard configurations with varying scanning angular range and/or x-ray illumination coverage in MCT.
Article
Full-text available
Computing a few eigenpairs from large-scale symmetric eigenvalue problems is far beyond the tractability of classic eigensolvers when the storage of the eigenvectors in the classical way is impossible. We consider a tractable case in which both the coefficient matrix and its eigenvectors can be represented in the low-rank tensor train formats. We propose a subspace optimization method combined with some suitable truncation steps to the given low-rank Tensor Train formats. Its performance can be further improved if the alternating minimization method is used to refine the intermediate solutions locally. Preliminary numerical experiments show that our algorithm is competitive to the state-of-the-art methods on problems arising from the discretization of the stationary Schrödinger equation.
Article
Full-text available
Most iterative algorithms for eigenpair computation consist of two main steps: a subspace update (SU) step that generates bases for approximate eigenspaces, followed by a Rayleigh-Ritz (RR) projection step that extracts approximate eigenpairs. So far the predominant methodology for the SU step is based on Krylov subspaces that builds orthonormal bases piece by piece in a sequential manner. In this work, we investigate block methods in the SU step that allow a higher level of concurrency than what is reachable by Krylov subspace methods. To achieve a competitive speed, we propose an augmented Rayleigh-Ritz (ARR) procedure and analyze its rate of convergence under realistic conditions. Combining this ARR procedure with a set of polynomial accelerators, as well as utilizing a few other techniques such as continuation and deflation, we construct a block algorithm designed to reduce the number of RR steps and elevate concurrency in the SU steps. Extensive computational experiments are conducted in Matlab on a representative set of test problems to evaluate the performance of two variants of our algorithm in comparison to two well-established, high-quality eigensolvers ARPACK and FEAST. Numerical results, obtained on a many-core computer without explicit code parallelization, show that when computing a relatively large number of eigenpairs, the performance of our algorithms is competitive with, and frequently superior to, that of the two state-of-the-art eigensolvers.
Article
We propose a new method for estimation in linear models. The ‘lasso’ minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant. Because of the nature of this constraint it tends to produce some coefficients that are exactly 0 and hence gives interpretable models. Our simulation studies suggest that the lasso enjoys some of the favourable properties of both subset selection and ridge regression. It produces interpretable models like subset selection and exhibits the stability of ridge regression. There is also an interesting relationship with recent work in adaptive function estimation by Donoho and Johnstone. The lasso idea is quite general and can be applied in a variety of statistical models: extensions to generalized regression models and tree‐based models are briefly described.
Book
Numerical Optimization presents a comprehensive and up-to-date description of the most effective methods in continuous optimization. It responds to the growing interest in optimization in engineering, science, and business by focusing on the methods that are best suited to practical problems. For this new edition the book has been thoroughly updated throughout. There are new chapters on nonlinear interior methods and derivative-free methods for optimization, both of which are used widely in practice and the focus of much current research. Because of the emphasis on practical methods, as well as the extensive illustrations and exercises, the book is accessible to a wide audience. It can be used as a graduate text in engineering, operations research, mathematics, computer science, and business. It also serves as a handbook for researchers and practitioners in the field. The authors have strived to produce a text that is pleasant to read, informative, and rigorous - one that reveals both the beautiful nature of the discipline and its practical side.
Article
Several important applications, such as streaming PCA and semidefinite programming, involve a large-scale positive-semidefinite (psd) matrix that is presented as a sequence of linear updates. Because of storage limitations, it may only be possible to retain a sketch of the psd matrix. This paper develops a new algorithm for fixed-rank psd approximation from a sketch. The approach combines the Nystrom approximation with a novel mechanism for rank truncation. Theoretical analysis establishes that the proposed method can achieve any prescribed relative error in the Schatten 1-norm and that it exploits the spectral decay of the input matrix. Computer experiments show that the proposed method dominates alternative techniques for fixed-rank psd matrix approximation across a wide range of examples.
Article
Iterative algorithms for large-scale eigenpair computation of symmetric matrices are mostly based on subspace projections consisting of two main steps: a subspace update (SU) step that generates bases for approximate eigenspaces, followed by a Rayleigh--Ritz projection step that extracts approximate eigenpairs. A predominant methodology for the SU step makes use of Krylov subspaces and builds orthonormal bases piece by piece in a sequential manner. On the other hand, block methods such as the classic (simultaneous) subspace iteration, allow higher levels of concurrency than what is reachable by Krylov subspace methods, but may suffer from slow convergence. In this work, we analyze the rate of convergence for a simple block algorithmic framework that combines an augmented Rayleigh--Ritz (ARR) procedure with the subspace iteration. Our main results are Theorem 4.5 and its corollaries, which show that the ARR procedure can provide significant accelerations to convergence speed. Our analysis will offer useful...
Article
We describe an active-set method for the minimization of an objective function $\phi$ that is the sum of a smooth convex function and an $\ell_1$-regularization term. A distinctive feature of the method is the way in which active-set identification and {second-order} subspace minimization steps are integrated to combine the predictive power of the two approaches. At every iteration, the algorithm selects a candidate set of free and fixed variables, performs an (inexact) subspace phase, and then assesses the quality of the new active set. If it is not judged to be acceptable, then the set of free variables is restricted and a new active-set prediction is made. We establish global convergence for our approach, and compare the new method against the state-of-the-art code LIBLINEAR.
Article
In this paper, we investigate a parallel subspace correction framework for composite convex optimization. The variables are first divided into a few blocks based on certain rules. At each iteration, the algorithms solve a suitable subproblem on each block simultaneously, construct a search direction by combining their solutions on all blocks, then identify a new point along this direction using a step size satisfying the Armijo line search condition. They are called PSCLN and PSCLO, respectively, depending on whether there are overlapping regions between two immediately adjacent blocks of variables. Their convergence is established under mild assumptions. We compare PSCLN and PSCLO with the parallel version of the fast iterative thresholding algorithm and the fixed-point continuation method using the Barzilai-Borwein step size and the greedy coordinate block descent method for solving the \(\ell _1\) -regularized minimization problems. Our numerical results show that PSCLN and PSCLO can run fast and return solutions not worse than those from the state-of-the-art algorithms on most test problems. It is also observed that the overlapping domain decomposition scheme is helpful when the data of the problem has certain special structures.
Article
We present an active-set method for minimizing an objective that is the sum of a convex quadratic and $\ell_1$ regularization term. Unlike two-phase methods that combine a first-order active set identification step and a subspace phase consisting of a \emph{cycle} of conjugate gradient (CG) iterations, the method presented here has the flexibility of computing a first-order proximal gradient step or a subspace CG step at each iteration. The decision of which type of step to perform is based on the relative magnitudes of some scaled components of the minimum norm subgradient of the objective function. The paper establishes global rates of convergence, as well as work complexity estimates for two variants of our approach, which we call the iiCG method. Numerical results illustrating the behavior of the method on a variety of test problems are presented.