pdf.io >> Programming >> Learning Dynamics and Robustness of Vector Quantization....pdf

Learning Dynamics and Robustness of Vector Quantization and ...

 FileName: neurocomp2007.pdf
[readonline]


 FileSize: 288 KB download
 Shared by: cs_rug_nl 56 month ago
 Category: Programming
 Report us: delete it


Abstract: depending on the structure of the data, the Neural Gas does not always obtain the best asymptotic quantization error. ... Vector quantization (VQ) is an important unsu pervised learning algorithm, widely used ...

Learning Dynamics and Robustness of
Vector Quantization and Neural Gas
Aree Witoelar a,∗ Michael Biehl a Anarta Ghosh b Barbara Hammer c
a University of Groningen, Mathematics and Computing Science, P.O. Box 800, NL9700 AV Groningen, the Netherlands
b WaNPRC, University of Washington, Seattle, WA98195, USA
c Clausthal University of Technology, Institute of Computer Science, D98678 ClausthalZellerfeld, Germany
Abstract
Various alternatives have been developed to improve the WinnerTakesAll (WTA) mechanism in vector quantiza
tion, including the Neural Gas (NG). However, the behavior of these algorithms including their learning dynamics,
robustness with respect to initialization, asymptotic results, etc. has only partially been studied in a rigorous mathe
matical analysis. The theory of online learning allows for an exact mathematical description of the training dynamics
in model situations. We demonstrate using a system of three competing prototypes trained from a mixture of Gaussian
clusters that the Neural Gas can improve convergence speed and achieves robustness to initial conditions. However,
depending on the structure of the data, the Neural Gas does not always obtain the best asymptotic quantization error.
Key words: Vector quantization, Clustering, Online learning, WinnerTakesAll algorithms, Neural Gas
1. Introduction A variety of alternatives to overcome this problem
have been proposed, some of which are heuristically
Vector quantization (VQ) is an important unsu motivated while others are based on the minimiza
pervised learning algorithm, widely used in diﬀerent tion of a cost function related to the quantization
areas such as data mining, medical analysis, image error: the selforganizing map (SOM) [12], fuzzy
compression, and speech or handwriting recognition kmeans [2], stochastic optimization [7], to name
[1]. The main objective of VQ is to represent the data just a few. These algorithms have in common that
points by a small number of prototypes or codebook each pattern inﬂuences more than one prototype at
vectors. This can directly be used for compression, a time through a ”winnertakesmost” paradigm.
clustering, data mining, or (with postlabeling of the Neural gas (NG) as proposed in [13] is a particularly
prototypes) classiﬁcation [9,14]. robust variation of vector quantization with the
The basic ”winnertakesall” (WTA) or batch introduction of neighborhood relations. Unlike the
algorithms like the popular kmeans clustering di selforganizing map, [12], the NG system takes into
rectly optimize the quantization error underlying account the relative distances between prototypes
vector quantization. However, these methods can be in the input space and not on a predeﬁned lattice.
subject to conﬁnement in local minima of the quan In practice, NG algorithms yield better solutions
tization error and can produce suboptimal results. than WTA; however, the eﬀect of this strategy
on convergence speed or asymptotic behavior has
∗ Corresponding author. hardly been rigorously investigated so far.
Email address: [email protected] (Aree Witoelar). Methods from statistical physics and the theory
URL: http://www.cs.rug/∼aree (Aree Witoelar).
Preprint submitted to Elsevier 8 November 2007
of online learning [8] allow for an exact mathemat The input data is presented sequentially during
ical description of learning systems for high dimen training and one or more prototypes are updated on
sional data. In the limit of inﬁnite dimensionality, line. Algorithms studied here can be interpreted as
such systems can be fully described in terms of a few stochastic gradient descent procedures with respect
characteristic quantities, the socalled order param to a cost function H(W ) related to E(W ). The gen
eters. The evolution of these order parameters along eralized form reads
the training procedure is characterized by a set of S
coupled ordinary diﬀerential equations (ODE). By 1
H(W ) = dξP (ξ) d(ξ, wi )f (ri )
integrating these ODEs, it is possible to analyse the 2 i=1
performance of VQ algorithms in terms of stabil 1
ity, sensitivity to initial conditions, and achievable − dξP (ξ)ξ 2 (2)
2
quantization error. This successful approach has also
been reviewed in [8,16], among others. where ri is the rank of prototype wi with respect to
The extension of the theoretical analysis of sim the distance d(ξ, wi ), i.e. ri = S − j=i Θij . Rank
ple (WTAbased) vector quantization with two pro rJ = 1 corresponds to the socalled winner, i.e. the
totypes and two clusters introduced in an earlier prototype wJ closest to the example ξ. The rank
work [5] is not straightforward. Additional proto function f (ri ) determines the update strength for
types and clusters introduce more complex interac the set of prototypes and satisﬁes the normalization
S
tions in the system that can result in radically dif i=1 f (ri ) = 1; note that it does not depend ex
ferent behaviors. Also, the mathematical treatment plicitly on distances but only on the ordering of the
becomes more involved and requires, for instance, prototypes with respect to the current example.
several numerical integrations. Here we introduce The corresponding stochastic gradient descent in
an additional prototype and a mixture of clusters. H(W ) is of the form
We investigate not only WTA but also the popular
µ µ−1 µ−1
Neural Gas approach [13] for vector quantization. wi = wi + ∆wi with
This is an important step towards the investigation µ−1 η µ−1
∆wi = f (ri )(ξ µ − wi ) (3)
of general VQ approaches based on neighborhood N
interaction such as selforganizing maps. where η is the learning rate and ξ µ is a single exam
ple drawn independently at time step µ of the se
2. WinnerTakesAll and Neural Gas quential training process. We compare two diﬀerent
algorithms:
Assume input data ξ ∈ I N , generated according
R (i) WTA:
to a given probability density function P (ξ). Vector Only one prototype, the winner, is updated
Quantization represents the input data in the same for each input. The cost function directly min
N dimensional space by a set of prototypes W = imizes the quantization error with H(W ) =
S
wi ∈ I N i=1 . The primary goal of VQ is to ﬁnd a
R E(W ). The corresponding rank function is
faithful representation by minimizing the socalled
fWTA (ri ) = Θij (4)
quantization or distortion error
j=i
S
1 (ii) Neural Gas:
E(W ) = dξP (ξ) d(ξ, wi ) Θij
2 i=1
The update strength decays exponentially
j=i
with the rank controlled by a parameter λ.
1 1
− dξP (ξ)ξ 2 (1) The rank function is f (ri ) = C(λ) hλ (ri )
2
where hλ (ri ) = exp (−ri /λ) and C(λ) =
S
where Θij ≡ Θ d(ξ, wj ) − d(ξ, wi ) . For each input ri =1 exp (−ri /λ) is a normalization con
vector ξ the closest prototype wi is singled out by the stant. The parameter λ is adjusted during
product of Heaviside functions, Θ(x) = 0 if x < 0; 1 training; it is frequently set large initially and
else. Here we restrict ourselves to the quadratic Eu decreased in the course of training. Note that
clidean distance measure d(ξ, wi ) = (ξ − wi )2 . The for λ → 0 the NG algorithm becomes identi
1
constant 2 dξP (ξ)ξ 2 term is independent of pro cal with WTA. We divide f (ri ) according to
totype positions and is subtracted for convenience. its ranks as
2
S
1 characteristic quantities, or socalled order param
fNG (ri ) = hλ (k)gi (k) (5) eters. A suitable set of characteristic quantities for
C(λ)
k=1
the considered learning model is:
where gi (k) = 1 if ri = k; 0 else and
Riσ = wi · Bσ and Qµ = wi · wj .
µ µ
ij
µ µ
(8)
k gi (k) = 1. In a model with three proto
types, this can be written in terms of Heavi Note that Riσ are the projections of prototype vec
side functions tors wi on the center vectors Bσ and Qµ correspond
µ
ij
to the self and cross overlaps of the prototype vec
gi (1) = Θij tors.
j=i From the generic update rule deﬁned above,
gi (2) = Θij (1 − Θik ) Eq. (3), we can derive the following recursions in
k=i j=k,i terms of the order parameters:
gi (3) = 1 − Θij . (6) µ µ−1
Riσ − Riσ µ−1
j=i = ηf (ri ) bµ − Riσ
σ
1/N
Qµ − Qij
ij
µ−1
3. Model = η f (rj ) hµ − Qij
i
µ−1
+ f (ri ) hµ j
1/N
µ−1 (ξ µ )2
We choose the model data as a mixture of M −Qij + η 2 f (ri )f (rj )
spherical Gaussian clusters: N
1
+O( ) (9)
M N
P (ξ) = pσ P (ξσ) with
where hµ and bµ are the projections of the input
i i
σ=1
data vector ξ µ :
1 1 2
P (ξσ) = exp − (ξ − σ Bσ ) (7)
(2πυσ )N/2 2υσ i
µ−1
h µ = wi · ξ µ and bµ = Bσ · ξ µ .
σ (10)
where pσ are the prior probabilities of each cluster. Note that the last two terms in Eq. (9) come from
The components of vector ξ µ are random numbers (ξ µ )2 − hµ − hµ + Qµ−1 , where (ξ µ )2 is the only term
i j ij
according to a Gaussian distribution with mean vec that scales with N .
tors σ Bσ and variance υσ . The mean vectors are In the limit N → ∞, the O(1/N ) term can be
orthogonal, i.e. Bi · Bj = δij where δ is the Kro neglected and the order parameters become self 
necker delta. The parameters σ describe the sepa averaging [15] with respect to the random sequence
ration between the clusters. This model can be ex of examples. This means that ﬂuctuations of the or
tended for mixtures of many Gaussian clusters with der parameters vanish and the system dynamics can
M < N by choosing a coordinate system where the be described exactly in terms of their mean values.
orthonormality conditions Bi · Bj = δij are fulﬁlled. Also for N → ∞ the rescaled quantity t ≡ µ/N
Note that the Gaussian clusters strongly over can be conceived as a continuous time variable. Ac
lap in high dimensions. The separation between cordingly, the dynamics can be described by a set
the clusters is apparent only in the I M subspace
R of coupled ODE [3,10] after performing an average
spanned by {B1 , B2 , . . . , BM }. It is therefore a over the sequence of input data:
nontrivial task to detect the structure of the data
in N dimensions. dRiσ
= η bσ f (ri ) − f (ri ) Riσ
dt
dQij
4. Analysis of Learning Dynamics = η hi f (rj ) − f (rj ) Qij + hj f (ri )
dt
− f (ri ) Qij + η 2 pσ υσ f (ri )f (rj ) σ
We give a brief description of the theoretical σ
framework and refer to [3] for further details. Fol (11)
lowing the lines of the theory of online learning,
e.g. [8], in the thermodynamic limit N → ∞ the where . is the average over the density P (ξ) and
system can be fully described in terms of a few the . σ is the conditional average over P (ξσ). Here
3
we exploit the following relation in the last term of 5. Results
dQij
dt in Eq. (11):
5.1. Learning Dynamics
ξ2 1 2
lim = lim pσ (υσ N + σ) = p σ υσ .
N →∞ N N →∞ N
σ σ We study the performance of both WTA and NG
Exploiting the limit N → ∞ once more, the quan in several cases using three prototypes and up to
tities hµ , bµ become correlated Gaussian quantities
i σ
three clusters. Stochastic gradient descent proce
by means of the Central Limit Theorem. Thus the dures approach a (local) minimum of the objective
above averages reduce to integrations in S + M di function in the limit η → 0. We can consider this
mensions over the joint density P (x) where x = ˜
limit exactly by rescaling the learning time as t = ηt.
(hµ , hµ , . . . , hµ , bµ , bµ , . . . , bµ ), which are fully spec
1 2
Then, the O(η 2 ) terms in Eq. (11) can be neglected
S 1 2 M
iﬁed by ﬁrst and second moments [3,6]: and the set of ODEs is simpliﬁed. For all demon
hµ
i
µ−1 µ
σ = σ Riσ , bτ σ = σ δτ σ
1.5
hµ hµ σ − hµ σ hµ σ = υσ Qij
i j i j
µ−1
b µ b µ σ − b µ σ b µ σ = υσ δτ ρ
τ ρ τ ρ 1
hµ bµ σ − hµ σ bµ σ = υσ Riτ .
i τ i τ
µ−1
(12) Ri2
Hence the joint density of hµ , bµ is given in terms of
i τ
0.5
the order parameters deﬁned in Eq. (8). While most
of the integrations can be performed analytically, 0
some have to be implemented numerically. See the
Appendix for the computations.
−0.5
Given the averages for a speciﬁc rank function −0.5 0 0.5 1 1.5
f (ri ), cf. Eqs. (B.7) and (B.14) we obtain a closed Ri1
form expression of ODE. Using the initial conditions (a)
Riσ (0), Qij (0), we integrate this system for a given
algorithm and get the evolution of order parameters
in the course of training, Riσ (t), Qij (t). The behav 1
ior of the system depends on the characteristic of
the data and the parameters of the learning scheme,
i.e. oﬀset of the clusters σ , variance within the clus 0
Ri2
ters υσ , learning rate η, and for NG, the rank func
tion parameter λ. As shown in [5], this method of −1
analysis is in good agreement with large scale Monte
Carlo simulations of the same learning systems for
−2
dimensionality as low as N = 200.
Analogously, the quantization error, Eq. (1), can
0 1 2 3
be expressed in terms of order parameters Ri1
S S (b)
1
E(W ) = Θij Qii − hi Θij (13)
2 i=1 i=1 Fig. 1. Trajectories of prototypes on the plane spanned by
j=i j=i
B1 and B2 . The cluster centers σ Bσ are marked by crosses.
Note that E(W ) does not depend explicitly on ξ; The trajectories are marked by solid lines (WTA) and dashed
here it is shown how the subtracted constant term lines (NG). The prototypes at initialization are marked with
described in Eq. (1) and Eq. (2) becomes useful. ˜
squares and at t = 10 with circles (WTA) and dots (NG).
Both algorithms converge at the triangles, where two pro
Plugging in the values of the order parameters
totypes coincide at {−0.07, 1.07}. The set of prototypes is
computed by integrating the ODEs, {Riσ (t), Qij (t)}, initially set (a) near the cluster centers, and (b) far away
we can study the socalled learning curve E in from the cluster centers. In both ﬁgures the parameters are
dependence of the training time t for a given VQ p1 = 0.45, 1 = 2 = 1, υ1 = υ2 = 1, η → 0, λi = 2, λf =
algorithm. 0.01 and tf = 50.
4
1.4 −0.6
1.2 WTA
Quantization error E(W)
−0.8 NG
1
−1
0.8
0.6 −1.2
iσ
R
0.4 −1.4
0.2
−1.6
0
−0.2 −1.8
−0.4 −2
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
Learning time Learning time
(a) (a)
1.5 0
−0.2 WTA
1
Quantization Error E(W)
−0.4 NG
0.5
−0.6
0 −0.8
i2
−0.5 −1
R
−1 −1.2
−1.4
−1.5
−1.6
−2 −1.8
−2.5 −2
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
Learning time Learning time
(b) (b)
Fig. 2. The corresponding order parameters Ri2 at learning Fig. 3. Evolution of the quantization error E(W ) in Fig. 1 at
˜
time t = ηt for WTA (solid lines) and NG (dashed lines) ˜
learning time t = ηt for WTA (solid line) and NG (dashed
algorithms in the system described in Fig.1. The initial sets line) algorithms. The prototypes are initialized (a) near the
of prototypes are deﬁned in Figs.1(a) and (b), respectively. cluster centers and (b) far away from the cluster centers.
strations, the NG algorithm is studied for decreas ˜
learning stage t = 10.
˜ ˜ ˜ ˜
ing λ with λ(t) = λi (λf /λi )t/tf where tf is a learn This can be illustrated with the evolution of the
ing time parameter. The inﬂuence of the initial set order parameters Ri2 (t) in Fig. 2. In Fig. 2(a), the
of prototypes on the learning curves is investigated order parameters of both algorithms converge rel
by choosing diﬀerent values of {Riσ (0), Qij (0)}. atively fast. In Fig. 2(b), the order parameters of
Figure 1 presents the prototype dynamics in a one prototype change rapidly compared to that of
system with three prototypes and two clusters. other prototypes in WTA algorithm. One prototype
We examine two diﬀerent initial sets of proto dominates as the winner and gets frequent updates
types: close to the origin at {Ri1 (0), Ri2 (0)} ≈ towards the cluster centers, while the other proto
{0, 0}, Qij (0) ≈ 0, ∀{i, j} in Fig. 1(a); and far away types are rarely updated. The NG algorithm par
from the origin on the side of the weaker cluster, tially solves this problem by updating all prototypes
viz. p1 , at {Ri1 (0), Ri2 (0)} ≈ {3, −2}, Qij (0) = at the initial stages of learning.
Riσ (0) · Rjσ (0), ∀{i, j} in Fig. 1(b). While the pro The quantization error obtained from the order
totypes have diﬀerent trajectories in WTA and NG ˜ ˜
parameters {Riσ (t), Qij (t)} is displayed in Fig. 3.
algorithms, they converge at the identical conﬁgu We observe that the quantization error decreases
ration at large t and λ → 0. Here, the projections faster in the WTA algorithm compared to NG meth
of two prototypes converge near the center of the ods at the initial stages of the learning. This behav
stronger cluster. The advantage of NG is apparent ior can be explained by the fact that the HNG dif
in Fig. 1(b) where all prototypes already reach the fers from E(W ) by smoothing terms in particular
area near the cluster centers at an intermediate in early stages of training. We observe that WTA
5
6 6 6
4 4 4
i2
i2
i2
R
R
R
2 2 2
0 0 0
−2 0 2 −2 0 2 −2 0 2
R R R
i1 i1 i1
(a) (b) (c)
6 6 6
4 4 4
i2
i2
i2
R
R
R
2 2 2
0 0 0
−2 0 2 −2 0 2 −2 0 2
R R R
i1 i1 i1
(d) (e) (f)
Fig. 4. Trajectories of the prototypes on the plane spanned by B1 and B2 , corresponding to the WTA (solid lines) and the
NG (dashed lines) algorithms. Here, p1 = 0.45, p2 = 0.55, υ1 = 1, υ2 = 1.21, 1 = 1 and 2 = 5. The cluster centers σ Bσ are
marked by ×. The initial prototype conﬁgurations for both algorithms are marked with . While the asymptotic conﬁgurations
of WTA (circles) algorithm depends on initialization, the NG (dots) always produces identical asymptotic conﬁgurations. In
these cases, the NG algorithm always ﬁnds the optimal quantization error.
yields the best overall quantization error in the ﬁrst 5.2. Asymptotic conﬁguration
set of initial values in Fig. 3(a). This is mirrored
˜
by the fact that, for large t and λf → 0, both algo The dynamics of the prototypes while learning
rithms yield the same quantization error. on a model data with a larger separation between
For WTA training, the prototypes reach t → ∞˜ the clusters are presented in Fig. 4. The initial
asymptotic positions corresponding to the global conﬁgurations correspond to the following val
minimum of E(W ) for small learning rates η → 0. ues of {Ri1 (0), Ri2 (0)}: (a) {−1, 2}, (b) {−0.5, 2},
However, learning can slow down signiﬁcantly at in (c) {−1, 1.5}, (d) {−0.5, 1.5}, (e) {−1, 1} and (f)
termediate stages of the training process. Transient {−0.5, 1}. In all panels, Qij (0) = Riσ (0) · Rjσ (0).
conﬁgurations may persist in the vicinity of local In this case, the optimal conﬁguration of pro
minima and can indeed dominate the training pro totypes is with two prototypes representing the
cess. The NG is more robust w.r.t. the initial po stronger cluster as in Figs. 4(a to c). However, the
sition of prototypes than WTA while achieving the asymptotic conﬁguration of the prototypes in the
best quantization error asymptotically. WTA algorithm are sensitive to the initial con
ditions. In some cases, viz. Figs. 4(d to f), this
6
conﬁguration is not the optimal set of prototypes. 6
6
Therefore, even in this comparably simple model,
prototypes in WTA can be conﬁned in suboptimal 5 5
local minima of the cost function E(W ). The is 4 4
sue of diﬀerent regions of initialization which lead
3 3
to diﬀerent asymptotic conﬁgurations are to be Ri3 Ri3
discussed in forthcoming projects. 2 2
The asymptotic conﬁgurations for the NG al 1
1
gorithm are independent of initial conditions as
shown in Figs. 4(a to f). During the learning process 0 0
with λ > 0 the system moves towards intermedi −1 −1
ate conﬁgurations with minimum HNG (W ). Given −1 0 1 2 −1 0 1 2
˜
suﬃciently large λ and t, these conﬁgurations are Ri1 Ri1
identical and therefore the NG algorithm is robust (a) E(W ) = −13.261 (b) E(W ) = −13.079
with respect to initial conditions. In these cases, the
asymptotic conﬁguration is the optimal conﬁgura 6 6
tion and thus the NG algorithm achieves optimal 5 5
performance.
4 4
We demonstrate a model where the NG algo
rithm does not yield optimal performance in Fig. 5. 3
Ri3 3
Ri3
In this more complex situation, the weaker cluster 2
2
(pσ = 0.45) is divided into two Gaussian clusters
with p1,2 = {0.25, 0.20}. This corresponds to a 1 1
system of three clusters, with σ = {1, 1, 5} and 0 0
pσ = {0.25, 0.20, 0.55}. The distance between the
−1 −1
ﬁrst two clusters is small compared to their distance −1 0 1 2 −1 0 1 2
to the third cluster. In comparison to the previous Ri1 Ri1
case, where the weaker cluster spreads out evenly (c) E(W ) = −13.079 (d) E(W ) = −13.079
in all directions, here it has a particular orientation
along the vector (B1 − B2 ). Because of this struc Fig. 5. (a) The optimal set of prototypes (solid dots) in
ture, the best quantization error is obtained when a system with three clusters projected on the plane space
one prototype is placed near each cluster center, spanned by {B1 , B3 }. The values of Ri2 are not shown here.
The cluster centers σ Bσ are marked by ×. (b,c,d) Trajec
as in Fig. 5(a), even though one cluster has a very tor
 Related pdf books
 Curriculum Informatica 2004/05
 Interoperability Test of ebXML eBusiness Solutions1
 A Visual ToolBased Approach to porting C++ Code.
 Measuring the efficiency offered to local e Governments by ...
 ViewFusion: Correlating Structure and Activity Views for ...
 P2P energy exchange agent platform featuring a game theory ...
 Sketching, Scaffolding, and Inking: A Visual History for ...
 Message&fromthe&Chairs&
 Tuan Anh Nguyen: Curriculum Vitae  RUG
 20 INTERNATIONAL PAGE Just four clicks away by immigration ...
 Surface Reconstruction from Noisy Point Clouds using Coulomb ...
 Incomplete Contour Representations and Shape Descriptors: ICR ...
 Herman te Riele, CWI Amsterdam CS and Mathematics Colloquium ...
 Qualitative Comparison of Multiscale Skin Tumor Segmentation ...
 TALEN, INTEGERS, MATRICES, WISKUNDIGE STRUCTUREN
 Process SemanticEnabled Customisation for Active Service ...
 Requirements Engineering in the Year 00: A Research Perspective
Download the ebook