学位論文要旨



No 125350
著者(漢字) クーソー,フローラン
著者(英字) Cousseau,Florent
著者(カナ) クーソー,フローラン
標題(和) 多層パーセプトロンの学習のダイナミクスと情報統計力学 : 理論と適用
標題(洋) Dynamics of Learning and Statistical Mechanical Informatics of Multilayer Perceptrons : Theory and Practice
報告番号 125350
報告番号 甲25350
学位授与日 2009.09.28
学位種別 課程博士
学位種類 博士(科学)
学位記番号 博創域第512号
研究科 新領域創成科学研究科
専攻 複雑理工学専攻
論文審査委員 主査: 東京大学 教授 岡田,真人
 東京大学 教授 山本,博資
 東京大学 准教授 江尻,晶
 東京大学 准教授 國廣,昇
 東京大学 准教授 溝川,貴司
内容要旨 要旨を表示する

We investigate the influence of the intrinsic structural properties of multilayer perceptron (MLP) neural networks on learning problems, and on information processing problems. It is known that MLPs can approximate any non-linear function up to an arbitrary precision, provided the fact that a sufficient number of hidden units are used. Thus, MLPs are a popular kind of neural networks because of their huge potential. However, this thesis shows that despite the promising theoretical potential of MLPs, actually releasing this potential is hard. The special structural properties of MLPs give rise to singular spaces which prevent the standard algorithms from being efficient. This underlines the need for algorithms especially designed to handle these particular spaces.

Introduction

An artificial neural network is a mathematical model based on biological neural networks. It consists of an interconnected group of artificial neurons (or units) and processes information using a connectionist approach to computation. Artificial neural networks are mainly used for classification and pattern recognition tasks. Before being usable, they require a training (or learning) stage. Generally, one is in possession of a data training set of known input-output pairs. Learning takes place by presenting the training data inputs to the network, and changing its parameters until the desired output is obtained. Multilayer perceptrons (MLP) are a kind of neural networks. They are adaptive nonlinear systems that receive input signals and transform them into adequate output signals. In general, MLPs are made of an input layer, one or several hidden layers, and an output layer. Information is processed forwardly and generally each unit receives inputs from all the neurons of its preceding layer, performs some non-linear transformation on the weighted sum of these inputs, and transmits its own output to all the neurons of its next layer. Each input received by a neuron is weighted by a parameter called the weight of the connection. Learning takes place by modifying the connection weights.

In section I we investigate the dynamics of learning in MLPs near special singular points of the parameter space. In section II we discuss potential applications of MLPs in coding theory (lossy compression and error correcting code) and show that despite promising theoretical results, conventional methods fail to provide satisfying performance practically.

I. LEARNING AND SINGULARITIES IN MULTILAYER PERCEPTRONS

A set of multilayer perceptrons can be viewed as a parameter space which forms a geometrical manifold, called neuro-manifold in the case of neural networks. It has already been studied [1] that singularities appear in such a manifold and that they strongly affect the dynamics of learning. Moreover, such a neuro-manifold does not have a Euclidian geometrical structure, but a Riemannian geometrical structure. Consequently, the standard gradient is not the steepest descent anymore. Saad and Solla [2] have shown that, while the learning processes, the standard gradient learning (the backpropagation learning) is usually trapped in the plateau (for a long period of time, the training error remains constant until suddenly it starts to decrease again. This phenomenon is illustrated in figure 1 (left) by the red curve). Amari et al. [1] have shown that such a plateau is due to the attraction of singularities and that the natural gradient learning which takes into account the geometrical structure of the neuro-manifold does not seem to be influenced by these singularities. In this study, we extend the theoretical framework developed in Amari et al. [1] to investigate the dynamics of learning near the singularity which is caused by the symmetry of hidden units. We define a student MLP as (see figure 1, middle left),

(〓)where(〓)and where(〓) (1)

p denotes the activation function and ε is a Gaussian white noise. The input x is a n-dimensional Gaussian vector x >> N(0, I). In this case the parameters are given by θ = (J1, J2,w1,w2). Next, we define a teacher perceptron,

(〓)

The student perceptron learns the behavior of the teacher perceptron, based on the training data (x1, y1), . . . , (xM, yM), where M is the number of training data. In this study we investigate the standard gradient learning (SGD), and the natural gradient learning (NGD) algorithm [1]. They are given by

(〓) (3)

where l(x, y, θ) = 1 2 (y f(x, θ))2, and where y is the teacher's output. (…) represents the expectation with respect to the teacher's signal.(〓)denotes the Fisher information matrix. E is the expectation with respect to the student's signal. η is a small constant so that the dynamics move along the gradient direction. However, singular regions exist in the parameter space of the student MLP [1]. They are given by,

(〓) (4)

C1 corresponds to an overlapping singularity (two neurons can be summed up in one neuron) while C2 corresponds to an eliminating singularity (one neuron becomes useless). The Fisher information matrix is singular on these regions. Singularities correspond to the deletion of one neuron. We investigate singularities of type C1. Note that the teacher is voluntarily written using a single neuron only, thus it lies on the singular region of the student parameter space.

We analytically obtained the trajectories of the dynamics of learning by standard gradient descent (SGD) and natural gradient descent (NGD) when the teacher function lies on the singularity [3]. We plot the dynamic vector fields of the SGD and the NGD in figure 1. The SGD dynamics is slowly attracted by C1 explaining the plateau phenomenon. On the NGD side however, C1 is not an equilibrium for the dynamics so that the region C1 is no more critical.

Finally, we have also investigated the dynamics of learning of the SGD when the teacher is outside the singularity [4]. We found that in this case, the singular region C1 turns into a Milnor-like attractor composed of some stable and unstable parts. The case of the NGD when the teacher is outside the singularity remains to be investigated.

II. APPLICATIONS OF MULTILAYER PERCEPTRONS IN CODING THEORY

Based on the previous work of Hosaka et al. [5], we investigate a lossy compression scheme using a tree like MLP as decoder. In the same way, based on the previous work of Shinzato et al. [6], we investigate an error correcting code scheme using tree like MLP encoder. We evaluate the typical performance of these schemes at the infinite codeword length limit by using the Replica Method of statistical mechanics, and investigate their practical implementations.

A. Lossy Compression

An original Ising message y of size M ii.e.: y = (y1, . . . , yM) where yμ 2 fi1, 1g) which is uniformly and independently distributed with a bias p (i.e.: P[y = 1] = 1 i P[y = i1] = p) is encoded into an Ising codeword s of size N < M by some non-linear transformation. The code rate of such a scheme is defined by R = N/M. The decoded message y of size M (i.e.: y = (y1, . . . , yM) where (〓) is given by some decoder. Here, we utilize three different tree-like committee machine and tree-like parity machine featuring K hidden units and one output unit as decoder of our scheme. Each network is based the same general architecture and makes use of a non monotonic transfer function fk, as shown in figure 2. The codeword s is split down into K Ising vectors which are N/K-dimensional, (i.e.:s = (s1, . . . , sK) where(〓). The vectors xμ l are fixed N/K-dimensional independent vectors uniformly distributed on fi1, 1g (i.e,; quenched random variables). The sgn function denotes the sign function taking 1 for x , 0 and i1 for x < 0. Since we investigate the lossy compression case, an amount of distortion D is allowed into the decoded message. This distortion is defined by(〓), where E denotes the expectation and d is the Hamming distance (d(x, y) takes the value 0 if x = y and 1 else). In other words, D represents the average error per bit tolerated in the decoded message y. For a given distortion D, when M and N goes to infinity, the so-called Shannon bound gives the best achievable code rate R a lossy compression scheme can ever achieved. We define the encoder as F(y) argmin s d(y,y(s)). Using the Replica Method, we were able to show that all the schemes can saturate the Shannon bound under some specific conditions [7].

An original Ising message s0 of size N which is uniformly and independently distributed (i.e.: P[s0 = 1] = P[s0 = i1] = 1/2) is encoded into an Ising codeword y0 of size M > N using the same tree like MLP as introduced in the lossy compression section (i.e.: we use equation (I), (II) or (III) to construct y0. Note that in this case s denotes s0 and yμ denotes yμ 0 ). The codeword y0 is then fed into a binary asymmetric channel (BAC) where each bit is flipped independently of the others with asymmetric probabilities according to the noise parameters (p, r). The BAC details are shown in figure 4. The corrupted codeword y is finally received at the ouptut of the channel. The code rate of such a scheme is defined by R = N/M. The decoded message s of size N is then inferred from y by using some non-linear transformation G(y) argmax S p(sjy; fxg). For given parameters (p, r), when M and N goes to infinity the so-called Shannon bound gives the best achievable code rate R an error correcting code scheme can ever achieved. Using the Replica Method, we were able to show that all the schemes can saturate the Shannon bound under some specific conditions [8].

C. Belief Propagation Algorithm

We have shown that the above schemes can yield optimal performance for both lossy compression and error correcting code tasks. However, in the lossy compression case, we lack a practical encoder (a formal encoder require an amount of time which grows exponentially with the size of the messages). Respectively for the same reasons, in the error correcting code case we lack a practical decoder.

Thus we need to use some approximation in order to make the encoding/decoding task a tractable one. In this section, we propose to apply the popular belief propagation (BP) algorithm which can be used to compute an approximation of the marginal posterior probabilities and therefore deduce a potential codeword/message in a tractable time. Hosaka et al. used a similar approach for the simple perceptron case [9]. An optimal codeword/message is given by the ground state of the Boltzmann distribution (〓)where H and Z are the relevant Hamiltonian and partition function respectively. The BP algorithm is used here to compute an approximation of the marginal probabilities(〓) in a tractable time, assuming that the Boltzmann factor of the previous distribution is factorizable.

Using this assumption, we performed simulations using the BP for both lossy compression and error correcting code case. The results we found are not optimal and far from the theoretical ones. In fact, we found that the BP performance gets poorer as the number of hidden units K increases. Figure 5 shows some results obtained for the CTH case. This results show that using a large number of hidden units perturb the BP dynamics and that the number of local minima increases with K. It is likely that in the codeword space, singular regions arise and perturb the BP dynamics is some similar way as in section I. The BP dynamics remains to be investigated.

Conclusion

The present thesis investigates the fundamental intrinsic properties of multilayer perceptron neural networks. Since their first discovery, MLPs have shown great potential. From a theoretical point of view, MLPs are universal approximators, capable of coding any non-linear function (provided a sufficient number of hidden units K). Thus, they can solve any pattern recognition task, or yield Shannon optimal performance when used in coding theory.

However practically, the results obtained using conventional methods like the SGD learning in section I or the BP algorithm in section II shows that several problems arise. All these results show that while multilayer perceptrons are very powerful tools theoretically, practical implementation of such systems remain difficult. We need better heuristics to efficiently use such devices. Conventional methods fail to provide satisfying results. Methods especially designed to take care of the particular geometrical structure induced by MLPs appear to be compulsory. On the learning theory side, the natural gradient is an example of such method and works efficiently without being subject to the plateau phenomenon. In the case of coding theory, such an algorithm is still to be found.

[1] S. Amari, H. Park, and T. Ozeki, Neural Computation, 18, 1007-1065, 2006.[2] D. Saad and S.A. Solla, Phys. Rev.E, 52, 4225-4243, 1995[3] F. Cousseau, T. Ozeki, and S. Amari, IEEE Trans. on Neural Networks, 19, 1313-1328, 2008.[4] H. Wei, J. Zhang, F. Cousseau, T. Ozeki and S. Amari, Neural Computation, 20, 813-843, 2008[5] T. Hosaka, Y. Kabashima and H. Nishimori, Phys. Rev. E, 66, 066126, 2002.[6] T. Shinzato and Y. Kabashima, proceedings of IBIS2006 (in Japanese), Osaka, Japan, 2006.[7] F. Cousseau, K. Mimura, T. Omori and M. Okada, Phys. Rev. E, 78, 021124, 2008.[8] F. Cousseau, K. Mimura, and M. Okada, in press.[9] T. Hosaka, Y. Kabashima, Physica A, 365, 113, 2006.

FIG. 1: Left: The plateau phenomenon. The red curve shows the SGD dynamics trapped in the plateau. The black curve shows the NGD dynamics unaffected. Middle left: Student model network. Middle right: Dynamic vector fields with trajectories when the teacher is located on the singularity (SGD). Right: Dynamic vector fields with trajectories when the teacher is located on the singularity (NGD).

FIG. 2: Left: General equation of the three different decoder networks. Middle: The non monotonic transfer function fk.Right: General architecture of the three decoder networks.

FIG. 3: Layout of the scheme

FIG. 4: Layout of the scheme

FIG. 5: Left: Error correcting code case using the CTH with K = 1 (solid), K = 3 (dashed) and K = 5 (dotted) hidden units. We set p = 0.1, r = 0.2. We used N = 1000. The vertical line represents the Shannon bound. Overlap denotes the quantity (1/N)s ・ s0 which measure how close the estimated message s is from the original one s0. Overlap of 1 means perfect decoding. Right : Lossy compression case using the CTH with K = 1 (dashed), K = 3 (dotted) and K = 5 (dash-dotted). We used p = 0.5 (top), and p = 0.8 (bottom). We choose N = 1000. Solid lines denote the Shannon bound

審査要旨 要旨を表示する

パーセプトロンとは,脳の働きにヒントを得た学習能力のあるパターン認識機械である.パーセプトロンは,脳科学的な側面だけでなく,機械学習や情報理論などの情報科学の分野でも注目をあびている.パーセプトロンの一種である多層パーセプトロン(Multi Layer Perceptron, MLP)は特異性や入れ替え対称性という特殊な性質をもつ.本論文の目的は,MLPを学習理論や情報理論に適用した場合に,特異性や入れ替え対称性が,MLPの性能にどのように影響を与えるかを理論的に解明することである.本論文は二部構成からなる.第一部ではMLPの学習理論について述べ,第二部ではMLPの情報理論への応用を述べられている.

本論文の導入では,本論文の中心課題であるMLPの特異性と入れ替え対称性が紹介されている.多層パーセプトロンは,入力層,中間層,出力層の三層から構成される.MLPでは,中間層の素子を入れ替えても,そのMLPの性質は変わらない.この性質を入れ替え対称性とよぶ.また,中間素子数が少ない小さいMLPは,中間素子数が多い大きいMLPに含まれるという性質がある.それを特異性という.大きいMLPで小さいMLPを表現すると,大きいMLPのパラメータである結合加重に関して不定性が存在する.大きいMLPの結合加重の有限領域である不定領域は,小さいMLPの結合加重では一点に対応する. これを特異点と呼ぶ.

第一部の第1章と第2章では,MLPの持つ特異性が,学習のダイナミクスにどのように影響を与えるかを述べている.これまでMLPの学習過程には,学習が一度停滞した後に再び学習がはじまるプラトーと呼ばれる現象が報告されている.これまで主に数値計算と計算機実験を用いて,MLPの持つ特異性や入れ替え対称性がプラトーの原因であると示されてきた.ここで,この知見を解析的手法で明らかにしている.数値計算等のこれまでの知見から,MLPの学習初期には系が特異点近傍に高速に収束し,プラトー状態では特異点近傍に停留することがわかっている.そこで学習過程のダイナミクスを,速いダイナミクスと遅いダイナミクスに分離する一種の断熱近似を用いて,遅い変数の閉じた方程式の導出に成功している.この理論を通常の学習法である最急降下法に適用し,特異点は安定な領域である安定多様体と不安定多様体から構成され,非線形動力学で知られているミルナーアトラクターと同様の性質を持つことを解析的に示している.MLPのプラトーに関する,このような解析的な計算は世界で初めての試みである.次に,この理論を,フィッシャー情報行列逆行列を用いた学習法である自然勾配法に適用している.自然勾配法では,結合加重の空間の尺度である計量をフィッシャー情報行列により変更している.特異点周りでは,フィッシャー情報行列が0に近い値をもつ.その逆数で結合加重を変換するので,変化が遅い特異点周りのダイナミクスを加速する働きを持つ.今回導出した理論を自然勾配法に適用すると,自然勾配法では特異点の動力学的な性質が変化し,特異点周りに系が停留するプラトーが生じないことを解析的に示すことができる.

第二部の第3章と第4章では,MLPを情報理論で取り扱われる歪み有りデータ圧縮と誤り訂正符号に適用した結果が述べられている.歪み有りデータ圧縮や誤り訂正符号では,性能の上限であるShannon限界が存在し,そのShannon限界を達成することが研究の基盤となっている.パーセプトロンの性質は,統計力学的手法の一つであるレプリカ法で解析されている.レプリカ法では,レプリカ対称性(Replica Symmetry, RS)とよばれる仮定を導入し,まず計算を進める.次にレプリカ対称性が破れRSB(Replica Symmetry Breaking)が起こるAT(Almeida Thouless)条件を求める.第3章では歪み有りデータ圧縮が議論されている.一層パーセプトロンについて,出力層に非単調関数を用いたパーセプトロンをRS仮定のもとに解析すると,レート歪み関数とよばれるShannon限界が達成されることが報告されている.しかしながら、あるパラメータ領域では,RS仮定が破れてRSBが起こることが報告されている.そこで本論文は,この困難を回避する方法として,多層パーセプトロンの一種である木構造パーセプトロンを歪み有りデータ圧縮に適応することを提案している.木構造パーセプトロンはK個の中間層素子から構成されている.各入力層素子がそれぞれひとつの中間素子にのみ結合している.木構造パーセプトロンは一層パーセプトロンを特殊な場合として含むモデルであり,K=1が一層パーセプトロンに対応する.また,パーセプトロンに木構造を導入することにより,同じ圧縮率を達成する解の数が,K=1の一層パーセプトロンに比べて2のK乗倍増加することも容易に示すことができる.まずRS仮定の下に,木構造パーセプトロンの圧縮限界を計算している.K=1と同様に出力関数のパラメータを最適化すると,木構造パーセプトロンでも圧縮の限界であるレート歪み関数が達成できることを報告している.次にAT条件を求める事により,木構造パーセプトロンがどの程度RSBに対して頑健かが議論されている.

第二部の第5章では,木構造パーセプトロンへの確率伝搬法の適用が述べられている.木構造パーセプトロンを用いた場合、誤り訂正の復号や歪み有りデータ圧縮の符号化は,取り扱う素子数の大きさに関して計算量が指数的に増大する.そこでそれら復号化や符号化において,近似的な手法を用いる必要がある.本論文では,その近似手法として確率伝搬法を導入している.確率伝搬法を木構造パーセプトロンに適用すると,中間素子数Kが大きいほど,圧縮率が悪化することが報告されている.その理由は,中間素子数Kが大きいほど,レート歪み関数を達成する解の数が2のK乗倍増加することに起因していることが示唆されている.

以上のように,本論文は,特異性と入れ替え対称性の観点から,多層パーセプトロンの学習理論と情報理論に関する新たな理論的な知見を獲得することに成功している.これらの研究により,多層パーセプトロンの学習の効率化や情報理論への適用性の解明について貢献することが期待できる。

なお、本論文第一部は尾関智子,H. Wei,J. Zhanおよび甘利俊一と,第二部は三村和史,大森敏明および岡田真人との共同研究であるが,論文提出者が主体となって解析,実験及び検証を行ったもので,論文提出者の寄与が十分であると判断する.

したがって,博士(科学)の学位を授与できると認める.

UTokyo Repositoryリンク