Learning four data points by three-layer ReLU NNs with different γ 3 's are shown in the first row. The corresponding scatter plots of initial (red) and final (green) {W [1]

Learning four data points by three-layer ReLU NNs with different γ 3 's are shown in the first row. The corresponding scatter plots of initial (red) and final (green) {W [1]

Source publication
Preprint
Full-text available
Substantial work indicates that the dynamics of neural networks (NNs) is closely related to their initialization of parameters. Inspired by the phase diagram for two-layer ReLU NNs with infinite width (Luo et al., 2021), we make a step towards drawing a phase diagram for three-layer ReLU NNs with infinite width. First, we derive a normalized gradie...

Contexts in source publication

Context 1
... first row in Fig. 2 shows typical learning results over different γ 3 's, from a relatively jagged interpolation (NTK scaling) to a smooth interpolation (mean-field scaling) and further to a linear spline interpolation. With input weight and bias, W [1] k is two-dimensional for d = 1. We display the {((W [1] ...
Context 2
... the trained network in Fig. 2. For γ 3 = 1.0, the initial scatter plot is very close to the one after training. However, for γ 3 = 2.0, active neurons (i.e., neurons with significant amplitude and away from the origin) are condensed at a few orientations, which strongly deviates from the initial scatter plot. For γ 3 = 1.5, the scatter points present an ...
Context 3
... γ3 = 1.5 (c) γ3 = 2.1 Figure 4: RD(W [1] ) vs. m. Still learn four data points as in Fig. 2 by three-layer ReLU NNs with different γ 3 's and γ 2 = 0. The slopes in (a)-(c) are fitted by RD(W [1] ) w.r.t m = 100, 1000, 2000, 5000, 10000 in a log-log scale. As γ 3 grows from 0.9 to 2.1, the corresponding slopes grow as ...
Context 4
... we explore the phase diagram by experimentally scanning S Wi over the phase space. The result for the same 1-d problem as in Fig. 2 is presented in Fig. 5. In the red zone, where S Wi is Table 2: Two groups of S W1 and S W2 . Within a group, γ 2 and γ 3 are the same, while α, a, W [2] and W [1] are different. These values are the average of eight experiments. ...