《Learning Discrete Structures for Graph Neural Networks》读后感
包括 英文、汉译以及个人的见解
Abstract
Graph neural networks (GNNs) are a popular class of machine learning models whose major advantage is their ability to incorporate a sparse and discrete dependency structure between data points. Unfortunately, GNNs can only be used when such a graph-structure is available. In practice, however, real-world graphs are often noisy and incomplete or might not be available at all. With this work, we propose to jointly learn the graph structure and the parameters of graph convolutional networks (GCNs) by approximately solving a program that learns a discrete probability distribution on the edges of the graph. This allows one to apply GCNs not only in scenarios where the given graph is incomplete or corrupted but also in those where a graph is not available. We conduct a series of experiments that analyze the behavior of the proposed method and demonstrate that it outperforms related methods by a significant margin.
摘要部分:图神经网络(GNNs)是一类流行的机器学习模型,其主要优点是能够在数据点之间合并稀疏和离散的依赖结构。不幸的是,GNNs只能在这种图结构可用时使用。然而,在实践中,真实世界的图模型通常是嘈杂的和不完整的,或者可能根本不可用。在这项工作中,我们提出通过解决图形边缘上的离散概率分布的求解问题来共同学习图的结构以及GCN的参数。这使得人们不仅可以在给定的图不完整或损坏的情况下应用GCN,还可以在图不可用的情况下应用GCN。我们进行了一系列实验,分析了所提出方法的性能,并证明它比相关方法有显著的优势。
1 Introduction
Relational learning is concerned with methods that cannot only leverage the attributes of data points but also their relationships. Diagnosing a patient, for example, not only depends on the patient’s vitals and demographic information but also on the same information about their relatives, the information about the hospitals they have visited, and so on. Relational learning, therefore, does not make the assumption of independence between data points but models their dependency explicitly. Graphs are a natural way to represent relational information and there is a large number of learning algorithms leveraging graph structure. Graph neural networks(GNNs)(Scarselli et al.,2009) are one such class of algorithms that are able to incorporate sparse and discrete dependency structures between data points.
关系学习涉及的方法不仅可以利用数据点的属性,还可以利用它们的关系。例如,诊断一个病人,不仅取决于病人的生命体征和人口统计信息,而且还取决于有关其亲属的相同信息、他们去过的医院的信息等等。因此,**关系学习并不假设数据点之间的独立性,而是显式地建模它们的依赖性。**图是表示关系信息的一种自然方式,有大量利用图结构的学习算法。图神经网络(GNNs)(Scarselli等人,2009)就是这样一类算法,能够在数据点之间合并稀疏和离散的依赖结构。
While a graph structure is available in some domains,in others it has to be inferred or constructed. A possible approach is to first create a k-nearest neighbor (kNN) graph based on some measure of similarity between data points. This is a common strategy used by several learning methods such as LLE (Roweis & Saul, 2000) and Isomap (Tenenbaum et al., 2000). A major shortcoming of this approach, however, is that the efficacy of the resulting models hinges on the choice of k and, more importantly, on the choice of a suitable similarity measure over the input features. In any case, the graph creation and parameter learning steps are independent and require heuristics and trial and error. Alternatively, one could simply use a kernel matrix to model the similarity of examples implicitly at the cost of introducing a dense dependency structure.
虽然图结构在某些领域是可用的,但在其他领域,它必须被推断形成或构造。一种可能的方法是首先根据数据点之间的某种相似性度量创建k-最近邻(kNN)图。这是几种学习方法(如LLE ( Roweis&Saul,2000) 和 Isomap (Tenenbaum et al.,2000))使用的一种常见策略。然而,**这种方法的一个主要缺点是,结果模型的有效性取决于k的选择,更重要的是取决于对输入特征的合适的相似性度量的选择。**在任何情况下,图形创建和参数学习步骤都是独立的,需要启发式和反复试验。或者,可以简单地使用一个内核矩阵来隐式地建模示例的相似性,但代价是引入一个密集的依赖结构。
With this paper, we follow a different route with the aim of learning discrete and sparse dependencies between data points while simultaneously training the parameters of graph convolutional networks(GCN),a class of GNNs. Intuitively, GCNs learn node representations by passing and aggregating messages between neighboring nodes(Kipf &Welling,2017; Montietal.,2017; Gilmeretal.,2017; Hamiltonetal., 2017; Duran & Niepert, 2017; Velickovic et al., 2018). We propose to learn a generative probabilistic model for graphs, samples from which are used both during training and at prediction time. Edges are modeled with random variables whose parameters are treated as hyper parameters in a bilevel learning framework (Franceschi et al., 2018). We iteratively sample the structure while minimizing an inner objective (a training error) and optimize the edge distribution parameters by minimizing an outer objective (a validation error).
本文提出了一种新的学习路径,目的是学习离散的稀疏依赖关系,同时训练图卷积网络(GCN)的参数。直观地说,GCN 通过在相邻节点之间传递和聚合消息来学习节点表示。我们提出学习一个图的生成概率模型,从中可以在训练和预测时使用样本。边缘用随机变量建模,其参数在双层学习框架中被视为超参数(Franceschi et al.,2018)。我们在最小化内部目标(训练误差)的同时迭代采样结构,并通过最小化外部目标(验证误差)来优化边缘分布参数。
To the best of our knowledge, this is the first method that simultaneously learns the graph and the parameters of a GNN for semi-supervised classification. Moreover, and this might be of independent interest, we adapt gradient based hyper parameter optimization to work for a class of discrete hyper parameters (edges, in this work). We conduct a series of experiments and show that the proposed method is competitive with and often out performs existing approaches. We also verify that the resulting graph generative models have meaningful edge probabilities.
据我们所知,这是第一种同时学习图形和GNN参数进行半监督分类的方法。此外,我们采用基于梯度的超参数优化方法来处理一类离散的超参数(边)。我们进行了一系列的实验,结果表明该方法与现有的方法相比具有竞争性,而且往往优于现有的方法。我们还验证了所得到的图生成模型具有有意义的边缘概率。
2 Background
We first provide some background on graph theory, graph neural networks, and bilevel programming.
我们首先提供一些关于图论、图神经网络和双层规划的背景知识。
2.1 Graph Theory Basics
A graph G is a pair (V,E) with V = {v_1,…,v_N} the set of vertices and E ⊆ V ×V the set of edges. Let N and M be the number of vertices and edges, respectively. Each graph can be represented by an adjacency matrix A of size N×N: A_i,j = 1 if there is an edge from vertex vi to vertex vj, and Ai,j = 0 otherwise. The graph Laplacian is defined by L = D−A where
D
i
,
i
=
Σ
j
A
i
,
j
D
i
,
j
=
0
(
i
≠
j
)
D_{i,i}=\Sigma_j A_{i,j} \\D_{i,j} = 0 \space (i\ne j)
Di,i=ΣjAi,jDi,j=0 (i=j)
We denote the set of all N ×N adjacency matrices by H_N.
基础图理论:定义了图G=<V,E>,定义了G的邻接矩阵以及图的拉普拉斯矩阵,n*n的邻接矩阵集合用H_N表示
2.2 Graph Neural Networks
Graph neural networks are a popular class of machine learning models for graph-structured data. We focus specifically on graph convolutional networks (GCNs) and their application to semi-supervised learning. All GNNs have the same two inputs. First,a feature matrix X ∈X_N ⊂R^{N×n} where n is the number of different node features, second, a graph G = (V,E) with adjacency matrix A ∈H_N. Given a set of class labels Y and a labeling function y : V →Y that maps (a subset of) the nodes to their true class label, the objective is, given a set of training nodes VTrain, to learn a function
f
w
:
X
N
×
H
N
→
Y
N
f_w : X_N ×H_N →Y^N
fw:XN×HN→YN
by minimizing some regularized empirical loss
L
(
w
,
A
)
=
Σ
v
∈
V
T
r
a
i
n
l
(
f
w
(
X
,
A
)
v
,
y
v
)
+
Ω
(
w
)
,
(
1
)
L(w,A) = \Sigma_{v∈VTrain} l(f_w(X,A)_v,y_v) + Ω(w), \qquad (1)
L(w,A)=Σv∈VTrainl(fw(X,A)v,yv)+Ω(w),(1)
where w ∈R^d are the parameters of f_w, f_w(X,A)_v is the output of f_w for node v, l : Y×Y →R+ is a point-wise loss function, and Ω is a regularizer. An example of the function f_w proposed by Kipf & Welling (2017) is the following two hidden layer GCN that computes the class probabilities as
f
w
(
X
,
A
)
=
S
o
f
t
m
a
x
(
A
^
R
e
L
u
(
A
^
X
W
1
)
W
2
)
,
(
2
)
f_w(X,A) = Softmax( \hat{A} \space ReLu( \hat{A}\space X\space W_1) \space W_2), \qquad(2)
fw(X,A)=Softmax(A^ ReLu(A^ X W1) W2),(2)
where w = (W_1,W_2) are the parameters of the GCN and ˆ A is the normalized adjacency matrix, given by
A
^
=
D
~
−
1
/
2
(
A
+
I
)
D
~
−
1
/
2
\hat{A} = \tilde{D}^{−1/2}(A + I) \tilde{D}^{−1/2}
A^=D~−1/2(A+I)D~−1/2
with diagonal,
D
~
i
i
=
1
+
Σ
j
A
i
j
\tilde{D}_{ii} = 1 +\Sigma_{j} A_{ij}
D~ii=1+ΣjAij
图神经网络是一类流行的处理图结构数据的机器学习模型。我们特别关注图卷积网络(GCNs)及其在半监督学习中的应用。所有GNN都有相同的两个输入。首先,一个特征矩阵X∈X_N⊂R^{N×N},其中N是不同节点特征的个数,其次是一个图G=(V,E),邻接矩阵a∈H_N,给定一组类标签Y和一个将节点(子集)映射到其真实类标签的标记函数Y: V→Y,目标是给定一组训练节点VTrain,学习函数
f
w
:
X
N
×
H
N
→
Y
N
f_w : X_N ×H_N →Y^N
fw:XN×HN→YN
通过最小化一些规则化的经验损失:
L
(
w
,
A
)
=
Σ
v
∈
V
T
r
a
i
n
l
(
f
w
(
X
,
A
)
v
,
y
v
)
+
Ω
(
w
)
,
(
1
)
L(w,A) = \Sigma_{v∈VTrain} l(f_w(X,A)_v,y_v) + Ω(w), \qquad (1)
L(w,A)=Σv∈VTrainl(fw(X,A)v,yv)+Ω(w),(1)
其中w∈R^d是f_w的参数,f_w(X,A)_v 是函数f_w对于结点v的输出结果,l:Y×Y→R+是点态损耗函数,Ω是正则化项。Kipf&Welling(2017)提出的函数f_w的一个例子是以下两个隐层GCN,它计算类概率为
f
w
(
X
,
A
)
=
S
o
f
t
m
a
x
(
A
^
R
e
L
u
(
A
^
X
W
1
)
W
2
)
,
(
2
)
f_w(X,A) = Softmax( \hat{A} \space ReLu( \hat{A}\space X\space W_1) \space W_2), \qquad(2)
fw(X,A)=Softmax(A^ ReLu(A^ X W1) W2),(2)
其中w=(W_1,W_2)是GCN的参数,而ˆA是规范化邻接矩阵,由式子定义
A
^
=
D
~
−
1
/
2
(
A
+
I
)
D
~
−
1
/
2
\hat{A} = \tilde{D}^{−1/2}(A + I) \tilde{D}^{−1/2}
A^=D~−1/2(A+I)D~−1/2
其中D是对角矩阵
D
~
i
i
=
1
+
Σ
j
A
i
j
\tilde{D}_{ii} = 1 +\Sigma_{j} A_{ij}
D~ii=1+ΣjAij
2.3 Bilevel Programming in Machine Learning
Bi-level programs are optimization problems where a set of variables occurring in the objective function are constrained to be an optimal solution of another optimization problem (see Colson et al., 2007, for an overview). Formally given two objective functions F and L, the outer and inner objectives, and two sets of variables, θ ∈ Rm and w ∈ Rd, the outer and inner variables, a bi-level program is given by
m
i
n
θ
,
w
θ
F
(
w
θ
,
θ
)
s
u
c
h
t
h
a
t
w
θ
∈
a
r
g
m
i
n
w
L
(
w
,
θ
)
.
(
3
)
min_{θ,w_θ} F(w_θ,θ) \quad such \space that \space w_θ ∈ argmin_w L(w,θ). \qquad (3)
minθ,wθF(wθ,θ)such that wθ∈argminwL(w,θ).(3)
Bi-level programs arise in numerous situations such as hyper parameter optimization, adversarial, multi-task, and meta learning (Bennett et al., 2006; Flamary et al., 2014; MunozGonzalez et al., 2017; Franceschi et al., 2018).
双层规划是一个优化问题,目标函数中出现的一组变量被约束为另一个优化问题的最优解(见Colson等人,2007年,概述)。形式化地给出了两个目标函数F和L,外部目标和内部目标,以及两组变量,θ∈Rm和w∈Rd,分别为外部和内部变量,给出了一个双层规划问题如下:
m
i
n
θ
,
w
θ
F
(
w
θ
,
θ
)
s
u
c
h
t
h
a
t
w
θ
∈
a
r
g
m
i
n
w
L
(
w
,
θ
)
.
(
3
)
min_{θ,w_θ} F(w_θ,θ) \quad such \space that \space w_θ ∈ argmin_w L(w,θ). \qquad (3)
minθ,wθF(wθ,θ)such that wθ∈argminwL(w,θ).(3)
Solving Problem (3) is challenging since the solution sets of the inner problem are usually not available in closed-form. A standard approach involves replacing the minimization of L with the repeated application of an iterative optimization dynamics Φ such as (stochastic) gradient descent (Domke, 2012; Maclaurin et al., 2015; Franceschi etal.,2017). Let w_θ,T denote the inner variables after T iterations of the dynamics Φ, that is,
w
θ
,
T
=
Φ
(
w
θ
,
T
−
1
,
θ
)
=
Φ
(
Φ
(
w
θ
,
T
−
2
,
θ
)
,
θ
)
,
w_{θ,T} = Φ(w_{θ,T−1},θ) = Φ(Φ(w_{θ,T−2},θ),θ),
wθ,T=Φ(wθ,T−1,θ)=Φ(Φ(wθ,T−2,θ),θ),
and so on. Now, if θ and w are real-valued and the objectives and dynamics smooth, we can compute the gradient of the function F(w_θ,T, θ) w.r.t. θ, denoted throughout as the hyper-gradient ∇ _θ F(w _θ,T ,θ), as
∂
w
F
(
w
θ
,
T
,
θ
)
∇
θ
w
θ
,
T
+
∂
θ
F
(
w
θ
,
T
,
θ
)
,
(
4
)
∂_w F(w_{θ,T},θ) ∇_θw_{θ,T} + ∂_θ F(w_{θ,T},θ), (4)
∂wF(wθ,T,θ)∇θwθ,T+∂θF(wθ,T,θ),(4)
where the symbol ∂ denotes the partial derivative (the Jacobian) and∇either the gradient (for scalar functions) or the total derivative. The first term can be computed efficiently in time O(T(d + m)) with reverse-mode algorithmic differentiation (Griewank & Walther, 2008) by unrolling the optimization dynamics, repeatedly substituting
w
θ
,
t
=
Φ
(
w
θ
,
t
−
1
,
θ
)
w_{\theta,t} = Φ(w_{θ,t−1},θ)
wθ,t=Φ(wθ,t−1,θ)
and applying the chain rule. This technique allows to optimize a number of hyper-parameters several orders of magnitude greater than classic methods for hyper-parameter optimization (Feurer & Hutter, 2018).
解决问题(3)是具有挑战性的,因为内部问题的解集通常不是封闭形式的。标准方法包括用迭代优化Φ(如随机梯度下降)的重复应用来代替L的最小化(Domke,2012;Maclaurin等人,2015;Franceschi等人,2017)。设w_θ,T表示动力学Φ进行T次迭代后的内部变量,即,
w
θ
,
T
=
Φ
(
w
θ
,
T
−
1
,
θ
)
=
Φ
(
Φ
(
w
θ
,
T
−
2
,
θ
)
,
θ
)
,
w_{θ,T} = Φ(w_{θ,T−1},θ) = Φ(Φ(w_{θ,T−2},θ),θ),
wθ,T=Φ(wθ,T−1,θ)=Φ(Φ(wθ,T−2,θ),θ),
以及之后的多次迭代(不一一列出),现在,如果θ和w是实值且目标和动力学平滑,我们可以计算函数F(w_θ,T,θ)谈到θ的梯度,表示为超梯度∇ _θ F(w_θ,T,θ),如下所示:
∂
w
F
(
w
θ
,
T
,
θ
)
∇
θ
w
θ
,
T
+
∂
θ
F
(
w
θ
,
T
,
θ
)
,
(
4
)
∂_w F(w_{θ,T},θ) ∇_θw_{θ,T} + ∂_θ F(w_{θ,T},θ), (4)
∂wF(wθ,T,θ)∇θwθ,T+∂θF(wθ,T,θ),(4)
其中符号∂表示偏导数(雅可比)和∇梯度(对于标量函数)或全导数。通过展开优化方法,可以在时间O(T(d+m))中有效地计算第一项(Griewank&Walther,2008),反复替换下式:
w
θ
,
t
=
Φ
(
w
θ
,
t
−
1
,
θ
)
w_{\theta,t} = Φ(w_{θ,t−1},θ)
wθ,t=Φ(wθ,t−1,θ)
并应用链式法则。该技术可优化多个超参数,比经典的超参数优化方法大几个数量级(Feurer&Hutter,2018)。
3 Learning Discrete Graph Structures
With this paper we address the challenging scenarios where a graph structure is either completely missing, incomplete, or noisy. To this end, we learn a discrete and sparse dependency structure between data points while simultaneously training the parameters of a GCN. We frame this as a bi-level programming problem whose outer variables are the parameters of a generative probabilistic model for graphs. The proposed approach, therefore, optimizes both the parameters of a GCN and the parameters of a graph generator so as to minimize the classification error on a given dataset. We developed a practical algorithm based on truncated reverse-mode algorithmic differentiation (Williams & Peng, 1990) and hyper-gradient estimation to approximately solve the bi-level problem. A schematic illustration of the resulting method is presented in Figure 1.
在这篇文章中,我们讨论了一个图结构要么完全缺失,要么不完整,或者有噪声的具有挑战性的场景。为此,我们在训练GCN参数的同时,学习数据点之间的离散稀疏依赖结构。我们将其定义为一个双层规划问题,其外部变量是图的生成概率模型的参数。因此,所提出的方法可以同时优化GCN的参数和图形生成器的参数,从而使给定数据集上的分类误差最小化。我们开发了一种基于截断逆模式算法微分(Williams&Peng,1990)和超梯度估计的实用算法来近似解决双层问题。图1给出了结果方法的示意图。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-vIBntaLb-1596250246798)(C:\Users\wsk2018\AppData\Roaming\Typora\typora-user-images\image-20200730224653847.png)]
3.1 Jointly Learning the Structure and Parameters
Let us suppose that information about the true adjacency matrix A is missing or incomplete. Since, ultimately, we are interested in finding a model that minimizes the generalization error, we assume the existence of a second subset of instances with known target, V_Val (the validation set), from which we can estimate the generalization error. Hence, we propose to find A ∈H_N that minimizes the function
F
(
w
A
,
A
)
=
Σ
v
∈
V
v
a
l
l
(
f
w
A
(
X
,
A
)
v
,
y
v
)
,
(
5
)
F(w_A,A) =\Sigma_{v\in V_{val}}l(f_{w_A}(X,A)_v,y_v), \qquad (5)
F(wA,A)=Σv∈Vvall(fwA(X,A)v,yv),(5)
where w_A is the minimizer, assumed unique, of L (see Eq. (1) and Sec. 2.3) for a fixed adjacency matrix A. We can then consider Equations (1) and (5) as the inner and outer objective of a mixed-integer bi-level programming problem where the outer objective aims to find an optimal discrete graph structure and the inner objective the optimal parameters of a GCN given a graph.
假设真邻接矩阵A的信息缺失或不完整。由于最终我们感兴趣的是找到一个最小化泛化误差的模型,所以我们假设存在第二个子集的实例,具有已知的目标V_Val(验证集),从中我们可以估计泛化误差。因此,我们建议寻找一个使函数最小化的A∈H_N,根据如下式子:
F
(
w
A
,
A
)
=
Σ
v
∈
V
v
a
l
l
(
f
w
A
(
X
,
A
)
v
,
y
v
)
,
(
5
)
F(w_A,A) =\Sigma_{v\in V_{val}}l(f_{w_A}(X,A)_v,y_v), \qquad (5)
F(wA,A)=Σv∈Vvall(fwA(X,A)v,yv),(5)
式中,w_A是L的最小化目标,假设唯一(见等式(1)和2.3)对于固定的邻接矩阵a,我们可以将方程(1)和(5)作为混合整数双层规划问题的内外目标,其中外部目标旨在找到最佳离散图结构,内部目标是给定图的GCN的最佳参数。
The resulting bi-level problem is intractable to solve exactly even for small graphs. Moreover, it contains both continuous and discrete-valued variables, which prevents us from directly applying Eq. (4). A possible solution is to construct a continuous relaxation (see e.g. Frecon et al., 2018), another is to work with parameters of a probability distribution over graphs. The latter is the approach we follow in this paper. We maintain a generative model for the graph structure and reformulate the bi-level program in terms of the (continuous) parameters of the resulting distribution over discrete graphs. Specifically, we propose to model each edge with a Bernoulli random variable. Let
H
~
N
=
C
o
n
v
(
H
N
)
\widetilde{H}_N = Conv(H_N)
H
N=Conv(HN)
be the convex hull of the set of all adjacency matrices for N nodes. By modeling all possible edges as a set of mutually independent Bernoulli random variables with parameter matrix θ ∈∽H_N ,we can sample graphs as
B
e
r
(
θ
)
∼
A
∈
H
~
N
.
Ber(θ)∼ A \in\widetilde{H}_N.
Ber(θ)∼A∈H
N.
Eqs. (1) and (5) can then be replaced, by using the expectation over graph structures. The resulting bi-level problem can be written as
m
i
n
θ
∈
H
~
N
E
A
∼
B
e
r
(
θ
)
[
F
(
w
θ
,
A
)
]
(
6
)
s
u
c
h
t
h
a
t
w
θ
=
a
r
g
m
i
n
w
E
A
∼
B
e
r
(
θ
)
[
L
(
w
,
A
)
]
.
(
7
)
min_{θ∈\widetilde{H}_N} E_{A∼Ber(θ)} [F(w_θ,A)] \qquad(6) \\such \space that \space w_θ = argmin_w E_{A∼Ber(θ)} [L(w,A)]. \qquad(7)
minθ∈H
NEA∼Ber(θ)[F(wθ,A)](6)such that wθ=argminwEA∼Ber(θ)[L(w,A)].(7)
By taking the expectation, both the inner and the outer objectives become continuous(and possibly smooth)functions of the Bernoulli parameters. The bi-level problem given by Eqs. (6)-(7) is still challenging to solve efficiently. This is because the solution of the inner problem is not available in closed form for GCNs (the objective is non-convex); and the expectations are intractable to compute exactly. This is different than e.g. (model free) reinforcement learning, An efficient algorithm, therefore, will only be able to find approximate stochastic solutions, that is, θ_i,j ∈ (0,1).
由此产生的双层问题即使对于小图也难以精确求解。此外,它包含连续变量和离散变量,这使得我们无法直接应用公式(4)。一个可能的解决方案是构造一个连续的松弛(参见Frecon等人,2018),另一个解决方案是使用图上概率分布的参数。后者是本文所遵循的方法。我们维护了一个图结构的生成模型,并根据离散图上结果分布的(连续)参数重新制定了双层程序。具体来说,我们建议用伯努利随机变量对每条边进行建模。令
H
~
N
=
C
o
n
v
(
H
N
)
\widetilde{H}_N = Conv(H_N)
H
N=Conv(HN)
为N个节点的所有邻接矩阵集合的凸壳。通过将所有可能的边建模为一组相互独立的Bernoulli随机变量,参数矩阵θ ∈∽H_N,我们可以用下式来对图取样:
B
e
r
(
θ
)
∼
A
∈
H
~
N
.
Ber(θ)∼ A \in\widetilde{H}_N.
Ber(θ)∼A∈H
N.
(1) 和(5)可以用图结构上的期望值来代替。由此得到的双层问题可以写成
m
i
n
θ
∈
H
~
N
E
A
∼
B
e
r
(
θ
)
[
F
(
w
θ
,
A
)
]
(
6
)
s
u
c
h
t
h
a
t
w
θ
=
a
r
g
m
i
n
w
E
A
∼
B
e
r
(
θ
)
[
L
(
w
,
A
)
]
.
(
7
)
min_{θ∈\widetilde{H}_N} E_{A∼Ber(θ)} [F(w_θ,A)] \qquad(6) \\such \space that \space w_θ = argmin_w E_{A∼Ber(θ)} [L(w,A)]. \qquad(7)
minθ∈H
NEA∼Ber(θ)[F(wθ,A)](6)such that wθ=argminwEA∼Ber(θ)[L(w,A)].(7)
通过取期望值,内部目标和外部目标都成为伯努利参数的连续函数(也可能是平滑函数)。由等式给出的双层问题。(6) —(7)仍然具有挑战性,难以有效解决。这是因为对于GCNs来说,内部问题的解不是封闭形式的(目标是非凸的),并且期望值难以精确计算。这不同于(无模型)强化学习,一种有效的算法,因此,只能找到近似的随机解,即θi,j∈(0,1)。
Before describing a method to solve the optimization problem given by Eqs. (6)-(7) approximately with hyper-gradient descent, we first turn to the question of obtaining a final GCN model that we can use for prediction. For a given distribution Pθ over graphs with N nodes and with parameters θ, the expected output of a GCN is
f
w
e
x
p
(
X
)
=
E
A
[
f
w
(
X
,
A
)
]
=
Σ
A
∈
H
N
P
θ
(
A
)
f
w
(
X
,
A
)
.
(
8
)
f^{exp}_w(X) = E_A[f_w(X,A)] =\Sigma_{A∈H_N} P_θ(A)f_w(X,A). \qquad (8)
fwexp(X)=EA[fw(X,A)]=ΣA∈HNPθ(A)fw(X,A).(8)
Unfortunately, computing this expectation is intractable even for small graphs; we can, however, compute an empirical estimate of the output as
f
^
w
(
X
)
=
1
S
Σ
i
=
1
S
f
w
(
X
,
A
i
)
,
(
9
)
\hat{f}_w(X) =\frac{1}{S}\Sigma_{i=1}^S f_w(X,A_i), \qquad(9)
f^w(X)=S1Σi=1Sfw(X,Ai),(9)
where S > 0 is the number of samples we wish to draw. Note that ˆ f is an unbiased estimator of f^exp_w . Hence,to use a GCN,f_w learned with the bi-level formulation for prediction, we sample S graphs from the distribution Pθ and compute the prediction as the empirical mean of the values of f_w.
在描述由等式给出的求解优化问题的方法之前。(6) —(7)在超梯度下降的情况下,我们首先讨论获得可用于预测的最终GCN模型的问题。对于给定分布Pθ且具有N个节点且参数为θ的图,GCN的期望输出为
f
w
e
x
p
(
X
)
=
E
A
[
f
w
(
X
,
A
)
]
=
Σ
A
∈
H
N
P
θ
(
A
)
f
w
(
X
,
A
)
.
(
8
)
f^{exp}_w(X) = E_A[f_w(X,A)] =\Sigma_{A∈H_N} P_θ(A)f_w(X,A). \qquad (8)
fwexp(X)=EA[fw(X,A)]=ΣA∈HNPθ(A)fw(X,A).(8)
不幸的是,计算这个期望值是困难的,即使对于小的图;然而,我们可以计算输出的经验估计值:
f
^
w
(
X
)
=
1
S
Σ
i
=
1
S
f
w
(
X
,
A
i
)
,
(
9
)
\hat{f}_w(X) =\frac{1}{S}\Sigma_{i=1}^S f_w(X,A_i), \qquad(9)
f^w(X)=S1Σi=1Sfw(X,Ai),(9)
其中S>0是我们希望绘制的样本数。注意,ˆf是f^exp_w的无偏估计量。因此,为了使用二级公式学习的GCN ,f_w进行预测,我们从分布Pθ中取样S图并计算预测值作为f_w值的经验平均值。
Given the parametrization of the graph generator with Bernoulli variables (Pθ = Ber(θ)), one can sample a new graph in O(N^2). Sampling from a large number of Bernoulli variables, however, is highly efficient, trivially parallelizable, and possible at a rate of millions per second. Other sampling strategies such as MCMC sampling are possible in constant time. Given a set of sampled graphs, it is more efficient to evaluate a sparse GCN S times than to use the Bernoulli parameters as weights of the GCN’s adjacency matrix. Indeed, for GCN models, computing ˆ f_w has a cost of O(SCd), rather than O(N^2d) for a fully connected graph, where
C
=
Σ
i
j
θ
i
j
C =\Sigma_{ij} θ_{ij}
C=Σijθij
is the expected number of edges, and d is the dimension of the weights. Another advantage of using a graph-generative model is that we can interpret it probabilistically which is not the case when learning a dense adjacency matrix.
给定具有Bernoulli变量的图生成器的参数化(Pθ=Ber(θ)),可以在O(N2)中采样一个新的图。然而,从大量Bernoulli变量中进行采样是非常有效的,可并行化程度很低,而且可以以每秒数百万的速率进行采样。其他采样策略,如MCMC采样也可以在恒定时间内实现。在给定一组采样图的情况下,计算稀疏GCN的S次比使用Bernoulli参数作为GCN邻接矩阵的权重更有效。实际上,对于GCN模型,计算ˆf_w的代价是O(SCd),而不是O(N2d)对于一个完全连通的图,其中
C
=
Σ
i
j
θ
i
j
C =\Sigma_{ij} θ_{ij}
C=Σijθij
是期望的边数,d是权重的维数。使用图生成模型的另一个优点是我们可以对其进行概率解释,而在学习稠密邻接矩阵时则不是这样。
3.2 Structure Learning via Hyper-gradient Descent
The bi-level programming formalism is a natural fit for the problem of learning both a graph generative model and the parameters of a GNN for a specific downstream task. Here, the outer variables θ are the parameters of the graph generative model and the inner variables w are the parameters of the GCN.
对于学习图形生成模型和特定下游任务的GNN参数的问题,双层编程形式是一种很自然的方法。这里,外部变量θ是图生成模型的参数,内部变量w是GCN的参数。
We now discuss a practical algorithm to approach the bi-level problem defined by Eqs. (6) and (7). Regarding the inner problem, we note that the expectation
E
A
∼
B
e
r
(
θ
)
[
L
(
w
,
A
)
]
=
Σ
A
∈
H
N
P
θ
(
A
)
L
(
w
,
A
)
(
10
)
E_{A∼Ber(θ)} [L(w,A)] = \Sigma _{A∈H_N} P_θ(A)L(w,A) \qquad (10)
EA∼Ber(θ)[L(w,A)]=ΣA∈HNPθ(A)L(w,A)(10)
is composed of a sum of 2(N2) terms, which is intractable even for relatively small graphs. We can, however, choose a tractable approximate learning dynamics Φ such as stochastic gradient descent (SGD),
w
θ
,
t
+
1
=
Φ
(
w
θ
,
t
,
A
t
)
=
w
θ
,
t
−
γ
t
∇
L
(
w
θ
,
t
,
A
t
)
,
(
11
)
w_{θ,t+1} = Φ(w_{θ,t},A_t) = w_{θ,t} −γ_t∇L(w_{θ,t},A_t), (11)
wθ,t+1=Φ(wθ,t,At)=wθ,t−γt∇L(wθ,t,At),(11)
where γ_t is a learning rate and A_t ∼ Ber(θ) is drawn at each iteration. Under appropriate assumptions and for t →∞, SGD converges to a weight vector w _θ that depends on the edges’ probability distribution (Bottou, 2010).
我们现在讨论一个实用的算法来处理由等式定义的双层问题。对于(6) 和(7)。关于内部问题,我们注意到
E
A
∼
B
e
r
(
θ
)
[
L
(
w
,
A
)
]
=
Σ
A
∈
H
N
P
θ
(
A
)
L
(
w
,
A
)
(
10
)
E_{A∼Ber(θ)} [L(w,A)] = \Sigma _{A∈H_N} P_θ(A)L(w,A) \qquad (10)
EA∼Ber(θ)[L(w,A)]=ΣA∈HNPθ(A)L(w,A)(10)
是由2(N2)项的和组成的,即使对于相对较小的图也很难处理。然而,我们可以选择一个可处理的近似学习动态Φ,如随机梯度下降(SGD),
w
θ
,
t
+
1
=
Φ
(
w
θ
,
t
,
A
t
)
=
w
θ
,
t
−
γ
t
∇
L
(
w
θ
,
t
,
A
t
)
,
(
11
)
w_{θ,t+1} = Φ(w_{θ,t},A_t) = w_{θ,t} −γ_t∇L(w_{θ,t},A_t), (11)
wθ,t+1=Φ(wθ,t,At)=wθ,t−γt∇L(wθ,t,At),(11)
式中,γ_t是学习速率,在每次迭代中绘制一个A_t∼Ber(θ)。在适当的假设下,对于t→∞,SGD收敛到依赖于边缘概率分布的权重向量w _θ(bottou,2010)。
Let w_θ,T be an approximate minimizer of E[L] (where T may depend on θ). We now need to compute an estimator for the hyper-gradient
∇
θ
E
A
∼
B
e
r
(
θ
)
[
F
]
.
∇_θE_{A∼Ber(θ)} [F].
∇θEA∼Ber(θ)[F].
Recalling Eq. (4), we have
∇
θ
E
[
F
(
w
θ
,
T
,
A
)
]
=
E
[
∇
θ
F
(
w
θ
,
T
,
A
)
]
=
E
[
∂
w
F
(
w
θ
,
T
,
A
)
∇
θ
w
θ
,
T
+
∂
A
F
(
w
θ
,
T
,
A
)
∇
θ
A
]
,
(
12
)
∇_θE[F(w_{θ,T},A)] = E[∇_θF(w_{θ,T},A)] = E[∂_wF(w_{θ,T},A)∇_θw_{θ,T} + ∂_AF(w_{θ,T},A)∇_θA], (12)
∇θE[F(wθ,T,A)]=E[∇θF(wθ,T,A)]=E[∂wF(wθ,T,A)∇θwθ,T+∂AF(wθ,T,A)∇θA],(12)
where we can swap the gradient and expectation operators since the expectation is over a finite random variable, assuming that the loss function F bounded. We use the so-called straight-through estimator (Bengio et al., 2013) and set
∇
θ
A
:
=
I
∇_θA := I
∇θA:=I
(which would be 0 a.e. otherwise); ∇_θ A appears both explicitly in (12) and in the computation of∇ _θ w _θ,T, through
∂
θ
Φ
(
w
θ
,
t
,
A
t
)
∂_θΦ(w_{θ,t},A_t)
∂θΦ(wθ,t,At)
(see Sec. 2.3 and Franceschi et al., 2017, for details). Finally, we take the single sample Monte Carlo estimator of (12) to update the parameters θ with projected gradient descent on the unit hypercube.
设w_θ,T是E[L]的近似极小值(其中T可能依赖于θ)。现在我们需要计算超梯度的一个估计量
∇
θ
E
A
∼
B
e
r
(
θ
)
[
F
]
.
∇_θE_{A∼Ber(θ)} [F].
∇θEA∼Ber(θ)[F].
回想等式(4),我们有
∇
θ
E
[
F
(
w
θ
,
T
,
A
)
]
=
E
[
∇
θ
F
(
w
θ
,
T
,
A
)
]
=
E
[
∂
w
F
(
w
θ
,
T
,
A
)
∇
θ
w
θ
,
T
+
∂
A
F
(
w
θ
,
T
,
A
)
∇
θ
A
]
,
(
12
)
∇_θE[F(w_{θ,T},A)] = E[∇_θF(w_{θ,T},A)] = E[∂_wF(w_{θ,T},A)∇_θw_{θ,T} + ∂_AF(w_{θ,T},A)∇_θA], (12)
∇θE[F(wθ,T,A)]=E[∇θF(wθ,T,A)]=E[∂wF(wθ,T,A)∇θwθ,T+∂AF(wθ,T,A)∇θA],(12)
这里我们可以交换梯度和期望算子,因为期望值在有限个随机变量上,假设损失函数F有界。我们使用所谓的直通估计器(Bengio等人,2013年)并假设
∇
θ
A
:
=
I
∇_θA := I
∇θA:=I
(否则为0 a.e.);∇_θ A显式出现在(12)和计算∇ _θ w _θ,T,通过
∂
θ
Φ
(
w
θ
,
t
,
A
t
)
∂_θΦ(w_{θ,t},A_t)
∂θΦ(wθ,t,At)
(见第.2.3和Franceschi等人,2017年,详情)最后,我们利用(12)的单样本montecarlo估计量,在单位超立方体上用投影梯度下降来更新参数θ
Computing the hyper-gradient by fully unrolling the dynamics may be too expensive both in time and memory. We propose to truncate the computation and estimate the hyper-gradient every τ iterations, where τ is a parameter of the algorithm. This is essentially an adaptation of truncated back-propagation through time (Werbos, 1990; Williams & Peng,1990)and can be seen as a short-horizon optimization procedure with warm restart on w. A sketch of the method is presented in Algorithm 1, while a more complete version that includes details on the hyper-gradient computation can be found in the appendix. Inputs and operations in squared brackets are optional.
通过完全展开动力学来计算超梯度可能在时间和内存上都过于昂贵。我们建议在每τ次迭代时截断计算并估计超梯度,其中τ是算法的一个参数。这本质上是对截断后向传播随时间变化的一种适应(Werbos,1990;Williams&Peng,1990),可以看作是一种在w上热重启的短时间优化过程。算法1中给出了该方法的草图,而包含超梯度计算细节的更完整版本可在中找到附录。方括号中的输入和操作是可选的。
The algorithm contains stopping conditions at the outer and at the inner level. While it is natural to implement the latter with a decrease condition on the inner objective, we find it useful to implement the first with a simple early stopping criterion. A fraction of the examples in the validation set is held-out to compute, in each outer iteration, the accuracy using the predictions of the empirically expected model (9). The optimization procedure terminates if there is no improvement for some consecutive outer loops. This helps avoiding over-fitting the outer objective (6), which may be a concern in this context given the quantity of (hyper)parameters being optimized and the relative small size of the validation sets.
该算法在外部和内部都包含停止条件。虽然在内部目标上使用减少条件来实现后者是很自然的,但是我们发现用一个简单的早期停止标准来实现第一个目标是有用的。在每次外部迭代中,验证集中的一小部分例子是用来计算使用经验预期模型(9)预测的精度。如果某些连续的外循环没有改善,则优化过程终止。这有助于避免过度设置外部目标(6),鉴于正在优化的(超)参数数量和相对较小的验证集,这可能是一个值得关注的问题。
The hyper-gradients estimated with Algorithm 1 at each outer iteration are biased. The bias stems from both the straight trough estimator and from the truncation procedure introduced in lines 11-13(Tallec&Ollivier,2017). Nevertheless, we find empirically that the algorithm is able to make reasonable progress, finding configurations in the distribution space that are beneficial for the tasks at hand.
算法1在每次外迭代中估计的超梯度是有偏的。偏差源于直线槽估计值和第11-13行中引入的截断过程(Tallec&Ollivier,2017)。然而,我们从经验上发现,该算法能够取得合理的进展,在分布空间中找到有利于手头任务的配置。
4. Experiments
We conducted a series of experiments with three main objectives. First, we evaluated LDS on node classification problems where a graph structure is available but where a certain fraction of edges is missing. Here, we compared LDS with graph-based learning algorithms including vanilla GCNs. Second, we wanted to validate our hypothesis that LDS can achieve competitive results on semi-supervised classification problems for which a graph is not available. To this end, we compared LDS with a number of existing semi-supervised classification approaches. We also compared LDS with algorithms that first create k-NN affinity graphs on the data set. Third, we analyzed the learned graph generative model to understand to what extent LDS is able to learn meaningful edge probability distributions even when a large fraction of edges is missing.
我们用三个主要目标进行了一系列实验。首先,我们在节点分类问题上评估了LDS,在这些问题中,图结构是可用的,但是缺少某些部分的边。在这里,我们比较了LDS和基于图的学习算法,包括vanilla-GCNs。其次,我们想验证我们的假设,即LDS可以在没有图的半监督分类问题上获得竞争性结果。为此,我们将LDS与许多现有的半监督分类方法进行了比较。我们还将LDS与首先在数据集上创建k-NN关联图的算法进行了比较。第三,我们分析了学习图生成模型,以了解LDS在多大程度上能够学习到有意义的边缘概率分布,即使在很大一部分边缘丢失的情况下。
4.1 Datasets
Cora and Citeseer are two benchmark datasets that are commonly used to evaluate relational learners in general and GCNs in particular (Senetal.,2008). The input features are bag of words and the task is node classification. We use the same dataset split and experimental setup of previous work (Yangetal.,2016;Kipf&Welling,2017). To evaluate the robustness of LDS on incomplete graphs, we construct graphs with missing edges by randomly sampling 25%, 50%, and 75%of the edges. In addition to Cora and Citeseer where we removed all edges,we evaluate LDS on benchmark datasets that are available in scikit-learn (Pedregosa et al., 2011) such as Wine, Breast Cancer (Cancer), Digits, and 20 Newsgroup (20news). We take 10 classes from 20 Newsgroup and use words (TFIDF) with a frequency of more than 5% as features. We also use FMA, a dataset where 140 audio features are extracted from 7,994 music tracks and where the problem is genre classification (Defferrard et al., 2017). The statistics of the datasets are reported in the appendix.
Cora和Citeseer是两个基准数据集,通常用于评估关系型学习者,尤其是GCN(Senetal.,2008)。输入特征是单词包,任务是节点分类。我们使用与先前工作相同的数据集分割和实验设置(Yangetal.,2016;Kipf&Welling,2017)。为了评估LDS对不完全图的鲁棒性,我们通过随机抽样25%、50%和75%的边缘来构造具有缺失边的图。除了我们删除了所有边缘的Cora和citeseer之外,我们还根据scikit learn(Pedregosa等人,2011)中提供的基准数据集(如葡萄酒、乳腺癌(癌症)、数字和20个新闻组(20news)来评估LDS。我们从20个新闻组中抽取10个类,使用频率超过5%的单词(TFIDF)作为特征。我们还使用了FMA,这是一个从7994首音乐曲目中提取140个音频特征的数据集,问题在于流派分类(Defferrard等人,2017年)。数据集的统计数据见附录。
4.2 Setup and Baselines
For the experiments on graphs with missing edges, we compare LDS to vanilla GCNs. In addition, we also conceived a method (GCN-RND) where we add randomly sampled edges at each optimization step of a vanilla GCN. With this method we intend to show that simply adding random edges to the standard training procedure of a GCN model (perhaps acting as a regularization technique) is not enough to improve the generalization.
对于有缺边图的实验,我们比较了LDS和vanilla GCNs。此外,我们还构想了一种方法(GCN-RND),即在vanilla GCN的每个优化步骤中添加随机采样的边。用这种方法,我们试图证明仅仅在GCN模型的标准训练过程中添加随机边(可能作为一种正则化技术)不足以提高泛化能力。
When a graph is completely missing, GCNs boil down to feed-forward neural networks. Therefore, we evaluate different strategies to induce a graph on both labeled and unlabeled samples by creating (1) a sparse Erdos-Renyi random graph (Erdos & Renyi, 1960) (Sparse-GCN); (2) a dense graph with equal edge probabilities (Dense-GCN); (3) a dense RBF kernel on the input features (RBF-GCN); and (4) a sparse k-nearest neighbor graph on the input features (kNN-GCN). For LDS we initialize the edge probabilities using the k-NN graph (kNN-LDS). We further include a dense version of LDS where we learn a dense similarity matrix (kNN-LDS (dense)). In this setting, we compare LDS to popular semi-supervised learning methods such as label propagation (LP) (Zhu et al., 2003), manifold regularization (ManiReg) (Belkin et al., 2006), and semi-supervised embedding (SemiEmb) (Weston et al., 2012). ManiReg and SemiEmb are given a k-NN graph as input for the Laplacian regularization. We also compare LDS to baselines that do not leverage a graph-structure such as logistic regression (LogReg), support vector machines (Linear and RBF SVM), random forests (RF), and feed-forward neural networks (FFNN). For comparison methods that need a kNN graph, k ∈{2,3,…,20}and the metric (Euclidean or Cosine) are tuned using validation accuracy. For kNN-LDS, k is tuned from 10 or 20.
当一个图完全丢失时,GCNs归结为前馈神经网络。因此,我们评估了在有标记样本和未标记样本上诱导图的不同策略:(1)稀疏的Erdos-Renyi随机图(Erdos&Renyi,1960)(稀疏GCN);(2)等边概率稠密图(稠密GCN);(3)输入特征上的稠密RBF核(RBF-GCN);(4)输入特征上的稀疏k近邻图(kn-GCN)。对于LDS,我们使用k-NN图(kNN-LDS)初始化边缘概率。我们还包括了一个密集的LDS版本,在这里我们学习了稠密相似矩阵(knnlds(dense))。在此背景下,我们将LDS与流行的半监督学习方法进行比较,如标签传播(LP)(Zhu等人,2003)、流形正则化(ManiReg)(Belkin等人,2006)和半监督嵌入(SemiEmb)(Weston等人,2012)。ManiReg和SemiEmb给出一个k-NN图作为拉普拉斯正则化的输入。我们还将LDS与不利用图结构的基线进行比较,如逻辑回归(LogReg)、支持向量机(线性和RBF SVM)、随机森林(RF)和前馈神经网络(FFNN)。对于需要kNN图的比较方法,k∈{2,3,…,20}和度量(欧几里德或余弦)是通过验证精度来调整的。对于kNN LDS,k从10或20调谐。
We use the two layers GCN given by Eq. (2) with 16 hidden neurons and ReLu activation. Given a set of labeled training instances VTrain (nodes or examples) we use the regularized cross-entropy loss
L
(
w
,
A
)
=
−
Σ
v
∈
V
T
r
a
i
n
y
v
◦
l
o
g
[
f
w
(
X
,
A
)
v
]
+
ρ
∣
∣
w
1
∣
∣
2
,
(
13
)
L(w,A) = −\Sigma_{v∈VTrain} y_v◦log[f_w(X,A)_v]+ρ||w_1||^2, (13)
L(w,A)=−Σv∈VTrainyv◦log[fw(X,A)v]+ρ∣∣w1∣∣2,(13)
where y_v is the one-hot encoded target vector for the v-th instance,◦denotes the element-wise multiplication and ρ is a non-negative coefficient. As additional regularization technique we apply dropout (Srivastava et al., 2014) with β = 0.5 as in previous work. We use Adam (Kingma & Ba, 2015) for optimizing L, tuning the learning rate γ from {0.005, 0.01, 0.02}. The same number of hidden neurons For LDS, we set the initial edge parameters θi,j to 0 except for the known edges (or those found by kNN) which we set to 1. We then let all the parameters (including those initially set to 1) to be optimized by the algorithm. We further split the validation set evenly to form the validation (A) and early stopping (B) sets. As outer objective we use the unregularized cross-entropy loss on (A) and optimize it with stochastic gradient descent. with exponentially decreasing learning rate. Initial experiments showed that accelerated optimization methods such as Adam or SGD with momentum under-perform in this setting. We tune the step size η of the outer optimization loop and the number of updates τ used to compute the truncated hyper-gradient. Finally, we draw S = 16 samples to compute the output predictions (see Eq. (9)). For LDS and GCN, we apply early stopping with a window size of 20 steps.
我们用两层隐含的激活层(GC2)和隐式神经元。给定一组标记的训练实例VTrain(节点或例子),我们使用正则化交叉熵损失:
L
(
w
,
A
)
=
−
Σ
v
∈
V
T
r
a
i
n
y
v
◦
l
o
g
[
f
w
(
X
,
A
)
v
]
+
ρ
∣
∣
w
1
∣
∣
2
,
(
13
)
L(w,A) = −\Sigma_{v∈VTrain} y_v◦log[f_w(X,A)_v]+ρ||w_1||^2, (13)
L(w,A)=−Σv∈VTrainyv◦log[fw(X,A)v]+ρ∣∣w1∣∣2,(13)
其中y_v是第v个实例的一个热编码目标向量,◦表示元素乘法,ρ是非负系数。作为附加的正则化技术,我们应用了与先前工作相同的β=0.5的辍学(Srivastava et al.,2014)。我们使用Adam(Kingma&Ba,2015)优化L,从{0.005,0.01,0.02}调整学习率γ。对于LDS,同样数目的隐藏神经元,我们将初始边缘参数θi,j设置为0,除了已知的边缘(或kNN发现的那些)我们设置为1。然后,我们让所有的参数(包括那些最初设置为1的参数)通过该算法进行优化。我们进一步平均分割验证集,形成验证集(A)和早期停止集(B)。以(A)上的非正则交叉熵损失为外目标,用随机梯度下降法对其进行优化。学习率呈指数下降。初步实验表明,加速优化方法,如Adam或SGD在动量不足的情况下可以在这种情况下运行。我们调整外部优化循环的步长η和用于计算截断超梯度的更新次数τ。最后,我们绘制S=16个样本来计算输出预测(见公式(9))。对于LDS和GCN,我们采用提前终止方法,窗口大小为20步。
LDS was implemented in TensorFlow (Abadi et al., 2015) and is available at https://github.com/lucfra/ LDS. The implementations of the supervised baselines and LP are those from the scikit-learn python package (Pedregosa et al., 2011). GCN, ManiReg, and SemiEmb are implemented in Tensorflow. The hyper-parameters for all the methods are selected through the validation accuracy.
LDS 在TensorFlow(Abadi等人,2015)中实施,可在https://github.com/lucfra/LDS公司。监督基线和LP的实现来自scikit learn python包(Pedregosa et al.,2011)。GCN、ManiReg和sememb在Tensor flow中实现。通过验证精度,确定了所有方法的超参数。
4.3 Results
The results on the incomplete graphs are shown in Figure 2 for Cora (left) and Citeseer (center). For each percentage of retained edges the accuracy on validation (used for early stopping) and test sets are plotted. LDS achieves competitive results in all scenarios and accuracy gains of up to 7 percentage points. Notably, LDS improves the generalization accuracy of GCN models also when the given graph is that of the respective dataset (100% of edges retained), by learning additional helpful edges. The accuracy of 84.1% and 75.0% for Cora and Citeseer, respectively, exceed all previous state-of-the-art results. Conversely, adding random edges does not help decreasing the generalization error. GCN and GCN-RND perform similarly which indicates that adding random edges to the graph is not helpful.
图2中显示了Cora(左)和Citeseer(中)的不完全图的结果。对于保留边缘的每一个百分比,绘制验证精度(用于早期停止)和测试集。LDS在所有场景中都取得了具有竞争力的结果,精确度提高了7个百分点。值得注意的是,LDS通过学习其他有用的边,提高了GCN模型的泛化精度,当给定的图是相应数据集的图时(保留了100%的边)。Cora和Citeseer的准确率分别为84.1%和75.0%,超过了所有先前最先进的结果。相反,添加随机边并不能减少泛化误差。GCN和GCN-RND 的性能相似,这表明向图中添加随机边是没有帮助的。
Figure 2 depicts the impact of the number of iterations τ to compute the hyper-gradients. Taking multiple steps strongly outperforms alternating optimization (i.e. τ = 0) in all settings. Increasing τ further to the value of 20, however, does not yield significant benefits, while increasing the computational cost.
图2描述了迭代次数τ对计算超梯度的影响。在所有设置中,采取多个步骤都比交替优化(即τ=0)表现得更好。然而,将τ进一步增加到20,不会产生显著的效益,同时会增加计算成本。
In Table 2 we computed the expected number of edges in a sampled graph for Cora and Citeseer, to analyze the properties of the graphs sampled from the learned graph generator. The expected number of edges for LDS is higher than the original number which is to be expected since LDS has better accuracy results than the vanilla GCN in Figure 2. Nevertheless, the learned graphs are still very sparse (e.g. for Cora,on average,less than 0.2% edges are present). This facilitates efficient learning of the GCN in the inner learning loop of LDS.
在表2中,我们计算了Cora和Citeseer在采样图中的期望边数,以分析从学习图生成器中采样的图的性质。LDS的预期边缘数量高于预期的原始数量,因为LDS比图2中的普通GCN具有更好的精度结果。然而,学习到的图仍然非常稀疏(例如,对于Cora,平均存在不到0.2%的边)。这有助于在LDS的内部学习回路中有效地学习GCN。
Table 1 lists the results for semi-supervised classification problems. The supervised learning baselines work well on some datasets such as Wine and Cancer but fail to provide competitive results on others such as Digits, Citeseer, Cora, and 20 News. The semi-supervised learning baselines LP, ManiReg and SemiEmb can only improve the supervised learning baselines on 1, 3 and 4 datasets, respectively. The results for the GCN with different input graphs show that kNN-GCN works well and provides competitive results compared to the supervised baselines on all datasets. kNN-LDS significantly outperforms kNN-GCN on 4 out of the 7 datasets. In addition, kNN-LDS is among the most competitive methods on all datasets and yields the highest gains on datasets that have an underlying graph. Moreover, kNN-LDS performs slightly better than its dense counterpart where we learn a dense adjacency matrix. The added benefit of the sparse graph representation lies in the potential to scale to larger datasets.
表1列出了半监督分类问题的结果。有监督的学习基线在某些数据集(如葡萄酒和癌症)上运行良好,但在其他数据集(如Digits、Citeseer、Cora和20 News)上无法提供有竞争力的结果。半监督学习基线LP、ManiReg和SemiEmb分别只能改进1、3和4个数据集的监督学习基线。对不同输入图的GCN的结果表明,kNN-GCN在所有数据集上与监督基线相比具有良好的性能和竞争性。kNN-LDS在7个数据集中的4个显著优于kNN-GCN。此外,kNN-LDS是所有数据集上最具竞争力的方法之一,并且在具有底层图形的数据集上产生最高的收益。此外,kNN-LDS的性能略优于它的稠密邻接矩阵。稀疏图表示法的额外好处在于可以扩展到更大的数据集。
In Figure 3, we show the evolution of mean edge probabilities during optimization on three types of nodes (train, validation, test) on the Cora dataset. LDS is able to learn a graph generative model that is, on average, attributing 10 to 100 times more probability to edges between samples sharing the same class label. LDS often attributes a higher probability to edges that are present in the true held-out adjacency matrix (green lines in the plots). In Figure 4 we report the normalized histograms of the optimized edges probabilities for the same nodes of Figure 3, sorted into six bins in log 10-scale. Edges are divided in two groups: edges between nodes of the same class (blue) and between nodes of unknown or different classes (orange). LDS is able to learn highly non-uniform edge probabilities that reflect the class membership of the nodes. Figure 5 shows similar qualitative results as Figure 4, this time for three Citeseer test nodes, miss-classified by kNN-GCN and correctly classified by kNN-LDS. Again, the learned edge probabilities linking to nodes of the same classes is significantly different to those from different classes; but in this case the densities are more skewed toward the first bin. On the datasets we considered, what seems to matter is to capture a useful distribution (i.e. higher probability for links between same class) rather than pick exact links; of course for other datasets this may vary.
在图3中,我们展示了在Cora数据集上优化三种类型节点(训练、验证、测试)时平均边缘概率的演变。LDS能够学习一个图形生成模型,即平均来说,将共享同一类标签的样本之间的边的概率增加10到100倍。LDS通常将更高的概率归因于存在于真实支撑邻接矩阵(图中绿线)中的边。在图4中,我们报告了图3中相同节点的优化边缘概率的标准化直方图,按log 10的比例分为6个单元。边分为两组:相同类的节点之间的边(蓝色)和未知或不同类的节点之间的边(橙色)。LDS能够学习反映节点类成员身份的高度非均匀边缘概率。图5显示了与图4类似的定性结果,这一次针对三个Citeseer测试节点,未通过kNN-GCN分类,并由kNN-LDS正确分类。同样,链接到同一类节点的学习边概率与来自不同类的节点的概率显著不同;但在这种情况下,密度更偏向于第一个点。在我们考虑的数据集上,重要的是捕捉有用的分布(即,同一类之间链接的概率更高),而不是选择确切的链接;当然,对于其他数据集,这可能会有所不同。
5 Related work
Semi-supervised Learning.
Early works on graph-based semi-supervised learning use graph Laplacian regularization and include label propagation (LP) (Zhu et al., 2003), manifold regularization (ManiReg) (Belkin et al., 2006), and semi-supervised embedding (SemiEmb) (Weston et al., 2012). These methods assume a given graph whose edges represent some similarity between nodes. Later,(Yangetal., 2016) proposed a method that uses graphs not for regularization but rather for embedding learning by jointly classification and graph context prediction. Kipf & Welling (2017) presented the first GCN for semi-supervised learning. There are now numerous GCN variants all of which assume a given graph structure. Contrary to all existing graph-based semi-supervised learning approaches, LDS is able to work even when the graph is incomplete or missing.
早期的基于图的半监督学习使用图拉普拉斯正则化,包括标签传播(LP)(Zhu et al.,2003)、流形正则化(ManiReg)(Belkin等人,2006)和半监督嵌入(SemiEmb)(Weston等人,2012)。这些方法假设给定的图的边表示节点之间的某种相似性。后来,(Yangetal.,2016)提出了一种方法,该方法不使用图进行正则化,而是通过联合分类和图形上下文预测来嵌入学习。Kipf&Welling(2017)提出了第一个半监督学习的GCN。现在有许多GCN变体,它们都假设一个给定的图结构。与现有的基于图的半监督学习方法不同,LDS 能够在图不完整或缺失的情况下工作。
Graph synthesis and generation.
LDS learns a probabilistic generative model for graphs. The earliest probabilistic generative model for graphs was the Erdos-Renyi random graph model (Erdos & R´enyi, 1960), where edge probabilities are modeled as identically distributed and mutually independent Bernoullis. Several network models have been proposed to model well particular graph properties such as degree distribution (Leskovec et al., 2005) or network diameter (Watts & Strogatz, 1998). Leskovec et al. (2010) proposed a generative model based on the Kronecker product that takes a real graph as input and generates graphs that have similar properties. Recently, deep learning based approaches have been proposed for graph generation (You et al., 2018; Li et al., 2018; Grover et al., 2018; De Cao & Kipf, 2018). The goal of these methods, however, is to learn a sophisticated generative model that reflects the properties of the training graphs. LDS, on the other hand, learns graph generative models as a means to perform well on classification problems and its input is not a collection of graphs. More recent work proposed an unsupervised model that learns to infer interactions between entities while simultaneously learning the dynamics of physical systems such as spring systems (Kipf et al., 2018). Contrary to LDS, the method is specific to dynamical interacting systems, is unsupervised, and uses a variational encoder-decoder. Finally, we note that Johnson (2017) proposed a fully differentiable neural model able to process and produce graph structures at both input, representation and output levels; training the model requires, however, supervision in terms of ground truth graphs.
LDS学习了一个图的概率生成模型。最早的图的概率生成模型是Erdos-Renyi 随机图模型(Erdos&R’enyi,1960),其中边缘概率被建模为同分布且相互独立的伯努利分布。已经提出了几种网络模型来很好地模拟特定的图特性,如度分布(Leskovec et al.,2005)或网络直径(Watts&Strogatz,1998)。Leskovec等人。(2010)提出了一个基于Kronecker产品的生成模型,该模型以一个真实的图作为输入,并生成具有相似性质的图。最近,有人提出了基于深度学习的图形生成方法(You et al.,2018;Li et al.,2018;Grover et al.,2018;De Cao&Kipf,2018)。然而,这些方法的目标是学习反映训练图特性的复杂生成模型。另一方面,LDS学习图形生成模型,作为在分类问题上表现出色的一种手段,它的输入不是图形的集合。最近的研究提出了一种无监督模型,该模型在学习诸如弹簧系统等物理系统动力学的同时,学习推断实体之间的相互作用(Kipf等人,2018)。与LDS相反,该方法专门用于动态交互系统,不受监督,并且使用变分编解码器。最后,我们注意到Johnson(2017)提出了一个完全可微的神经模型,能够在输入、表示和输出水平上处理和生成图结构;然而,训练模型需要在基本真实图方面进行监督。
Link prediction.
Link prediction is a decades-old problem (Liben-Nowell & Kleinberg, 2007). Several survey papers cover the large body of work ranging from link prediction in social networks to knowledge base completion(Lu & Zhou, 2011; Nickel et al., 2016). While a majority of the methods are based on some similarity measure between nodepairs, there has been a number of neural network based methods (Zhang & Chen, 2017; 2018). The problem we study in this paper is related to link prediction as we also want to learn or extend a graph. However, existing link prediction methods do not simultaneously learn a GNN node classifier. Statistical relational learning (SRL) (Getoor & Taskar, 2007) models often perform both link prediction and node classification through the existence of binary and unary predicates. However, SRL models are inherently intractable and the structure and parameter learning steps are independent.
链接预测是一个几十年前的问题(libennowell&Kleinberg,2007)。一些调查论文涵盖了从社交网络中的链接预测到知识库完成的大量工作(Lu&Zhou,2011;Nickel等人,2016)。虽然大多数方法基于节点间的某种相似性度量,但也有许多基于神经网络的方法(Zhang&Chen,2017;2018)。本文研究的问题与链路预测有关,因为我们也希望学习或扩展图。然而,现有的链路预测方法不能同时学习GNN节点分类器。统计关系学习(SRL)(Getoor&Taskar,2007)模型通常通过存在二进制和一元谓词来执行链路预测和节点分类。然而,SRL模型本身就很难处理,而且结构和参数学习步骤是独立的。
Gradient estimation for discrete random variables.
Due to the intractable nature of the two bi-level objectives, LDS needs to estimate the hyper-gradients through a stochastic computational graph (Schulman et al., 2015). Using the score function estimator, also known as REINFORCE (Williams, 1992), would treat the outer objective as a blackbox function and would not exploit F being differentiable w.r.t. the sampled adjacency matrices and inner optimization dynamics. Conversely,the path-wise estimator is not readily applicable, since the random variables are discrete. LDS borrows from a solution proposed before (Bengio et al., 2013),at the cost of having biased estimates. Recently, Jang et al. (2017); Maddison et al. (2017) presented an approach based on continuous relaxations to reduce variance, which Tucker et al. (2017) combined with REINFORCE to obtain an unbiased estimator. Grathwohl et al. (2018) further introduced surrogate models to construct control variates for black-box functions. Unfortunately, these latter methods require to compute the function in the interior of the hypercube, possibly in multiple points (Tucker et al., 2017). This would introduce additional computational overhead.
由于两个双层目标的难处理性,LDS 需要通过随机计算图估计超梯度(Schulman et al.,2015)。使用分数函数估计量,也被称为加强(Williams,1992),将把外部目标视为一个黑匣子函数,而不会利用F是可微的w.r.t.抽样邻接矩阵和内部优化动力学。相反,由于随机变量是离散的,因此路径估计并不容易适用。LDS借用了之前提出的解决方案(Bengio等人,2013年),代价是有偏差的估计。最近,Jang等人(2017年);Maddison等人(2017)提出了一种基于连续松弛来减少方差的方法,Tucker等人(2017)结合REINFORCE得到一个无偏估计量。Grathwohl等人(2018)进一步引入代理模型来构造黑箱函数的控制变量。不幸的是,后一种方法需要计算超立方体内部的函数,可能是多个点(Tucker等人,2017)。这将带来额外的计算开销。
6 Conclusion
We propose LDS, a framework that simultaneously learns the graph structure and the parameters of a GNN. While we have used a specific GCN variant (Kipf & Welling, 2017) in the experiments, the method is more generally applicable to other GNNs. The strengths of LDS are its high accuracy gains on typical semi-supervised classification datasets at a reasonable computational cost. Moreover, due to the graph generative model LDS learns, the edge parameters have a probabilistic interpretation.
The method has its limitations. While relatively efficient, it cannot currently scale to large datasets: this would require an implementation that works with mini-batches of nodes. We evaluated LDS only in the transductive setting, when all data points (nodes) are available during training. Adding additional nodes after training (the inductive setting) would currently require retraining the entire model from scratch. When sampling graphs, we do not currently enforce the graphs to be connected. This is something we anticipate to improve the results, but this would require a more sophisticated sampling strategy. All of these shortcomings motivate future work. In addition, we hope that suitable variants of LDS algorithm will also be applied to other problems such as neural architecture search or to tune other discrete hyper-parameters.
我们提出了 LDS,一个同时学习GNN的图结构和参数的框架。虽然我们在实验中使用了特定的GCN变体(Kipf&Welling,2017),但该方法更普遍地适用于其他GNN。LDS的优势在于它以合理的计算成本在典型的半监督分类数据集上获得了高精度。此外,由于 LDS 学习的图形生成模型,边缘参数具有概率解释性。
equire a more sophisticated sampling strategy. All of these shortcomings motivate future work. In addition, we hope that suitable variants of LDS algorithm will also be applied to other problems such as neural architecture search or to tune other discrete hyper-parameters.
我们提出了 LDS,一个同时学习GNN的图结构和参数的框架。虽然我们在实验中使用了特定的GCN变体(Kipf&Welling,2017),但该方法更普遍地适用于其他GNN。LDS的优势在于它以合理的计算成本在典型的半监督分类数据集上获得了高精度。此外,由于 LDS 学习的图形生成模型,边缘参数具有概率解释性。
这种方法有其局限性。虽然相对有效,但它目前无法扩展到大型数据集:这将需要一个与小批量节点一起工作的实现。我们只在传输设置下评估 LDS,当所有数据点(节点)在训练期间可用。在训练后添加额外的节点(归纳设置)目前需要从头开始重新训练整个模型。当对图进行采样时,我们当前不强制将图连接起来。这是我们期望改进的结果,但这需要一个更复杂的抽样策略。所有这些缺点都激励着未来的工作。此外,我们希望 LDS 算法的适当变体也能应用于其他问题,如神经结构搜索或其它离散超参数的优化。
施工ing……