Research Statement
Haixuan Yang
November 2007
I follow a philosophy that the performance of an algorithm can be improved greatly if
the data can be represented in an efficient and meaningful way. Probability representation
and fuzzy set representation of data are well developed, and so they almost reach the limits
in the performance improvements. I am interested in two other interesting representations
of data: a random graph and a semigroup, which are underdeveloped at current time.
1
The Viewpoint of Random Graphs
The task of Machine learning is to learn knowledge from data. Since data can be represented as a random graph, we provide a random graph perspective over the field of machine
learning in my past work. In my Ph.D. thesis [4], we establish three machine learning modeling frameworks on random graphs: (1) Heat Diffusion Models on Random Graphs, (2)
Predictive Random Graph Ranking, and (3) Random Graph Dependency.
1.1
Heat Diffusion Models on Random Graphs
The heat diffusion models on random graphs lead to Graph-based Heat Diffusion Classifiers
(G-HDC ) and a novel ranking algorithm on Web pages called DiffusionRank. For G-HDC,
a random graph is constructed on data points by various methods: K nearest neighbor
(KNN) graph, a graph (SKNN) with the shortest edges whose number is the same as the
KNN graph, and the Volume-based KNN graph. These three graphs can be considered as
the representation of the underlying geometry, and the heat diffusion model on them can be
considered as the approximation to the way that heat flows on a geometric structure. The
KNN graph, the SKNN graph, and the Volume-based KNN graph lead to three classifiers
when applied to the G-HDC, i.e., KNN-HDC, SKNN-HDC, and VHDC. Experiments show
that G-HDC can achieve better performance in accuracy in some benchmark datasets. For
DiffusionRank, theoretically we show that it is a generalization of PageRank when the heat
diffusion coefficient tends to infinity, and empirically we show that it achieves the ability of
anti-manipulation by setting the heat diffusion coefficient to be finite.
1.2
Predictive Random Graph Ranking
Predictive Random Graph Ranking (PRGR) incorporates DiffusionRank. PRGR aims to
solve the problem that the incomplete information about the Web structure causes inaccurate results of various ranking algorithms. More specifically, we generate a random graph
based on the known information about the Web structure. The random graph can be considered as the predicted Web structure, on which ranking algorithms are expected to be
improved in accuracy. For this purpose, we extend some current ranking algorithms from a
static graph to a random graph. Experimental results show that the PRGR framework can
improve the accuracy of the ranking algorithms such as PageRank and Common Neighbor.
1
Research Statement
1.3
Haixuan Yang
November 2007
Random Graph Dependency
Random Graph Dependency Γεα (RG2 |RG1 ) is a novel general measure on two random graphs
RG1 and RG2 . In a special case when RG1 and RG2 are equivalence relations R1 and R2 ,
ε = 0, and α = −1, Γεα (RG2 |RG1 ) becomes Γ(R1 , R2 ), which is first defined in terms of the
equivalence relations, then interpreted in terms of minimal rules, and further described by
an algorithm for its computation. Experimental results show that the speed of the new C4.5
using Γ(R1 , R2 ) is greatly improved comparing with the original C4.5R8 using the conditional entropy, while the prediction accuracy and tree size of the new C4.5 are comparable
with the old one. Moreover, Γ(R1 , R2 ) achieves better results on attribute selection than
γ used in Rough Set Theory. In the case when ε = 0 and α = 1, Γεα (RG2 |RG1 ) becomes
H(RG2 |RG1 ). H(RG2 |RG1 ) generalizes the conditional entropy because it becomes equivalent to the conditional entropy when the random graphs take their special form–equivalence
relations. Our experimental study demonstrates that H(RG2 |RG1 ) is an informative measure, showing its success in decision trees on small sample size problems. In the case when
ε = 0 and α = −1, Γεα (RG2 |RG1 ) can help to search parameters K and β in KNN-HDC
faster and more accurate than the cross-validation method.
In a summary, the viewpoint of random graphs indeed provides us an opportunity of
improving some existing machine learning algorithms. While I will continue to deepen and
widen the proposed machine learning algorithms in viewpoint of random graphs, I list some
possible directions for future efforts in the viewpoint of semigroup theory over the field of
machine learning.
2
The Viewpoint of Semigroup Theory
Since data can be acted on by a semigroup, and some descriptions about data such as
kernels can be affected by some transformations, we can show a viewpoint of semigroup
theory on machine learning.
First of all, we describe the difference between a group and a semigroup. A semigroup S
is a set endowed with an associative operation. The concept of semigroups is more general
than that of groups. Note that in a group each element g has an inverse g −1 such that
g −1 ∗ g = e where e is the unique identity element. A group must be a semigroup, but a
semigroup is not necessarily a group. The element g can be decoded from the product g ∗h if
h is given, and so elements in a group can be intuitively understood as “hard” things, while
elements in a semigroup may be “soft” and the product of two elements in a semigroup may
“collapse”.
Next, we discuss some possible applications of semigroup theory.
2
Research Statement
2.1
2.1.1
Haixuan Yang
November 2007
Kernel Methods on Semigroups
Representing Data as Semigroups
In [1], data can be represented by as semigroups when we are faced multinomials, histograms,
and bags of components (such as words, n-grams, etc.). On such kinds of data, k(s, t) =
ϕ(s + t) is a positive definite kernel for some functions ϕ : S → R. It is observed that, when
S = Σ+
n two simple quantifications do actually work:
1. Using the trace (perimeter) of Σ, i.e., ϕ : Σ → e−βtrΣ
2. Using the regularized determinant (volume) of Σ, i.e., ϕ : Σ →
1
det( η1 Σ+I)
I am interested in the unsolved problem in [1]:
When would ϕ(x + y) be a p.d. kernel?
Moreover, note that the semigroups mentioned in [1] are commutative. The problem becomes even more difficult if the semigroups are not commutative, for example, people may
feel differently if they are shown two pictures in different orders. For such a noncommutative case, if the semigroup is completely regular, then the semigroup kernel can be defined
as K(x, y) = ϕ(xy −1 ), where y −1 is the local group inverse of y.
2.1.2
Semigroups of Transformations Acting on Data
We say that a semigroup S acts on a set X, if for any s ∈ S there is a mapping Ts : X → X
such that if s1 s2 = s3 , then Ts1 (Ts2 (x)) = Ts3 (x) for any x ∈ X.
I am interested in defining a semigroup S-invariant kernel K on X such that K(x, y) =
K(s1 x, s2 y) for any s1 , s2 ∈ S, x, y ∈ X. This definition is meaningful if we consider the
similarity of two images. If two images are similar, then the appropriately transformed ones
should be similar too, for example, if one image is rotated while the other is mapped into
2-dimensional space.
ISOMAP, LLE, and EigenMap can map an image to a 2-dimensional space. Here we
want to find similarity description between an original image and that after transformation.
2.2
Semigroup Invariant Features
In [2], a complete set of rotational and translationally invariant features for images is proposed, and impressive classification accuracy is achieved. Although the cubic invariants in
Eq. (13) in [2] can determine the original image modulo translations and rotations, I think
it is unnecessary to extract L3 features for classification tasks. A possible approach that
can reduce number of features in a sense of invariance is to develop semigroup-invariant
features. A rough description about this idea is given below.
3
Research Statement
Haixuan Yang
November 2007
1. Unlike Eq. (7) in [2], we do not map the images onto the unit sphere.
2. In [2], the group SO(3) = {A|A0 A = I} acts on the unit sphere. In our approach,
there are two possible definitions about the semigroup S.
(a) If S acts on each pixel in an image, then the definition is given by
S =< {A ∈ M22 |A = AA0 A and A0 A = AA0 } >,
(2.1)
i.e., S is the semigroup generated by the elements in {A|A = AA0 A and A0 A =
AA0 }.
(b) If S acts on the whole image, then the definition can be given by
S =< {A ∈ Mnn |A = AA0 A and A0 A = AA0 } >,
(2.2)
where n is the number of pixel in an image, which is considered as a vector.
In each of the above two definitions,µ S contains
¶ some transformations that collapse.
1 0
For example, in the first definition,
∈ S. Intuitively S contains local group
0 0
transformations, and so the S-invariant features that are both globally invariant and
locally invariant, and consequently the number of S-invariant features are expected
to be reduced greatly. To implement this idea, we need more efforts to extend the
methods in [2] from group theory to semigroup theory, with which field I am familiar.
3. It is well known that a completely regular semigroup is union of groups, in which each
element x lies in local group Hx , where Hx is the H-class containing x. For various
varieties of completely regular semigroups, refer to [3]. We plan to implement the
above idea within completely regular semigroups.
2.3
Semigroup Invariant Machine Learning Algorithms
For example, the linear SVM is an SO(n)-invariant classifier where n is the number of
features. This is because that SVM is a geometrically intuitive classifier, the rotation
of the space will result in a decision hyperplane rotated in the same way. The nonlinear SVM is also an SO(n)-invariant classifier if the kernel is SO(n)-invariant. Generally speaking, if the kernels can be written in the form of K(x, y) = f (x0 y, x0 x, y 0 y),
which is true for RBF kernels and polynomial kernels, then the corresponding SVM is
SO(n)-invariant. By the above discussion, we can see that SVM is SO(n)-invariant in
most cases. For semigroups, the situation is different: in Eq. (2.1) or (2.2), for s1 ∈ S,
K(s1 x, s2 y) = f ((s1 x)0 (s1 y), (s1 x)0 (s1 x), (s1 y)0 (s1 y)) may not equal K(x, y). The almost
trivial problems for groups becomes interesting for semigroups.
4
Haixuan Yang
Research Statement
2.4
November 2007
Semigroup Invariant Regularizations
In [6], we aim to provide a new viewpoint for supervised or semi-supervised learning problems. By such a viewpoint we can provide a general condition under which constant functions (or generally any function) should not be penalized. The basic idea is that, if a
learning algorithm can learn an inductive function f (x) from examples generated by a joint
probability distribution P on X × R, then the learned function f (x) and the marginal PX
represents a new distribution on X × R, from which there is a re-learned function r(x).
The re-learned function should be consistent with the learned function in the sense that
the expected difference on distribution PX is small. Fortunately, the re-learned function
r(x) can be expressed as an integral form denoted as LK (f ). Then the proposed method is
formulated as
l
1X
min
(f (xi ) − yi )2 + γ||f − LK (f )||2K .
(2.3)
f ∈HK l
i=1
for supervised learning problems. A corresponding Representer Theorems is found for this
problem. For this consistency method, I have some important notes:
• If K is a heat kernel, then the solution to Eq. (2.3) can be found by solving a linear
system, in which the coefficient matrix is a closed form.
• r(x) can be explained as the smoothed part of f (x), and the difference r(x) − f (x)
has the meaning of the intrinsically uneven part of f .
• Under some conditions, f − LK (f ) ≈ Lf, where L is the Laplace Beltrami operator.
Following the above line, it would be much more exciting if we considered group or semil
P
group transformations in the item
(f (xi ) − yi )2 or the item ||f − LK (f )||2K . Next I
i=1
talk a little more about the item ||f − LK (f )||2K . Define a new norm ||f − LK (f )||2new =
max ||f s −LK (f s )||2K , where f s (x) = f (sx). ||f −LK (f )||new includes the information about
s∈S
transformations.
Note that if we leave our consistency method unconsidered, i.e., we only consider the
traditional regularization term ||f ||2K , the above discussions remain valid.
2.5
Random Graph Dependency under Transformations
In [5], we propose a random graph dependency measure H(RG2 |RG1 ) based on distance
information between two random graphs RG1 and RG2 , which degrades to the traditional
measure when RG1 and RG2 degrade to equivalence relations. We then investigate some
basic properties of it, and perform an empirical study by replacing the conditional entropy
5
Research Statement
Haixuan Yang
November 2007
in the C4.5 algorithm with the novel random graph dependency measure. Experimental
results show that employing the random graph dependency measure significantly improves
the prediction accuracy of C4.5 on some benchmark datasets. In summary, the random
graph dependency measure generalizes the conditional entropy successfully at least in its
application in the field of C4.5 decision trees.
A possible new direction is to consider H(RG2 |RG1 ) under some transformations of
semigroup. The two random graphs RG2 and RG1 are generated by data points, and if
these data points are transformed, the random graphs will be changed accordingly, and so
will be H(RG2 |RG1 ). The average or the minimum of the dependency may be more robust
to outliers.
In a summary, the viewpoint of semigroup theory over the field of machine learning is
an exciting and challenging direction. If some technical difficulties are overcome, a fruitful
research outcome can be expected.
References
[1] Marco Cuturi, Kenji Fukumizu, and Jean-Philippe Vert. Semigroup kernels on measures.
Journal of Machine Learning Research, 6:1169–1198, 2005.
[2] Risi Kondor. A complete set of rotationally and translationally invariant features for
images. arXiv:cs/0701127v2, 2007.
[3] Haixuan Yang. Two operators on the lattice of completely regular semigroup varieties.
Semigroup Forum, 64:397–424, 2002.
[4] Haixuan Yang. Machine Learning Models on Random Graphs. PhD thesis, The Chinese
University of Hong Kong, 2007.
[5] Haixuan Yang, Irwin King, and Michael R. Lyu. A novel random graph dependency
measure for accuracy improvement in C4.5 decision trees. In Preparaion, 2007.
[6] Haixuan Yang, Michael R. Lyu, and Irwin King. Learning with consistency between
inductive functions and kernels. In Preparaion, 2007.
6
© Copyright 2026 Paperzz