Computer Science
Aston University
Birmingham B4 7ET
http://www.cs.aston.ac.uk/
2
CS3210 Geographic Information Systems
Contents
1 Course Outline
Geographic Information Systems
CS3210
1
What is a GIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2
Course philosophy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.3
Course outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
2 Functionality and Applications of GIS
Dan Cornford
2.1
September 27, 2005
Course Outline
We live in a world that is increasingly dominated by Information Technology, where databases are continually
being created and expanding. However data 6= information, so as computer scientists one of our roles may
be thought of as the conversion (and facilitation of the conversion process) of data to information. Since we
live in a four dimensional world (space-time) it is necessary to consider to spatial and temporal aspects of the
data. Until recently the techniques required to perform this type of spatial and temporal analysis have not
been readily available. With the advent of faster, more powerful computers with increased storage potential,
the analysis of spatial data has become a reality. Geographic Information Systems (GIS) provide the tools to
process this spatially referenced data. This course is all about the theory underpinning GIS, their application
and the more recent development of web based GIS and the object oriented underpinnings of this.
The aim of the module is to provide a balance between the underlying theory required to implement a working
GIS system – the data structures and algorithms, and the applications of this technology to address real world
problems – the information systems aspect.
1
1.1
Applications of GIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 Spatial Data
4
5
5
6
3.1
Vector spatial data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1 Lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.2 Polygons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
8
8
3.2
Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
3.3
Set theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3.1 Set relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3.2 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4
Topological spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4.1 Point-set topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4.2 Combinatorial topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5
Network spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.6
Metric spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.7
Raster spatial data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.8
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4 Database Concepts
15
5 Models of Spatial Information
17
5.1
Models, domains and morphisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.2
Object based models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.3
Implementing object based models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.4
Field based models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6 Representation and Algorithms
21
6.1
Special features of spatial data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.2
Spatial objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.2.1 Representing spatial structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.2.2 Geometric algorithms for vector data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.3
Network spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.3.1 Finding the shortest route from A to B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.4
Spatial fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.1 Regular tessellation representations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.2 Irregular tessellation representations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.4.3 Delauney Triangulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7 The Internet and GIS
7.1
35
XML, HTML and the future of the Internet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
8 XML– eXtensible Mark-up Language
8.1
32
32
32
33
35
SVG– Scalable Vector Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
9 GML– Geography Mark-up Language
39
9.1
OGC– the Open GIS Consortium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
9.1.1 Interoperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
9.1.2 OpenGIS Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
9.2
GML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.2.1 The composition of GML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.2.2 GML and geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.2.3 GML and features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.2.4 GML in use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
42
42
44
46
9.3
GML
9.3.1
9.3.2
9.3.3
46
49
51
57
(3.x). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Coordinate reference systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CS3210 Geographic Information Systems
9.3.4
9.3.5
9.3.6
9.3.7
9.3.8
9.3.9
9.3.10
Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Temporal data and dynamics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Definitions and dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Units, measures and values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Coverages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Styling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
58
61
62
62
63
64
68
9.4
Application schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
9.5
Associated technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
9.5.1 XMML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
9.5.2 SensorML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
9.6
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
10 Structures and Access Methods
10.1 Standard data structures and access methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.1.1 Unordered files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.1.2 Ordered files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.1.3 Hash files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.1.4 Index files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.1.5 B-trees. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
71
71
71
72
72
72
10.2 From 1D → 2D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
10.3 Structures for point data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.3.1 Grid file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.3.2 Point quad-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.3.3 Point 2D-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
74
74
75
10.4 Lines and Areas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
10.5 Structures for Raster Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
11 Stages of a GIS Application
77
12 Sources of Raster Spatial Data
77
12.1 Sampling: interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.1.1 Nearest neighbour interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.1.2 Piecewise linear interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.1.3 Inverse distance interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.1.4 Geostatistical interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.1.5 Which interpolation algorithm? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
78
79
79
80
84
12.2 Remote Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.2.1 The electro-magnetic spectrum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.2.2 The platform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.2.3 Some satellite systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.2.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.2.5 Into the future . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
85
86
87
88
88
12.3 Scanning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
12.4 Conversion between fields and objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
13 Sources of Vector Spatial Data
89
14 Errors in Spatial Data
89
15 Coordinate Systems and Map Projections
92
15.1 Planar projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
15.2 Cylindrical projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
15.3 Conic projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
15.4 Which projection to use? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
16 The Future of GIS?
95
16.1 Role in business . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
16.2 Data models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
16.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
16.4 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
16.5 GML and OpenGIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
16.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4
CS3210 Geographic Information Systems
1.1
What is a GIS
A GIS is more than a spatially referenced database. The database is indeed the core of the GIS system but
the nature of spatial (or geographic) information means that care has to be taken when designing such a
database. In particular the operations which one might like to apply to spatial data are often very different
to those traditionally applied to non-spatial data. Thus a GIS must include tools for the exploration, analysis
and summarising of spatial data. One of the key methods for exploration and summarising data is graphical
representation. In a GIS this will usually take the form of maps, and this is the third key (summarising)
requirement.
So a GIS provides:
➢ a core database system;
➢ spatial data analysis capabilities;
➢ and map display capabilities.
1.2
Course philosophy
This course combines a theoretical explanation of the tools used in a GIS with a broad overview of the applications of GIS. It will be left to the individual student to ensure they have a sufficient knowledge of standard
relational database systems (i.e. it is assumed you have this already, although such knowledge is not heavily
used during the course). The first part of the course will develop the theory necessary to understand GIS and is
strongly based on Worboys (1995), while the second section will focus more on the applications and surrounding
issues, including object and web based GIS, however to make the presentation of the material more natural,
this division is not necessarily always represented in the timing of the lectures. These notes are not designed
to be followed in the lectures, rather they complement the lecture material, and ideally should be read off-line
prior to the lectures.
1.3
Course outline
This course will cover the following subjects in greater or lesser depth.
➢
➢
➢
➢
➢
➢
➢
➢
➢
➢
➢
Functionality and applications of GIS.
Characteristics of, and models for, spatial data.
Representations and algorithms for spatial data.
Structures and access methods for spatial data.
Sources of raster spatial data, interpolation and introduction to remote sensing.
Sources of vector spatial data.
Errors in spatial data and their treatment in GIS.
Coordinate systems and map projections (very lightly).
Examples of the application of GIS to real problems.
Object based GIS.
Web based mapping: XML, GML and the future of GIS.
Some of the areas necessarily require a more mathematical exposition than others. I will be covering (at a high
level) some pretty tricky statistics, since these are integral to the analysis of spatial data (and thus GIS). In
general the core of the course is as presented in (Worboys, 1995). There is now a second edition of the book
(Worboys and Duckham, 2004) which is better presented than the original and this book will form important
reading for the course.
In addition to the lectured components there will be a series of lab classes where the public domain, Unix based
GRASS GIS system will be introduced and used in an example applications, which will form part of the course
assessment. The assessed work will revolve around you critically examining the role and requirements of a GIS
used in a piece of research or applied work related to the assignment you chose to undertake. More of this later
....
CS3210 Geographic Information Systems
2
Functionality and Applications of GIS
To some extent we have already tackled this problem in the preceding sections. A GIS has been defined in
Worboys (1995) as:
A GIS is a computer-based information system that enables capture, modelling, manipulating, retrieval,
analysis and presentation of geographically referenced data.
5
6
CS3210 Geographic Information Systems
These divisions are not rigid or exclusive and many GIS applications cross boundaries.
Demographics is the study of human populations, and is key to many of the business related activities such
as catchment analysis (in particular for retail developments), public planning (e.g. transport, health, social
services) and communications technology. In such activities the aim is to find the likely source regions for
visitors to a facility (or shop) and through their socio-economic backgrounds their impact (i.e. profit!) on the
facility (shop). This type of GIS study typically uses census (e.g. population data) and infrastructure (e.g.
roads and transport links) data.
An organised collection of computer hardware, software, geographic data, and personnel designed to
efficiently capture, store, update, manipulate, analyse, and display all forms of geographically referenced
information.
There are a huge number of environmental applications of GIS. These range from pollution monitoring and
assessment to wild-life habitat assessment. Most projects in this area are less commercial and often use large
amounts of remotely sensed data. A growing area is that of environmental impact assessment, which very
often requires that the geographic context be taken into account. High precision farming, whereby much of
the management of a farm is centrally controlled using a GIS to manage fertiliser application or ploughing is
another developing area.
As can be seen, these are rather broad definitions, and thus most databases, although often not designed as
such, could be considered a GIS. In the earlier incarnations in the 1980’s a GIS was really nothing more than
a spatially referenced database (Rogerson and Fotheringham, 1994):
Facilities management is to do with the maintenance and planning of efficient utility networks such as gas
pipes and fibre-optic networks. GIS might also be used in studies of energy efficiency or load balancing of
electrical networks. This typically uses the organisations own internal data, together with other infrastructure
and possibly demographic data.
or from a more business oriented viewpoint in Environmental Systems Research Institute, Inc. (1993) as:
Geographic information systems were initially developed as tools for the storage, retrieval and display
of geographic information. Capabilities for the geographic analysis of spatial data were either poor or
lacking in these early systems.
However in the late 1980’s and 1990’s the increase in computer power has facilitated a considerable increase in
the analytical capabilities of GIS. One of the significant problems of the analysis of spatial data is the size of
the datasets – and this requires special consideration when designing the database and analysis tools, as will be
shown later. At the minimum a GIS must have the ability to:
➢ capture,
➢ manipulate,
➢ store and retrieve,
➢ analyse,
➢ and display,
Network analysis is concerned with the efficient design of transport systems or delivery routes and is typically
used by courier companies. Although this falls within the realms of GIS, specialist packages are often used due
to the complexity of the analysis required, with the GIS being used as a spatially referenced database, although
this is changing as GIS become more capable. This requires infrastructure data, and often also demographic
data.
Land management concerns planning and development. This overlaps with both demographical analysis and
environmental management, but is kept separate because planning departments may have differing GIS requirements such as requiring a temporal aspect to the data held. Huge amounts of data are held, largely by local
authorities, on the ownership of land. Most of these are now computerised to some extent, however the data is
generally not publicly available on a large scale.
It should be clear from this brief survey that GIS has a very wide range of applications, from the hardest (profit
oriented) business decisions to the softest environmental applications. Later in the course we will explore in
a little more detail some specific applications, largely by reading relevant papers which describe typical GIS
projects.
geographically (or spatially1 ) referenced data. This provides us with our basic definition of a GIS.
2.1
Applications of GIS
This section briefly introduces the application areas of GIS. Some of the largest users (both in terms of the size
of the databases and the amount of money spent on them) are utility companies and local authorities. Utility
companies use GIS extensively to maintain a database of their pipes, cables, drains etc. while local authorities
might use GIS in planning, transportation and environmental services (the level of use varying widely across
different authorities). Other users of GIS include international bodies, such as the United Nations Food and
Agriculture Organisation, national and governmental bodies such as the Census Office and a huge variety of
business organisations, from big corporate organisations to the large supermarkets and down to some high-tech
farms. Applications can be broadly divided into the following categories:
➢ demographic analysis;
➢ environmental management;
➢ facilities management;
➢ network analysis;
➢ and land management.
1 A GIS is usually designed to store spatially referenced data, which is on geographic scales of centimetres to thousands of
kilometres. The term ‘Spatial Information System’ is sometimes used to describe systems for storing the spatial ordering of protein
chains, and other (typically very small scale) non-geographical data.
3
Spatial Data
Spatial data generally has a different character from data usually considered when designing database technology.
The key differences relate to the fact that the simple tabular2 structures used in standard relational databases
are often not appropriate for spatial data, particularly its analysis and querying. This section provides the basic
concepts necessary to model spatial data structures, which is the topic of the next section.
The concept of spatial data has many interpretations, and each one of us has our own unique perspective on what
is meant by space. For most, the immediate response is based on our common experience of a three dimensional
(plus time) world. For GIS this is a pretty good starting point. However this experiential definition is not
objective, and to proceed we require a concrete, formal definition. A good definition, might be ‘a framework in
which things exist’. There are, however, two possible views of space:
➢ absolute space - space which exists independently of any objects being present;
➢ relative space - space which exists by virtue of the presence of objects.
This philosophical division has implications for the design and construction of GIS.
If one considers typical spatial data structures, such as points, lines, polygons and networks it is clear that
2 When
text appears in sans serif format this means there is a definition of the term in the database concepts section.
CS3210 Geographic Information Systems
7
these objects will not be readily incorporated in the table structure embraced by relational databases. This is
largely related to the standard requirement that items in the tables be atomic. To store a line, a vector of point
coordinates is required. In the next sections we will deal in greater depth with models of spatial information.
Two fundamentally different methods may be used to represent spatial data. The two approaches can be considered complimentary rather than competitive, with the vector (or geometric) approach being most appropriate
for spatial information that is associated with discrete objects, while the raster (or field ) approach is most
suitable for continuous spatial processes. These differences in the two methods can be seen in Figure 3.1.
8
CS3210 Geographic Information Systems
3.1
Vector spatial data
Geometry, the study of the properties and relations of constructible [plane] figures3 is key to the understanding
of (relative) space. For a GIS the space is always geographic (by definition) and thus Euclidean 2-space4 (to
represent the surface of the Earth) is the most common framework to adopt. In later lectures the issue of three
dimensional and spatio-temporal GIS and projective coordinate systems will be addressed. For now a local
Cartesian coordinate system will be adopted.
In a Cartesian coordinate system a point in space is represented in terms of its distance, measured along the
perpendicular x and y axes, from a fixed origin. Typically a point will be thought of as consisting of a vector
(x, y) which we often write simply as x (see Figure 3.2). The standard rules of vector spaces apply in Cartesian
coordinates:
➢ addition: (x1 , y1 ) + (x2 , y2 ) = (x1 + x2 , y1 + y2 )
➢ scalar multiplication: k · (x, y) = (kx, ky)
p
➢ vector norm: kxk = (x2 + y 2 )
y−axis
x 2= (x 2 ,y 2)
θ
|x 1x 2|
x 1= (x 1 ,y 1)
x−axis
(a)
(b)
Figure 3.1: The difference between (a) raster and (b) vector representations of space.
Figure 3.2: Graphical representation of a 2D
Euclidean space with Cartesian coordinates.
Vector based data structures are excellent for storing point, line and polygon objects, such as spot heights,
roads, rivers, parish boundaries and house footprints. Traditionally, vector based GIS have been used by
planning authorities, governmental agencies, business and in the social sciences.
Raster based data structures are excellent for storing data on continuous spatial processes such as temperature
fields, elevation or population density. Traditionally raster based GIS have been used in the physical and
environmental sciences.
As can be seen from Figure 3.1 the raster and vector data structures can both be used to represent any
spatial information, although not always very efficiently. Today most applications will be best served using a
combination of raster and vector representations. In future sections we investigate the methods we could use
to implement both representations and show some of the conversion methods for changing between the two
representations. The distinction between vector and raster data structures is very similar to the distinction
between vector and bitmap graphics.
Within this dichotomy, there exists a unifying notion that the spatial aspect of the data being studied is intrinsic
to the problem. We shall first consider the vector model of space.
The bearing, θ, of x2 from x1 is given by the solution (in
the range 0◦ to 360◦ ) of the simultaneous equations,
(x2 − x1 )
|x1 x2 |
(y2 − y1 )
.
cos(θ) =
|x1 x2 |
sin(θ) =
Note that a bearing is always taken to refer to the clockwise
angle from the y-axis (or geographic / magnetic North).
These simple definitions prescribe everything we will need
to know about the coordinate space in which we are working.
3.1.1
Historically, there was a division between GIS systems based around these alternative representations of spatial
data, however most modern GIS systems encompass both approaches to some extent. There are advantages
and disadvantages to both approaches, some of which are apparent from Figure 3.1. It can be seen that these
methods for representing spatial data correspond quite closely to the absolute (raster) and relative (vector)
concepts of space.
For a given the space it is the norm which defines the concept of distance. We can now define the distance between
2 points, |x1 x2 | using our norm to be:
p
|x1 x2 | = kx1 − x2 k = (x1 − x2 )2 + (y1 − y2 )2 .
Lines
Defining points is relatively simple in any coordinate system. In a vector based GIS most objects are not
readily represented by points. Features such as rivers, communication cables and roads5 are most appropriately
represented by lines. Thus some simple 2D line geometry is required. Two points are sufficient to define a line:
given two points, x1 and x2 , the point-set of the line passing through both is given by {λx1 + (1 − λ)x2 } with
λ defined over all the real numbers. The line segment between x1 and x2 is given using the same definition
but this time λ is restricted to the range 0 to 1 inclusive. A more convenient description of a line is given by
its equation, ax + by − c = 0. This parametric form is not often used in a GIS, as we shall see later, but it
will occasionally be useful. In theory there is no reason to constrain ourselves to straight lines - we could use
polynomial curves, splines, ellipses, indeed any geometric construct to define objects, however the vast majority
of GIS use line segments alone to represent linear features, which has implications for accuracy.
3.1.2
Polygons
Lines are not sufficient to represent spatial objects such as houses, postcode regions or urban areas. Such objects
have area as well as length, and thus must be represented by polygons. A polygon (in 2D space) is defined as
the region of the plane enclosed by a finite set of line segments which meet at their end points (Figure 3.3).
There are several definitions which it will be useful to understand (Worboys, 1995):
3 A more mathematical definition might be the one given by Felix Klein, ‘the study of those properties of a set that remains
invariant when the elements of the set are subjected to the transformations of some transformation group’.
4 A finite-dimensional real vector space possessing a scalar product so that a distance may be imposed.
5 This is a scale and application dependent description. Roads and rivers in particular may be represented by polygons at certain
scales and for certain applications.
CS3210 Geographic Information Systems
9
10
CS3210 Geographic Information Systems
3.3
Set theory
Set theory might not at first seem directly relevant to GIS, however it is fundamental to much of the database
work you have seen in your earlier careers. We will build up the spatial data structures bit by bit, and probably
the simplest model is the set based model.
polyline
simple
closed polyline
simple polygon
convex polygon
Figure 3.3: Several different polylines and polygons.
The key concepts of set theory are:
elements:
the items in the set;
set:
a group of elements;
membership: a relation between elements and sets.
These concepts can be used express simple spatial relations, such as cities are contained within countries.
However they are not always useful (in the UK in particular) since the elements of sets may not be the same,
for instance post-codes and ward boundaries. Set theory was developed on a binary assumption: an element is
either in a set or not.
Original
Translation
Scaling
Rotation
Figure 3.4: Simple transformations of polygons.
a finite set of line segments (which are called edges) such that each end point is shared by
exactly two edges, except for a possible two extreme points.
edge:
the line segment between two points.
extreme points: the end points of a polyline.
simple polyline: a polyline that has no intersecting edges.
closed polyline: a polyline that has no extreme points.
polygon:
the region of the plane enclosed by a polyline.
simple closed polygon: the region of the plane enclosed by a simple closed polyline.
polyline:
In general most objects can be represented by simple closed polygons, however these can be further classified
according to several properties. An important concept is that of a convex polygon. A convex polygon has all
internal angles less than 180◦ . For a complete definition of the different type of polygons the reader is referred to
Worboys (1995). Many of the concepts will be familiar to those who have taken a Computer Graphics module,
and a good starting point for more graphics oriented treatment of 2D geometry is Foley et al. (1993).
3.2
Transformations
An important concept in geometry is that of a transformation (Figure 3.4). This is mapping of every point
in the plane to a corresponding point in the new plane. Later sections address the more complex issues of
coordinate projections, but for now only simple transformations will be addressed.
There are several different types of transformation that can be considered (Worboys, 1995):
congruent:
shape and size preserving (e.g. translation);
similarity:
shape preserving (e.g. scaling);
affine:
collinearity preserving (i.e. preserves parallelism, e.g. rotation).
There are more possible transforms, but these will suffice for the present. The definitions above are not exclusive,
for instance a rotational transformation is affine and congruent.
If we consider
S = T:
S ⊂ T:
#S:
S ∩ T:
S ∪ T:
S/T :
S0:
∅:
two sets S and T we can define many set operations:
every element in S is in T and every element in T is in S;
S is a subset of T ;
the cardinality of S, that is the number of elements in S;
the intersection of S and T ;
the union of S and T ;
the difference of S from T – elements in S but not in T ;
the complement of S – elements not in S but in the universal set;
the empty set.
You will be familiar with these concepts, so this is really a review and an introduction to my notation. Note
that open sets do not include their boundary points, while closed sets do.
3.3.1
Set relations
In order to define set relations, we need to introduce the concept of a product set. A product set, written S × T ,
is the ordered pairs of all the elements. For instance if:
S = {Julie, Dave} , T = {Cat, Dog, Lizard} ,
then,
S × T = {(Julie, Cat), (Julie, Dog), (Julie, Lizard), (Dave, Cat), (Dave, Dog), (Dave, Lizard)} .
Now we can express the fact that Julie likes lizards and Dave likes cats and dogs using a binary relation between
S and T , that is a subset of their product set.
We can have several different forms of relations:
reflexive:
the elements of the set are all related to themselves;
symmetric:
x is related to y implies y is related to x;
transitive:
x is related to y and y is related to z implies x is related to z.
Many of the relations between sets which have a spatial context are none of the above. For instance consider
the distance between major cities in the UK – the relation closest is clearly not reflexive (it doesn’t make sense)
and neither it is symmetric or transitive.
Another concept we will use a great deal later on is that of a function, that is a mapping from one set S (the
domain) to another set T (the image). Functions are said to be injective if elements in the domain are mapped
to unique elements in the image. A good example is a projection from the sphere (S) onto the 2D plane (T ),
which is injective. This means that an inverse function exists so that we can also do the reverse mapping.
CS3210 Geographic Information Systems
11
12
CS3210 Geographic Information Systems
3.4.1
Convex
Semi−Convex
Non−Convex
Figure 3.5: A convex set, a semi-convex set and a non-convex set.
3.3.2
Convexity
Convexity can be defined for sets as well as polygons, although the definition makes most sense for sets defined
on the 2D plane (the extension to 3D is obvious). The set S is said to be convex if element in S is visible from
every other element in S. An element is visible from another element in a set S if every point on the straight
line connecting the two points is also in S. S is semi-convex is there is at least on observation point in S.
An observation point is a point from which all others in the set can be seen. These definitions are shown in
Figure 3.5. It is clear that notions of convexity require some form of topology on the space within the set.
Point-set topology
Point-set topology is concerned with the definition of some
fairly abstract ideas of which we should be aware. We
cover this subject rather briefly, focusing on the relevant
parts for GIS. The key concepts in point-set topology are
those of a point x, which is one of a set of points S and
the neighbourhood of x (a subset of S), as illustrated in
Figure 3.6.
Neighbourhoods can either be open (excluding the boundary points) or closed (including the boundary points). The
closure of a region means the interior of the region and
its boundary points. The definitions may seem rather trivial, but when implementing a land management system the
question of the ownership of boundaries is vital.
arc
3.4
loop
S
neighbourhood
x
Figure 3.6: The concepts important to understanding point sets.
cell or area
torus
Topological spaces
Figure 3.7: The basic objects in a 2D Euclidean plane.
Topology is the ‘study of form’. Topological properties are those which are invariant under a rubber sheet
transformation (called a topological transform or homeomorphism) which does not tear or fold the sheet. For
example the connectedness of two points cannot change unless we tear the sheet, whereas the distance between
two points will almost certainly change.
The following properties are topological:
➢ A point is an end point of an arc;
➢ an arc does not cross itself (simple arc);
➢ a point is on the boundary of a region;
➢ a point is in the interior of a region;
➢ a point is in the exterior of a region;
➢ an area is open;
➢ an area has no holes (simple area);
➢ an area is connected;
➢ a point is within a loop.
The following properties are non-topological:
➢ the distance between two points;
➢ the bearing of one point from another;
➢ the length of an arc;
➢ the perimeter of a cell;
➢ the area of a cell.
From these properties it should be clear that topology largely deals with the relations of sets with other sets,
in some space. The concept of a homeomorphism is also central to the concept of projection, thus topological
properties will be conserved under projections. We can also use homeomorphisms to produce abstracted versions
of spatial objects, such as is done to produce the London underground map.
Within point-set topology objects may be defined and ascribed certain properties. The most common objects
are shown in Figure 3.7. The properties of the objects are fairly clear from the figure.
X
reg(X)
Figure 3.8: A point-set (left) and its regularisation (right).
Another important concept is the regularisation of an object. This is the closure of the objects interior point-set,
and is shown in Figure 3.8. The concept of regularisation is frequently used when cleaning up digitised data, or
imported data to ensure the polygons have the correct structure. It also gives us a method for testing whether
an object is a simple polygon, by comparing the object and its regularisation.
3.4.2
Combinatorial topology
f=6
e = 20
v = 15
Figure 3.9: Euler’s formula for a planar
cell.
A method often used in GIS to represent the objects of interest
is to model these as a combination of some base shape (called
a simplex ). This representation is particularly appropriate for
objects which exhibit similar characteristics over a range of scales
– that is fractal objects, which are to be represented in a vector
framework.
This is largely based around the concepts derived from Euler’s formula relating the number of vertices, edges and faces of a given
dimensional object. For instance for a polyhedron (a three dimensional shape with no holes) we have:
f −e+v =2,
CS3210 Geographic Information Systems
13
where f is the number of faces of the shape, e is the number of edges and v is the number of vertices. For
planar cells the equation is:
f −e+v =1,
as illustrated in Figure 3.9. These formulae give us the Euler characteristics of a surface.
14
CS3210 Geographic Information Systems
cycle:
a path in the graph from a node back to itself;
acyclic graph: a graph with no cycles (easily analysed);
tree:
a connected acyclic graph, frequently used as a data structure in CS in general and GIS.
Planar graphs are graphs which can be represented as planar embeddings of the abstract graph with edges only crossing at nodes. An example of a planar graph and two possible embeddings are shown in Figure 3.12. Note that the
two planar embeddings shown are not homeomorphic to
each other, so there is some ambiguity as to which is appropriate. We will return to some of these concepts as
we need them. In particular we will explore network data
structures further later in the course.
0-simplex
1-simplex
Figure 3.12: A graph and two of its possible
planar embeddings.
2-simplex
A number of simplexes
(a)
and
the resulting complex
3.6
(b)
Figure 3.10: (a) Simplexes and (b) the building up of complexes.
The fundamental building blocks for combinatorial topology are the so called simplexes. These are shown in
Figure 3.10(a). More complex shapes are then made up by combining a number of simplexes to produce a
complex , as shown in Figure 3.10(b). These shapes may also be directed. The most common application of this
form of model is in terrain modelling, where the elevation of the surface is represented using a Triangulated
Irregular Network (TIN).
3.5
Metric spaces
If we add a concept of distance to some of the rather abstract ideas we have been dealing with so far we get
somewhat closer to what we might consider to be a concept of space (Euclidean space). However distance is
not always what we might intuitively expect. If we define three points s, t and u which are in some set S, then
S is a metric space if:
➢ d(s, t) > 0 if s 6= t,
➢ d(s, t) = d(t, s),
➢ d(s, t) + d(t, u) >= d(s, u),
where d() represents the distance function. Metric spaces have a natural topology defined by the concept of
distance. There are a number of possible ways in which we can define distance but in general we will stick to
the standard Euclidean space. Some alternatives include the Manhattan distance (computed along a grid rather
than as the crow flies) or time of travel as distance (although one has to take care with symmetry).
Network spaces
Network spaces are intimately linked to graph theory and are widely used to represent and compute information
about transport networks. They are derived from Euler’s analysis of the problem of crossing the bridges in
Koenigsberg park.
3.7
Raster spatial data structures are generally used to represent processes that are spatially continuous, such as the
spatial distribution of the precipitation amount or elevation. One appropriate model for this sort of information
is a random field model which we investigate later. There are a number of other models which we could use if
the data required them, which will also be discussed. The actual data structures used to store raster spatial
data are arrays, although other variants are used to improve the efficiency of the representation.
3.8
the geographical problem
graph and geography
abstract graph
Figure 3.11: The Koenigsberg bridge problem and its graphical representation.
Raster spatial data
References
Burrough, P. A. and McDonnell, R. A. 1998. Principles of Geographic Information Systems. Oxford: Oxford
University Press.
Date, C. J. 1995. An Introduction to Database Systems (6th ed.). Reading, MA: Addison-Wesley.
The problem and its associated graph is shown in Figure 3.11 and this allows us to prove it is impossible to
visit all the land masses, crossing each bridge only once. Graphs are a topological construct, and do not have
a notion of distance associated with them in general.
Rather that spend a long time reviewing the theory of graphs, we give a number of definitions:
graph:
a set of nodes and edges;
edge:
e = {x, y} is the edge joining nodes x and y;
directed graph: each edge of the graph has a direction attached;
labelled graph: each edge has an associated label;
connected graph: there is a path between all pairs of nodes in the graph;
isomorphic graphs: two graphs which have the same connectedness are said to be isomorphic;
Environmental Systems Research Institute, Inc. 1993. Understanding GIS: The Arc/Info Method. Harlow,
Essex: Longman Scientific and Technical.
Foley, J., A. van Dam, S. K. Feiner, J. F. Hughes, and R. L. Phillips 1993. Introduction to Computer Graphics.
Reading, Massachusetts: Addison-Wesley.
Rogerson, P. A. and A. S. Fotheringham 1994. GIS and spatial analysis: introduction and overview. In A. S.
Fotheringham and P. A. Rogerson (Eds.), Spatial Analysis and GIS, pp. 1–10. London: Taylor and Francis.
Worboys, M. F. 1995. GIS: A Computing Perspective. London: Taylor and Francis.
Worboys, M. F. and Duckham, M. 2004. GIS: A Computing Perspective, 2nd Edition, London: Taylor and
CS3210 Geographic Information Systems
Francis.
4
Database Concepts
The database forms the core of the GIS. The functionality is built around the database. As computer scientists
one or our main concerns will be the design and implementation of the database. In this module I am assuming
some familiarity with the key concepts in databases. For those students who are less familiar this appendix,
in the form a glossary, should cover the relevant ground. Reference to a standard database textbook such as
(Date, 1995) or simply reviewing Chapter 2 in Worboys (1995) should bring the student to the required level
of familiarity.
ANSI–SPARC: a layered database architecture consisting of users, views, a conceptual scheme, a internal scheme
and the stored data. The DBMS controls the links between each layer.
atomicity:
attribute:
(of transactions): the property the all operations of a transaction must either all have their effect
on the database or have no effect at all. This ensures database concurrency and integrity.
(of data): the property that the data cannot be decomposed into a further list of data items.
the label of a column in a table: for instance NAME, AGE and CITY could be some attributes in
a marketing database.
canned transactions: transactions which are frequently repeated by a user, and can be optimised for performance
in advance.
cardinality:
15
16
CS3210 Geographic Information Systems
indexing:
the use of indices (or pointers) to speed up access to the data in the database.
integrity:
the property that all the data in the database is correct and self consistent.
integrity constraints: properties of the data, set in advance by the user, which are always true. For instance speed
must always be greater than or equal to zero.
interleaving:
locking:
concurrency:
the ability to support the simultaneous use of data by several users.
committal:
the act of signalling a permanent change to the database after a transaction has been completed.
database:
‘a unified computer-based collection of data, shared by authorised users, with the capability
for controlled definition, access, retrieval, manipulation and presentation of the data within it
’(Worboys, 1995)
data model:
the model expressing the relationships in (and structure of) the data stored.
DBMS:
DataBase Management System: connects the user to the data, through software which implements
the data model, provides tools for manipulating the data in the database, controls security and
concurrency in the databases and ensures user independence.
a mechanism whereby part or all of the database is locked out to prevent access by other processes/users during a transaction. This can solve problems of concurrency but will result in a loss
of performance.
metadata:
additional information on the source, accuracy, lineage etc. of data stored in the database.
performance:
the speed of retrieval of data from the database.
query:
a request to access (or analyse) data in the database. This may involve several transactions.
query compiler: an essential part of the DBMS which parses, analyses and compiles a command written in the
query language.
query language: a method for communicating with a database.
query optimiser: a part of the query compiler which produces compiled code that ensures good databases performance.
tables:
(of a table) the number of tuples (rows) in a table.
conceptual scheme: the scheme for storing the data in the database as described by the data model but independent
of the actual implementation. Contrast with internal scheme.
the ordering of the constituent operations of a transaction.
internal scheme: the implementation details of the actual structures used in the data model such as the address
of the data on physical storage media. Contrast with conceptual scheme.
a table (or array) containing data about a set of attributes. Each cell in the table has a value and
each row is referred to as a tuple.
tabular relations: See tables.
tuple:
a row in a table which consists of a list of values for each attribute in the table.
RDBMS:
Relational DataBase Management System: a DBMS for a relational database.
relation:
a relation scheme and the associated data.
relation scheme: a set of attribute names together with a definition of their domains.
relational algebra: the structure and combination of operators applied to relational databases such as: union,
intersection, difference, project, restrict, join and divide.
relational database: a database that is uses links between data (often in different files) to generate information,
that is, a set of relation schemes and the data.
reliability:
the database has safety features against unforeseen events such as power failures.
rollback:
Distributed DataBase: a database that is spread across several platforms. This makes most sense
where the data have a logical partitioning.
the recovery of the database to the state before a transaction if something goes wrong during the
transaction.
security:
the database ensures users do not use data in unauthorised ways.
DDBMS:
Distributed DataBase Management System: a DBMS for a DDB.
SQL:
degree:
(of a table) the number of attributes (columns) in a table.
Structured Query Language: a language which permits the definition of the data model and allows
the manipulation of data in a relational database. This is documented in most database text books,
see Date (1995) for example.
domain:
(of relations) the types of attributes in a relation scheme such as integer, real, string, date, etc.
DDB:
execution strategy: a method used to perform a particular query on the database. This will usually be written
in a query language such as SQL and compiled by the query compiler. In general there will be
many execution strategies, and a query optimiser might be used to implement the one with the
best performance.
stored data manager: part of the DBMS which gives access to the internal scheme.
system catalogue: the part of the DBMS which stores the views, conceptual schemes, and internal schemes and
the mappings across these, allowing mappings from the high-level objects in the queries to the
low-level objects in the internal scheme.
transaction:
first normal form: a relation in which the data items are atomic.
independence: (of the database) the concept that users should not be confronted with the complexities of the
low-level database processes. Keeping the conceptual scheme and the internal scheme separate
allows independence.
(of the transactions) the property that each transaction should have same same effect on the
database independently of the interleaving of transactions submitted concurrently.
an atomic unit of interaction between a user and the database. They are typically either, insertion,
modification, deletion or retrieval of data from a database.
transaction management: the management of transactions to ensure concurrency and integrity. This is a key area
for database design.
value:
the actual data held for a single cell in a table.
views:
the ability to change the user interface to accommodate different user requirements.
CS3210 Geographic Information Systems
5
17
Models of Spatial Information
We use models all the time, sometimes consciously and sometimes without even realising it. A model is a tool
for simplifying a process, relationship or object. This process or object is said to come from the source domain
which is generally ‘reality’. The process or object is represented (using the model) in the target domain. In
general the aim will be to discover or understand something in the target domain (model) which can be then
applied in a useful way in the source domain (reality).
D
Put like this the concept may sound rather alien but there
are many concrete examples:
➢ flight simulator;
➢ economic model;
➢ weather forecast models;
➢ computer graphic representation of a robot;
➢ maps;
➢ graph.
m(D)
inv(m)
There are a number of models of spatial data that are available to us. These were shown at an abstract level in
Section 3. It is clear that the most significant choice in models for spatial phenomena is whether to use object
(vector) or field (raster) approaches to modelling.
An object6 based approach assumes that space is populated by discrete entities, which co-exist. This is most
closely related to the standard relational database model which consists of a collections of tuples such as:
➢ location, rainfall, soil type, elevation;
Models, domains and morphisms
m
CS3210 Geographic Information Systems
this will be the focus of this course, however it is also important to have a broad understanding of the domain
models as well.
In this section we investigate the computational models that we can use to represent some of the classes of
spatial objects that we discussed before. It should be clear by now that we are following a bottom up approach
to the understanding and definition of GIS. This section bridges the gap between the abstract notions of the
previous section and the next section which deals with more implementation related questions.
5.1
18
that defines a relation. The field based approach can be seen as a mapping, spatial framework → attribute
domain, while the object based approach might be considered a mapping: object domain → spatial embedding. However we must take care here since the object based modelling paradigm does not necessarily
correspond to the use of a vector GIS, and likewise raster based representations can be encompassed in object
based approaches. For instance we might think of a layer in a GIS which stores all the road network data for
a particular county in the UK. If this is represented as a series of arcs, all on the same layer does this really
correspond to the object approach? There is no hard division here.
A field based approach typically models the attributes distribution (in space) by some form of spatial function.
We shall look at various options for representing spatial functions later in the course.
5.2
Object based models
Figure 5.1: A model viewed as a morphism.
In general we can see that a model defines a mapping
(which preserves some or all of the structure and information) from the source domain to the target domain.
We call such a mapping a morphism, which we can represent by a (modelling) function m. A simple example
of a morphism is the mapping x → x2 , which preserves the structure in the real positive numbers, and has an
√
inverse mapping y → y. The concept of a model as a morphism is illustrated in Figure 5.1. For the model to
be useful we must have:
m−1 (m(x))) ≈ x ,
where x is some transformation which the model should represent. This simply means that we can define an
inverse mapping from the target domain, back to the source domain, that is, we can interpret the results of the
model in terms of things that relate to the real world!
Application
Domain
Application
Domain
Model
Conceptual
Computational
Model
Logical
Computational
Model
Physical
Computational
Model
Figure 5.2: The typical stages of a GIS model.
We can think of GIS investigations as a series of modelling exercises which attempt to solve a given problem.
One decomposition might be as shown in Figure 5.2. The application domain refers to the subject area the
GIS investigation is exploring. This may be the flow of traffic in a city, the spread of pollution after a fire, the
erosion of a hill side or a more complex ‘reality’. The application domain model (or more simply, domain model)
is a model of the application domain, constructed by experts. These might be models for roads and traffic flow,
complex atmospheric and environmental models etc. The conceptual computational model takes into account
the computational context, and can be used to modify the model complexity, for instance we might determine
the way we represent different components of the application domain models. The logical computational model
concerns to design of the database components of the models, such as the choice of data structures to use.
Finally the physical computational model is the actual piece of software, designed and implemented on a specific
platform. The role of the computer scientist has traditionally been in the later steps of model development, so
y
Boundar
Polygon
Age
Ow
ne
r
Age (numeric)
27
In the object based approach we decompose space (the so
called information space) into a series of objects or entities.
In order to be objects we must have that the entity is:
➢ identifiable,
➢ relevant,
➢ describable.
Person Name (string)
Joe Logs
A description is given by the objects properties: static things such as the cities name, behavioural - things such as
how the object is drawn and structural such as where the
Figure 5.3: A spatially referenced tree object.
object is (location - in hierarchy or space). It is important
to realise the distinction between spatially referenced objects (houses, trees or lakes) and spatial objects (polygons, lines, points). Figure 5.3 shows a spatially referenced object (a tree) which has a spatial object (polygon)
for a boundary, as well as other attributes.
These spatial objects are embedded into a certain type of
space (called the embedding space!) which can be Euclidean, metric, topological or set based as discussed before. Almost all GIS use Euclidean space (distance and
bearing exist) as the fundamental embedding space, with
point, line (arc), and area (polygon) structures as the fundamental spatial objects. This gives the object based approach to GIS very strong links with geometry. One other
commonly used data structure is the simplicial complexes,
used to represent surface elevation (height above sea level)
in TINs.
spatial
extent
1-extent
0-extent
2-extent
point
area
arc
loop
closed loop
The geometric structures can be used to define a class hierarchy, which is illustrated in Figure 5.4. Note that a
Figure 5.4: A hierarchy for spatial objects.
polygon is not necessarily the same as an area - a polygon
is defined to be a simple looped polyline + interior. In general features such as roads or rivers may be represented by a series of arcs, discretised to represent the object approximately, known as polylines. This raises
an interesting issue to do with the accuracy of the vector based model. If we are representing objects which
are really straight line segments (such as power-lines) our model can be very accurate, so long as the scale is
6 We
do not mean object oriented approach here, and object is meant in a non-technical sense.
CS3210 Geographic Information Systems
small enough7 . If the objects are curved then the polyline will be a (first order) approximation to the real
object (e.g. middle line of the river). We can also use higher order approximations (such as polynomial curves,
especially those used widely in computer graphics such as Bézier curves). We will return to this question of
model accuracy later.
There are a number of operations which can applied to the spatial objects, many of which we have looked at
in the previous section, and which are summarised in Table 4.2 in Worboys (1995, p. 171). Many of these
operations are directly relevant to some of the questions which we would like our model to address. For instance
a point in polygon operation might be useful for determining whether a post box was within a given ward (at a
larger scale). At a small scale we might want to use a polygon in polygon analysis to find out whether a house
was really within a single ward. Using the intersection of two polygons we could find areas with both clay soils
and grass vegetation. There are many examples.
A useful division can be made between static operations, which do not change the objects involved, and dynamic
operations which do alter the objects. Typical static operation in a GIS environment are operations like
intersection, union, is within, distance and area. Dynamic operations are things like create, destroy, update,
split, merge. In a GIS setting the update function may frequently include the use of transformations (scalings,
translations, rotation) and projections.
5.3
Implementing object based models
There is a distinction between object based approaches to GIS and object oriented approaches to GIS solutions.
The decision to model the application domain using a variety of spatial objects has a long tradition in GIS, but
many of these models are not object oriented, because the objects themselves have no real identity, they simply
sit on a layer with lots of other elements of the same type. This layer based approach is actually more strongly
identified with field based models as discussed in Section 5.4.
True object oriented approaches to GIS are still relatively uncommon, however there are some rare exceptions,
and increasingly all GIS companies are realising that having “object oriented approach” somewhere in their
software description is a good sales idea, despite the fact that in almost all instance these are simply object
wrappers into an underlying tabular / layered data structure (which inevitably creates inefficiencies). Recent
developments, driven by the OpenGIS Consortium (OGC), a self-appointed, but open, group of experts, seeking
to produce a consistent set of open standards for spatial data, mean that object oriented approaches to GIS are
easier to envisage and can communicate readily with existing GIS software. We explore this later in Section 7,
but first we consider the data structures (representations) and algorithms which are typically used in GIS.
5.4
Field based models
For field based models the basic premise is that the attribute g (which may be, for example, elevation, rainfall or
per capita income) varies in space as some function which we have yet to describe. The value of the function g
depends on spatial coordinates x, y. We write g = g(x, y) and in general we will assume that g is a mathematical
function of some sort. We will address the exact form later. What we can note is that the value of g in reality
is assumed to exist for every location (x, y), and thus to store the exact field would require infinite storage.
In order to circumvent this problem we store the value of g on some finite tessellation of space (our spatial
framework) which is in practice either a grid, a quad-tree or a TIN. Note that this means that our models are
already quite different from the reality, the size of the difference again being related to the scale of the variation
and model.
The field based model is readily extended to 3D space (but we do not do that here). The field based model
also implies another form of restriction on the way we use the data in a GIS. It can be seen that each piece of
information that we would like to bring together is a separate layer in the data structure. Thus we are forced
into a layered way of thinking, which is a feature of many GIS projects. This is also true of many vector data
structures, although the Object Oriented approach to GIS alleviates some of these constrictions.
In general, we represent fields as a mathematical surface and these can be displayed using 3D graphs, colour and
7 Small
scale means different things to different people. In geographical terms small scale means between 10:1 to 10,000:1. Note
the small refers to the fact that a small distance on our map represents a small distance in the real world, that is small scale maps
have more detail.
19
20
CS3210 Geographic Information Systems
25
20
8
8
6
8
6
4
6
4
15
2
4
2
0
2
0
0
−2
−2
−2
−4
−4
−4
−6
−6
−6
−8
25
−8
25
−8
25
10
20
5
25
15
20
5
0
(a)
10
15
20
25
(b)
20
15
10
10
5
5
0
25
15
20
10
5
5
0
15
10
10
5
20
25
15
20
15
10
5
0
0
(c)
0
(d)
Figure 5.5: A field based model displayed as (a) a mesh plot, (b) a contour plot, (c) a surface plot and (d)
a lit surface plot.
contours as shown in Figure 5.5. Most natural phenomena / characteristics are best represented using fields,
particularly those which are continuous. We can think of different types of field:
Nominal:
different qualitative values which cannot be ordered (land-cover).
Ordinal:
ordered data but not directly comparable (landslide risk categories).
Ratio:
usual measurements taking real values (elevation).
Raster data structures are most suited for ratio variables, since the other discrete data types may be more
sensibly represented using vector data structures. In many raster data structures it will also be useful to be able
to assign a NULL value to represent areas where we have no information or where the variable is not defined.
We may want to distinguish between these two NULL values using different codes.
g(x)
g(x)
g(x)
x
x
x
Discontinuous
Continuous
Differentiable
Figure 5.6: Three different levels of continuity in fields.
As well as thinking about the types of data that the fields can represent we can also think about the types of
field behaviour we might see spatially. There are three general classes of spatial behaviour:
Discontinuous: the values of g may change radically from location to location.
Continuous: the values of g change smoothly but the gradient may jump.
Continuous and Differentiable: both the value of g and the gradient changes smoothly.
The different concepts of the spatial behaviour of a field are illustrated in 1D in Figure 5.6. The concept
of differentiability links in strongly with our sense of smoothness. A field that is once differentiable will be
reasonably smooth, but a field that is twice differentiable will be even smoother. Of course how this appears
will depend upon the scales of the variation. The behaviour of the field in 2D space may either be isotropic,
that is the same in all directions or anisotropic, that is vary differently in different directions. We will consider
this in more detail when we look at the actual models we will use.
One of the key concepts in GIS (and indeed one of the reasons that geography is interesting) is the observation that
things which are closer together tend to be more similar.
We can see this when we look at the world - albeit often
with hard edges. We use this concept, which is embodied
in the mathematical ideas of spatial correlation or covariance. The spatial covariance expresses to what extent (on
average) g varies with itself over increasing distance. We
can plot the covariance as a function of spatial separation
as shown in Figure 5.7 to produce the covariance function
of the field. We will look at these so called Random Field
Models for field data in more detail later in the course.
C(r)
Separation distance (r)
Figure 5.7: An example of a typical covariance
function.
CS3210 Geographic Information Systems
We might also like to consider the types of operations which we might apply to the data contained in field
structures. In general we can group these into:
Local:
acting at a point, often across different fields, e.g. max, min, sum.
Focal:
acting over a neighbourhood (usually using one field), e.g. slope, aspect, filtering.
Zonal:
acting over some predefined zones, often computing statistics, e.g. max, min, mean, variance.
We will look at a variety of these operations and their applications later in the course, particularly in the lab
classes.
6
Representation and Algorithms
In this section we try to start bridging the gap between the somewhat abstract concepts of the previous sections
and the more concrete computer code of the finished GIS system. We will look at some of the methods that
can be used to represent the spatial data models we have discussed and what algorithms can be applied to
compute with these data structures. From a Computer Science point of view these next two sections are key
to the GIS module. However, time does not permit us to spend a great deal of effort on this topic, which is
strongly covered in (Worboys, 1995).
6.1
21
22
CS3210 Geographic Information Systems
In all these discussions we must bear in mind that we are using finite precision digital representations of the
objects spatial location (particularly when we are using an object based model in Euclidean coordinates) and
we must ensure that the rounding errors we introduce are not accumulated. If we do not do this we can produce
results which have non-trivial rounding errors associated with them. We will look at this in more depth later
in the course.
6.2
Spatial objects
A point on the surface of the Earth will be represented in general by its latitude and longitude (and possibly
elevation). In later lectures we will look briefly at the issues surrounding the projection of this sphere onto 2D
surfaces. For now let us assume that pretty much everything we are interested in can be adequately represented
on a 2D Cartesian plane, with coordinates (x, y). These coordinates are determined by a fixed point (generally
the map origin) and two axes, which are typically, but not always, aligned south-north and west-east.
Special features of spatial data
Traditional data structures contain 1D information such as a persons name, a payroll number, or their actual
pay. Spatial data inherently requires data structures capable of storing 2D and 3D (even 4D) data. This means
that the storage and computational requirements for a spatial database are likely to be a lot larger than for
the standard 1D databases with which you are more familiar. Since the size of the data structures is greater
we must pay more attention to the efficiency of the implementation than we might otherwise (as is also true in
Computer Graphics).
There is an additional problem with geometric algorithms, since there may not always exist a computational
solution. We define an algorithm to be the specification of a computational process required to perform an
operation. The use of geometric algorithms dates back a long time (to Euclid circa 300 BC). Our brains have
very efficient geometric algorithms. For instance if you try and determine whether a point is in a polygon, we
can do that very quickly, while it is not so clear how to write an algorithm to do this. For geometric problems
we need to ask:
➢ is there an algorithm we can use to solve the problem,
➢ what is the most efficient algorithm,
➢ and what data structures can we use to make the algorithm efficient.
Figure 6.1: Discretisation errors in digital databases.
The model we are setting up is continuous, that is a point can exist at any location, however, since we need
to represent these coordinates digitally there will be a truncation error on the location, the size of which is
related to the number of bits we are prepared to use to store our coordinates (typically 32 to 64 depending on
machine architecture) and the size of the area we are trying to model. The problems become more acute when
we consider the intersection of two lines, since this intersection point is unlikely to correspond to a possible
coordinate location (see Figure 6.1).
We might also consider the special cases that often exist in geometric algorithms, instance the use of bounding
boxes to assess whether line can cross, before working out where they actually cross.
The complexity of an algorithm is clearly related to the efficiency. We can consider two related aspects; the
storage complexity (how much memory / storage is needed) and the time complexity (how long it takes). In
general we will trade them off against each other, and it will depend upon the needs of each user as to which is
most important. We generally express the complexity of an algorithm in terms of its complexity function. The
complexity function relates the number of objects denoted by n (such as the number of vertices in a polygon)
to the number of elementary computations we require. For many operations, such as finding the area of the
polygon this will be O(n) where the big O stands for of the order of, that is to say approximately n operations
are needed (actually it will be kn but k will be small). The O() of an algorithm tells us about the asymptotic
behaviour of the algorithm as n → ∞, which will often be the case with large geographic data sets.
For object based models the storage requirements remain O(n) as the dimension increases, because we merely
need to store extra coordinates. For field based models where we use a fixed grid as the dimension d increases,
or number of discrete cells along each dimension n increases the storage requirements grow as O(nd ), thus we
have a very greedy data structure. This will require the use of special data structures in field based models to
keep the storage requirements to a sensible level. An alternative is to use compression methods, as is done in
image processing. The final consideration for spatial data is that we will need methods for converting between
the object and field representations.
Figure 6.2: The Douglas-Peucker algorithm for line discretisation.
Another factor that we have to deal with is that while our data is generally curved most of the data structures
used in GIS are designed for linear data. It is possible to represent curves to an arbitrary accuracy using straight
lines by taking smaller and smaller segments, however the direct segmentation of a line into fixed intervals is
not very efficient in terms of storage. One approach to the problem is to use the Douglas-Peucker algorithm
which is illustrated in Figure 6.2.
The Douglas-Peucker algorithm works by starting from the both end points and connecting them by a straight
line. We then find the point on the curve that is most distant from the line and connect this with lines from
both end points. This is iterated for each line section until each lines section is closer to all points of the curve
section which it represents than some small number ². We can choose ² to be any value, the smaller it is the
shorter the segments of the resulting polyline will be.
CS3210 Geographic Information Systems
6.2.1
23
24
CS3210 Geographic Information Systems
Representing spatial structure
(x1,y1)
(x2,y2)
A
We have looked at a number of object types and high level
(x3,y3)
(x4,y4)
models which will be of use in a GIS. Now we start to think
(x5,y5)
B
a bit more about how we can encode some of the spatial
C
information. Consider the polygons shown in Figure 6.3.
There are a huge number of ways that we could represent
(x8,y8)
this structure. Probably the most simple is give the poly(x6,y6)
(x7,y7)
gon identifier followed by a list of the vertex coordinate
pairs (x, y). If we ensured that we always gave these pairs
Figure 6.3: Some simple structures to represent.
in a clockwise order we could even define the difference between the inside and outside of the polygon. Alternatively we could add a labelling point somewhere inside the
polygon, to which we could attach attribute information. Clearly such a representation is very inefficient, both
in terms of the storage space but also the ease of constructing queries to the database. For instance we might
often wish to determine which are the neighbouring polygons – which would take a lot of computation using
this simple structure. It is left as an exercise for the reader to show this will have O(mn2 ) time complexity if
we assume each polygon has n vertices and the total number of polygons is m.
We can improve the characteristics by labelling each vertex, and using these pointers (labels) to identify the
vertex coordinates. This cuts down on the storage requirements, but does little else. In computer graphics this
is referred to as a pointers-to-vertices data structure; in GIS it is often known as spaghetti. This structure is
still very difficult to query, but was often used in the first generation of GIS.
Note that the sequence NO gives a sequence of the types mentioned prior to that item. These sequences give
the membership of the individual points along an arc and of the arcs around an area. These sequences are
not important for describing the topology but are for displaying the shape. An extended entity-relationship
diagram of the model is shown in Figure 5.10 in (Worboys, 1995, p. 195).
To construct the tables for the NAA relation the following steps must be undertaken:
➢ label all points on the map;
➢ determine which points must be nodes and label these;
➢ direct the arcs (you can choose which direction) and label these;
➢ label the areas;
➢ build up the tables according to your labelling.
We can include even more topology using a Doubly Connected Edge List (DCEL) representation. This representation focuses strongly on topology and less on the embedding of the nodes, arcs and areas in space. It allows
us to efficiently search for the surrounding sequence of arcs of each node as well as the surrounding sequence of
arcs for each area. The NAA representation would not enable us to easily find the sequence of the surrounding
arcs, only an unordered list of each arc ID. To do this we need to add the relations next arc and previous arc
to the NAA arc relation. The table for the for a DCEL representation of Figure 6.4 is shown in Table 2.The
previous arc = next arc anti-clockwise from the begin node while the next arc = next arc anti-clockwise
from the end node.
Table 2: Representing the objects shown in Figure 6.4 using the doubly connected edge list representation.
Note that X represents the external area.
Now we try to add some topological information to our representation. One method that we can use to increase
the topological information is to use the node-arc-area (NAA) representation. This is a type of entity-relation
(E-R) model. The primary entities are nodes (vertices), directed arc (polyline) and area (closed simple polyline).
The representation is governed by several rules:
➢ an arc has exactly one start and end node;
➢ each node belongs to at least one arc;
➢ each area is bounded except the external area;
➢ arcs intersect only at nodes (planar embedding of
a graph);
➢ each arc has exactly one area on its left and one on
its right;
➢ each area belongs to at least one arc.
If we apply these rules to the polygons in Figure 6.3 we can
obtain Table 1, which can also be viewed with reference to
Figure 6.4.
a
2
A
3
b
B
e
c
d
f
4
Figure 6.4: Representing
NAA model.
things
begin node
1
2
3
2
4
4
end node
2
3
1
4
3
1
left area
A
A
A
B
B
C
right area
X
B
C
X
C
X
Table 1 shows one of the possible tables in the database. A full list of the tables is:
➢ point(point ID,x coord,y coord)
➢ node(node ID,point ID)
➢ polyline(arc ID, sequence of point ID’s)
➢ arc(arc ID,begin node,end node,left area,right area)
➢ area(area ID, sequence of arc ID’s)
begin node
1
2
3
2
4
4
end node
2
3
1
4
3
1
left area
A
A
A
B
B
C
right area
X
B
C
X
C
X
previous arc
c
a
b
b
d
e
next arc
d
e
f
f
c
a
1
C
using
Table 1: Representing the objects shown in Figure 6.4 using the node-arc-area representation. Note that X
represents the external area.
arc ID
a
b
c
d
e
f
arc ID
a
b
c
d
e
f
the
With these more complex data structures editing an existing structure becomes more involved. Adding an arc
to an existing table requires that the whole table be updated. Even with this more complex data structure a
query such as find the sequence of arcs around a given area is not trivial to implement. For instance we can
find a sequence of arcs around a node n using the pseudo-code in Listing 1. This is equivalent to finding the
order of exits off a roundabout. The procedure would be used by first searching the table (entries begin node
and end node) to find an arc that is incident with the node n. If no such arc exists then we should return an
error, otherwise we can repeatedly apply the procedure below, starting with the first arc we found above and
continuing until we find that we have returned to the starting arc.
FUNCTION new_arc(arc, n)
BEGIN
IF begin_node(arc) = n THEN
new_arc = previous_arc(arc)
ELSE
new_arc = next_arc(arc)
ENDIF
RETURN new_arc
END
Listing 1: Pseudocode for the procedure needed when finding the sequence of arcs about node n.
A simple extension to this algorithm yields a method for finding the sequence (or cycle) of arcs surrounding a
given area. It is left as a student exercise.
Other representations (often used in computer graphics) such as boundary representations (or b-reps) are used
in some GIS systems which focus on visualisation rather than analysis and manipulation, but the representations
CS3210 Geographic Information Systems
25
are not very information rich in general, since they have been developed to enable quick processing for display.
An alternative representation (often used in CAD/CAM applications) that is used in GIS data structures is the
winged-edge representation. It is a variant of the DCEL representation, but can only be used on objects with
which the notion of clockwise can be defined. This will be true of all the geographic data we will consider. The
key element in this data structure is again the edge or arc but we do not explore it further here.
6.2.2
Geometric algorithms for vector data
This section considers some of the algorithms which can be applied to the vector data we have just discussed.
A great deal of reference is made to standard geometric results, which we will review.
We start with some metric algorithms (which we have covered earlier). Recall that the distance between 2
points, |p1 p2 | where p1 = (x1 , y1 ) and p2 = (x2 , y2 ), using a Euclidean norm is:
p
|p1 p2 | = kp1 − p2 k = (x1 − x2 )2 + (y1 − y2 )2 .
(6.1)
A common query will be given a line L and a point p what is the distance between the point and the line. This is
most simply answered by considering the the line L in terms of its implicit functional form {(x, y)|ax+by+γ = 0}.
In words this notation describes the set of points (x, y) for which the equation ax + by + γ = 0 is satisfied.
If we represent the line by an implicit function f (x, y) = ax + by + γ = 0 and we know the slope-intercept
explicit form is:
∆y
y = mx + c =
x+c,
(6.2)
∆x
where ∆y = y2 − y1 , ∆x = x2 − x1 and c can be computed from c = y1 − mx1 , we can write:
f (x, y) = ∆y · x − ∆x · y + ∆x · c ,
26
CS3210 Geographic Information Systems
FUNCTION point_dist_to_line(xp,yp,x1,y1,x2,y2)
-- Compute the distance from a point (xp,yp) to a line defined by
-- its start (x1,y1) and end (x2,y2) points.
BEGIN
dx1p = x1 - xp;
dx21 = x2 - x1;
dy1p = y1 - yp;
dy21 = y2 - y1;
frac = dx21*dx21 + dy21*dy21;
-- Compute the distance along the line that the normal intersects.
lambda = -(dx1p*dx21 + dy1p*dy21) / frac;
-- Make sure we only take this if it is along the line segment,
-- otherwise choose the appropriate end point.
lambda = MIN(MAX(lambda,0.0),1.0);
-- Compute the x and y separations between the point on the line
-- that is closest to (xp,yp) and (xp,yp).
xsep = dx1p + lambda*dx21;
ysep = dy1p + lambda*dy21;
point_dist_to_line = SQRT(xsep*xsep + ysep*ysep);
RETURN point_dist_to_line
END
Listing 2: Pseudocode for the algorithm for finding the distance from a line to a point.
1
3
0
(6.3)
and thus a = ∆y, b = −∆x and γ = ∆x · c. We can now convert between three representations of a line: the
line segment defined by two end points, the explicit function and the implicit function.
If we consider the line L represented in its implicit functional form then the distance from the point p = (xp , yp )
is given by:
|axp + byp + γ|
√
.
(6.4)
dL =
a2 + b2
p
end
p2
dL
Note that this equation is only correct for the complete line.
In the case of a line segment we need to first check whether
the point is within the span of the line segment (which is
end
p1
middle
an example of the many special conditions which we come
across in geometry). If the point lies in the middle of the
line segment as shown in Figure 6.5, then we can use our
above formula, otherwise we need to take the minimum
of the two end point distances to the point p. If we are
Figure 6.5: The distance to a line segment.
dealing with a polyline then we need to perform each of
these operations for each segment and then take the minimum distance, which means if we have n segments the
time complexity of the algorithm will be O(n). If the line segments are short with respect to the overall length
then the minimum of the distances to the vertices will be a good approximation (although this is still O(n) in
time). We show an algorithm in Listing 2. Note this algorithm is needed in the Douglas-Peucker algorithm.
L
We could use this algorithm to compute the distance between two polygons - which we could interpret as
minimum distance between their point sets. If each polygon had n vertices we would need to compare all n
vertices in each polygon with all n − 1 line segments in the other polygon, giving a brute force algorithm of
O(n2 ) complexity. A more common definition of the distance between two polygons is the distance between
their centroids. The centroid of a polygon is simply computed for a regular polygon P which has vertices
{p1 , p2 , . . . , pn } and is given by the mean of its vectors:
cP =
p1 + p2 + . . . + pn
.
n
(6.5)
4
3
2
Figure 6.6: Point in polygon determination using the half line algorithm.
Similarly the area of a simple polygon is given by:
AP =
p1 × p2 + p2 × p3 + . . . + pn−1 × pn + pn × p1
,
2
(6.6)
where the × means taking the vector cross product which is given (in 2D) by
p1 × p2 = x1 y2 − x2 y1 .
(6.7)
The area operation has an additional property. If it is applied to a triangle with vertices {p1 , p2 , p3 } then not
only can we find the area of the triangle, but this area is signed:
Ap1 ,p2 ,p3 =
p1 × p2 + p2 × p3 + p3 × p1
.
2
(6.8)
Thus we have that Ap1 ,p2 ,p3 = −Ap3 ,p2 ,p1 . We can use this sign to determine which side of the line segment
p2 → p3 the point p1 is. If:
➢ Ap1 ,p2 ,p3 > 0 then p1 is to the left-hand side of p2 → p3 ;
➢ Ap1 ,p2 ,p3 < 0 then p1 is to the right-hand side of p2 → p3 ;
➢ Ap1 ,p2 ,p3 = 0 then p1 is on p2 → p3 .
This can be a very useful test when building or querying for topology on a purely geometric representation.
CS3210 Geographic Information Systems
27
Another operation we will frequently require is determining whether a point is within a polygon. Consider a
point p and a convex polygon P which has vertices {p1 , p2 , . . . , pn }, ordered anti-clockwise around the polygon.
One way of determining whether a point is in the polygon is to compute Ap,p1 ,p2 , Ap,p2 ,p3 up to Ap,pn ,p1 . If
all these are greater than zero then the point is in the polygon.
28
CS3210 Geographic Information Systems
area operator defined earlier. If this is zero, then the points are collinear, and we need only test whether point
lies within the bounding box of the line segment as shown in Figure 6.7.
( x 3 , y3 )
If the polygon is not convex then this algorithm will not work, it being left as an exercise for the reader to show
a case where this algorithm would not work. If the polygon is not convex then the semi-line algorithm can be
used to determine whether a point is in the polygon. We again assume a point p and a polygon P , this time
with no restrictions on the polygon other than we will assume the point is not on the boundary of the polygon.
( x 2 , y2 )
da12
lamb
( x i , yi )
( x 1 , y1 )
PROCEDURE intersect_two_lines(x1,y1,x2,y2,x3,y3,x4,y4)
-- Compute whether two line segments defined by (x1,y1) --> (x2,y2)
-- and (x3,y3) --> (x4,y4) cross and if so where.
BEGIN
cross = TRUE; -- Boolean to say whether the lines cross.
dx21 = x2 - x1;
dy21 = y2 - y1;
dx43 = x4 - x3;
dy43 = y4 - y3;
dx31 = x3 - x1;
dy31 = y3 - y1;
determ = dx43*dy21 - dy43*dx21;
-- Check if the lines parallel
IF (ABS(determ) < accuracy) THEN
lines are parallel
cross = FALSE;
ELSE
detinv = 1.0 / determ;
-- Compute the distance along the line (x1,y1) --> (x2,y2).
lambda12 = (dx43*dy31 - dy43*dx31)*detinv;
-- Compute the distance along the line (x3,y3) --> (x4,y4).
lambda34 = (dx21*dy31 - dy21*dx31)*detinv;
IF (((lambda12.LT.0.0) AND (lambda12.GT.1.0)) OR
((lambda34.LT.0.0) AND (lambda34.GT.1.0))) THEN
line segments do not cross
cross = FALSE;
ENDIF
-- Compute the point at which the lines cross.
xi = x1 + dx21*lambda12
yi = y1 + dy21*lambda12
( x 4 , y4 )
Figure 6.8: Crossing line segments.
To check whether two line segments intersect is a little more tricky. However it can be seen that if one end point
of the first line segment is on the opposite side of the second line to the first line’s other end point, then the two
must cross. (We assume that no points of one line lie on the other - we must test this first in practice.) Thus we
can use the area operator again here. Once we have determined whether they could cross, we must determine
the point at which the infinitely long lines would cross and then check whether this is within one of the line
segments. Determining the point at which the lines cross is not trivial, because again there are a number of
special cases (think about Figure 6.8). However, the algorithm is rather simple and is shown in Listing 3.
6.3
Network spaces
It often happens that we want to represent networks in a GIS. A very common example is that of representing a
transport network of some kind. Transport networks, such as the road layout in Birmingham have a dimension
somewhere between 1 and 2, that is they are composed of 1D line segments (roads) joined at nodes (road
junctions). The line segments are often directed (one way streets) and to encode the topology and network
geometry we need some special data structures. We have already looked at the concept of a graph, which is a
useful simplified model of a network.
B
ENDIF
RETURN cross, xi, y1;
END
D
3
5
12
6
Listing 3: Pseudocode for the algorithm for finding whether two line segments cross and the point at which
they do. See Figure 6.8 for meaning of the variables.
The semi-line algorithm is very simple - we simply draw a
half line from the point to infinity (in any direction) and
check how many times the line crosses the polygon boundary. If the number is odd the point is in the polygon – if
even it is outside as shown in Figure 6.6. We have to be
a little careful to check what happens if the line intersects
a vertex. To implement the algorithm the semi-line will
usually be considered to be aligned with the x-axis and the
polygon translated so that the point p is at the origin. We
then only need to traverse the polygon checking whether
each edge crosses the x-axis (exercise - how do we check
that?), thus the algorithm has O(n) time complexity.
17
A
C
Figure 6.9: A very simple road network.
bounding box
p2
Consider the graph in Figure 6.9 this represents the transport network of a small portion of the whole system. In
order to represent the topology in network, we can use a
tool called the adjacency matrix. This expresses, in binary
form, the connectedness of the graph. By default we define all elements on the diagonal of the matrix to have a
connectedness of zero – that is we do not use a 1 to represent the fact that an object is connected to itself. Other
entries in the matrix are 1 if the objects are connected and
zero if not. For Figure 6.9 this gives the adjacency matrix:
A B C D
A 0 1 1 0
B 1 0 1 1
C 1 1 0 0
D 0 0 1 0
p
p1
Figure 6.7: The bounding box of a line segment.
To check whether a point is on a segment is easy. First we check the area of the segment and point using the
Another common form adopted for the adjacency matrix is to replace the boolean 0 / 1 with 0 / the distance
between nodes in the graph (i.e. the length of the connecting edges). For Figure 6.9 this would yield the
adjacency matrix:
A
B
C
D
A
0
12
6
0
B
12
0
5
0
C
6
5
0
3
D
0
17
0
0
CS3210 Geographic Information Systems
29
Using this adjacency matrix you can, not only to determine whether it is possible to get from one place to
another, but also to find the shortest route. These types of adjacency matrices are used in the Auto-Route
type programs which can find the shortest path between two points. Note that the distances can be replaced
by travel times or typical speeds and distance. If more attributes are to be stored for the links, then pointers
(keys) can be used to reference further tables in a separate database which might contain information such as
the speed limits on the road segment, number of lanes, type of road surface or typical speeds on the segment
(which may vary as a function of time).
One problem with using adjacency matrices is that they are not very efficient storage wise, since if you have
n nodes in your graphical representation then your adjacency matrix will have n2 entries. If the network is
symmetric (no one-way streets) this can reduced to n(n+1)
– i.e. still O(n2 ). But is there really any need to
2
store all these entries. Well if you think about a typical transport network the answer should be pretty obviously
no. Most transport networks are sparsely connected, that is a typical road segment (edge) is connected to very
few road junctions (nodes) compared to the number of nodes in the full road network. This means that most
elements in the adjacency matrix are zero, that is the matrix itself is sparse. This means we can use special
methods for efficiently storing and processing sparse matrices, which can bring the storage requirements down
to O(n), since the connectedness is very sparse indeed in typical road networks. Another method to represent
this data structure would be to use an edge list, which for this example would be: AB, AC, BA, BC, BD, CA,
CB, DC. For large networks this edge list will be compact.
30
CS3210 Geographic Information Systems
1. distance vector – zero for the start node, the distance for connected nodes, and infinity for nodes not directly
connected.
2. path vector – set to initial node for connected nodes, others empty.
3. included vector – set to true for the initial node only, all others false.
The algorithm proceeds as shown in Listing 4.
find a node (say X) whose distance from the start node is smallest
amongst all those nodes not included.
mark the node X as
for each node Y not included
if (X and Y are connected) AND
(the distance from start node to X plus the distance from X to Y is
less than the current distance THEN
Update distance of the Y node to the distance from
start node to X plus the distance from X to Y
So far so good, but having created the adjacency matrix what can we now do? As suggested earlier, these
structures can be searched to find things like the shortest path. Typically this is what adjacency matrices are
designed to accomplish – for instance in routing freight vehicles using the shortest (or most efficient path). It
is not immediately obvious how to accomplish such a task, particularly on a very large adjacency matrix. For
small areas with say 10 nodes humans are very good at estimating the shortest route, but above this number
of nodes an automatic method will be needed.
A
C
B
D
D
ENDIF
Listing 4: Dijkstra’s algorithm for finding the shortest path (and lengths of all other paths) through a graph
(network).
C
C
B
D
Figure 6.10: A depth first traversal (left) and breadth first (right) traversal of the road network, starting
from the A node. Note the depth first traversal is one of many possibilities.
Most algorithms for searching these kind of data structures use either a breadth first or depth first approach.
In the both approaches you start from your starting point (!) – in a depth first search you then follow on
branch of the tree until you get to the desired end node, or a dead end. Note starting from a particular node,
the number of ways in which you can reach a desired ending node can be represented as a tree – indeed a tree
can be constructed which represents all possible journeys from the starting node. In a breadth first search you
explore all possible levels in the tree at the same time, thus at each step you take several branches. The tree
which shows all possible journey from node A in Figure 6.9 is shown in Figure 6.10, where we do not allow any
loops.
6.3.1
Update path of Y node to X
end
A
B
included (= true).
Finding the shortest route from A to B
Neither of these approaches is very good for large networks, so we will look at an alternative strategy for finding
the shortest path through a graph (network) called Dijkstra’s algorithm. This algorithm uses three vectors each
of length n where n is the number of nodes in the graph (network). The three vectors are:
➢ distance vector – keeps a track of the distance from the start no to the nodes which have so far been
explored.
➢ path vector – shows the preceding node for the current best path.
➢ included vector – boolean showing whether nodes have been used in the minimal distance calculation.
The algorithm itself will be easier to understand having attempted an example. At the first iteration:
We now show the application of Dijkstra’s algorithm to the network shown in Figure 6.9 to find the shortest
distance from A to D. We start with the following initialisation:
node
A
B
C
D
distance
0
12
6
∞
path
A
A
-
included
1
0
0
0
Now the node with the smallest distance from the start node is C. After one iteration the table becomes:
node
A
B
C
D
distance
0
11
6
∞
path
C
A
-
included
1
0
1
0
Note that since node C is not connected to node D, the entry for D remains unchanged. At the next (and final)
iteration we find node B has the smallest distance and is not included :
node
A
B
C
D
distance
0
11
6
28
path
C
A
B
included
1
1
1
0
This would finish the algorithm since only one node in now not included. The shortest path from A to D is 28
units long. The path can be found from the final table by following back the paths: for instance the previous
node to D is B, the previous node from B is C, and the previous node from C is A (our start point), thus the
traversal is AC, CB, BD. In the lectures we will try to apply this algorithm to a more complicated network.
CS3210 Geographic Information Systems
Application of this algorithm will generally have O(n2 ) time complexity with O(n) storage requirements. A
complete search of the network to find the shortest path will have O(n3 ) time complexity, thus Dijkstra’s
algorithm makes a great deal of difference to the time to compute shortest paths in a graph of a network.
A related problem is that of the travelling salesperson, a classic computational problem. The idea is to determine
an algorithm which can find the shortest path through a graph (network) such that all nodes are visited. There
is as yet no known polynomial time (i.e. O(nx )) solution to this problem. The only known general solution
is an exhaustive search (i.e. computing all possible traversals) which generates a vast number of possibilities,
even for small networks – the solution is O(xn ) time complex. This is said to be exponentially complex and
is the worst case we are ever likely to come across – even for small x = 2 (and typically, depending on the
connectedness of the network it will be bigger) a small network of 100 nodes will produce ∼ 21 00 = 1.27 × 103 0
possible routes. This type of problem is said to be NP-complete, meaning that no polynomial time algorithm
can be found. If such an algorithm could be found this would also solve a large number of related NP-complete
problems. We can however find polynomial time algorithms which almost solve the problem (that is they find
a very good route, if not guaranteed the best) in most cases.
31
32
CS3210 Geographic Information Systems
6.4
Spatial fields
The question of representing field based models is at least as complicated as that for vector objects. With fields
we have an additional complication – the subject of interest is assumed to exist at every point in (a predefined)
space. Thus there is no need to worry about topology here – it is implicit in the data. But we have finite
resources (storage and processing) so how can we store something that represents a value at every location?
The answer is to sample the continuous process at a number of points and try to ensure that the points we
retain are sufficient to represent the continuous process.
The assumption will be that these points are somehow
characteristic of their neighbourhood (indeed in many cases
they will represent an average value over that neighbourhood). So the question is how shall we choose these points?
There are two commonly used methods based on regular
and irregular tessellations of the 2D plane. A regular tessellation is one that repeats over and over, and thus there is
no need to store the whole tessellation, just the start point
and the spacing (see Figure 6.11). An irregular tessellation
Figure 6.11: Regular and irregular point patterns.
changes across the domain and is more expensive to store
because we need to store each vertex’s coordinates. There
are benefits to an irregular tessellation, which can change its sampling pattern to better sample those areas of
the domain which have a greater variability as illustrated in the south-western corner of the irregular tessellation
in Figure 6.11.
6.4.1
Regular tessellation representations
There are only 3 possible regular tessellations of the 2D Euclidean plane, based on triangles, squares, and
hexagons. Of these the only tessellation that is commonly used in GIS is that based on squares, which yields
the raster data structure. This is used for several reasons. First, and probably most importantly, it fits very
nicely with traditional, widely used, data structures based on arrays. It also corresponds to the areas sampled
by remote sensing (the instrument usually samples a square pixel) and the grid system used in most map
projections. However for a large range of variables, which are sampled irregularly we need to use some form of
interpolation to transfer the irregular samples to estimates on a regular grid. We will return to this question
when we look at sources of raster data.
6.4.2
Irregular tessellation representations
The most commonly used irregular representation is the
Triangulated Irregular Network. The TIN can also be used
as a method for interpolating the values at observation
points to grid points or to unsampled locations. This is
illustrated in Figure 6.12, where we know the locations
(p1 = (x1 , y1 ), p2 = (x2 , y2 ), p3 = (x3 , y3 )) of the vertices
of a facet of a TIN, as well as the values of the variable at
those locations (which we can represent as a height). We
want to find the value (height) at the point, p = (xp , yp ).
The three vertices define a plane given by the equation:
z1
zp
z3
p3
p1
p
z2
p2
Figure 6.12: Using a TIN to interpolate a variable.
αx + βy + γz + δ = 0
(6.9)
Knowing the coefficients (α, β, γ, δ) enables us to write:
zp =
−(αxp + βyp + δ)
.
γ
(6.10)
Here we have assumed a linear function over the triangle facet. We can determine the coefficients using simple
geometry – the algorithm is shown in Listing 5 and is based on finding the determinant of a 3 by 3 matrix (see
for instance (Bowyer and Woodwark, 1983, p115)).
CS3210 Geographic Information Systems
PROCEDURE compute_plane(x1,y1,z1,x2,y2,z2,x3,y3,z3)
-- Compute the coefficients for the plane defined by
-- points (x1,y1), (x2,y2), (x3,y3)
BEGIN
dx12 = x1 - x2;
dy12 = y1 - y2;
dz12 = z1 - z2;
dx32 = x3 - x2;
dy32 = y3 - y2;
dz32 = z3 - z2;
-- Compute the coefficients.
a = dy12*dz32 - dz12*dy32;
b = dz12*dx32 - dx12*dz32;
c = dx12*dy32 - dy12*dx32;
d = -(a*x1 + b*y1 + c*z1);
RETURN a,b,c,d;
END
33
34
CS3210 Geographic Information Systems
75mm
14
12
10
8
6
4
2
0
−2
2
1
2
1
0
0
−1
−1
−2
−2
Figure 6.14: An example of a TIN used for a DEM.
Listing 5: Pseudocode for finding the equation of a plane given three points.
Given a set of points, there will in general be very many possible triangulations. So which one should we
choose? Well the answer depends on what properties we would like to ensure are contained in our TIN. A very
popular method is to constrain our triangles to be as ‘equilateral as possible’. This leads us to the Delauney
triangulation, which is also associated with Voronoi polygons.
6.4.3
by selecting two random points from the point-set P , and we find the dividing line between them (as shown in
Figure 6.13). We then select another point from P and compute the new dividing lines and add them to the
representation. We repeat this iteratively until all points have been considered. The Delauney triangulation
can then be computed from the Voronoi polygons as shown in Figure 6.15.
Delauney Triangulation
Figure 6.15: The derivation of the Delauney triangulation from the Voronoi polygons.
Figure 6.13: Constructive definition of Voronoi polygons.
The most simple way to define a Delauney triangulation is by the construction of its dual, a Voronoi tessellation.
If we consider each point at which we have an observation, then the neighbourhood which is closer to that point
than any other point in the region is its Voronoi polygon. We can then form the Delauney triangulation by
joining together all nodes which are in adjacent polygons as well as the convex hull of the point set. Note that
to form the Voronoi polygons we need to ensure that no three adjacent points are collinear. The Delauney
triangulation of a set of points P has the properties that:
➢ it is unique (that is well determined);
➢ its external edges form the convex hull of P ;
➢ the triangles are as close to equilateral as possible.
So what use might the Delauney triangulations and Voronoi polygons be? Imagine the case that you are
designing a GIS system for monitoring crop water requirements (e.g. an agricultural application) and we have a
series of point measurements of rainfall from rain gauges. One method we could use to determine the rainfall at
an unmeasured location would be to ascribe that location the value of the nearest neighbour. This is equivalent
to defining the value inside the Voronoi polygon to be the value of the central point, producing something of a
Giant’s Causeway effect.
The next question is how to create polygons (or a Delauney triangulation - which is equivalent since both are the
dual of each other). There are several solutions (some of which are quite involved) but in general the algorithm
is O(nlogn) for the worst case. Several sources of algorithms are outlined in Worboys (1995, p 227) but here
we will consider a constructive approach (which is not optimal in terms of its time complexity) but is relatively
simple to understand. This is referred to as the incremental method (providing an online algorithm). We start
Delauney triangulations are often used for modelling terrain, by assigning each vertex a z value as discussed
above. Given a set of survey points with z values we could then define the surface as illustrated in Figure 6.14.
We could spend a great deal more time looking at issues surrounding the computational algorithms for geometric
data, however we now move on to considering data access issues. The interested reader will find further details
in texts on computational geometry such as (Bowyer and Woodwark, 1983) and (Laszlo, 1996).
Bowyer, A., Woodwark, J. 1983. A programmer’s geometry London: Butterworths.
Laszlo, M. J. 1996. Computational Geometry and Computer Graphics in C++ London: Prentice Hall.
CS3210 Geographic Information Systems
7
The Internet and GIS
Changes to the way companies operate mean that web based systems are becoming central to most organisations
functioning. This is also true for GIS systems, and a small scan of the most popular GIS vendors shows that all
of them are offering some sort of web based functionality, generally in the form of some variety of map server.
A map server allows users to specify a location and the server software will draw a map of that location or
area, which is often exported to the user as a bit-mapped image (that is the work of rendering the map is done
server side). This is a very simple example of serving spatial information over the web, but is static (client
side) and cannot be re-used. Something with far more radical implications is starting to be adopted, called
GML. GML is based on the XML specification and is an attempt to provide web friendly spatial data using a
widely accepted XML schema. We will look at GML in more depth later. The key idea is that once everyone is
using an agreed standard for describing their spatial data, this can readily be integrated over the web, through
the use of XML technologies, specifically Xlink and XPointer. The OpenGIS Consortium (OGC), which is
the W3C of GIS, is developing a number of GIS standards for interoperability (the ability to share data and
methods across different platforms / softwares / data sources). These standards include interface standards,
but probably most significantly GML. The web site can be found at: http://www.opengis.org/ and contains
many different specifications.
7.1
XML, HTML and the future of the Internet
HTML is a mark-up language that describes both the style (with Cascading Style Sheets – CSS) and the content
of web pages. This is generally not a good idea. The content and its rendering should be independent. For
instance it would be very useful to be able to render a series of locations as a list in a table or as points on
a map, depending on the user requirements. With HTML this is not possible. XML, on the other hand, is
a mark-up language for describing data. XML stands for eXtensible Mark-up Language – that is the user is
at liberty to define their own data mark-up language. The XML standard specifies how to define your data
mark-up, that is what is legal and what isn’t. XML is a large and complex technology, with many off-shoots.
It is moderated by the W3C (http://www.w3.org/) who ensure standards across the Internet. I will give a
simple, short introduction to XML, and then we will look at GML.
8
XML– eXtensible Mark-up Language
We will explore XML through a very simple example. My aim here is not to teach you how to create and use
XML, but to give you an understanding of the technology and its role. Suppose we wish to have a web page
which displays data about where a series of people live in the UK. In HTML this would have to be a static
document, possibly using the <table> HTML element. Indeed the HTML might look something like Listing 1.
This static HTML is fine, but what if we wanted to show this as a map? Or what if we wanted to make the table
dynamic so that we could search for and display people with certain names only? The traditional approach to
this problem has been to put the data into a database (maybe even a GIS) and then extract the data using
HTML forms and some scripting language (CGI / PERL / Javascript) which generates appropriate (SQL)
queries to the database (which has to be run on some server). The database then returns the results and your
scripting language converts these to HTML. This requires servers, databases and a number of communication
protocols. The basic idea of XML is that data should be stored in an easily portable, easily readable manner
(textual) and should be independent of the methods used to display it.
If we were to write some XML to store the data in Listing 1 it might look something like Listing 2. This is not
a particularly sensible design for data about location, particularly the address field, which really should be in
separate fields, but this is an example. The XML file now contains the data, together with tags which name the
data, for instance the <forename> element. Note in XML a closing tag is always necessary, i.e. </forename>.
The first line of the file specifies that this is XML and which version. We then specify the stylesheet that is to
be used, in this case a style sheet called tableperson.xsl. The style sheet defines how the XML is rendered;
in this case as HTML. XSLT (eXtensible Stylesheet Language for Translation) provides a method for describing
the transformations that need to be applied to the XML data to produce the desired output. Often, this will
produce HTML output, although there is no need for this to be the case – any output style can be allowed, as
we shall see later.
35
36
CS3210 Geographic Information Systems
<html>
<head>
<META http-equiv="Content-Type" content="text/html">
<title>Where do they all live?</title>
</head>
<body>
<h1> Where do these people live? </h1>
<table border="2px">
<tr>
<th colspan="2">Name</th>
<th>Address</th>
<th>Postcode</th>
<th colspan="2">Location</th>
</tr>
<tr>
<td>Joe</td>
<td>Bloggs</td>
<td>Aston University, Birmingham</td>
<td>B4 7ET</td>
<td>49.2</td>
<td>-3.1</td>
</tr>
<tr>
<td>Jim</td>
<td>Blaggs</td>
<td>Aberdeen University, Aberdeen</td>
<td>AB1 5ZX</td>
<td>56.4</td>
<td>-1.1</td>
</tr>
</table>
</body>
</html>
Listing 1: HTML for displaying the location data for some people.
The style sheet to produce a tabular data display in HTML (which actually produces the HTML shown in
Listing 1) is shown in Listing 3. This XSLT file is also XML, and this time uses namespaces extensively (note
the URL for the namespace does not actually exist, it is simply a unique identifier). Namespaces are important
in XML. They are reminiscent of object oriented methods. The idea of namespaces is key for the extensible
nature of XML. The use of namespaces allows people to have multiple definitions of the same object, which
has advantages and disadvantages. The advantage is that a user can use their own preferred syntax and not
have to worry that anther XML schema also uses the element <tree> to describe a tree data structure, when
the user wants to represent a physical tree. The disadvantage is that the user can then define and use any
element names, so my <street> may be your <road>. This causes problems with interoperability (the posh
word for sharing information, data and methods). To address this issue requires bodies (such as W3C) to lay
down standards. Before we look at an XML standard for spatial data, I should briefly mention schemas.
XML schemas (XSD) replace what used to be called DTDs (Document Type Definitions). Schemas, much like
database schemas describe the elements and their types for an XML document. Schemas define what elements
are permissible within XML documents, their characteristics, and any restrictions. We will not be covering
XML schemas in great depth, however the schema for the simple XML example presented in Listing 2 is shown
in Listing 4. The schema is pretty self explanatory, defining the different elements in the XML document. This
is a very simple schema, but illustrates the building up of complex types of elements as sequences of simple base
types (e.g. strings, decimals, integers, booleans). It also shows how schemas can be used to require certain fields
(all in this example) and put upper limits on the number of instances (again here they are all 1). By defining
a schema (hopefully with a little more documentation than I have given in this example) XML developers can
build in the interoperability features which are one of the main goals of using XML and web based processing.
Of course it helps greatly if people stick to some widely recognised schema appropriate for the data they are
CS3210 Geographic Information Systems
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl"
href="http://www.seas.aston.ac.uk/~cornford/xml/tableperson.xsl"?>
<locationList xmlns="http://seas.aston.ac.uk/namespaces/Location"
xmlns:xsi="http://www.w3.org/1999/XMLSchema-instance"
xsi:schemaLocation= "http://www.seas.aston.ac.uk/~cornford/xml/person.xsd">
<person>
<forename>Joe</forename>
<surname>Bloggs</surname>
<address>Aston University, Birmingham</address>
<postcode>B4 7ET</postcode>
<latitude>49.2</latitude>
<longitude>-3.1</longitude>
</person>
<person>
<forename>Jim</forename>
<surname>Blaggs</surname>
<address>Aberdeen University, Aberdeen</address>
<postcode>AB1 5ZX</postcode>
<latitude>56.4</latitude>
<longitude>-1.1</longitude>
</person>
</locationList>
Listing 2: XML for containing the location data.
using, extending this (which is ultimately what XML is about) only when they require elements so specialised
that others are not already using them.
Note that the aim of this introduction is to give you a brief overview of the use of XML, rather than a complete
tutorial. Also it must be observed that XML is a rapidly developing technology, which is now relatively stable,
but still likely to evolve over the coming years. Thus a recent textbook would be recommended if you are
intending to use XML widely in your work. I have yet to find what I consider a good XML guide, thus I do not
give a suggestion. Check out the books shops as near as possible to the time you wish to implement XML, so
that you have the most current resource. Alternatively there are a number of web pages which have tutorials
and guides.
8.1
SVG– Scalable Vector Graphics
When generating maps, or even simply visualising in some other way 2D data, it is necessary to use some
sort of graphical representation. Traditionally the Internet has relied on two raster (bitmap) formats: gif’s
and jpeg’s. These are two alternative compression formats for standard bit-mapped images, which is fine for
raster data, but for vector data these will be static, possibly large files which will not allow the best use of the
data. Currently there is no standard method for displaying 2D vector graphics over the Internet. Microsoft
has launched Vector Mark-up Language (VML) but this is only supported by Internet Explorer, thus is rather
limited in its adoption. In the last year W3C has adopted SVG as a standard (January 2000), and Adobe has
developed a nice plug-in (for almost all systems) that can render SVG. SVG can now be embedded in an HTML
document. However, SVG is rapidly becoming a worldwide standard, and thus is likely to the language of choice
for rendering maps over the Internet (although XML can be translated into any form desired).
I will not be covering SVG in any depth, suffice to say that the commands in SVG to draw simple 2D features
are very simple to use, and link in a very logical way with the features we might want to represent when using
spatial data. Thus the translation from XML to SVG is conceptually very simple for spatial data.
37
38
CS3210 Geographic Information Systems
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns="http://www.w3.org/TR/REC-html40"
version="1.0"
xmlns:loc="http://seas.aston.ac.uk/namespaces/Location">
<xsl:output method="html"/>
<!-- for MS IE only <xsl:stylesheet xmlns:xsl="http://www.w3.org/TR/WD-xsl"> -->
<xsl:template match="/">
<html>
<head>
<title>Where do they all live?</title>
</head>
<body>
<h1> Where do these people live? </h1>
<xsl:apply-templates select="loc:locationList" />
<hr />
</body>
</html>
</xsl:template>
<xsl:template match="loc:locationList">
<table border="2px">
<tr>
<th colspan="2">Name</th>
<th>Address</th>
<th>Postcode</th>
<th colspan="2">Location</th>
</tr>
<xsl:apply-templates select="loc:person" />
</table>
</xsl:template>
<xsl:template match="loc:person">
<tr>
<td><xsl:value-of select="loc:forename" /></td>
<td><xsl:value-of select="loc:surname" /></td>
<td><xsl:value-of select="loc:address" /></td>
<td><xsl:value-of select="loc:postcode" /></td>
<td><xsl:value-of select="loc:latitude" /></td>
<td><xsl:value-of select="loc:longitude" /></td>
</tr>
</xsl:template>
</xsl:stylesheet>
Listing 3: XSLT for producing a HTML based tabular display of the XML data.
SVG is a 2D graphics, XML based standard, however in many GIS applications 2.5D or 3D data will be used
to produce more realistic output (for instance fly-throughs to show the effect of a new development on the
surrounding area). To render these sort of images over the web requires a 3D graphics mark-up language, and
the most likely to be adopted as a standard in eXtensible 3-D (X3D). X3D is the XML incarnation of VRML,
the Virtual Reality Mark-up Language. This is very much a standard in development but is likely to gain W3C
approval.
CS3210 Geographic Information Systems
<?xml version="1.0"?>
<schema xmlns="http://www.w3.org/2000/10/XMLSchema"
targetNamespace="http://seas.aston.ac.uk/namespaces/Location"
elementFormDefault="qualified"
xmlns:loc="http://seas.aston.ac.uk/namespaces/Location">
<complexType name="locationListType">
<element name="person" minOccurs="0" maxOccurs="unbounded" type="loc:personType"/>
</complexType>
<element name="locationList" type="loc:locationListType" />
<complexType name="fullnameType">
<sequence>
<element name="forename" minOccurs="1" maxOccurs="1" type="string"/>
<element name="surname" minOccurs="1" maxOccurs="1" type="string"/>
</sequence>
</complexType>
<complexType name="locationType">
<sequence>
<element name="latitude" minOccurs="1" maxOccurs="1" type="decimal"/>
<element name="longitude" minOccurs="1" maxOccurs="1" type="decimal"/>
</sequence>
</complexType>
<complexType name="personType">
<sequence>
<element name="fullname" minOccurs="1" maxOccurs="1" type="loc:fullnameType" />
<element name="address" minOccurs="1" maxOccurs="1" type="string"/>
<element name="postcode" minOccurs="1" maxOccurs="1" type="string"/>
<element name="location" minOccurs="1" maxOccurs="1" type="loc:locationType" />
</sequence>
</complexType>
</schema>
Listing 4: XSD (person.xsd) for the XML data in Listing 2.
9
40
CS3210 Geographic Information Systems
contribute, read and use their standards. This is very unlike the GIS industry only a few years ago, where
companies had their own proprietary standards, particularly file formats.
Probably the most informative thing to do is examine the OGC mission and vision statements:
Our Vision:
Our vision is a world in which everyone benefits from geographic information and services made
available across any network, application, or platform.
Approximately 80% of business and government information has some reference to location, but until
recently the power of geographic or spatial information and location has been underutilised as a vital
resource for improving economic productivity, decision-making, and delivery of services. We are an
increasingly distributed and mobile society. Our technologies, services, and information resources must
be able to leverage location (i.e., my geographic position right now) and the spatial information that helps
us visualise and analyse situations geographically.
Products and services that conform to OGC open interface specifications enable users to freely exchange and apply spatial information, applications and services across networks, different platforms and
products.
Our Mission:
Our core mission is to deliver spatial interface specifications that are openly available for global use.
Open interface specifications enable content providers, application developers and integrators to focus
on delivering more capable products and services to consumers in less time, at less cost, and with more
flexibility.
Our Approach:
Organise Interoperability Projects: OGC employs a variety of innovative techniques to enable developers and integrators to rapidly and efficiently test, validate, and document specifications based on user
requirements.
Work toward consensus: We work by consensus to understand interface requirements and to bring
developers, integrators and users into agree on specifications.
Formalise OGC Specifications: Through OGC’s structured Committee programs, OGC members develop, review, and make public OpenGIS specifications.
Develop strategic business opportunities: We continuously scan the market to anticipate, identify, and
engage communities in the development and adoption of OpenGIS specifications.
Develop strategic standards partnerships: OGC maintains strong partnerships with international and
commercial standards organisation and technology communities to focus the agenda for interoperability.
Promote demand for interoperable products: Through our marketing and public relations programs,
we work with our members to increase awareness and acceptance of OpenGIS Specifications by users.
Our Core Values - We are committed to:
➢ Global Community - meeting the spatial technology interoperability needs of the global community.
➢ Innovation - delivering programs to rapidly develop interfaces to meet the realities of changing
technology.
➢ Efficiency, Timeliness, and Quality - meeting market needs at the right time, lowest possible
cost, and highest level of utility.
➢ Integrity - working by consensus to agree on interfaces while respecting and protecting the
intellectual property of our members.
➢ Leadership maintaining spatial technology leadership in the global standards community
GML– Geography Mark-up Language
GML is an XML schema designed to represent geographical information. As of March 2001, version 2.0 was
adopted which defined an XML schema for simple geographic features (we will come back to what this means).
Version 1.0 of GML was based on DTDs, and is now obsolete. In January 2003, the much extended GML (3.0)
was released, which was upgraded to a proposal for GML (3.1) in April 2004. The aim of GML is to provide a
common XML based framework (schema) for representing geographic data. This is important because it will
allow people to develop the interoperability of their software (easy exchange of data in GML and methods in
other initiatives) and have easy access to data over the web. The driving force behind GML is the Open GIS
Consortium (OGC), and we will look at their role next.
9.1
39
OGC– the Open GIS Consortium
The OGC is a self appointed (but open!) consortium of GIS industry companies, public bodies involved in
mapping, academic institutions and other interested parties. It is based on a similar organisational model to
W3C, being largely web based and putting out position papers for open review and amendment. The main web
page for the OGC is http://www.opengis.org/. This provides access to all OGC recommendations, proposals
and future development plans. A great strength of the OGC is the open nature of this forum. Anyone can
(from http://www.opengis.org/)
While this vision is rather all encompassing, with a lot of corporate speak, the core aim is very democratic,
embodying the best aspects of the Internet and its effect on society. The work of the OGC is vital to the opening
up of standards in spatial data handling, modelling and display. The OGC is working across a very large range
of technologies and ideas. GML is merely one strand to their work.
9.1.1
Interoperability
One of the key aims of the OGC is to enable interoperability. OGC recognise several aspects to their studies:
CS3210 Geographic Information Systems
➢ Feasibility Studies. Research efforts directed at understanding emerging technology areas.
➢ Testbeds. Collaborative, applied research and development efforts to develop, architect and test candidate specifications addressing Sponsor requirements.
➢ Planning Studies. Strategic studies that assess opportunities to expand and sustain an organisation’s
interoperability capacity.
➢ Pilot Projects. Collaborative testing efforts that apply technology implementing OGC specifications
to the real world.
➢ Technology Insertion Projects. Collaborative projects focusing on expanding an organisation’s
interoperability capacity by laying the infrastructure (groundwork) for open implementations.
The key issue here is to get vendors (data suppliers, software suppliers, consultants) and users to recognise and
use a common set of procedures at various levels. The most obvious of these is to use common representations
for data, which is the role of GML.
9.1.2
OpenGIS Specifications
In order to realise the aims of the OGC a series of specifications have been issued by the OGC. These are called
OpenGIS specifications and can be grouped into:
➢ OpenGIS Abstract Specification. The OpenGIS Abstract Specification provide the framework or
reference model for OpenGIS Implementation Specifications. This high-level guide is very useful for
those looking to be involved in the technical development of OpenGIS Specifications.
➢ OpenGIS Implementation Specifications. These documents detail the agreed upon interfaces that
OGC develops through its consensus process. These are software engineering specifications - any software
developer can use this information to build a product that implements one or more of these specifications.
The software should then be able to communicate with any other software that implements the same
specification(s).
➢ Recommendation Papers. A Recommendation Paper presents an official OGC position on a given
topic, such as a draft Implementation Specification. OGC Recommendation Papers are published for
public review and comment and they may result in either Approved Abstract Specifications or Approved
Implementation Specifications.
➢ Conformant Products. OGC distinguishes between products that claim to implement OpenGIS
Specifications and products that have been tested to be conformant to the OpenGIS Specifications
using one of the OGC Test Suites.
These specifications are key to developing interoperability. I do not want to reproduce the OGC web site here, so
the user is referred to the OGC web page for an in depth discussion of the OGC specifications. Some, however,
are rather key to what we will discuss next and will be mentioned. Some of the most relevant specifications are:
➢ OpenGIS Simple Features Specification. This is a model for simple vector based features (point
/ line / polygon) features, spatial referencing (common projections) and feature attributes (i.e road
surface type / name).
➢ OpenGIS Grid Coverage Specification. This deals with handling satellite images, aerial photographs, digital elevation data and general raster data. The specification includes how to handle basic
operations.
➢ OpenGIS Coordinate Transformation Services Specification. This provides methods for specifying projections / coordinate systems to be used with the previous two specifications.
➢ OpenGIS Web Map Server Interface (WMS 1.1.0) Specification. The OpenGIS Web Map
Server Specification (WMS) is a set of interface specifications that provide uniform access by Web
clients to maps rendered by map servers on the Internet. Thus, WMS is a service interface specification
that: enables the dynamic construction of a map as a picture, as a series of graphical elements, or as a
packaged set of geographic feature data; answers basic queries about the content of the map; can inform
other programs about the maps it can produce and which of those can be queried further. The Web Map
Server Interface provides four protocols (GetCapabilities, GetMap, GetFeatureInfo and DescribeLayer)
in support of the creation and display of registered and superimposed map-like views of information that
come simultaneously from multiple sources that are both remote and heterogeneous.
➢ OpenGIS Catalogue Services Specification. The specification is specifically designed to allow data
fusion. It is based on an ISO standard (TC211) and provides a service for discovering what data is
available and where. The specification allows for service which know of all existing data catalogues
and these can be queried using SQL or geographical queries. Maintenance of such catalogues, although
41
42
CS3210 Geographic Information Systems
tedious, is vital to the efficient use of any data, but this is particularly true for GIS data because the
volumes are so huge, and sources so diverse.
➢ OpenGIS Geography Mark-up Language (GML 2.0, 3.0, 3.1). This will be explored next.
The key issue with all these specifications is that they are public domain. Any software vendor can implement
these specifications if they have the required skills, thus the exchange of data (and information) will be greatly
facilitated. It is quite possible that the adoption of open standards by the GIS industry will be the biggest
single change the industry has seen and will revolutionise the use of spatial data?
9.2
GML
We will start with the most simple GML 2.0 specification, which will provide a basis for the extension to
GML 3.x later. GML (2.0) is essentially a series of three XML schema. It implements the OpenGIS Simple
Features Specification in XML. Thus the geometry is 2D and only basic objects such as points, (poly-)lines
and polygons are implemented. There is no topological information stored, however it is possible to have more
complex features (called feature collections) which are composed of many polygons, lines or points. The full
implementation specification is 74 pages long, and we will not be covering it, rather we will concentrate on the
important points. The key idea, much as with XML is to separate presentation and content. GML is a content
description language.
9.2.1
The composition of GML
Feature
Geometery
<<include>>
<<import>>
Xlink
Figure 9.1: How GML fits together.
GML is made up of three elements:
➢ feature schema;
➢ geometry schema;
➢ Xlink schema.
The Xlink schema is a W3C recommendation for linking elements in XML and allows cross referencing of data,
via a URL. This allows us to model associations between GML objects by reference, rather than by value. The
other two schema are specific to GML.
The geometry schema, as might be expected, deals with encoding the geometry of the objects which are to
be represented. For the time being this is restricted to points, poly-lines and polygons, in a particular spatial
reference system. The features schema is used to describe geographical features, and each feature can have one
or more geometry objects (for instance a house may have a point to identify it for some purposes, but a polygon
for others). The links between the elements of GML are depicted in Figure 9.1.
An example of the way that features and geometries are combined in practice is shown in Figure 9.2. This shows
a new class road being defined (this is a specialisation of feature) which has attributes name and surfaceType,
as well as a destination (which is itself a feature: a town) and a centerLine which is a LineString which is a
specialisation of the geometry class. This is characteristic of the way that GML is supposed to be implemented.
9.2.2
GML and geometry
In GML (2.0) the geometry that can be represented is restricted to points, poly-lines and polygons. These must
be defined in some specific spatial reference system:
<Point gid="P1" srsName="http://www.opengis.net/gml/srs/epsg.xml#4326">
<coord><X>56.1</X><Y>0.45</Y></coord>
</Point>
CS3210 Geographic Information Systems
43
Road
Feature
-name : string
-surface : int::surfaceCode
Geometry
0..1
*
0..*
-destination
LineString
Town
-centerLine
*
Figure 9.2: A partial example of GML (shown using a UML static structure diagram) as applied to represent a road.
The above code shows an example of a point stored in GML is shown. Notice that the srsName is used to
define the spatial reference system (in this case, as will generally be the case, in terms of a URL) - which is
required for all geometries. The point is also given a geographic identifier gid although this is not required.
The coordinates are then given as decimals in the relevant spatial reference system.
<Box srsName="http://www.opengis.net/gml/srs/epsg.xml#4326">
<coord><X>0.0</X><Y>0.0</Y></coord>
<coord><X>100.0</X><Y>100.0</Y></coord>
</Box>
The above XML fragment shows how a bounding box is stored in GML. All objects can have their bounding
boxes declared to make processing easier. The format is simple – the lower left and then upper right coordinates
of the box are defined, together with the spatial reference system.
An example of defining a line made up of two line segments is shown below:
<LineString srsName="http://www.opengis.net/gml/srs/epsg.xml#4326">
<coord><X>0.0</X><Y>0.0</Y></coord>
<coord><X>20.0</X><Y>35.0</Y></coord>
<coord><X>100.0</X><Y>100.0</Y></coord>
</LineString>
The coordinates of each of the points on the line is given in turn, again in the specified coordinate system.
44
CS3210 Geographic Information Systems
</LinearRing>
</innerBoundaryIs>
</Polygon>
The outer boundary of the polygon (of which there can only be one) is defined as a LinearRing (notice that in
this case the coordinates are given as comma separated tuples, with a space between each point on the ring –
this is the other admissible coordinate format in GML). A linear ring is simple a poly-line, where the end point
is also joined to the start point. Polygons may also have inclusions (islands), and thus the inner boundaries (if
extant) of the polygons can also be represented using the same data structure.
Looking through the GML code you may have noticed that it is not very compact, yet throughout the course I
have stressed the importance of compact data structures and efficient methods. This is a problem with all XML
based languages; they are not very size efficient. This can be remedied using standard data compression techniques, which will work very well on data with a strong structure (such as is imposed by GML) but there remains
a problem with GML. The data structures defined will necessarily involve duplication of polygon boundaries,
and as we discussed earlier there are better representations for vector data than their raw coordinates. GML
(3.0) or later will address this.
<MultiGeometry gid="c731" srsName="http://www.opengis.net/gml/srs/epsg.xml#4326">
<geometryMember>
<Point gid="P6776">
<coord><X>50.0</X><Y>50.0</Y></coord>
</Point>
</geometryMember>
<geometryMember>
<LineString gid="L21216">
<coord><X>0.0</X><Y>0.0</Y></coord>
<coord><X>0.0</X><Y>50.0</Y></coord>
<coord><X>100.0</X><Y>50.0</Y></coord>
</LineString>
</geometryMember>
<geometryMember>
<Polygon gid="_877789">
<outerBoundaryIs>
<LinearRing>
<coordinates>0.0,0.0 100.0,0.0 50.0,100.0 0.0,0.0</coordinates>
</LinearRing>
</outerBoundaryIs>
</Polygon>
</geometryMember>
</MultiGeometry>
Finally, the XML below illustrates how polygons are defined:
<Polygon gid="_98217" srsName="http://www.opengis.net/gml/srs/epsg.xml#4326">
<outerBoundaryIs>
<LinearRing>
<coordinates>0.0,0.0 100.0,0.0 100.0,100.0 0.0,100.0 0.0,0.0</coordinates>
</LinearRing>
</outerBoundaryIs>
<innerBoundaryIs>
<LinearRing>
<coordinates>10.0,10.0 10.0,40.0 40.0,40.0 40.0,10.0 10.0,10.0</coordinates>
</LinearRing>
</innerBoundaryIs>
<innerBoundaryIs>
<LinearRing>
<coordinates>60.0,60.0 60.0,90.0 90.0,90.0 90.0,60.0 60.0,60.0</coordinates>
GML can also be used to store collections of geometric objects, which are all in a common spatial reference
system. The XML fragment above shows an example of such a collection, which is denoted a <MultiGeometry>
type.
9.2.3
GML and features
The world around us can be thought of as being composed of a range of different features, such as rivers, roads,
houses, .... much as we might use an object model in GIS. In GML features are used to describe particular
geographical objects. Features will always have at least on geometry property, and sometimes more:
<CityModel fid="Cm1456">
<dateCreated>Feb 2000</dateCreated>
<gml:featureMember>
CS3210 Geographic Information Systems
<River fid="Rv567">....</River>
</gml:featureMember>
<gml:featureMember>
<River fid="Rv568">....</River>
</gml:featureMember>
<gml:featureMember>
<Road fid="Rd812">....</Road>
</gml:featureMember>
</CityModel>
45
46
CS3210 Geographic Information Systems
9.2.4
GML in use
In practice GML requires several services to be used. This is illustrated in Figure 9.3. The GML schema defines
how data is stored (more correctly served) by some web feature server. This is then passed to a rendering engine
(possibly a Java applet or servlet, or the web browser itself) where a stylesheet is used to convert the data into
some format for display. For GML it seems likely that SVG is the standard that will be most widely adopted.
This is then displayed as a map by the browser or a plug-in.
Map
SVG
The above fragment shows an example of GML being used to represent some features, in this case two rivers
and a road. In order to make sense of this it is necessary to understand the feature schema that is being used,
which is shown below:
Styling
Engine
XSLT
Map Style
File
GML
Web Feature
Server
<element name="CityModel" type="ex:CityModelType"
substitutionGroup="gml:_FeatureCollection"/>
<element name="River" type="ex:RiverType" substitutionGroup="gml:_Feature"/>
<element name="Road" type="ex:RoadType" substitutionGroup="gml:_Feature"/>
<complexType name="CityModelType">
<complexContent>
<extension base="gml:AbstractFeatureCollectionType">
<sequence>
<element name="dateCreated" type="month"/>
</sequence>
</extension>
</complexContent>
</complexType>
<complexType name="RiverType">
<complexContent>
<extension base="gml:AbstractFeatureType">
<sequence>....</sequence>
</extension>
</complexContent>
</complexType>
<complexType name="RoadType">
<complexContent>
<extension base="gml:AbstractFeatureType">
<sequence>.....</sequence>
</extension>
</complexContent>
</complexType>
Note that the feature types are defined as extensions of the GML AbstractFeatureType. The use of namespaces
is implicit in the gml: precursor to all GML commands. The actual properties of the features we are representing
are stored within an <sequence> part.
We will not go into great depth on the use of GML 2.0 and schemas. In general GML provides abstract base
feature types (and base types for feature collections). The user can then define their own extensions of this base
feature schema to suit their own purpose. However, to help interoperability a number of predefined schemas
are stored on the OGC web site at: http://www.opengis.net/schema.htm. This should provide a repository
for GML based schema.
XSD
GML schema
Figure 9.3: How GML works in practice.
Within the UK GML has only been taken up relatively recently (with the exception of Laser-Scan), however in
2001 two large organisations; the UK Government and the UK Ordinance Survey (UKOS), both declared their
future lay with GML. This could be an exciting time for the UK, since the UKOS under the work-plan of the
Digital National Framework is also greatly reducing costs of digital data. This is key to the widespread use of
spatial data, and will produce a change in the way that digital map data is used in the UK. The UK Ordinance
Survey can be found at http://www.ordnancesurvey.co.uk/.
9.3
GML (3.x)
GML (2.0) has several limitations including:
➢ restriction to 2D data;
➢ no topology;
➢ only simple vector data.
These restrictions are serious and mean that in order to represent any kind of realistic spatial data, GML (2.0)
must be extended. GML (3.0) was released in January 2003 and adds ISO 19100 series conformance as well as
the following to GML (2.0):
➢ new geometry class with: topological information (based on NAA); 3D geometry; curves and surfaces;
➢ feature time stamps, histories and events (temporal GIS);
➢ units of measure;
➢ observations;
➢ meta-data, including a default visualisation;
➢ coverages (layers such as DEM, TIN or other raster data structures).
This set of functionality mean GML can truly be regarded as the plausible candidate for geo-spatial data sharing
over the web. Time will not permit a full treatment of all 33 schema in GML (3.x), however an overview of
all the schema and their function will be given, with some examples. These will be based on GML (3.1) and
further information can be obtained from Lake et al. (2004).
The full object hierarchy of GML (3.1) is shown in Figure 9.4. The gmlBase.xsd schema defines the gml: Object,
gml: GML and gml: MetaData root properties. These define a number of convenience types, such as the
gml:NullType which allows users to specify missing, inapplicable, unknown, withheld or template data. Template data is used to denote a value that will be available later. The ability to represent missing data (and to
CS3210 Geographic Information Systems
47
48
CS3210 Geographic Information Systems
➢ abstract elements have an underscore added to their name;
➢ the words Type and Abstract are added to the end and beginning respectively of the relevant XML
schema.
All GML objects derive (directly or indirectly) from the AbstractGMLType. The schema to declare this object
type is shown here:
<element name=" GML" type="gml:AbstractGMLType"
abstract="true" substitutionGroup="gml: Object"/>
<complexType name="AbstractGMLType" abstract="true">
<sequence>
<element ref="gml:metaDataProperty" minOccurs="0" maxOccurs="unbounded"/>
<element ref="gml:description" minOccurs="0"/>
<element ref="gml:name" minOccurs="0" maxOccurs="unbounded"/>
</sequence>
<attribute ref="gml:id" use="optional"/>
</complexType>
Figure 9.4: The GML (3.1) object hierarchy.
provide a reason through a Universal Resource Indicator (URI) is very useful when incomplete data exists. This
is implemented through the convenience element gml:Null as shown in two examples below:
The structure of a global element declaration gml: GML with a schema defining the behaviour (AbstractGMLType)
is common to all of GML.
In GML properties refer to any characteristics of a GML object. It is possible to include an unlimited number
of additional properties (above those inherited from AbstractGMLType) which can have simple or complex type,
and can be stored by value or by reference. To support properties with complex content, a basic pattern for
property elements in provided by the gml:AssociationType, shown below:
<gml:Null>withheld</gml:Null>
<element name=" association" type="gml:AssociationType" abstract="true"/>
<gml:Null>http://www.lame.excuse.com/emergency/TheDogAteIt</gml:Null>
<complexType name="AssociationType">
<sequence>
<element ref="gml: Object" minOccurs="0"/>
</sequence>
<attributeGroup ref="gml:AssociationAttributeGroup"/>
</complexType>
It also defines a sign type (+/-) for use on the topological data structures. There are also standard GML types
for boolean, integer, double, name (may not contain white space) and string, which are automatically allow to
have null content. The gml:CodeType is used to reference dictionaries, classification schemes etc, referenced
through a URI, and acts rather like a keyword in GML:
<element name = "landClass" type = "gml:CodeType"/>
This is a convenience type, it is not a requirement that all properties requiring complex content utilise this type.
<landClass codeSpace = "http://www.gml.aston.ac.uk/ClassificationScheme">scrub</landClass>
Often it is useful to specify properties by reference in xlinks.xsd:
The above fragment shows the use of the gml:CodeType to indicate that scrub is a valid land cover type
according to the classification scheme held at the indicated URI.
The final basic type is the gml:MeasureType which is a double together with a reference to the correct units of
measure. Its use is shown here,
<element name = "height" type = "gml:MeasureType"/>
<height uom = "http://www.iso.org/iso/en/.../units/m">2.400</height>
<height uom = "http://www.dated.org/old/.../units/feet">6.8</height>
where a doors height is encoded in both metres and feet.
As well as the basic types GML defines a series of list types (extending standard XML lists) which allow a
number of values to be included in a single tag, separated by white spaces. There are corresponding types that
also allow null values in the list. Note there is no string list type, since string may include white space. Lists
may also be composed of units of measure types and code types.
The structure of GML remains the same. Features are represented by XML elements in GML instances (the
name of the feature type corresponds to the name of the element). In UML, this corresponds to an object.
Feature properties are sub-elements in the GML instances; in UML this corresponds to either association roles,
or object attributes. The property value is indicated by the data type of the feature property.
The following conventions are used when writing GML:
➢ objects (created based on features) are given a conceptually meaningful name in UpperCamelCase;
➢ properties are given a name in lowerCamelCase;
<attributeGroup name="simpleLink">
<attribute name="type" type="string" fixed="simple" form="qualified"/>
<attribute ref="xlink:href" use="optional"/>
<attribute ref="xlink:role" use="optional"/>
<attribute ref="xlink:arcrole" use="optional"/>
<attribute ref="xlink:title" use="optional"/>
<attribute ref="xlink:show" use="optional"/>
<attribute ref="xlink:actuate" use="optional"/>
</attributeGroup>
and in gmlBase.xsd:
<attributeGroup name="AssociationAttributeGroup">
<attributeGroup ref="xlink:simpleLink"/>
<attribute ref="gml:remoteSchema" use="optional"/>
</attributeGroup>
The above code shows the schema that are used to facilitate this using xlink. The remoteSchema attribute
allows us to identify a remote schema that defines the content of the properties referred to by the xlink-ed
item. In GML (3.1) the schematron language is used to enforce constraints such as limiting the property to be
referenced only by value or by reference, but this is not core to the use of GML and we will ignore it here.
All GML objects we create will inherit from AbstractGMLType and can thus contain metadata (that is information describing the data they store), a description, which is stored as a string (or reference) and many names,
which are of gml:CodeType. In practice an object may have multiple names (e.g. Birmingham International
Airport, BHX, the airport) depending on the context, so the application can select the relevant code space and
CS3210 Geographic Information Systems
49
use this name. The object may also have an ID, which is of XML type ID and thus must be unique within the
document.
To support the construction of object collections a gml:member concrete type is declared as <element name="member"
type="gml:AssociationType"/> which allows to create data instances which inherit from the GML base types.
There is an equivalent type for a sequence of members: <element name="members"
type="gml:ArrayAssociationType"/>, where gml:ArrayAssociationType simply allows for a sequence of
gml: Object’s. This allows the declaration of arrays (for objects of the same type, where the order may be important) and bags (for objects of different types or with no particular order), given by <element name="Array"
type="gml:ArrayType" substitutionGroup="gml: GML"/> and <element name="Bag" type="gml:BagType"
substitutionGroup="gml: GML"/> respectively.
These base types define the building blocks used in GML. They will now be illustrated in a more coherent
manner.
9.3.1
Features
cd gmlFeature
gmlBase::AbstractGMLType
+
«XSDattribute» id: id [0..1]
FeatureArrayPropertyType
+_Feature
0..* AbstractFeatureType
+featureMembers
0..1
+boundedBy
BoundingShapeType
0..1 +
+_Feature
«XSDchoice» Null: NullType
0..1
«XSDchoice»
+Envelope 1
FeaturePropertyType
+featureMember
AbstractFeatureCollectionType
+
AssociationAttributeGroup: AssociationAttributeGroup
0..*
geometryBasic0d1d::Env elopeType
+
+
+
SRSReferenceGroup: SRSReferenceGroup
pos: DirectPositionType [2]
coordinates: CoordinatesType
TimePositionUnion
temporal::TimePositionType
FeatureCollectionType
+
+
+
+timePosition
«XSDattribute» calendarEraName: string [0..1]
2
«XSDattribute» frame: anyURI [0..1]
«XSDattribute» indeterminatePosition: TimeIndeterminateValueType [0..1]
Env elopeWithTimePeriodType
+
«XSDattribute» frame: anyURI [0..1]
Figure 9.5: The GML (3.1) feature type hierarchy.
A GML feature is meant to be a meaningful object in the context of the application, for instance a road, supermarket or lake. The basic definition, without any features deprecated in GML (3.1), is shown below:
<complexType name="AbstractFeatureType" abstract="true">
<complexContent>
<extension base="gml:AbstractGMLType">
<sequence>
<element ref="gml:boundedBy" minOccurs="0"/>
</sequence>
</extension>
</complexContent>
</complexType>
<element name=" Feature" type="gml:AbstractFeatureType"
abstract="true" substitutionGroup="gml: GML"/>
50
CS3210 Geographic Information Systems
The abstract element gml: Feature defines anything that can be though of as a feature in GML. The corresponding UML static class diagram that represents the organisation of the feature type diagrammatically is
shown in Figure 9.5. Note that the mapping between UML and XML is not always simple and in some cases
rather odd UML notation is forced. These diagrams are intended to assist understanding and are not normative
descriptions of GML. The gml:boundedBy property describes an envelope surrounding the entire feature and
is designed to facilitate fast search for features in a given area. Note that the bounded by property can be of
an arbitrary gml:BoundingShapeType which itself may be of type gml:Envelope or null. It is also possible to
include a temporal extent since the gml:EnvelopeType is extended to include a time envelope if desired.
To describe the location of an object as convenience type gml:location is defined, as shown below:
<element name="location" type="gml:LocationPropertyType"/>
<complexType name="LocationPropertyType">
<sequence>
<choice>
<element ref="gml: Geometry"/>
<element ref="gml:LocationKeyWord"/>
<element ref="gml:LocationString"/>
<element ref="gml:Null"/>
</choice>
</sequence>
<attributeGroup ref="gml:AssociationAttributeGroup"/>
</complexType>
Note that a location can be defined in four ways: through a geometry, a description, a keyword (of gml:CodeType)
or a null value. An example of representing a location using geometry might be:
<gml:location>
<gml:Point gml:id="point96" srsName="urn:EPSG:geographicCRS:62836405">
<gml:pos>-31.936 115.834</gml:pos>
</gml:Point>
</gml:location>
or the location could be obtained from another source (e.g. a web map feature server), as in:
<gml:location
xlink:href="http://www.maps.uk/bin/gazm01?placename=leeds&placetype=C&county=Y+"/>
It is often useful to define associations between features and this is realised by the concrete gml:featureProperty
and gml:featureMember types use the gml:FeaturePropertyType content model. There is also a gml:featureMembers
type which provides an array of features.
The geometry of the feature will generally be specified in the applications schema, however GML provides several
methods to assist in this. Geometry can be specified using predefined formal names such as pointProperty,
curveProperty, surfaceProperty, solidProperty, multiPointProperty. Alternatively predefined description names can be used:
<element name="centerOf" type="gml:PointPropertyType"/>
<element name="position" type="gml:PointPropertyType"/>
<element name="extentOf" type="gml:SurfacePropertyType"/>
<element name="edgeOf" type="gml:CurvePropertyType"/>
<element name="centerLineOf" type="gml:CurvePropertyType"/>
<element name="multiLocation" type="gml:MultiPointPropertyType"/>
<element name="multiCenterOf" type="gml:MultiPointPropertyType"/>
<element name="multiPosition" type="gml:MultiPointPropertyType"/>
<element name="multiCenterLineOf" type="gml:MultiCurvePropertyType"/>
<element name="multiEdgeOf" type="gml:MultiCurvePropertyType"/>
<element name="multiCoverage" type="gml:MultiSurfacePropertyType"/>
<element name="multiExtentOf" type="gml:MultiSurfacePropertyType"/>
Finally application developers can create their own specific properties, for instance a mobile phone mast might
have a gml:location that provides a point geometry, an extentOf property that defines the area of the site,
and a serviceArea that defines those areas which the mast serves. This illustrates that features can also have
multiple geometries.
CS3210 Geographic Information Systems
51
Feature collections allow the grouping of features:
<complexType name="AbstractFeatureCollectionType" abstract="true">
<complexContent>
<extension base="gml:AbstractFeatureType">
<sequence>
<element ref="gml:featureMember" minOccurs="0" maxOccurs="unbounded"/>
<element ref="gml:featureMembers" minOccurs="0"/>
</sequence>
</extension>
</complexContent>
</complexType>
Note that a feature collection can have many gml:featureMember properties, but only one gml:featureMembers
property. Feature collections are made concrete by:
<element name="FeatureCollection" type="gml:FeatureCollectionType"
substitutionGroup="gml:_Feature"/>
<complexType name="FeatureCollectionType">
<complexContent>
<extension base="gml:AbstractFeatureCollectionType"/>
</complexContent>
</complexType>
There is also an abstract version of a feature collection gml: FeatureCollection, which serves as the head of
the feature collection substitution group.
A feature can also have topology, which is included by the application schema designer, using the GML topology
schema discussed later. This is also true for the temporal properties of features, which can be instants or
durations (and these can have simple topological structure too, as in the ordering of time). Note that when the
application schema developer creates their new application schema features must be defined using:
<element name="<<featureName>>" type =
substitutionGroup= gml:_Feature />
<<contentModel >>
where the substitutionGroup is only required if the feature is to be used in a feature collection.
9.3.2
Geometry
The basic type, as of GML (3.1), is:
<complexType name="AbstractGeometryType" abstract="true">
<complexContent>
<extension base="gml:AbstractGMLType">
<attributeGroup ref="gml:SRSReferenceGroup"/>
</extension>
</complexContent>
</complexType>
and all geometries inherit from this. This means that all geometries have an optional coordinate reference
system attached to them. Aggregated geometry elements are assumed to be in a common coordinate system
unless otherwise specified. The spatial reference system group is:
<attributeGroup name="SRSReferenceGroup">
<attribute name="srsName" type="anyURI" use="optional"/>
<attribute name="srsDimension" type="positiveInteger" use="optional"/>
<attributeGroup ref="gml:SRSInformationGroup"/>
</attributeGroup>
which uses:
<attributeGroup name="SRSInformationGroup">
<attribute name="axisLabels" type="gml:NCNameList" use="optional"/>
<attribute name="uomLabels" type="gml:NCNameList" use="optional"/>
</attributeGroup>
52
CS3210 Geographic Information Systems
which provides a list of axes labels (in order) and their corresponding units of measurement. Spatial reference
systems are complex in traditional GIS settings, and more so in the GML schema, so we will not treat them
in any depth, but note that the GML (3.1) framework supports a range of well known reference systems, user
defined systems and the conversion between reference systems.
GML geometry can be defined using the abstract head:
<element name="_Geometry" type="gml:AbstractGeometryType" abstract="true"
substitutionGroup="gml:_GML" />
which is in turn used to describe a concrete property:
<complexType name="GeometryPropertyType">
<sequence>
<element ref="gml:_Geometry" minOccurs="0" />
</sequence>
<attributeGroup ref="gml:AssociationAttributeGroup" />
</complexType>
The geometry element can be defined using either value or reference (xlink). We can also use arrays of geometries
(which cannot reference the geometry using xlink).
Coordinate geometry is defined using:
<complexType name="DirectPositionType">
<simpleContent>
<extension base="gml:doubleList">
<attributeGroup ref="gml:SRSReferenceGroup"/>
</extension>
</simpleContent>
</complexType>
<element name="pos" type="gml:DirectPositionType" />
this being the preferred use in GML (3.1) – don’t use the gml:coord element. There are also list types for
coordinate positions, but the explicit use of gml:coordinates was deprecated at version 3.1 for coordinates
that are number based tuples (although the type gml:CoordinatesType which is a text string will remain).
Vectors are included as:
<complexType name="VectorType">
<simpleContent>
<restriction base="gml:DirectPositionType"/>
</simpleContent>
</complexType>
<element name="vector" type="gml:VectorType" />
which simply defines an ordered set of numbers representing a point in space.
Envelopes, which we have met earlier are defined as:
<complexType name="EnvelopeType">
<sequence>
<element name="lowerCorner" type="gml:DirectPositionType"/>
<element name="upperCorner" type="gml:DirectPositionType"/>
</sequence>
<attributeGroup ref="gml:SRSReferenceGroup"/>
</complexType>
<element name="Envelope" type="gml:EnvelopeType"/>
where the choice to use deprecated properties has been removed for clarity. Thus an envelope is a bounding
box, but should not taken to be minimal, and can be defined in an arbitrary coordinate reference system if
desired.
There are a large number of geometric primitives, and only those which are likely to have common use will be
CS3210 Geographic Information Systems
illustrated. The base type is:
<complexType name="AbstractGeometricPrimitiveType" abstract="true">
<complexContent>
<extension base="gml:AbstractGeometryType" />
</complexContent>
</complexType>
which provides:
<element name="_GeometricPrimitive" type="gml:AbstractGeometricPrimitiveType"
abstract="true" substitutionGroup="gml:_Geometry" />
and this is used in:
<complexType name="GeometricPrimitivePropertyType">
<sequence >
<element ref="gml:_GeometricPrimitive" minOccurs="0" />
</sequence>
<attributeGroup ref="gml:AssociationAttributeGroup" />
</complexType>
This must contain the values or use xlink, but not both. Again we see that a given type is accompanied by a
property to contain that type.
The point property is defined by:
<complexType name="PointType">
<complexContent>
<extension base="gml:AbstractGeometricPrimitiveType">
<sequence>
<element ref="gml:pos" />
</sequence>
</extension>
</complexContent>
</complexType>
<element name="Point" type="gml:PointType" substitutionGroup="gml:_GeometricPrimitive" />
where again all deprecated elements are not shown. Thus the element gml:Point is used to store point geometry.
To create a property, which has a point as its value, either directly or by reference, the follow XML is declared:
<complexType name="PointPropertyType">
<sequence >
<element ref="gml:Point" minOccurs="0" />
</sequence>
<attributeGroup ref="gml:AssociationAttributeGroup" />
</complexType>
<element name="pointProperty" type="gml:PointPropertyType" />
It is also possible to declare arrays of points, as it was possible for other GML elements:
<complexType name="PointArrayPropertyType">
<sequence>
<element ref="gml:Point" minOccurs="0" maxOccurs="unbounded" />
</sequence>
</complexType>
<element name="pointArrayProperty" type="gml:PointArrayPropertyType" />
The next we explore the representation of curves. There are several methods to represent curves, so the object
based approach of GML with inheritance works well here. Initially an abstract base type is defined:
53
54
CS3210 Geographic Information Systems
<complexType name="AbstractCurveType" abstract="true">
<complexContent>
<extension base="gml:AbstractGeometricPrimitiveType" />
</complexContent>
</complexType>
The curve is assumed to be continuous with the following abstract head element:
<element name="_Curve" type="gml:AbstractCurveType" abstract="true"
substitutionGroup="gml:_GeometricPrimitive" />
Like points it is possible to now specify (abstract) curves as properties (clearly the appropriate behaviours for
the concrete implementations are used when the data is made available). This is done using:
<complexType name="CurvePropertyType">
<sequence>
<element ref="gml:_Curve" minOccurs="0"/>
</sequence>
<attributeGroup ref="gml:AssociationAttributeGroup" />
</complexType>
<element name="curveProperty" type="gml:CurvePropertyType" />
It is again possible to specify arrays of curves, but this is not shown here. There are a range of options for
representing a curve, and the most commonly used definition in GIS is as a line string (piecewise linear polyline).
This is defined as:
<complexType name="LineStringType">
<complexContent>
<extension base="gml:AbstractCurveType">
<sequence>
<choice>
<choice minOccurs="2" maxOccurs="unbounded">
<element ref="gml:pos" />
<element ref="gml:pointProperty" />
</choice>
<element ref="gml:posList" />
</choice>
</sequence>
</extension>
</complexContent>
</complexType>
<element name="LineString" type="gml:LineStringType" substitutionGroup="gml:_Curve" />
Just as with other types a line string can be defined as a property, although this has been deprecated in 3.1,
for the preferred type gml:CurvePropertyType.
There is also a simple type for 2D surfaces, these being abstract until we get to the definition of a polygon:
<complexType name="PolygonType">
<complexContent>
<extension base="gml:AbstractSurfaceType">
<sequence>
<element ref="gml:exterior" minOccurs="0" />
<element ref="gml:interior" minOccurs="0" maxOccurs="unbounded" />
</sequence>
</extension>
</complexContent>
</complexType>
<element name="Polygon" type="gml:PolygonType" substitutionGroup="gml:_Surface"
Note the polygon has a single exterior but may have multiple interiors (that is islands within the polygon). The
boundaries are declared as:
CS3210 Geographic Information Systems
<element name="exterior" type="gml:AbstractRingPropertyType" />
<element name="interior" type="gml:AbstractRingPropertyType" />
so both use the same basic gml:AbstractRingPropertyType:
<complexType name="AbstractRingPropertyType">
<sequence>
<element ref="gml:_Ring" />
</sequence>
</complexType>
which uses:
<complexType name="LinearRingType">
<complexContent>
<extension base="gml:AbstractRingType">
<sequence>
<choice>
<choice minOccurs="4" maxOccurs="unbounded">
<element ref="gml:pos" />
<element ref="gml:pointProperty" />
</choice>
<element ref="gml:posList" />
</choice>
</sequence>
</extension>
</complexContent>
</complexType>
<element name="LinearRing" type="gml:LinearRingType" substitutionGroup="gml:_Ring" />
The linear ring must have at least 4 points to define it. A ring is essentially a closed curve, where the final point
duplicates the first point (i.e. the four points are used to define a triangle).
55
56
CS3210 Geographic Information Systems
<complexType name="OrientableCurveType">
<complexContent>
<extension base="gml:AbstractCurveType">
<sequence>
<element ref="gml:baseCurve" />
</sequence>
<attribute name="orientation" type="gml:SignType" default="+" />
</extension>
</complexContent>
</complexType>
<element name="baseCurve" type="gml:CurvePropertyType" />
<element name="OrientableCurve" type="gml:OrientableCurveType"
substitutionGroup="gml:_Curve" />
This allows us to orient, that is give a direction to any of the curve types, which will be very useful when we
start to define topological structures.
The surface types have also been significantly enhanced to include:
<complexType name="SurfaceType">
<complexContent>
<extension base="gml:AbstractSurfaceType">
<sequence>
<element ref="gml:patches" />
</sequence>
</extension>
</complexContent>
</complexType>
<element name="Surface" type="gml:SurfaceType" substitutionGroup="gml:_Surface" />
GML (3.1) supports a number of additional geometries. Curves can be built up using curve segments, which
can be defined to have a number of properties (e.g. differentiability at both ends and in the interior) and can
be composed of types such as:
<simpleType name="CurveInterpolationType">
<restriction base="string">
<enumeration value="linear" />
<enumeration value="geodesic" />
<enumeration value="circularArc3Points" />
<enumeration value="circularArc2PointWithBulge" />
<enumeration value="circularArcCenterPointWithRadius" />
<enumeration value="elliptical" />
<enumeration value="clothoid" />
<enumeration value="conic" />
<enumeration value="polynomialSpline" />
<enumeration value="cubicSpline" />
<enumeration value="rationalSpline" />
</restriction>
</simpleType>
For many applications in traditional GIS most of these types are redundant, however many CAD systems, for
example use curves extensively, and in GIS this is likely to increase in the future since e.g. spline curves provide
a more parsimonious model for certain objects. The range of types largely reflects the types of curves used in
GIS systems. For each of these types there are corresponding definitions in the schema, but we will not cover
these here. For more detail the reader is referred to GML-WorkingGroup (2004).
The geometry schema also defines an orientable curve:
This defines a surface as being made up of a number of patches. These patches are defined as an array of surface
patch properties, and can be defined using the following interpolation methods:
<simpleType name="SurfaceInterpolationType">
<restriction base="string">
<enumeration value="none" />
<enumeration value="planar" />
<enumeration value="spherical" />
<enumeration value="elliptical" />
<enumeration value="conic" />
<enumeration value="tin" />
<enumeration value="parametricCurve" />
<enumeration value="polynomialSpline" />
<enumeration value="rationalSpline" />
<enumeration value="triangulatedSpline" />
</restriction>
</simpleType>
We will encounter many of these methods in the course, but equally there are a large number we will not
consider. Perhaps the most relevant for GIS are the planar and tin methods, although in the future the more
flexible spline based methods might become more widely used. Just as with curves, surfaces are orientable.
A TIN can be defined using:
CS3210 Geographic Information Systems
<complexType name="TinType">
<complexContent>
<extension base="gml:TriangulatedSurfaceType">
<sequence>
<element name="stopLines" type="gml:LineStringSegmentArrayPropertyType"
minOccurs="0" maxOccurs="unbounded"/>
<element name="breakLines" type="gml:LineStringSegmentArrayPropertyType"
minOccurs="0" maxOccurs="unbounded"/>
<element name="maxLength" type="gml:LengthType"/>
<element name="controlPoint">
<complexType>
<choice>
<element ref="gml:posList"/>
<group ref="gml:geometricPositionGroup"
minOccurs="3" maxOccurs="unbounded"/>
</choice>
</complexType>
</element>
</sequence>
</extension>
</complexContent>
</complexType>
<element name="Tin" type="gml:TinType" substitutionGroup="gml:TriangulatedSurface"/>
which provides a flexible TIN model which includes break lines and stop lines to allow critical ridges or troughs
to be explicitly included and edges of areas (or areas with problems) to be defined. The maximum length of
any side of any triangle can also be specified. ControlPoint shall contain a set of the positions used as points
for this TIN. Since each TIN contains triangles, there shall be at least 3 points. As well as the standard 2D
types, GML (3.1) defines a series of 3D solid object types. These will not be covered here, details are given in
GML-WorkingGroup (2004).
To produce more complex models, GML (3.1) adds two constructs for grouping the geometries:
➢ geometryAggregates;
➢ geometryComplexes.
Geometry aggregates (instances of a subtype of gml:AbstractGeometricAggregateType) are arbitrary aggregations of geometry elements. They are not assumed to have any additional internal structure and are used to
group pieces of geometry of a specified type. Application schemas may use aggregates for features that use multiple geometric objects in their representations. Geometry complexes (instances of gml:GeometricComplexType)
are closed collections of geometric primitives, meaning they will contain their boundaries. The composite geometries (CompositeCurve, CompositeSurface and CompositeSolid) can be viewed as primitives and as complexes.
Composites are groupings of the same geometry types, so composite curves are simple collections of curves.
Complexes on the other hand will contain all elements that bound the type. For instance a complex surface
would contain the bounding curves and the points of these curves.
9.3.3
Coordinate reference systems
Coordinate reference systems are defined using 6 separate schema. While a location (point) can be specified by a
set of coordinates, the actual position can only be made unambiguous through the introduction of a coordinate
reference system (otherwise we just have numbers). In general, although every point may have it’s own unique
coordinate reference system, we would typically assign a common coordinate reference system to all points in a
given data set. The schema are extensive and complex and we will not explore them further here, other than to
note that almost all types of coordinate systems are supported (and transformations) and positional accuracy
can also be represented.
57
58
CS3210 Geographic Information Systems
9.3.4
Topology
GML (3.1) supports topology at three levels and topological complexes. There is strong symmetry in the
(topological boundary and co-boundary) relationships between topology primitives of adjacent dimensions. The
topology primitives supported are nodes (0D), edges (1D), faces (2D) and topoSolids (3D). Topology primitives
are bounded by directed primitives of one lower dimension (e.g. an edge is bounded by two nodes – the nodes
are the boundaries of the edge). The co-boundary of each topology primitive (that is the objects which the
primitive can form part of the boundary of, for instance the co-boundary of a node may be a number of edges)
is formed from directed topology primitives of one higher dimension. The topology schema includes definitions
from the geometry schema yielding the ability to describe primitives and complexes with a geometric realisation:
<include schemaLocation="geometryComplexes.xsd"/>
As with other GML constructs abstract base types are declared:
<complexType name="AbstractTopologyType" abstract="true">
<complexContent>
<extension base="gml:AbstractGMLType"/>
</complexContent>
</complexType>
<element name="_Topology" type="gml:AbstractTopologyType"
abstract="true" substitutionGroup="gml:_Object"/>
A base type is also declared for topological primitives:
<complexType name="AbstractTopoPrimitiveType" abstract="true">
<complexContent>
<extension base="gml:AbstractTopologyType">
<sequence>
<element ref="gml:isolated" minOccurs="0" maxOccurs="unbounded"/>
<element ref="gml:container" minOccurs="0"/>
</sequence>
</extension>
</complexContent>
</complexType>
<element name="_TopoPrimitive" type="gml:AbstractTopoPrimitiveType"
abstract="true" substitutionGroup="gml:_Topology"/>
Topology primitives form the base types for all topology constructs. The concept of isolation (e.g. faces may
isolate nodes) and being a container (nodes may have faces as containers) can only work for shapes with a
dimension difference greater than 2 (e.g. an edge cannot isolate, or contain nodes).
The most simple primitive is the node, which has the following form:
<complexType name="NodeType">
<complexContent>
<extension base="gml:AbstractTopoPrimitiveType">
<sequence>
<element ref="gml:directedEdge" minOccurs="0" maxOccurs="unbounded"/>
<element ref="gml:pointProperty" minOccurs="0"/>
</sequence>
</extension>
</complexContent>
</complexType>
<element name="Node" type="gml:NodeType" substitutionGroup="gml:_TopoPrimitive"/>
A node has no boundary, so this is not represented, but it does have a co-boundary – a directedEdge – for
edges incident on the node the orientation of the directedEdge is positive, for edges emanating from the node
it is negative. It is possible (like DCEL) to order edges in a clockwise sequence about the node in 2D (this
is not possible in 3D). The node may also realise a point geometry. There is also a directedNode which also
includes an orientation which can be used when defining edges to indicate whether the node is a start (’-’) or
CS3210 Geographic Information Systems
end (’+’) node defined in:
<attribute name="orientation" type="gml:SignType" default="+"/>
The next topological construct is the edge:
<complexType name="EdgeType">
<complexContent>
<extension base="gml:AbstractTopoPrimitiveType">
<sequence>
<element ref="gml:directedNode" minOccurs="2" maxOccurs="2"/>
<element ref="gml:directedFace" minOccurs="0" maxOccurs="unbounded"/>
<element ref="gml:curveProperty" minOccurs="0"/>
</sequence>
</extension>
</complexContent>
</complexType>
<element name="Edge" type="gml:EdgeType" substitutionGroup="gml:_TopoPrimitive"/>
Note that an edge must have two directed nodes (a start and end node). The co-boundary of the edge is the
set of (directed) faces incident on the edge, and may or may not be represented. An edge may also have a
geometry, realised as a curve. Again the base Edge element is extended to a directed edge:
<complexType name="DirectedEdgePropertyType">
<choice>
<element ref="gml:Edge" minOccurs="0"/>
</choice>
<attribute name="orientation" type="gml:SignType" default="+"/>
<attributeGroup ref="gml:AssociationAttributeGroup"/>
</complexType>
<element name="directedEdge" type="gml:DirectedEdgePropertyType">
<annotation>
<appinfo>
<sch:pattern>
<sch:rule context="gml:directedEdge">
<sch:extends rule="hrefOrContent"/>
</sch:rule>
</sch:pattern>
</appinfo>
</annotation>
</element>
which effectively adds an orientation attribute.
The next primitive is a face:
<complexType name="FaceType">
<complexContent>
<extension base="gml:AbstractTopoPrimitiveType">
<sequence>
<element ref="gml:directedEdge" maxOccurs="unbounded"/>
<element ref="gml:directedTopoSolid" minOccurs="0" maxOccurs="2"/>
<element ref="gml:surfaceProperty" minOccurs="0"/>
</sequence>
</extension>
</complexContent>
</complexType>
<element name="Face" type="gml:FaceType" substitutionGroup="gml:_TopoPrimitive"/>
which is much like an edge, but must have at least one directedEdge as its boundary, and may be bounded
by solids, and represented geometrically as a surface. The directedEdges that form the boundary of the face
should be oriented with the face on the left. Again there is a directedFace which is used in the co-boundary
59
60
CS3210 Geographic Information Systems
relation for edges.
We will not look at topoSolids, merely note their existence. As we saw previously, these topology primitives may
be isolated (that is a node may be completely contained by a face). These relationships can also be represented
in the topological schema. In general use the topological primitives are designed to be represented (for data
that really has a geometry using types gml:TopoPoint, gml:TopoCurve and gml:TopoSurface elements and
corresponding gml:topoPointProperty, gml:topoCurveProperty and gml:topoSurfaceProperty elements.
These are most relevant where a specific (possibly multiple) geometry will be linked with shared node definitions.
The base topological constructs are brought together using topological complexes:
<complexType name="TopoComplexType">
<complexContent>
<extension base="gml:AbstractTopologyType">
<sequence>
<element ref="gml:maximalComplex"/>
<element ref="gml:superComplex" minOccurs="0" maxOccurs="unbounded"/>
<element ref="gml:subComplex" minOccurs="0" maxOccurs="unbounded"/>
<element ref="gml:topoPrimitiveMember" minOccurs="0" maxOccurs="unbounded"/>
<element ref="gml:topoPrimitiveMembers" minOccurs="0"/>
</sequence>
<attribute name="isMaximal" type="boolean" default="false"/>
</extension>
</complexContent>
</complexType>
<element name="TopoComplex" type="gml:TopoComplexType"
substitutionGroup="gml:_Topology"/>
Topological complexes are rather involved, since the terminology gets rather heavy, but actually things are a
little simpler once you can see through this. This type and element provide encoding for a topology complex
comprising multiple topology primitive members. In addition to primitives, each complex holds a reference to
a unique maximal complex (the complex which has no super-complex) and optionally to some number of subor super- complexes. A topology complex contains its primitive and sub-complex members, and is contained by
its super-complex(es). The primitive and sub-complex members of a topological complex have dimensionality
less than or equal to the dimensionality of the topology complex. There is one and only one maximal complex
per topological manifold. This simply provides a hierarchical way to represent large scale topology, by using
complexes and their super- (that is the higher level complexes that contain the complex) and sub- (lower level
complexes contained in the complex) complexes.
The only thing we need to do now is to define a type to enable us to represent fact that a topological primitive
is part of a complex and this is achieved using the
<element name="topoPrimitiveMember" type="gml:topoPrimitiveMemberType">
<annotation>
<appinfo>
<sch:pattern>
<sch:rule context="gml:topoPrimitiveMember">
<sch:extends rule="hrefOrContent"/>
</sch:rule>
</sch:pattern>
</appinfo>
</annotation>
</element>
<complexType name="topoPrimitiveMemberType">
<sequence>
<element ref="gml:_TopoPrimitive" minOccurs="0"/>
</sequence>
<attributeGroup ref="gml:AssociationAttributeGroup"/>
</complexType>
and a corresponding topoPrimitiveMembers for multiple topoPrimitives.
CS3210 Geographic Information Systems
Finally we have a utility type that allows us to associate a GML object to a topological complex:
<complexType name="TopoComplexMemberType">
<sequence>
<element ref="gml:TopoComplex" minOccurs="0"/>
</sequence>
<attributeGroup ref="gml:AssociationAttributeGroup"/>
</complexType>
<element name="topoComplexProperty" type="gml:TopoComplexMemberType"/>
This enable us to create a feature collection which can contain (or reference using xlink) a topological complex
which contains topology referenced by members of that feature collection.
9.3.5
Temporal data and dynamics
The temporal provision in GML is extensive, and almost as complete as the spatial representations. There is
provision for temporal geometry, topology and the definition of temporal reference systems. As with all GML
component schema, the temporal elements can be associated with geographic features. There are two basic time
models: interval, which allows us to measure duration, and ordinal which only allows us to order time. We
will not cover the temporal schema in any depth, merely note their presence. Most current GIS systems offer
only limited support for temporal data, so to make full use of this aspect of GML will require some evolution
in current software or tailor-made solutions.
GML 3.1 also provides provision for dynamic data, through dynamic feature collections which extend regular
feature collections. This means it supports the representation of moving objects and the storage of their paths.
Dynamic features must have either a time-stamp (to give the time at which the features are valid) or a history
(which contains a set of time slices, each of which can have the relevant values for the time varying properties).
The moving object type is just one example of how time slices can be extended:
<element name="MovingObjectStatus" type="gml:MovingObjectStatusType"
substitutionGroup="gml:TimeSlice"/>
<complexType name="MovingObjectStatusType">
<complexContent>
<extension base="gml:AbstractTimeSliceType">
<sequence>
<element ref="gml:position"/>
<element name="speed" type="gml:MeasureType" minOccurs="0"/>
<element name="bearing" type="gml:DirectionPropertyType"
minOccurs="0"/>
<element name="acceleration" type="gml:MeasureType" minOccurs="0"/>
<element name="elevation" type="gml:MeasureType" minOccurs="0"/>
<element ref="gml:status" minOccurs="0"/>
</sequence>
</extension>
</complexContent>
</complexType>
This allows us to represent a moving object through its current position, together with a speed and direction
(bearing), and an acceleration (the elevation allows us to represent objects which are moving up and down).
An example of where this might be used is a specialisation of a history, called a track, which could be used to
represent the path of a delivery vehicle for a courier company.
61
62
CS3210 Geographic Information Systems
<element name="track" type="gml:TrackType" substitutionGroup="gml:history"/>
<complexType name="TrackType">
<complexContent>
<restriction base="gml:HistoryPropertyType">
<sequence>
<element ref="gml:MovingObjectStatus" maxOccurs="unbounded"/>
</sequence>
</restriction>
</complexContent>
</complexType>
It is probably most simple to see this using an example:
<gml:track>
<gml:MovingObjectStatus>
<gml:validTime><gml:TimeInstant>
<gml:timePosition>2004-6-27T06:00:00</gml:timePosition>
</gml:TimeInstant></gml:validTime>
<gml:location>
<gml:Point><gml:pos>435.12. 120.27.</gml:pos></gml:Point>
</gml:location>
<gml:speed uom="#kph">67.4</gml:speed>
<gml:bearing><gml:CompassPoint>SE</gml:CompassPoint></gml:bearing>
</gml:MovingObjectStatus>
<gml:MovingObjectStatus>
<gml:validTime><gml:TimeInstant>
<gml:timePosition>2004-6-27T06:05:00</gml:timePosition>
</gml:TimeInstant></gml:validTime>
<gml:location>
<gml:Point><gml:pos>432.12. 123.22.</gml:pos></gml:Point>
</gml:location>
<gml:speed uom="#kph">37.4</gml:speed>
<gml:bearing><gml:CompassPoint>S</gml:CompassPoint></gml:bearing>
</gml:MovingObjectStatus>
</gml:track>
which could represent the track of a delivery vehicle for example.
9.3.6
Definitions and dictionaries
Almost all applications will use terms which require careful definition. This is the job of a dictionary, so that a
given label can be tied to standard definitions. In general definitions are better captured from external sources,
where these exist, by reference to an external URI, however in some instances it will be necessary to embed
the dictionary in the GML file. Both routes are supported in GML 3.1. The aim of GML dictionaries is to
provide a simple (but non-trivial) schema for describing, in a hierarchical model, definitions, including relations
between terms in the dictionary. The abstract base is specialised in many other GML schema such as temporal
reference systems, units of measure and coordinate reference systems. The aim is not to duplicate mechanisms
for providing support for more complex descriptions such as taxonomies, ontologies and thesauruses – these are
catered for by specialist XML based tools. It is recommended that all GML application schema refer to a GML
dictionary to specify the definitions used for the elements.
9.3.7
Units, measures and values
In many cases we will want to represent quantities that are represented on a particular scale (that is use a given
unit of measure) which for the most part will be standard SI units. For example we may want to represent the
length of a given object, but a value of 2.4 is meaningless without a unit of measure attached. Once we know
it is a measurement in metres then we can use that measurement e.g. to buy a beam of that length. This is
achieved in GML through the use of the units of measure uom attribute. The units.xsd schema defines a set
of base units and allows the user to add definitions, derive units from base units and define simple formulae.
CS3210 Geographic Information Systems
Units are used by the measure type:
<element name="measure" type="gml:MeasureType"/>
<gml:measure uom="#m">1.76</gml:measure>
which shows the definition and how it might be used. Here the uom attribute refers to a URI (in this instance
within the current namespace, using the abbreviated Xpointer notation). There is a lot more to the GML
concepts of units of measure, measures and value schema, but we will not cover it in any more depth here.
9.3.8
Observations
Observations are used to model the act of observing and generally also contains the result of that observation. Since observations are the primary source of many derived GIS entities, this is an important schema.
An observation feature describes both the metadata about the observation event, and also the result of the
observation. The observation might range from taking of a photograph, to more scientific remote sensing, or
direct observation, e.g. of the temperature at a given weather station. The schema provides a basis on which
more domain specific observation schema can be built.
The observation is declared using:
<element name="Observation" type="gml:ObservationType"
substitutionGroup="gml:_Feature"/>
<complexType name="ObservationType">
<complexContent>
<extension base="gml:AbstractFeatureType">
<sequence>
<element ref="gml:validTime"/>
<element ref="gml:using" minOccurs="0"/>
<element ref="gml:target" minOccurs="0"/>
<element ref="gml:resultOf"/>
</sequence>
</extension>
</complexContent>
</complexType>
Since the observation is a straightforward extension of gml:AbstractFeatureType it automatically has
gml:metadataProperty, gml:description, gml:name, gml:location and gml:boundedBy properties. The
gml:using element describes the instrument or sensor used to make the observation. The gml:target describes
the region, object or location at which the observation is made, and is declared as follows:
<element name="target" type="gml:TargetPropertyType"/>
<element name="subject" type="gml:TargetPropertyType"
substitutionGroup="gml:target">
<complexType name="TargetPropertyType">
<choice minOccurs="0">
<element ref="gml:_Feature"/>
<element ref="gml:_Geometry"/>
</choice>
<attributeGroup ref="gml:AssociationAttributeGroup"/>
</complexType>
The target property makes most sense for remote observations, since the gml:location property should be
used to give the position of the sensor (e.g. the aircraft on which the camera is mounted), while the target gives
the region that is observed by the sensor.
The gml:resultOf property contains the the actual observation result; the value may be in-line, or a reference
to a value elsewhere. If the value is in-line, it must be a member of the gml: Object substitution group. The
use of observations is best illustrated by an example:
63
64
CS3210 Geographic Information Systems
<gml:Observation>
<gml:location xlink:href="http://www.metoffice.com/stations/453"/>
<gml:validTime>
<gml:TimeInstant>
<gml:timePosition>2004-01-31T12:00:00</gml:timePosition>
</gml:TimeInstant>
</gml:validTime>
<gml:using xlink:href="http://www.weather.org/sensors/T46c"/>
<gml:target xlink:href="http://www.metoffice.com/stations/453"/>
<gml:resultOf>
<gml:Quantity uom="#C">-4.4</gml:Quantity>
</gml:resultOf>
</gml:Observation>
This example shows how a scientific observation might be stored, but GML has much broader applicability, and
might be used to define the something that is apparently non-geographical, such as taking a photograph:
<gml:Observation>
<gml:location>
<gml:LocationString>Aston University</gml:LocationString>
</gml:location>
<gml:validTime>
<gml:TimeInstant>
<gml:timePosition>2004-07-14T12:12:23</gml:timePosition>
</gml:TimeInstant>
</gml:validTime>
<gml:subject xlink:href="http://www.aston.ac.uk/graduation/students/CS/"/>
<gml:resultOf xlink:href="CSgraduation.jpg"/>
</gml:Observation>
This data can only very loosely be considered part of a geography, but it illustrates that: a) geography is
pervasive in our lives; b) GML might have scope outside traditional GIS boundaries.
There is also a gml:directedObservation type which allows us to define the direction from which a remote
observation was taken, as well as the distance, but in general this will be used only for photographs. We can
also attach a measure of the quality of the observation using the data quality schema.
9.3.9
Coverages
Coverages support mapping from a spatio-temporal domain to attribute values where attribute types are common to all geographic positions within the spatio-temporal domain. A spatio-temporal domain consists of a
collection of direct positions in a coordinate space. Examples of coverages include rasters, triangulated irregular
networks, point coverages, and polygon coverages. Coverages are the prevailing data structures in a number of
application areas, such as remote sensing, meteorology, and bathymetric, elevation, soil, and vegetation mapping and typically correspond to the field data structures we have already met, and more broadly to layered
approaches to GIS.
We can define the coverage as either
➢ a set of (location, value) pairs;
➢ a description of locations (e.g. the grid) and some method to describe the values at each location.
The first option is simple, the latter more powerful and compact. In GML the first representation can be
implemented using a homogeneous feature collection. The GML coverage schema provides a mechanism for
the second option. GML describes coverages as mappings from a domain (the space-time locations) to a range
(the attributes). The complete coverage is implemented as a GML feature; thus the feature might describe the
complete near surface temperature distribution across the UK at a given time instant.
At version 3.1 GML includes only simple grids as geometries for the domain of the coverage, but this is planned
to be extended in future versions of GML. The grid is defined using:
CS3210 Geographic Information Systems
<element name="Grid" type="gml:GridType"
substitutionGroup="gml:_ImplicitGeometry"/>
<complexType name="GridType">
<complexContent>
<extension base="gml:AbstractGeometryType">
<sequence>
<element name="limits" type="gml:GridLimitsType"/>
<element name="axisName" type="string" maxOccurs="unbounded"/>
</sequence>
</extension>
</complexContent>
</complexType>
This defines a rectangular grid, where limits specify the maximum and minimum coordinates along the primary
axes. As an example:
<gml:Grid dimension="2">
<gml:limits>
<gml:GridEnvelope>
<gml:low>0 0</gml:low>
<gml:high>600 1200</gml:high>
</gml:GridEnvelope>
</gml:limits>
<gml:axisName>Easting</gml:axisName>
<gml:axisName>Northing</gml:axisName>
</gml:Grid>
Note that the dimension attribute is inherited from gml:AbstractGeometryType, the above specifying a grid
in UK National Grid coordinates. Note the grid spacing is by default assumed to exist at integer coordinates.
If we want to define a grid which is more meaningful, that is clearly in some coordinate system, then we need
to use the gml:RectifiedGridType:
<gml:RectifiedGrid dimension="2">
<gml:limits>
<gml:GridEnvelope>
<gml:low>1 1</gml:low>
<gml:high>100 100</gml:high>
</gml:GridEnvelope>
</gml:limits>
<gml:axisName>Easting</gml:axisName>
<gml:axisName>Northing</gml:axisName>
<gml:origin>
<gml:Point gml:id="Scilly Isles" srsName="urn:EPSG:UKNG">
<gml:pos>340.0 440.0</gml:pos>
</gml:Point>
</gml:origin>
<gml:offsetVector srsName="urn:EPSG:UKNG">0.1 0.0</gml:offsetVector>
<gml:offsetVector srsName="urn:EPSG:UKNG">0.0 0.1</gml:offsetVector>
</gml:RectifiedGrid>
This defines a grid with 100 by 100 points, starting at UK National Grid coordinates (340.0E, 440.0N), with
grid spacings every 0.1 km in each direction, that is covering a 10 by 10 km region. In this example the grid is
aligned with the base coordinate system, but it need not be. Note this is a regular grid in UK National Grid
coordinates, but will be irregular in any other coordinate systems, thus the coordinate systems are specified in
the definition.
The coverage is defined using:
65
66
CS3210 Geographic Information Systems
<complexType name="AbstractCoverageType" abstract="true">
<complexContent>
<extension base="gml:BoundedFeatureType">
<sequence>
<element ref="gml:domainSet"/>
<element ref="gml:rangeSet"/>
</sequence>
</extension>
</complexContent>
</complexType>
The gml:domainSet can be the grid over which the range is defined, but can also be other geometries. The
gml:rangeSet defines the mapping for the range. There is a discrete coverage which has a function to map
from the domain to the range at each domain location:
<complexType name="AbstractDiscreteCoverageType" abstract="true">
<complexContent>
<extension base="gml:AbstractCoverageType">
<sequence>
<element ref="gml:coverageFunction" minOccurs="0"/>
</sequence>
</extension>
</complexContent>
</complexType>
There is also a corresponding continuous coverage, which defines the range over all points in the domain.
The domain set is described using:
<element name="domainSet" type="gml:DomainSetType"/>
<complexType name="DomainSetType">
<choice minOccurs="0">
<element ref="gml:_Geometry"/>
<element ref="gml:_TimeObject"/>
</choice>
<attributeGroup ref="gml:AssociationAttributeGroup"/>
</complexType>
This domain set is a geometry or a time object, so we can represent spatial fields, or time series. At present
there is no mechanism for an explicit space-time representation.
The range is described using:
<element name="rangeSet" type="gml:RangeSetType"/>
<complexType name="RangeSetType">
<choice>
<element ref="gml:ValueArray" maxOccurs="unbounded"/>
<element ref="gml:_ScalarValueList" maxOccurs="unbounded"/>
<element ref="gml:DataBlock"/>
<element ref="gml:File"/>
</choice>
</complexType>
which allows us to represent the values in an array, a value list (very similar) , a data block (i.e. something like
a comma separated value list) or a file. The declaration of these types can be found in the GML specification.
We note that they provide a flexible framework for storing the range values at the domain locations, since they
are quite detailed descriptions (e.g. the file type allows the use of pretty much any format, including binary
and compressed files).
The gml:coverageFunction defines how to deal with the ‘interpolation’ of the function values:
CS3210 Geographic Information Systems
<element name="coverageFunction" type="gml:CoverageFunctionType"/>
<complexType name="CoverageFunctionType">
<choice>
<element ref="gml:MappingRule"/>
<element ref="gml:GridFunction"/>
</choice>
</complexType>
A mapping rule can provide a formal (for example using MathML) or informal description of the mapping
function. The grid function does the same job but only for grid geometries:
<element name="GridFunction" type="gml:GridFunctionType"/>
<complexType name="GridFunctionType">
<sequence>
<element name="sequenceRule" type="gml:SequenceRuleType" minOccurs="0"/>
<element name="startPoint" type="gml:integerList" minOccurs="0">
</sequence>
</complexType>
The startPoint is the index position of a point in the grid that is mapped to the first point in the range set.
If the startPoint is omitted the startPoint is assumed to be equal to the value of gml:low in the gml:Grid
geometry. Subsequent points in the mapping are determined by the value of the sequenceRule:
<complexType name="SequenceRuleType">
<simpleContent>
<extension base="gml:SequenceRuleNames">
<attribute name="order" type="gml:IncrementOrder" use="optional"/>
</extension>
</simpleContent>
</complexType>
which uses:
<simpleType name="SequenceRuleNames">
<restriction base="string">
<enumeration value="Linear"/>
<enumeration value="Boustrophedonic"/>
<enumeration value="Cantor-diagonal"/>
<enumeration value="Spiral"/>
<enumeration value="Morton"/>
<enumeration value="Hilbert"/>
</restriction>
</simpleType>
This simply gives the ordering type of the range values in the function. The actual order, within the type is
specified using the gml:IncrementOrder element – the default is across, then up. Most grids will use the default
setting of the startPoint being the gml:low property, and the sequence rule as linear.
An example is shown below
67
68
CS3210 Geographic Information Systems
<ClimateDataJan
xmlns="http://www.metoffice.com/climate"
xmlns:gml="http://www.opengis.net/gml"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.cs.aston.ac.uk/~cornford/GMLClimate.xsd">
<gml:domainSet>
<gml:RectifiedGrid dimension="2">
<gml:limits>
<gml:GridEnvelope>
<gml:low>1 1</gml:low>
<gml:high>2 2</gml:high>
</gml:GridEnvelope>
</gml:limits>
<gml:axisName>Easting</gml:axisName>
<gml:axisName>Northing</gml:axisName>
<gml:origin>
<gml:Point gml:id="1023" srsName="urn:EPSG:UKNG">
<gml:pos>340.0 440.0</gml:pos>
</gml:Point>
</gml:origin>
<gml:offsetVector srsName="urn:EPSG:UKNG">100.0 0.0</gml:offsetVector>
<gml:offsetVector srsName="urn:EPSG:UKNG">0.0 100.0</gml:offsetVector>
</gml:RectifiedGrid>
</gml:domainSet>
<gml:rangeSet>
<gml:DataBlock>
<gml:rangeParameters>
<gml:CompositeValue>
<gml:valueComponents>
<Temperature uom="urn:UKMO:uom:degC">template</Temperature>
<Pressure uom="urn:UKMO:uom:hPa">template</Pressure>
<Rainfall uom="urn:UKMO:uom:mm">template</Rainfall>
</gml:valueComponents>
</gml:CompositeValue>
</gml:rangeParameters>
<gml:tupleList>4.5,1012.4,57.6 1.5,1014.4,36.6
6.7,1014.6,44.5 2.3,1017.8,32.1</gml:tupleList>
</gml:DataBlock>
</gml:rangeSet>
</ClimateDataJan>
While this section goes into some depth about how grid coverages are used in GML, there is the potential to
expand the representation considerably. This section may well expand in the future.
9.3.10
Styling
GML is designed to separate style from content, so having styles in GML might seem rather against the spirit
of GML. However it is often useful to be able to specify styles which at the current version of GML relate to
the graphical display of the data only but in theory could be applied to generate the data for any format. The
default styles that can be attached to features in GML 3.1 can then be incorporated with the GML that defines
the content. These are not then enforced, rather they can be used where the application does not have any
styling information of its own. We will not consider the schema here.
9.4
Application schema
The use of GML in practice requires the development of application schema, which inherit the GML base types
(in particular extend the GML abstract types). In GML 2.0 this meant extending on the abstract feature type,
CS3210 Geographic Information Systems
but in GML 3.1 this can be applied to features, geometries, topologies and most other GML constructs, thus
the framework is very flexible and customisable. This is enhanced by the concept of GML profiles, which allow
the user to utilise only those parts of GML 3.1 that are required in the particular application domain. The
application of GML 3.1 will be illustrated in the lectures.
9.5
Associated technologies
GML forms the basis of a significant effort on the part of a range of organisations and individuals to facilitate
the sharing and ease of availability of geo-spatial data.
9.5.1
XMML
XMML, the eXploration and Mining Markup Language, is an XML based encoding for geo-science and exploration information which extends GML. It is intended to support exchange of exploration information in a wide
variety of contexts. This includes between software packages on the desktop, between users and organisations,
and in particular to be compatible with http. Since XMML is a plain-text format within which the tags provide
a degree of internal documentation, it is highly suitable for archival use. Furthermore, the GML framework
used by XMML ensures that a certain amount of ”schema-level” information is present in the data instance (all
objects are clearly labelled with their data-type, as well as their role in context) which reinforces this portability.
XMML has been designed on the premise that exploration data, such as bore-holes, rock samples and other
measurements, are essentially geo-spatial, so there are benefits in using an implementation framework that
is aligned with other geographic information systems. The XMML implementation is based on GML (3.1),
developed by the OGC and currently undergoing standardisation as ISO 19136. By choosing this basis the
following are obtained:
➢ a basic meta-model, based on features, properties, objects and values;
➢ a regular XML encoding pattern used to serialise instances conforming to the meta-model;
➢ inheritance of a large number of utility components, particularly concerning geometry, topology, temporal, coordinate reference systems, which therefore do not have to be re-invented;
➢ conformance with standards from the ISO 19100 series;
➢ compatibility with the OGC Web Feature Service (WFS) interface;
➢ potential for integration with data expressed in other GML-based languages;
➢ compatibility with national Spatial Data Infrastructures;
➢ widespread availability of basic processing tools and software components, particularly for transformation
to legacy formats, and styling for portrayal and reports.
XMML itself will be standardised through the IUGS Commission on Geo-science Information. The XMML code
is freely available from the web site http://www.seegrid.csiro.au/xmml.
9.5.2
SensorML
SensorML was developed to model sensors, primarily those used in satellite remote sensing. Under the auspices
of the Global Mapping Task Team (GMTT) within the international Committee for Earth Observing Satellites
(CEOS). Mike Botts began development of an XML-based Sensor Model Language for describing the geometric,
dynamic, and radiometric properties of dynamic sensors. Development and testing of SensorML has progressed
primarily under the auspices of the OGC, through funding from the NASA and others. SensorML is also under
review as part of the ISO TC211 Projects 19115 (Part 2), 19129, and 19130.
SensorML provides the models and XML schema encoding for defining the geometric, dynamic, and observational
characteristics of a sensor. The purpose of SensorML is:
➢ provide general sensor information in support of data discovery;
➢ support the processing and analysis of the sensor measurements;
➢ support the geo-location of the measured data;
➢ provide performance characteristics (e.g. accuracy, threshold, etc.);
69
70
CS3210 Geographic Information Systems
➢ archive fundamental properties and assumptions regarding sensor.
SensorML provides a functional model for the sensor, not necessarily a detailed description of hardware. It
supports rigorous geo-location models, which can describe sensor parameters independent of platform and target,
as well as mathematical models which can directly map between sensor and target space. Thus SensorML can
model the response function (also called the model function or forward model) of the sensor, relating what the
sensor observes to the quantity of interest to the user.
SensorML can apply to virtually any sensor, whether in-situ or remote sensors, and whether it is mounted
on a stationary or dynamic platform. Geo-location of observed data will be supported through ”plug-n-play”
models for sensor grids, frame cameras, scanners, and replacement sensor (RPC - Rapid Positioning Coordinates/Rational Polynomial Coefficients). Other geo-location models and response models can be developed by
independent communities and incorporated into SensorML.
The SensorML concept provides significant advantages for processing, visualisation, and data mining of dynamic
sensor data within a distributed desktop environment. In addition, the on-board use and direct distribution
of a SensorML by the sensor itself, can provide additional major benefits with regard to the remote in-thefield processing of real-time sensor data, for autonomous operation of sensor systems (e.g. guidance, on-board
processing, and target recognition), and for cross-communication within a SensorWeb among aircraft, satellites,
and ground-based sensors.
9.6
Summary
GML is likely to have further incarnations, since it is still a young technology. There seems a very strong
likelihood that GML will form the basis of almost every GIS application and data service, even those which are
not specifically web based. For the exam it is necessary to know about the basic schema at a top level (that is
their role, but not the syntax of their declaration), and to understand a little about the feature, geometry and
topology schema in terms of their ability to represent the data representations we look at in this module. You
will also need to know a little about creating application schema.
GML-WorkingGroup 2004. ISO/TC 211/WG 4/PT19136 Geographic Information - Geography Markup Language (GML). OpenGIS Consortium. http://www.opengis.org/ accessed 9/7/04.
Lake, R., D. S. Burggraf, M. Trninic, and L. Rae 2004. Geography Mark-up Language – Foundation for the
Geo-Web. London: John Wiley and Sons Ltd.
CS3210 Geographic Information Systems
10
71
72
CS3210 Geographic Information Systems
Table 3: An example of a file containing geographic data (locations of major cities in the UK). Note the x
and y coordinates are latitude and longitude – we shall cover this in the section on map projections.
Structures and Access Methods
city ID
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
In the previous section we looked at how to represent some of the data types we need to be able to handle
in a GIS investigation. We also looked at some simple topological and geometric algorithms. We discussed
how to represent the data to facilitate some of the topological queries that we might want to apply to the
spatial representation. In this section we discuss more general issues of data structures and access methods
for spatially referenced data, with emphasis on improving database efficiency. We start with a brief review of
standard database structures and access methods.
10.1
Standard data structures and access methods
Most databases can be considered as a collection of files, each file consisting of a collection of records. If we
consider a concrete example, such as the student database maintained in Computer Science, a record will be
something like the data on a particular student, a field will be something like the student mark in CS3210
(which will be a repeating field, since there will be marks for all other modules taken) or it could be a comment
field for recording special circumstances (a variable length field). The files will be composed of a sequence of
records, possibly grouped by programme (CS, CB, CH). As this will be a relatively small database structure
and access methods will not be critical, but consider the larger databases held by many organisations (e.g.
telephone directory or many GIS databases) and you can see size will be an issue.
10.1.3
name
Plymouth
Southampton
Bristol
London
Norwich
Cardiff
Swansea
Birmingham
Leicester
Liverpool
Manchester
Sheffield
Leeds
Newcastle
Glasgow
Edinburgh
x coord
-4.2
-1.4
-2.6
-0.1
1.3
-3.2
-3.9
-1.9
-1.1
-3.0
-2.2
-1.4
-1.5
-1.6
-4.3
-3.3
y coord
50.4
50.9
51.5
51.5
52.6
51.5
51.6
52.5
52.6
53.4
53.8
53.4
53.8
55.0
55.9
56.0
Hash files
Since the files are stored on disk (the size generally meaning that memory will not be sufficient on many systems)
the speed of access to any record will be determined by the seek time (time to move the disk read head to correct
position), latency (time to read the stuff from the disk) and the CPU transfer time (actual transmission across
system bus(es)). In general the seek time will dominate, thus to minimise the time it takes to make a query we
will want to minimise the seek time. One way we can do this is to try and ensure that for a typical query, the
data needed is physically close on the disk. Thus we try and avoid fragmenting our files, and impose a sensible
ordering. Of course the problem here is that we cannot generally know ahead of time what queries will be made
to the database.
Hash functions are basically functions which take a field as an input (e.g. student ID number) and produce the
address of the data as an output (e.g. maybe taking the last 4 numbers of the student ID will give the location
of the block containing that students data). There are problems with implementing hash functions on many
field types, and since records will seldom correspond to blocks, we might have to be very clever in setting up
our hash functions (particularly if we have variable record lengths). There may also be problems in getting an
even distribution of the data across the surface of the disk. Of course if we can solve these problems the hash
file organisation is very quick to query on the hashed variable, in fact it is O(1).
The next few sections consider possible methods of file organisation. The file organisation tells us how the data
is stored on the disk - do we use contiguous allocation or linked (pointer) allocation. First we consider the most
simple case.
10.1.4
10.1.1
Unordered files
As the name suggests this file system has no ordering. When we get a new piece of data we simply place it at
the first free position within our data file. Thus insertion is very efficient – there is no need to update ancillary
structures, except maybe marking which blocks on the disk are used and which not. Of course there is a price
to pay for this simplicity. Retrieval of information is very inefficient. If we want to search the database for a
certain record, using a certain field and we have n records in the database, this will be of O(n) time complexity.
10.1.2
Index files
If we cannot use hash functions, then an index file may be the next best thing. Rather like the index in a book,
where we look up a certain keyword, rather than search the entire book, index files give us improved efficiency
when querying the database. This index is an ordered version of the field being indexed together with a pointer
to the relevant record. Thus rather than ordering the file on one field, we can index several – or all – of the
fields, and any search on an indexed field will have O(log2 (n)) time complexity. If the data files are really large
then we could index the index file, which would make the search even more efficient. The limit of this multi-level
indexing can be seen to some sort of tree like structure.
10.1.5
B-trees
We will not cover B-trees is any depth in these notes - the interested reader is referred to Worboys (1995, p.
247). We simply note that B stands for balanced. Thus B-trees give us the advantage of multi-level indexing,
but are usable in highly dynamic settings.
Ordered files
10.2
Using this file system we choose to order the database upon a particular field, for instance the students surname.
This means that insertion of a new item of data can be expensive, since naively we will need to move all the
records below the insertion point, although we can reduce the complexity of this by using a pointer file to hold
the locations of the records, which is the only thing we need to update. The advantage that an ordered file
system gives us is that a query on the ordered field gives us a typical time complexity of O(log2 (n)), using a
binary (divide and conquer) search algorithm. Of course retrievals based on other fields still have O(n) time
complexity.
From 1D → 2D
In the previous section we dealt with typical searches on 1D data structures (e.g. student ID number). We
might also like to query on multi-dimensional data (e.g. find all records where student name is Smith and course
taken is CS3210) but these non-spatial queries are different from spatial queries in that in general the multiple
fields will not be correlated, and will not have a natural concept of distance (can’t compare apples with pears).
In spatial databases most spatial queries will be on two variables (although it may be more in 3D and 4D
CS3210 Geographic Information Systems
73
GIS). These two variables are the x and y location variables, which will typically have a (Euclidean) concept of
distance assigned.
An example of a spatial database is shown in Table 3. The type of queries we might ask of this data can be
grouped into:
➢ non-spatial query e.g. where is London?
➢ point query e.g. what is at the point (−4.2, 50.4)?
➢ range query e.g. what cities are within the box defined by (−5, 50) → (−3, 52)?
74
CS3210 Geographic Information Systems
situations, and in any case unless the distribution of points is reasonably regular, there are likely to be empty
buckets and overcrowded buckets. This problem is common to many techniques of data access, such as hashing,
where data can easily bunch together if care is not taken. The optimal partition size will depend on the number
and spacing of points and the typical size of range query which we want to make.
10.3.1
The first type of query can be handled and optimised like any other non-spatial query. The second query could
be solved by an binary (index) search on x followed by checking the y values of any found records. The final
query might also be addressed in the same way. But both these queries require a non-optimised O(m) check on
the m retrieved y values (any original indexing cannot now be used and we would need to re-index). If m gets
big then the total time complexity will be O(m log2 (n)), which could be expensive.
Grid file
58 oN
56 oN
54 oN
52 oN
A B C
row
row prime
Morton
Peano-Hilbert
10 oW
8oW
6oW
Figure 10.1: Four commonly used tilings of space.
o
4 W
o
2 W
o
0
F
o
2 E
y_grid = [49, 52, 53.5, 59]
G H
(a)
The essential problem is that the computational model for storage devices is fundamentally 1D (the addressing
systems of a disk or memory). This is a problem because we cannot devise an ordering in 1D which maintains
closeness of records for objects which are close in 2D. So we try and think up clever orderings which allow
records which are close in space to also be close in the data structure so that we can index on this 1D address.
If we assume that we can discretise 2D space at some level into a series of square blocks (or tiles), then if we
store the point data in these tiles according to some ordering we may be able to speed up access to the data.
Examples of the common orderings of tilings of space are shown in Figure 10.1.
In general the Morton tiling is most frequently used although the Peano-Hilbert tiling is as efficient. Both can
be seen as being derived from fractal principles. The question now is how can we use these orderings?
(b)
Figure 10.3: (a) A plot of the city data shown in Table 3 using a Lambert projection (later) and (b) the
grid directory for the grid file – note cells have been joined to produce region D.
Rather than being restricted to a fixed grid structure, the grid file structure allows us to use uneven grids, which
can have rectangular regions. This allows us tailor our grid to the data we have, so that each bucket will have
a similar (even identical) number of points within it. This reduces the problems of unbalance, but it does mean
that a dynamic situation will require re-computation of the grid file if the balance of the number of points in
each cell is to be kept. The method is illustrated in Figure 10.3 where we can see we have a more compact data
structure than in Figure 10.2, with a more even distribution of data points.
10.3.2
10.3
x_grid = [−10, −3.5, −1.8, 2]
D D E
50 oN
Point quad-tree
Structures for point data
We have already dealt with appropriate relations for storing point data, the question now is how can we order
these in our data structure to enable us to achieve efficient queries. In particular we might want to optimise the
efficiency of range queries – that is finding points within a
given region. We will look at point data in some detail, but
time will not permit a thorough treatment of structures for
line and polygon data. Details can be found in Worboys
(1995, Chapter 6).
58 oN
NE
1
SW
SE
4
NW
1
4
1
1
56 oN
2
2
54 oN
2
3
52 oN
1
1
3
1
2
o
2
3
1
2
4
2
50 N
The most simple method would be to place a grid over
space and allocate the points to the cells in which they fall
10 W
2 E
8 W
0
6 W
2 W
4 W
(the south and west boundaries are considered part of the
cell). In this context we might refer to a cell as a ‘bucket’
Figure 10.2: A possible grid organisation.
for the points. This is shown schematically for the cities
data in Figure 10.2. We can then store the points which occur in the same cell in the same block on the disk,
for example. We might also order the cells using a Morton tiling. Thus points which are close in space will also
be close on the disk, and range queries will be quicker to make.
o
o
o
o
o
o
o
However the fixed cell size and origin means that the data structure may not be appropriate in dynamic
3
3
Figure 10.4: Constructive definition of the point quad-tree, including the ordering of the children: NW, NE, SW, SE.
This data structure is based on the tree concept (extended to 2D space so that each node has 4 children) which
we will cover in more detail when we look at methods for storing raster data. The nodes in the tree are the points
themselves, thus they have the coordinates and attributes of the points attached, but also have pointers to their
children - which are points to the NW, NE, SW, SE respectively. They are more easily understood by reference
CS3210 Geographic Information Systems
75
to their constructive algorithm, which is illustrated in Figure 10.4. These are useful data structures because
they can be created quickly – O(n log(n)) and can also be queried efficiently – a point query has O(log(n))
time complexity. It is also an efficient embedding of the 2D space in a 1D data structure. However like all
variable resolution methods it suffers, since a great deal of effort may be required to update the structure with
the addition or removal of a point (particularly if this occurs near the top of the data structure). It is also
very sensitive to the order in which the points are presented - different orderings may given very different tree
structures.
76
CS3210 Geographic Information Systems
data storage, but the interested reader can check the notes on image compression, available from the GIS web
site.
1 0
10.3.3
Point 2D-tree
0
>
1
2
1
3
3
1
2
2
2
2
3
> <
1
1
1
1 1 0 1 0 0 1 1 0 1 1 0 0 1 1 0 1 1 0 0
4
4
<
1
1
2
2
3
0 0
4
3
Figure 10.6: A raster data layer and the quad-tree decomposition of this region, and the quad-tree.
The most commonly used raster data structure in GIS is the region or cell quad-tree. This structure works on
grids of dimension 2n × 2n – so some padding may be necessary to use the quad-tree on most data structures.
At the top of the tree we consider the whole of the space. If this is all one value then we can stop here, and
simply assign that value to the root node. If there is any heterogeneity (non-constant values) in the full region
we divide this into four new squares and repeat the process until each square is simply a cell or pixel. The
decomposition is most easily represented by a tree structure where each parent has four children, ordered as
shown in Figure 10.7.
Figure 10.5: Constructive definition of the point 2D-tree, including the ordering of the children: >, <.
1
2
3
4
1
This data structure is similar to the point quad-tree but we try and ensure that there are fewer dangling nodes,
by giving each node only two children which correspond to > and < in one coordinate. Thus even layers in
the tree represent the x coordinate and odd layers the y coordinate. The price we pay is that we will generally
have trees which are much deeper! No free lunch. The constructive algorithm is shown in Figure 10.5, and
can be contrasted with that shown in Figure 10.4. Note that the tree has far fewer dangling nodes, and gives
a generally more compact data structure. The querying of the tree for range type queries is relatively simple,
algorithms being developed in Worboys (1995, p. 264), although there is an omission in the full tree updating
algorithm. The tree structure will again depend on the order of presentation of the points in the data set.
Of course all these data structures come at a price. We need to store extra information to speed up the standard
queries and do quite a bit of preprocessing. The decision as to whether this is worth it will depend on whether
we are constrained by storage or time, and what our users feel is more valuable.
10.4
Lines and Areas
We do not have the time to cover data structures appropriate for speed up queries on linear and area objects,
but this is treated in Worboys (1995).
10.5
Structures for Raster Data
The basic data structure appropriate for 2D spatial data is the standard array. The cells are referenced in row
order in general, and, by two indices. Arrays are not very suitable for storing large volumes of raster data due
to the sparseness or repetition in the data. A sparse array is one that contains many zero, repeated or ‘NULL’
values. For instance considering the binary data shown on the top of Figure 10.6, there are large areas where
there is no data at all, but we still need to store the NULL contents.
This problem is essentially one of storage space. One solution, used in image processing and computer graphics
is to compress the 2D array using all manner of compression methods. We will not cover this aspect of raster
1
2
3
4
Figure 10.7: The ordering of the leaves of a quad-tree.
The good points of the quad-tree are that it gives an efficient ordering of 2D space in a 1D structure, which
means a lot of queries are relatively easily implemented (see Worboys (1995, p. 255) for more details). It is
also variable resolution, which means that extra resolution can be used in areas where there is a great deal of
variability. Note that the ordering of the children implies a Morton tiling of space, thus points in the quad-tree
which are close can be expected to be close in 2D space.
The down side is that the quad-tree is difficult to use with dynamic data, because whole roots need to be
updated, and it is not invariant to translation of the objects within it. Also if the grid values vary rapidly
in space, then there will be no reduction of storage (there is potential for an increase) and an increase in
computational time. In general it will not be appropriate (unless forced on us by software constraints) to store
intrinsically vector data (e.g. roads) as raster data, since the quad-tree structure is generally inefficient for this
sort of data.
CS3210 Geographic Information Systems
Stages of a GIS Application
78
CS3210 Geographic Information Systems
12.1
Sampling: interpolation
‘Reality’
Presentation
Data retrieval
Measurements
Users
Data models
Figure 12.1: Interpolating to a grid.
Data analysis
Data structures
Before we move on to issues of data acquisition (measureDATABASE
ment) I would like to briefly review the typical stages in
a GIS application. Since we are dealing with a geographic
Information System, it will come as no surprise that this
Figure 11.1: An overview of GIS.
follows similar lines to standard IS projects:
➢ establish the objectives of the project [5];
➢ identify the analysis method(s) suitable for meeting these objectives [10];
➢ acquire the data necessary for the analysis [40];
➢ perform the (spatial) analysis [20];
➢ summarise the results [10];
➢ interpret the results [15];
➢ refine the analysis if needed [??].
For a typical GIS based project, the greatest amount of time will be spent on the data acquisition phase, as
indicated by the numbers in square brackets which indicate a proportion of the time which might be devoted
to each stage. Of course this will vary greatly from project to project and is only meant as a rough guide.
The reason that data acquisition often takes so long is that this frequently requires the creation of completely
new data sets. As the GIS industry matures more data is becoming available is pre-packaged form and thus
less time may be spent on data acquisition, but because the data is commercial, more money might be spent.
We will not deal in any depth with standards for spatial data transfer (one major reason being that these are
still in some flux), but this is becoming increasingly important. The next sections deal with methods for the
acquisition of raster and vector data.
So the next question is how do we go about this interpolation? We have already met an example earlier in
the notes, where we estimated the z value of a TIN for an arbitrary point (xp , yp ). That was an example of
piecewise linear interpolation, because we assumed that the surface varied like a plane locally between the three
surrounding points in the triangulation. In general, interpolation methods are most easily understood in 1D
first, extension to higher dimensions being natural. We will cover several interpolation methods, which can be
differentiated by the assumptions that they make about the underlying spatial process. At the highest level we
can divide interpolation methods into global and local categories.
Data
Global interpolation methods use all the data at once
to estimate the value of a parametric surface that is defined over the entire region using some function. If we
use polynomials in x and y to represent the surface e.g.
z = f (x, y) = αx + βy + γ then this type of interpolation
is called trend surface analysis. The most crude version of
this is to say that the function is a constant, and thus given
by the mean value of the variable of interest. We could also
use the values of other predictors, which we have sampled
Figure 12.2: Data used in 1D interpolation.
densely in space, to predict the value of the variable we are
interested in. For example at a crude level temperature is strongly related to elevation (it gets colder the higher
you get). Thus if we have a few samples of the temperatures at different locations and a digital elevation model,
giving elevation at all locations, then we could use a regression relation between temperature and elevation to
predict the temperature at the grid points. Global interpolation methods tend to be less suitable for spatial
data because typically spatial data varies greatly over a range of scales, which cannot be captured using simple
global models.
10
9
8
7
6
5
4
3
2
1
0
5
10
15
x
20
25
30
Local interpolation methods are generally to be preferred since they are more flexible, considering only the data
in a local neighbourhood to compute the value of the variable in a grid cell. However, they are generally rather
more computationally intensive, since the neighbourhood and thus the computation, changes at each grid cell.
12.1.1
12
We will start be considering interpolation / sampling methods to acquire raster data. Consider the sampling network
shown in Figure 12.1, which may be the result of specific
samples (field campaign, questionnaires) or some fixed network (meteorological stations, pollution monitoring sites).
We want to perform some analysis on the basis of these
measurements – but we want to use the spatial context.
The most simple way is to convert from site data to raster
data, so that we produce a realisation of the spatial variation of the variable we are interested in.
z
Hopefully, you will now be in a position (thanks to the labs
and the coursework) to appreciate the power and potential uses of GIS. We have seen the representation methods
used to store both raster and vector data. We also have
an reasonable idea about optimising the physical storage
structures to enable efficient querying of these data structures. In the labs we have looked at some of the issues
relating to the application of GRASS to solve some typical
GIS based problems (data analysis). It should be clear by
now that as with all computer science issues it is the user
that dictates the form the various parts of the GIS take
(see Figure 11.1).
Nearest neighbour interpolation
Sources of Raster Spatial Data
In this section we address methods for acquiring spatial data which will be stored in a raster format. While
the actual data structure used for storage may be a quad-tree, we will assume that the end product we produce
takes the form of a regular grid of squares, over a finite Euclidean domain (i.e. something that we can easily
store in an array).
There are several methods one could envisage for acquiring such data. In general raster data structures are
most appropriate for storing continuous data, such as temperature, elevation, land cover, population density
(?) or lead pollution levels. Thus we will assume that we are trying to acquire data that is best stored as a
raster variable. The most simple method of data collection for raster data would be to sample the data at the
centre of each grid cell and use this value. However this is seldom possible, and what happens if we change the
resolution of our grid? It also begs the question of what the value in a raster cell represents – is it the value at
the centre of the cell or the average value within the cell. In most studies this question is not really properly
addressed, but my preference is to assume it is the average value in the cell.
In nearest neighbour interpolation we interpolate the value
equal to the nearest neighbour. In 1D this produces step
like curves, as illustrated in Figure 12.3. In 2D this will
produce Voronoi polygons, with each cell in the polygon
taking the value of the central observation. This leads to
discontinuities in the surface (sudden jumps from one value
to another) which is not realistic for continuous phenomena. Thus nearest neighbour based interpolation should
only be applied to those variables which are really discontinuous, such as regions for which rainfall > 0 mm. Computationally nearest neighbour interpolation is relatively
cheap, being O(mn) at worst if there are m points in the
grid at which to interpolate, and n observations.
Nearest neighbour
10
9
8
7
6
z
11
77
5
4
3
2
1
0
0
5
10
15
x
20
25
30
Figure 12.3: Nearest neighbour interpolation.
It is clear from Listing 6 that the algorithm is very simple, the only slightly tricky part involving the determi-
CS3210 Geographic Information Systems
79
80
CS3210 Geographic Information Systems
FOR each point at which to interpolate
Find the closest point in the data set
Assign that points value to the current location
END
Inverse distance to the power
nation of the closest point. If we use clever data structures, such as a point 2D-tree to store the points finding
the closest point could be made efficient.
Inverse distance to the power
10
9
9
8
8
7
7
6
6
5
5
z
z
Listing 6: Pseudocode for nearest neighbour interpolation (any dimension).
1.0
10
4
4
3
3
2
2
1
0
0
12.1.2
1
5
10
15
x
20
25
0
0
30
5
10
15
x
20
25
30
Piecewise linear interpolation
(a)
Linear interpolation
10
9
8
7
6
z
2.0
5
4
3
2
1
0
0
5
10
15
x
20
25
30
Linear, piecewise interpolation in 1D assumes that the
curve is made up of a series of linear segments, which when
extended to 2D can either produce triangulation based linear (planar) interpolation or bi-linear interpolation. This
will produce continuous surfaces, but with sharp changes
in direction at the points at which the planes or lines join.
Thus this is generally considered a poor model for interpolation of continuous variables which vary smoothly in
space. However it is O(mn) again in complexity, but does
not produce discontinuities, thus is to be preferred to nearest neighbour interpolation when speed is of the essence.
Like nearest neighbour interpolation the algorithm, Listing 7, is rather simple although we can save a bit of time
by pre-computing the triangulation - indeed if storage is
not at a premium we might also store the planar coefficients (which we can compute using Listing 5) so that
they are not repeatedly evaluated.
Figure 12.4: Piecewise linear interpolation.
(b)
Figure 12.5: (a) Inverse distance and (b) inverse distance squared interpolation methods.
FOR each point, p, at which to interpolate
Find the surrounding points, p_i, in the neighbourhood
FOR each point p_i
w_i = (1 / distance(p_i,p))^power;
END
Normalise the weights, w_i
z_p = 0;
FOR each point p_i
z_p = z_p + w_i*z_i;
END
END
Listing 8: Pseudocode for inverse distance interpolation (any dimension).
construct a (Delauney) triangulation of the data set
FOR each point at which to interpolate
Find the corresponding Delauney triangle
Use planar interpolation to derive the value
END
Listing 7: Pseudocode for TIN based piecewise linear interpolation (2D).
Bi-linear interpolation is rather different in that we find the nearest neighbours on a grid – that is the four
surrounding corners of the neighbouring pixels (Burrough and McDonnell, 1998). Thus it is most widely used
on data which is already gridded e.g. for changing projections.
have thousands of observations we will generally one choose a small number of nearby points to use in the
interpolation. Often we will choose all points within u kilometres, or the n closest points. Since the use of
inverse distance weighting is rather ad-hoc (there is no process based explanation for using it) setting the values
of these parameters is also a little ad-hoc.
12.1.4
Geostatistical interpolation
Geostatistical interpolation is rather more principled, being based on a proper spatial process model. This means
that in general (given a sensibly chosen process model) geostatistics is the method of choice for most interpolation
problems. Of course the benefits come at a price.
p(x)
12.1.3
Inverse distance interpolation
Possibly the most commonly used form of interpolation, which produces continuous, smooth interpolation (for
some choices of the power parameter) is the method based on inverse distance weighting. As the name suggests,
the weight given to each observation varies inversely with the distance between the observation and interpolation
point.
As we can see from the algorithm shown in Listing 8 there are several factors which need to be decided upon to
implement inverse distance interpolation. Firstly we must decide on the power that the inverse distance is raised
to. In Figure 12.5(a) the power is one, while in (b) it is two. This can be seen to produce very different results.
In general inverse distance squared might be preferred since this gives a smoother (more realistic) surface.
Having fixed the power, we also need to decide on the neighbourhood. For small data sets (say less than
50 observations) we can assume that we will use all the data to interpolate to a new point, however if we
x
The kriging method is based on assuming that the underlying spatial process is a continuous random field, which can
be described in terms of the mean and (auto-)covariance
function. Thus we are assuming right from the start that the variable we are representing is a random variable
– that is the variable has a certain probability distribution. This is illustrated in Figure 12.6 where we can
look at the probability of the random variable x, which is denoted by p(x). We can see areas where x is highly
probable and areas where x is less probable. In the random field model that underlies kriging we assume that
the probability distribution is Gaussian or Normal – so that it is totally described by its mean (average) and
variance (standard deviation, or spread). Indeed in the Gaussian random field model we assume that the joint
distribution of all the variables (at all points) is multi-variate Gaussian, so that the distribution is completely
specified by the mean function and covariance function.
Figure 12.6: Concept of a random variable.
CS3210 Geographic Information Systems
81
We have met a covariance function earlier in the text – this essentially describes the similarity of points as a
function of the distance between them. Recall that more similar points tended to be closer to each other. In
order to correctly estimate the covariance function we need to assume that the spatial process is stationary –
that is the same process has generated the samples at all locations in space. In practice this is very difficult to
check – there are many ways to alleviate the effects on non-stationarity but we will not discuss these here. Let
us assume the process is stationary.
Variogram
semivariance
Covariance function
covariance
sill
sill
nugget
nugget
separation distance
range
82
CS3210 Geographic Information Systems
-- Compute the semi-variance and separation distance
-- for each lag interval using all the data.
FOR each point, p_i, in the sample point set
FOR each point p_j , in the sample point set
sep_dist = distance(p_i,p_j);
semi_var = (z(p_i) - z(p_j))^2;
determine the lag interval from sep_dist: lag
sdist(lag) = sdist(lag) + sep_dist;
svar(lag) = svar(lag) + semi_var;
npoints(lag) = npoints(lag) + 1;
END
END
-- Compute the average for each lag class
separation distance
range
Figure 12.7: A typical variogram and the corresponding covariance function.
Now we need to know how to estimate the covariance function. One method is to assume it is known from our
knowledge of the generating process – this sort of prior knowledge is very rarely available. The other option is to
estimate the covariance function from our samples of the stationary process. In standard (frequentist) statistics
we make use of repeated samples of a process (which we generally assume is stationary in time). In spatial
statistics we only have one sample of a given process (can’t step into the same river twice) so the stationarity
assumption is key to making inference about the process.
sample variogram
There are two equivalent (under stationarity assumptions)
functions which we can compute to describe the variation of
the process (we will assume a zero mean for convenience)
– the covariance function or the variogram. Figure 12.7
average semivariance for given lag interval
shows a covariance function and the corresponding variogram. In general terms we can write: covariance function
≈ variance − variogram. We can also see from Figure 12.7
separation distance
that a variogram can be interpreted as providing a mealag interval
sure of dissimilarity as a function of separation distance,
Figure 12.8: The sample and parametric variwhile the covariance function represents a measure of simogram.
ilarity as a function of separation distance. In practice we
compute a sample variogram or covariance function from the available point data. We then fit a parametric functional form to this sample variogram in order to be able to estimate the variogram for all separation
distances as shown in Figure 12.8.
semivariance
parametric variogram
function
The choice of the parametric form of the variogram function is critical to the properties of the resulting interpolated process – particularly if simulation is used. First we must compute the sample variogram. An algorithm
for this is shown in Listing 9. Once the sample variogram has been computed we must decide on the parametric
function we will use to represent it. Typically we will choose from one of the following models:
➢ exponential – γ(h) = c0 + c1 (1 − exp(−h/r));
➢ squared exponential – γ(h) = c0 + c1 (1 − exp(−h2 /r2 ));
➢ linear – γ(h) = c0 + bh;
although there exist many other possibilities. For the exponential and squared exponential models c0 corresponds
to the nugget (or noise variance), c1 to the relative sill variance and r to the range (the distance beyond which
there is on average no correlation) – see Figure 12.7. The variable h represents the separation distance. The
linear variogram has no sill (the variance is unbounded), just the nugget (or noise variance), c0 and the slope
of the line b.
We can interpret the various parameters of the variogram. The nugget variance has two contributing sources.
It can be seen as representing the combined effect of the uncorrelated errors in observing the variable z and the
component of the variation of z that cannot be resolved by the sampling scheme (i.e. variations at scales less
FOR each lag class,
IF (npoints(lag) > 0) THEN
svar(lag) = svar(lag)/(2*npoints(lag));
sdist(lag) = sdist(lag)/(npoints(lag));
ENDIF
END
Listing 9: Pseudocode for computing the sample variogram.
than the average minimum pair separation distance). The relative sill variance gives a measure of the range of
values over which the well sampled part of the process varies (that is the magnitude of the variability of the
process). The range of the variogram is indicative of the characteristic (spatial) scales over which the process
varies.
The variogram function, and in particular its behaviour as h → 0, tells us about the continuity properties of the
generating process. For the exponential variogram the underlying process is continuous but not smooth, while
for the squared exponential variogram the process is as smooth as it can possibly be. There is a lot of theory
here which we do not have time to develop. Having chosen the form of the model (either on the basis of the
data, or some prior knowledge of the generating process) we can tune the parameters of the model to produce
the best fit to the sample variogram. This is a complex procedure which we will not cover here, but note that
in general this will be a non-linear optimisation problem.
An example of the application of kriging given an assumed variogram is shown in Figure 12.9 (for the data shown
in Figure 12.2), using an exponential variogram. We have yet to specify how to compute the interpolation. Let
us assume that we have computed the sample variogram and fitted a good model to this. Just like the inverse
distance weighting method we will now compute the interpolated value of z at some unsampled point p as a
weighted linear combination of the surrounding pi values in the neighbourhood of p. In the inverse distance
weighting method the weights simply depended on the distance of the sample points from the interpolation
point. In kriging we use the variogram to assess the importance of the neighbouring points on the basis of their
‘covariance’ with both the interpolation point and with themselves. This ensures that sample points which are
very close together are relatively down weighted.
CS3210 Geographic Information Systems
83
84
CS3210 Geographic Information Systems
Results for: nugget = 0.500, sill = 4.000, range = 2.500
semivariance
semivariance
Results for: nugget = 0.500, sill = 4.000, range = 1.667
6
4
Exponential Variogram
2
0
0
5
10
6
4
0
15
Gaussian Variogram
2
0
5
separation distance
10
15
separation distance
16
16
14
14
data
kriged value
1.96*kriging
variance
12
10
data
kriged value
1.96*kriging
variance
12
10
8
8
6
6
4
4
2
2
0
−2
0
−4
−2
0
5
10
15
20
25
30
0
5
10
15
20
25
30
Figure 12.9: Kriging interpolation (exponential variogram).
Figure 12.10: Kriging interpolation (squared exponential variogram – bottom).
It is not possible to give details of exactly how the weights are calculated here because to do this would require
matrix algebra, since we use the inverse covariance matrix to compute the effect of clustering in the sample
points. I am afraid you will have to take my word for it. For those of you who are comfortable with mathematics
we have:
zp = c0 0 C −1 z ,
(12.1)
at Cressie (1993) for more details. For instance kriging provides a method not just for estimating point values
(as is done using the other methods outlined previously) but also for estimating the average value in blocks
(so called block kriging). This would be consistent with the interpretation of the cell value as representing an
average in that cell.
where the use of bold letters denotes vectors, so that z is the (column) vector of observed z values at the
neighbouring sample points. The covariance matrix of [z, zp ] which is:
¸
·
C c0
,
(12.2)
0
c0 c00
12.1.5
Which interpolation algorithm?
where c0 is a column vector of the covariance between the interpolation point p and all the surrounding
observations, C is the covariance matrix of the surrounding observations and c00 is the variance at the point p.
These are computed using a variogram or covariance function determined earlier.
The question remains – what should we do if we have a specific point data set and want to generate a continuous
representation (raster layer) of this. We have several possible algorithms, each of which has its own merits. In
general we would like to use the best algorithm available, which is going to be kriging is almost all situations
where we have sufficient data. To estimate the variogram accurately it is generally held that at least 100 samples
are required – more if the process is non-stationary.
An advantage of using kriging is that because we have a probabilistic (or stochastic) model we can also estimate
the accuracy of the prediction at the unsampled location (based on an assumption that our model is correct).
We can write the variance of the estimate at the interpolation point as:
If speed is critical and we need to perform the interpolation in real-time then we may use a simpler method
such as inverse distance weighting.
σp2 = c00 − c0 0 C −1 c0 .
(12.3)
Thus we can provide both the most likely value (mean) of the interpolated variable and confidence limits
(as shown in Figure 12.9 for kriging using the exponential variogram and in Figure 12.10 using the squared
exponential variogram). This is important because it enables us to estimate the effect of the interpolation
method, in terms of any additional errors that this introduces into the data in the raster layers. Note that
finding the inverse of a matrix has O(n3 ) time complexity, thus we usually use a moving neighbourhood to
compute the kriging based interpolation, and thereby keep n small.
There are many factors which we have not dealt with above. For instance we have assumed a zero mean. In
many situations this is not going to be the case (for instance if the point observations were of temperature we
would observe a gradual cooling trend as one went further north). If a trend is present we must take this into
account. There is not scope to investigate this here, but it is still an active area of research. We have also
assumed isotropy – that is the variable behaves the same in all directions. This is often not valid and we then
have to embrace some model of the anisotropy.
What is more the procedures we have discuss as kriging are merely a subset of those available, many of which
are highly mathematical. The interested reader who is comfortable with mathematics and statistics should look
In an ideal world our choice of method would be determined by what we believed about the generating process.
However data is always limited (since it is expensive to collect) and often time is limited. The results of any
interpolation will generally also depend on the skill and experience of the practitioner, since there is some
degree of art to choosing the correct algorithm. In the future more complex, highly statistical methods (based
on Bayesian thinking) will probably minimise the impact of the user, and give better, more robust interpolation
results.
Cressie, N. A. C., 1993. Statistics for Spatial Data. Chichester: Wiley.
CS3210 Geographic Information Systems
12.2
85
Remote Sensing
We have seen in the previous section that raster data can be acquired by interpolation of point samples. In this
section we look at how remote sensing can be used to acquire huge amounts of spatial information which is, in
the first instance, best represented in the field model and stored in the raster data structure.
The concept of remote sensing goes back a long way (our own eyes are the most familiar remote sensing tools)
but only recently, in the digital age, has the full power of remote sensing been realised. The essence of remote
sensing is to view, from a remote location, the object of interest. We are probably most familiar with this in
the context of photography, where we use a photosensitive film to capture an image of a remote object. Digital
cameras are now common-place, and these replace the film with an array of charge-coupled device (CCD) based
sensors which measure the intensity of the projected light at a certain wavelength over an array of points in
the image. This produces an array of pixels, whose value represents the intensity of light emanating from the
corresponding point in the image.
In remote sensing the same technology is used. In the past optical cameras with film were used, but now the
sensors are almost all electronic – even on the ‘spy-satellites’. In general we think of remote sensing as referring
to satellite borne instruments, although a large portion of data acquired by remote sensing is still from aircraft
mounted systems. The great benefit of remote sensing over other methods of data acquisition is the ability
to cover large regions of the Earth’s surface at relatively low cost (although remote sensing is still often not
economically viable without government support). These large regions can also be assessed synoptically (at the
same time) and repeatedly, to build up a picture of change. If you think remote sensing is largely concerned
with sensing at visible wavelengths you are wrong. Increasingly, other regions of the electro-magnetic spectrum
are utilised to provide very useful information.
CS3210 Geographic Information Systems
something at the Earth’s surface, so the first requirement is that the atmosphere, through which the radiation
will generally have to pass twice, is transparent to EMR of that wavelength. Regions of the EMR spectrum for
which this is true are shown in Figure 12.11. In general there are certain ranges of wavelengths in which the
cloud free atmosphere is approximately transparent, these being called window regions.
It is also important to consider what source of radiation is to be used. The sun provides good illumination
(on one half of the globe at a time) in the ultraviolet, visible and near infrared regions of the spectrum. Thus
in these regions we can use passive sensors which measure the amount of sunlight reflected or scattered by
the surface in window regions of these wavelengths. The Earth itself emits radiation in the thermal infrared,
regardless of solar illumination, thus passive sensors can be used in this region of the spectrum. The final region
commonly used, is the microwave region – here there are no natural emitters, thus an active instrument, which
creates its own radiation source and bounces this off the surface of the Earth must be used. Radar based ideas
operate in this region of the spectrum, but are complicated to use in satellites because of the large amount of
power needed to generate the radar beam.
Typically the instrument is composed of an array of CCD sensors which respond in the desired wavelength
only, with some optics in front to focus the image onto the sensors. In some systems the number of sensors is
reduced by using mirrors to scan across a linear array of sensors, or even a single sensor, however this means
that mechanical parts are needed in the instrument – a potential disadvantage in space.
12.2.2
The platform
0.4
0.6
1.0
3.0
Wavelength( µm)
15
Radio waves
Microwave
Far Infrared
Thermal Infrared
Middle Infrared
Near Infrared
Visible
Radiation
0.1
Ultraviolet
The electro-magnetic spectrum
Atmospheric transmission
12.2.1
86
0.1
100
Wavelength (m)
Figure 12.11: A sketch of the atmospheric transmission of EMR for the full region commonly used in remote
sensing, based on (Mather, 1999).
Electro-magnetic radiation (EMR) is emitted by all objects which have energy. You and I are currently giving
off large amounts of electro-magnetic radiation in the thermal infrared region of the spectrum. We are doing this
because our bodies are hot – indeed the wavelength of the radiation emitted is a function of the temperature of
the emitting body. The useful EMR spectrum is shown in Figure 12.11, together with the amount of radiation
at each wavelength absorbed by the atmosphere. Roughly speaking we have:
➢ X-ray – high energy and not very useful since absorbed in the atmosphere,
➢ ultraviolet – high energy, largely absorbed (ozone),
➢ visible – from the sun, not heavily absorbed,
➢ near infrared,
➢ thermal infrared – emitted by objects on the surface, partially absorbed,
➢ far infrared,
➢ microwave – increasingly used, but not intuitive,
➢ radio-wave – used for communications.
When deciding on what wavelength to use for remote sensing there are several factors to consider. The most
important question is: what is the object or property that is to be observed? Here we will assume that it is
Figure 12.12: The orbit of the SPOT series of satellites.
Another issue that must be considered is the platform. As a general rule the further the platform is from the
surface of the Earth, the bigger the region that can be seen, but the worse the resolution of the instrument
becomes. Considering satellites there are two options:
➢ Geostationary orbits mean that the satellite remains over the same location on the surface of the
planet at all times. For this orbit to be stable the satellite must be a long way from the Earth (about
36000 km – to minimise the effect of its gravity). Thus geostationary satellites typically have poor
ground resolution (of the order of 1–10 km). There is, however, the benefit of being able to see almost
half the globe at the same time, which means very good spatial coverage and time resolution can be
provided. Thus this sort of system is largely used for weather monitoring / forecasting.
➢ Polar orbits place the satellites much closer to the Earth’s surface, but rather than staying in a fixed
place they orbit the Earth, typically almost over the poles. Thus they observed a swathe of the Earth’s
surface (Figure 12.12) on each orbit, this swathe moving across the surface of the Earth both as the
satellite orbits and the Earth rotates. This means that for most instruments the same point on the
Earth’s surface is sampled once every 1 to 30 days. The benefit of this lower orbit (about 650 km
typically) is that the images acquired have a very fine resolution, typically from 1 km down to 1 m. The
finer the resolution, the smaller the swathe in general.
Constraints on the resolution of the instruments include not only the optics, but also how to transmit the
information down to Earth quickly enough, and what to do with all the data when it arrives.
CS3210 Geographic Information Systems
Once the image has been acquired and transmitted to Earth we need to be able to correct several image
characteristics. There are at least three corrections which are typically applied:
➢ photo-rectification – remove any distortion caused by the optics of the imaging system,
➢ geo-rectification (also called geo-referencing) – map the image into the ground coordinate system,
➢ ortho-rectification – remove any distortion cause by the topography.
Each of these processes is a highly mathematical operation which we will not cover, but it is important to
understand that remotely sensed data imported into a GIS have already undergone a number of manipulations
which themselves may have introduced additional errors into the data. Thus although remote sensing does
provide pretty much the only means for obtaining primary, synoptic, spatial data the ‘primary’ data has been
heavily processed.
87
88
CS3210 Geographic Information Systems
12.2.4
Applications
Once we have the remotely sensed images, they will generally need further processing to extract the information
desired. At their most simple remote sensing images are used as backdrops for other information in the GIS,
e.g. showing the route of a new power line in context. This is the most elementary use of remote sensing data
but still requires rectification of the images.
More commonly we will use a wide variety of methods (some of which are very powerful and mathematically
advanced) to extract the information we need from the remotely sensed images. This might involve:
➢ classification of land-cover,
➢ updating / correction of existing data (often semi-manual),
➢ creation of DEMs using two images taken from different positions,
➢ monitoring of crop growth, pests or irrigation needs.
One of the fastest developing areas of science is the extraction of information from huge data sets (data mining),
and remote sensing is one area where these methods are very relevant.
12.2.3
Some satellite systems
12.2.5
This section briefly outlines some of the more commonly used remote sensing (satellite) systems. Possibly the
most well known is the Landsat series of satellites. Like any other machine, satellites have a limited life thus
there are often many satellites all launched for the same mission (although often with upgraded or additional
sensors). The Landsat series of satellites dates back from 1972. These are polar orbiting satellites, with sensors
in 7 bands on the thematic mapper instrument:
➢ Band 1 – blue green (visible),
➢ Band 2 – green (visible),
➢ Band 3 – red (visible),
➢ Band 4 – near infrared,
➢ Band 5 – middle infrared,
➢ Band 6 – thermal infrared,
➢ Band 7 – additional middle infrared.
All bands other than six have a ground resolution of approximately 30 m, with band 6 having a resolution of
120 m.
The primary mission for Landsat, as the name suggests, was the monitoring of land-use, land-use change, crop
growth and mineral exploration. Since many of the sensors are passive and in the solar range, the satellite has
a sun synchronous orbit, meaning that each orbit is half in full (near midday) sunlight. The main problem for
Landsat is cloud cover, which can mean that some locations are only very rarely sampled.
Into the future
The use of remote sensing is set to increase in the future, with higher resolution (spectral and ground footprint)
instruments and ever more satellites. Current satellites can give ground resolutions of 1 m in pan-chromatic
bands (e.g. IKONOS), for small regions of the Earth’s surface. At this resolution it is possible to detect cars
and other small objects. This is a relatively cheap way of surveying very large areas accurately although good
rectification is important. Future trends look set to provide still higher resolution and in smaller (and more)
spectral bands.
The Millennium mapping project is another interesting remote sensing project, which is aiming to produce a
complete digital aerial photograph coverage of Great Britain at very high resolutions. This can be found on the
web, from where you can probably obtain an image of your home.
An interesting recent innovation in remote sensing is to use lidar (laser-radar) fired from a plane at a known
position directly down to the surface to infer a very accurate DEM of the surveyed area. This can produce 1
m resolution DEMs which are accurate to within 10 cm. This sort of information might be very useful when
assessing the risk of flooding of a given area. Insurance companies are particularly interested in this.
But the real challenge for the future, in my opinion, will be the extraction of useful information from the
accurate, high resolution data. This is a challenge that requires involved mathematical and statistical models,
which we cannot cover here.
For more details on remote sensing the reader is referred to Mather (1999).
Landsat has been used in many applications, but largely in the field of land-cover mapping, particularly in less
well surveyed countries. This type of information is often very useful in a GIS and has many applications, from
agricultural planning to site selection to mineral exploitation. The relatively regular temporal coverage means
that the satellite has been used to monitor crop growth and climate change (e.g. in the Sahel).
Another well known system is the SPOT series of satellites. These French satellites are polar orbiters with
sensors in the green, red and near infrared parts of the spectrum (at 20 m ground resolution) and a panchromatic sensor (sensing at all visible wavelengths) with a 10 m ground resolution. The SPOT satellites can
fulfil some of the roles that Landsat does (although the fewer spectral classes mean less accurate classification
of land-cover) but the better spatial resolution means that features such as roads and buildings can be resolved
on SPOT scenes. Thus SPOT data can be used to automatically update and improve existing GIS databases.
The final satellites to be considered are the polar orbiting ERS series of satellites. These include an on board
Synthetic Aperture Radar (SAR) sensor. Unlike the other satellites considered so far this is an active instrument,
which bounces a pulse of C-band radar (6 cm wavelength) off the surface at an oblique angle. The amount of
backscattered energy is measured, which gives information on the surface roughness, and electrical properties
over the 25 m footprint. SAR images are not like those we are used to seeing with our eyes, but have proved
very useful for land-cover mapping, mineral prospecting and DEM creation. SAR is particularly effective since
it is not really affected by cloud.
Mather, P. M., 1999. Computer Processing of Remotely-Sensed Images. Chichester: Wiley.
12.3
Scanning
An alternative to remote sensing and possibly the easiest way to acquire new raster data layers is to scan in
existing maps. This is commonly done to provide a backdrop to show other analyses, or provide simple map
display systems. It is not easy to directly analyse these maps, thus further processing is often required before
they can be used in a GIS. It is also necessary to overcome problems caused by errors in the original maps and
the distortion of the paper copies in storage.
12.4
Conversion between fields and objects
We do not have time to cover this in any depth, but there are a series of very interesting algorithms which
we can use to convert from raster to vector. This is most easily accomplished for scanned images, where the
scanned maps contain only the features of interest. In remotely sensed images the problem is extremely difficult,
CS3210 Geographic Information Systems
since the images tend to contain a lot of information and detail which is not relevant to your data requirements,
and this must be filtered in some way. Details of the Zang-Suen erosion algorithm, which can be used to extract
vector data from scanned maps, can be found in Worboys (1995, p. 230).
89
90
CS3210 Geographic Information Systems
although others are more conservative and have better error propagation properties. We can think of several
types of error:
• Precision;
13
Sources of Vector Spatial Data
• Accuracy:
– geometric,
There are several sources of vector data which can be accessed. The most commonly used method for acquiring
vector data is digitising. This involves a user defining the vertices of the objects being digitised, often from
an existing paper map or aerial photograph using a digitising tablet. Attributes are also often added at this
point. The main drawback here is that the digitised data is only as reliable as the original map from which it is
digitised, and that extra operator errors are often added during digitising. It is also expensive, since it is very
labour intensive.
Scanning can also be used, generally in a semi-automated way, to vectorise existing maps. We could also use
remote sensing images as the source. Methods to extract vector information from these raster data structures
require complicated mathematics, or a great deal of operator intervention.
The global positioning system (GPS) can be used to obtain vector data from field based studies. GPS based
surveys are probably the most accurate source of geo-referenced data we have available. GPS uses a constellation
of satellites (which are controlled by the US military) to triangulate the receivers ground position using a number
of ‘visible’ satellites. Using repeated samples it is possible to estimate position to around 25 m, however if an
additional fixed base unit, which is sited at a known position, is used in addition to the field instrument, many
of the atmospheric effects which cause most of the errors on the position measurement can be eliminated. Thus
using differential GPS, as the method is called, positional accuracies to tens of centimetres can be achieved. To
obtain greater accuracy, traditional surveying methods, using laser enhanced equipment is necessary. This is
labour and skill intensive.
14
Errors in Spatial Data
One of the major problems in a GIS is the keep a handle on the analysis errors. This is particularly important
for two reasons. First, people tend to believe that results presented as digital maps are exact. Secondly, errors
in the base data tend to build up (propagate) during many analyses. Thus the results produced by a GIS should
be regarded as skeptically as those produced by any other model.
We have already seen an example, related to precision, where the intersection point of two lines could not be
represented in the computer system with sufficient precision. Precision is a measure of the ability of the digital
number (which by its very nature represents a discretisation of the real number line) to represent the location
of an object.
Say we were interested in storing the location of buildings to a precision of 1 meter, across the whole of Great
Britain. The domain has a size of O(1000 km) or 106 m. With a 16 bit numbers we can represent the spatial
location to a precision of about 16 m. Thus we will need to use at least 32 bit numbers to represent the
coordinates, and if we increase the precision required and the areas considered we may have to go up to 64 bit
reals (doubles).
– scale,
– attribute,
– topological;
• Consistency:
– scale,
– source,
– classifications;
• Currency;
• Completeness.
We do not have time to fully cover all these issues, so they are raised for awareness. In general all other data
in databases of any kind will also be susceptible to most of the errors described above. Let us consider those
errors which are especially prevalent in spatial data. We have already discussed accuracy briefly when looking at
sources of GIS data. The geometric (positional) accuracy that is required will depend upon the application and
the scale of the phenomena. If the application is in utility network management in a city it will be important
that the positions of the cables and pipes is known to say ±0.1 m. If the aim is to study the global distribution
of iron deposits, then an accuracy of ±1 km might be acceptable. Of course the more geometrically accurate
the data needs to be the more it costs to collect it.
There is also a question of consistency between the different data sources. If one of the ‘layers’ of information
being used is derived from a 1:250,000 map and another from a 10 m SPOT image, it is important to ensure
that the two layers are registered together. A good example of this can be seen in the GRASS tutorials –
try plotting the roads coverage on top of the image coverage, and look at the motorway – which is correct?
In general when combining multiple maps we should try and ensure that they are co-registered – that is the
relative errors between them are as small as possible. It is often very difficult to obtain absolute ground truth
– differential GPS is probably best here – but we can use the well surveyed triangulation points in the UK as
known ‘anchors’ to which the maps can be attached.
Scale is an important issue in spatial data, because different scales of information are required for different
applications. We have discussed this above, to some extent. Maps at scales of 1:2,500 should have a positional
accuracy of 2 m while maps at 1:250,000 scales can be expected to have accuracies of about 200 m. We have
also seen how scale might affect the way we represent curves – e.g. the Douglas-Peucker algorithm.
Of course precision is just one part of the story because we can have very precise measurements which are
very inaccurate, such as the distance between London and Birmingham is 24.3432545434512 km – very precise,
but this is false precision. However if we assume that one day our data will be accurate, precision may be a
significant source of error. Analyses such as point in polygon and line intersection algorithms are potentially
sensitive to precision errors.
Undertaking modelling using the geometrically imprecise data can produce unpredictable effects. An obvious
example is that if our boundary data is only accurate to ±200 m and we are looking for individual houses
in a given region we might get the answer quite wrong, which may be critical if the boundary is an electoral
district. If our modelling is more complex the positional errors could accumulate, although using most geometric
algorithms the growth of the errors will not be too severe. Tracing the error propagation through geometrically
based modelling is complex, and very rarely done in practice. Most effort is put into producing as error free
data sets as possible. Often the errors relating to attribute data, or the model itself are more gross than the
geometrical errors in a high quality spatial data set.
Of more concern is the propagation of errors. Consider trying to pot a ball on a pool table. Even if you are not
that good the chances are you can pot the ball if you hit it directly with the cue. But if you have to hit the
white first and play this onto the ball you want to pot it gets harder. A small error on hitting the white might
produce a big error on the red. If you have to do a plant (hit the white into a red which hits the other red
you want to pot) it becomes even harder because the errors multiply. This is common in some GIS analyses,
Attribute errors can have many sources. These may be as simple as operator error, or as complex as the errors
introduced into a dataset due to the use of an interpolation method to estimate a variable at an unsampled
location. Where the data is stored in a raster data structure this type of error can be quantified (when using
appropriate, probabilistic models) and propagated through the analysis analytically. For instance Heuvelink
(1998) is a book that deals almost exclusively with this issue. We cannot cover the details here, but it is worth
CS3210 Geographic Information Systems
noting that operations such as the computation of slope and aspect (i.e. neighbourhood based operations) from
a DEM produce very complicated error distributions, given simple attribute errors on the DEM.
Since many of the models we use will be non-linear in the attributes, the distribution of the errors at the end of
the analysis will often not be of a recognised form such as a Gaussian or normal distribution. If this is the case,
we must either make approximations, or use Monte Carlo based methods to estimate the effects of the errors.
Monte Carlo methods work by having multiple attribute fields, which are samples from the true distribution
(we hope) but include the errors present in the data. It is rather like assuming there are many worlds, and
calculating the outcome of the modelling in each of these different worlds. When we put this together we can
see how the errors added on to the raw data in the many worlds has affected the outcome. This will give us
a measure of uncertainty. The problem here is that if our model is complex and big, we need to run it many
thousands, possibly millions of times to accurately account for the errors.
We can do similar things but changing only certain key bits of data. This is called sensitivity analysis, because
we will only change one thing at a time – for instance we might ask how sensitive is the analysis undertaken in
the second piece of coursework to positional errors in the location of the existing supermarkets. This is often
undertaken when it is suspected that a few key errors may have a significant effect on the final conclusions.
Another common problem is that of data currency. Since houses, roads, cable networks and many other objects
are constantly being built, moved and removed, the data in a GIS needs to be kept current. This requires
constant effort in data acquisition, which is usually handled by the mapping agencies such as the Ordnance
Survey in Great Britain. Thus when you buy data from the Ordnance Survey it is generally relatively up to
date, but in five years time (and a lot less in some applications) the data will be out of date. When updating
spatial data, all the problems of database transactions occur and some form of version control and locking may
be necessary.
For spatial data we also need to consider completeness. If you think about the second piece of coursework the
data was incomplete from two (maybe more) aspects. First, the data only covered a small region – you had
no idea what went on outside that small region. If you were undertaking that analysis in reality a much larger
region would be needed. Secondly, the data was incomplete because not all the information you required was
available. This is often the case – it might have been nice to know which supermarkets people currently shop at
(if any). But this data was not available, so maybe you used a surrogate, such as assuming that people will shop
at the closest supermarket. This ability to use surrogate data is a strength and a weakness of GIS. It means if
we are inventive we can use existing information to infer something useful, but this will introduce additional
errors into our model.
In general we will hope that we can trace the lineage of the data using meta-data. Meta-data should describe the
creation, intended accuracy, measures of accuracy derived from other field surveys of the data, any changes made
to the data. This is vital if the data is to be used in any analysis, and is something that is often overlooked
in standard data structures. Another area that is often included in discussion of errors is the issue of data
availability. If you are working near a military installation it may be difficult to get high resolution data. There
are also issues of data format, copyright and cost. We do not treat these here.
One of the big changes that I think will have to happen to GIS in the future is that there will be built in
error propagation modelling, although this raises a number of issues concerning error quantification for existing
datasets and who will do this (do we trust the optimistic data providers??), how we can represent the uncertainty
and how we then use this in decision making. Many people are much happier with a numerical answer, as opposed
to a probability distribution. Probabilistic modelling makes decision making a more complicated process, but
should produce better decisions.
Heuvelink, G. B. M., 1998.
Francis.
91
92
CS3210 Geographic Information Systems
15
Coordinate Systems and Map Projections
As far as our everyday experience is concerned the Earth is flat. This is because visibility is generally less than
30 km and the Earth is so big that we cannot really detect the fact it is a sphere. However it is a sphere, and
when we are using data over large areas (e.g. the United States) we need to be aware that representing points
as if they existed on a flat plane will produce some distortion. Most GIS assume that the coordinates of the
objects within them are 2D, that is exist on the surface of the Earth.
The most general coordinate system to use would be spherical polar coordinates - that is two angles and a
distance from the centre of the Earth. You will be familiar with these coordinates as latitude and longitude –
giving the angle from the centre of the Earth, in terms of the south–north angle across the equator (latitude),
and west–east angle from Greenwich (longitude). Typically one assumes a certain geoid (shape for the Earth)
and these are sufficient to represent any point on the Earth’s surface.
The Earth is not a perfect sphere, the shape being distorted due to the rotation of the Earth and the locations
of the continents and to a small extent the pull of the moon. The exact measurement of the shape of the Earth
is difficult, although recent advances in satellite technology mean that there are several versions of the ellipsoid
which can be used. These define the amount by which the Earth is ‘squashed’ at the poles, as well as the radius
of the Earth. The most commonly used reference ellipsoid is the WGS-84 datum, which gives the equatorial
radius as 6378.137 km and the polar radius as 6356.752 km.
Earth’s surface
ellipsoid
geoid
Figure 15.1: The relation between the ellipsoid, the geoid and the Earth’s surface.
There is also a reference geoid – this includes the distortions caused by the continents and is illustrated in
Figure 15.1. Geoids have been acquired using satellite borne altimeters and other methods. If we are going to
use two different data sets then we must take care to ensure that a common geoid model is used to define the
projections of both data sets. If we do not account for the uses of different geoids serious errors can result.
When we consider projections there are three properties which we might be concerned with:
➢ conformal – preserves angles locally;
➢ equal area – preserves area;
➢ equal distance – preserves distance.
There are no 2D projections which can preserve all these properties thus the choice of projection will depend
upon the use that is to be made of the data.
15.1
Planar projections
Error propagation in environmental modelling with GIS. London: Taylor and
Figure 15.2: Four azimuthal projections: orthographic, stereographic, equal area and equal distance, from
left to right.
If we project the objects on the surface of the Earth directly onto a flat plane, we get a so called azimuthal
CS3210 Geographic Information Systems
93
94
CS3210 Geographic Information Systems
projection of half the surface of the Earth. These are so called because the azimuth (direction) from the centre
point of the projections is correct, although it is wrong elsewhere, the distortion getting worse the nearer the
edge one goes. Several examples are shown in Figure 15.2.
The orthographic projection is often used for showing what the Earth might look like from space. It preserves
no useful properties but ‘looks good’. The stereographic projection is frequently used for areas near the polar
regions and is conformal, but not equal area or length. The other two projections are area and distance
preserving respectively but neither are conformal. In general azimuthal projections are not widely used.
15.2
Cylindrical projections
Most of the projections used in GIS are cylindrical.
In these projections the surface of the Earth is projected onto a cylinder which surrounds the globe as
shown in Figure 15.3. Typically the cylinder will
only touch the sphere on a great circle (one of the
longest circles around the globe from any point).
The most common cylindrical projection is to allow the cylinder to wrap around the equator. This
produces the well known Mercator projection, which
Figure 15.3: Cylindrical projection.
is shown in Figure 15.4. The Mercator projection is
conformal, and lines give constant bearings, so the map is good for navigation, but it is not area or length
preserving.
Figure 15.5: Four cylindrical projections for a smaller area: Mercator, oblique Mercator, transverse Mercator and universal transverse Mercator from left to right.
15.3
Conic projections
The other projection commonly used is to project the surface of the Earth onto a cone surrounding the Earth.
This works well in specific hemispheres, with the cones generally placed so that the apex (tip) is over one of the
poles.
Figure 15.6: Conic projections for the Northern Hemisphere: equal area to the left and equal distance to the right.
Figure 15.4: Four cylindrical projections: Mercator, transverse Mercator, equal distance and Peters equal
area, from left to right.
If we take the great circle to be a line of constant longitude (i.e. the cylinder is on its side) then a transverse
Mercator projection is achieved. It is conformal, but again not area or length preserving. This projection can
be used to show fairly large areas of the world with big north–south extents.
If we take a Mercator projection, but ensure that distances are equal we produce the equidistant projection,
which is not conformal or area preserving. We can however directly measure distances from this.
The final global projection we consider in the Peters or equal area projection which as the name suggests
preserves areas, but is not conformal or length preserving.
If we are looking at smaller regions of the Earth’s surface then the projection chosen becomes less noticeable
(but the geoid becomes more important). Four projections are shown for smaller regions in Figure 15.5. These
are all Mercator projections, with the cone touching the sphere along different great circles. The middle left
figure shows an oblique Mercator, where the great circle is at an angle across the region. The right hand figure
show the universal transverse Mercator (UTM) projection, which is the most commonly used projection in GIS.
There are several UTM zones defined across the world (which means this is a local coordinate system) and in
each of these zones the great circle that defines the projection changes.
The British national grid system is a transverse Mercator projection, based on the Airy Ellipsoid. This is the
most commonly used coordinate system in the UK, which gives the distance east and north of the origin to the
south and west of the Scilly Isles. The data in the GRASS tutorial is displayed under this projection.
Examples of two conic projections are shown in Figure 15.6. The Albers equal area conic projection is not
conformal, while the Lambert equal distance is conformal but not equal area. Conic projections are most useful
for displaying large longitudinal extents in mid-latitudes for instance.
15.4
Which projection to use?
The decision on which of the available projections to use will depend upon the region which is to be represented.
In general vector data is best stored with the coordinates specified in latitude, longitude form, since this can
then be projected to any desired coordinate system. However, authorities such as Birmingham City Council
who only require GIS for the local area might fix their coordinate system once and for all and keep all data in
the British national grid projection for example.
For raster data the grid system (projection) that is used is more important, because changing projections between
different coordinate systems is computational expensive and likely to introduce additional interpolation errors.
Thus most raster GIS use UTM like system to define their grids. In the UK this is almost always based on the
British national grid projection.
In the United States a slightly different approach is adopted due to the increased size of the area, thus there
are local projections which change for different regions to provide local reference systems. This can make interboundary GIS rather complicated, thus the vector data is generally stored internally in latitude, longitude form
and projected for display.
CS3210 Geographic Information Systems
16
The Future of GIS?
In this section I hypothesise about the future role that GIS might have. I will use a practical example of the
application of GIS in the insurance industry to illustrate the impact these advances are likely to have. In recent
years all insurance companies have realised that GIS can give them a competitive edge by adding the spatial
dimension to their insurance activities. In the past this was based on post-code and previous claims, tomorrow
it will be based on predictive, spatial modelling of threats.
16.1
Role in business
Increasingly businesses are seeking competitive advantage through the use of information (the information
economy). GIS deal specifically with spatial data and extracting spatial information from that data. Current
trends, such as the spatial extensions added to the Oracle database and the extension of SQL to support some
spatial and temporal queries, suggest that GIS as independent software might not last long into the future. The
distinction between database and GIS is already very blurred, thus the addition of spatial analysis capabilities
to standard databases could mean the need for GIS is reduced. The other side of the coin is that increasing
use of spatial data will require more experts (i.e. computer scientists and geographers) to acquire, maintain
and process the spatial data but possibly not within a GIS. The advanced spatial modelling capabilities of GIS
seem unlikely to be provided in more standard databases, thus for more specialised modelling it seems likely
that GIS will survive, possibly as extensions to standard databases.
There is also a question over whether the current trend to desktop based GIS will continue. The growth of the
Internet and the potential of the fat host / thin client model of computing (which is somewhat at odds with
Java) suggest that a client-server approach to GIS may be the most likely model. If the server is a powerful,
multi-processor machine, the client merely requires map display capabilities which will probably be provided by
XML. Many of the big GIS vendors are selling Internet based mapping packages. Another area that seems likely
to strongly interact with GIS is that of Virtual Reality (VR) – planning a new wind turbine facility is helped
by a VR simulation of the visual effect this might have. This links GIS to advances in computer graphics.
The future will, I believe, see the adoption of GIS like systems into a central role in many businesses decision
support systems, and thus they will become a core part of the business information system.
In insurance companies this is already happening, although it can be seen more widely in re-insurance, where
huge companies buy some of the risk under-written by insurers. It is clear that buildings and home insurance
risks have a spatial dimension. The likelihood of subsidence damage to a house is a function of soil type, which
varies in space. Thus if you are going to try and predict the incidence of subsidence you will need a GIS which
contains information on soil types and properties as well as the spatial distribution of rainfall and evaporation.
The model used will need to include details on the house and when it was built, and other factors. If you can
model these down to a road, or even house scale then you can set your premiums to reflect the risk – although
there are dangers here that doing this will destroy the basis for insurance.
However all the big re-insurance companies now have GIS to explore and assess their exposure to risk – and to
try and minimise this – or spread the risk. There is a general trend is toward doing more things for ourselves –
so it seems likely that the use of GIS will spread downwards .... possibly to home users in assessing their own
risk?
16.2
Data models
Object oriented models seem set to take over the world (as embodied in Java). This will not be a revolution,
rather an evolution. In the example of the Oracle database, which historically is based around the ER model,
with tabular data structures, the spatial extensions are based on hybrid object-ER approaches. Although we
have used field based approaches in the practical work, there is no reason why these fields cannot be integrated
into the object approach. Everything we have seen can be translated into an object oriented approach, and I
suspect that this will be the way forward.
Object oriented approaches are more natural for spatial data than layer based vector models, because the
hierarchies they define correspond more closely to the way we understand space e.g. a house is on a street which
95
96
CS3210 Geographic Information Systems
is in a town which is in a country. One problem with an object oriented approach is the natural inertia of a
database industry which is still largely dominated by ER approaches.
We will also see the blooming of 3D GIS, since for many applications (e.g. visibility analysis, utility management,
mining, ....) information is required in at least 2.5D (this typically means a 2D data structure used with a digital
elevation model) and more often in 3D (that is an additional height coordinate).
The next step will be to integrate the time dimension into GIS. Change over time is something that we are
increasingly interested in modelling, whether it be for the assessment of the impact of climate change, or the
changes in the demographics of a city for marketing or site selection. Integrating the temporal dimension is
more easily realised in the object based approach, since for the most part things stay the same, it is only small
objects that change over time.
16.3
Data
This may be the area where some of the most significant changes occur. The focus of NASA in its Mission to
Planet Earth, and of many other national space agencies such as the European Space Agency, has become very
Earth centred. This is starting to supply vast amounts (many terabytes) of data on the surface (and other)
properties of the Earth. GIS provides the framework within which this data can be stored and analysed.
It seems likely the the cost and accuracy of spatial data will come down as more becomes available and more
people start using it. However the volume of data is still a very significant issue, and thus data structures and
access methods remain crucial, despite faster computers. We will not only improve the accuracy of the data we
might also start to do something sensible about error propagation.
In the insurance application the increased availability of data might have a very significant impact. Lidar
based DEMs will allow assessment of flood potential, microwave radars might allow assessment of soil moisture to predict subsidence. The higher resolution of visible wavelength remote sensing might enable accurate
characterisation of house type and location, together with the proximity of trees (which may affect subsidence).
The problem with having vast amounts of data is the extraction of information from the data becomes more
computational and conceptually difficult. This will require better models and techniques to turn the data into
information (although really the two are the same thing).
16.4
Models
With the increased data available we will be able to implement more sophisticated models which embody more
of the factors we believe to be important in the phenomena we are trying to address. For instance when deciding
on the potential sites for a new supermarket we might have some sophisticated models of competition between
supermarkets to allow us to assess the effect of a promotion or a competitor placing a store nearby (what if
scenarios). Not only will the models tend to become more sophisticated, I believe they will also become more
principled and statistical in nature. This will allow a more careful error analysis – a measure of the uncertainty
will be as important as the answer.
This is particularly so for the insurance industry where the whole thing is based on probability. Most of the
events that could occur are very rare, but occasionally we get very bad floods, or storms, or dry summers.
These can cause big losses, so it is important that the companies know both the mean (expected number of
occurrences or expected losses) and the variance (or spread of the losses), otherwise they may be badly hit by
the exceptional year.
Another area that will see growth in the coming years is that of data mining. Data mining is all about finding
useful relationships, or other information in huge databases. Spatial databases often have this sort of hidden
information (maybe relating expenditure on chocolate to the aspect of the persons house??!!) which might be
usefully exploited by certain organisations. This may not just be for business benefit, but might be useful for
social services or environmental improvement.
CS3210 Geographic Information Systems
16.5
GML and OpenGIS
The move to create global standards, based on ISO recommendations means that the potential is there for
distributed GIS, data warehousing and interoperable applications. The efficiency benefits of a standards based,
open data system are potentially huge, especially for public bodies such as central and local government, but
this requires vision and energy. It will take time for people to fully understand the potential of GML and web
services, but I am pretty certain that service oriented architectures (web services) are a big part of the future
of computing. GML is already providing the basis for the next generation of spatially enabled technologies.
16.6
Summary
Overall the future of GIS is sure to be interesting if nothing else. Like much technology the evolution is occurring
very rapidly, and on a global scale. This makes the exact future rather difficult to predict, however a change to
a more object based view seems rather likely. The main impact, however, is likely to come from the increased
data availability and the improved modelling that will result, in particular from the adoption of GML and
web services. This suggests GIS and associated technology will be a key (and probably integral) part of most
information systems in the future in business and the public sector. I hope that by the time you are well in
your careers you are all using GIS, just not calling it GIS!
References
Bowyer, A. and J. Woodwark 1983. A programmer’s geometry. London: Butterworths.
Burrough, P. A. and R. A. McDonnell 1998. Principles of Geographic Information Systems. Oxford: Oxford
University Press.
Date, C. J. 1995. An Introduction to Database Systems (6th ed.). Reading, MA: Addison-Wesley.
Environmental Systems Research Institute, Inc. 1993. Understanding GIS: The Arc/Info Method. Harlow,
Essex: Longman Scientific and Technical.
Foley, J., A. van Dam, S. K. Feiner, J. F. Hughes, and R. L. Phillips 1993. Introduction to Computer Graphics.
Reading, Massachusetts: Addison-Wesley.
GML-WorkingGroup 2004. ISO/TC 211/WG 4/PT19136 Geographic Information - Geography Markup Language (GML). OpenGIS Consortium. http://www.opengis.org/ accessed 9/7/04.
Heuvelink, G. B. M. 1998. Error propagation in environmental modelling with GIS. London: Taylor and
Francis.
Lake, R., D. S. Burggraf, M. Trninic, and L. Rae 2004. Geography Mark-up Language – Foundation for the
Geo-Web. London: John Wiley and Sons Ltd.
Laszlo, M. J. 1996. Computational Geometry and Computer Graphics in C++. London: Prentice Hall.
Mather, P. M. 1999. Computer Processing of Remotely-Sensed Images. Chichester: Wiley.
Rogerson, P. A. and A. S. Fotheringham 1994. GIS and spatial analysis: introduction and overview. In A. S.
Fotheringham and P. A. Rogerson (Eds.), Spatial Analysis and GIS, pp. 1–10. London: Taylor and Francis.
Worboys, M. F. 1995. GIS: A Computing Persepctive. London: Taylor and Francis.
Worboys, M. F. and M. Duckham 2004. GIS: A Computing Persepctive, 2nd Edition. London: Taylor and
Francis.
97
© Copyright 2026 Paperzz