Monday, August 29, 2011

Smallworld Technical Paper No. 1 - Ten Difficult Problems in Building a GIS

Richard G. Newell & David G. Theriault

Abstract

Building a GIS is a fruitful area if one likes the challenge of having difficult technical problems to solve. Some problems have been solved in other technologies such as CAD or database management. However, GIS throws up new demands, therefore requiring new solutions. This paper has chosen to examine ten difficult problems, why they need to be solved and gives some indication of the state of the art of current solutions.

Introduction

The subject of Geographical Information Systems has moved a long way from the time when it was thought to be concerned only with digital mapping. Whereas digital mapping is limited to solving problems in cartography, GIS is much more concerned with the modelling, analysis and management of geographically related resources. At a time when the planet is coming under increasing pressure from an ever more demanding, increasing population, the arrival of GIS technology has come none too soon.

However, there is a widespread lack of awareness as to the true potential of GIS systems in the future. When the necessary education has been completed, will the systems be there to handle the challenge? It has to be said that the perfect GIS system has not yet been developed.

Many of the fundamental problems have now been identified, and indeed attempts (sometimes good ones) to solve them have been embodied in today's commercially available systems. However, no system has an adequate solution to all of them and indeed, some are not addressed by any system (Rhind and Green, 1988).

In this paper we have chosen to examine ten important problems which we believe need to be solved. Ten is a somewhat arbitrary number since it would have been easy to find many more. However, one has to draw the line somewhere. Also we call the problems difficult, as opposed to unsolved, because for most of them, there is now some light at the end of the tunnel and, indeed, implemented solutions exist.

The fundamental causes of difficulty in GIS stem from the huge volumes of spatially related data, the nature of the data itself, and the wide spectrum of different organisations with diverse application requirements. Problems can be classified under the following four categories:

  • Data capture - this will be the single biggest cost in implementing a new system.
  • Performance - most systems strain under the volume of data and scale of the tasks invoked.
  • Customisation - every organisation is different; no black box system is going to be adequate.
  • Integration - today's systems are disjoint, in the future data and functions will need to be provided in one seamless environment.

Our ten problems are really different manifestations of these four and so the rest of this paper is devoted to a commentary on each of them in turn. As most organisations are at the data capture stage, it is not surprising that this is the area where most work has been done in trying to improve the process.

The Topology Capture Problem

The lack of a really good solution to this problem is probably going to contribute most to the slow speed of implementation of GIS systems in the future. Although much progress has been made using scanning and vectorisation technology to capture map geometry (Nimtz, 1988), the addition of feature information, the generation of meaningful topology and the association of user data involves a large human element. Some modern topological editors do help to ease the burden (Visvalingam et al, 1986), but essentially, the only solution is to throw man-hours at it (Chrisman, 1987). Thus for some time we will be seeing expedient solutions using raster backdrops without topology.

Large Data Volumes

The volume of data required to represent the resources and infrastructure of a municipality is of the order of 0.5 to 1 gigabyte per 100,000 of population, supporting examples can be found where an urban region would require 14 million points simply to represent the map planimetry (Bordeaux, 1988). Not only are the municipal records required to be handled in the database, but also large amounts of map data in the form of coordinates, lines, polygons and associations of these. Today's database technology is barely up to the task of allowing the handling of geographic data by large numbers of users with adequate performance. Serious questions have been raised as to whether the most popular form of database, the relational model, will be able to handle the geometric data with adequate response. Certainly, if this data is accessed via the approved route of SQL calls, the achievable speed is orders of magnitude less than that which can be achieved by a model structure built for the task (Frank, 1988).

A common approach to this problem is to partition the database into separate map sheets. Frequently this is achieved by interfacing a sheet based CAD drafting system to a database (but see continuous mapping below). Another approach for avoiding the shortcomings of commercially available database systems is to build a special purpose database system to handle the graphics, which lives alongside the commercially available database system. A third approach, based on checking-out subsets of the database, followed by checking-in the changes, can overcome the problem of reviewing and editing (but see continuous mapping and version management below).

The idealistic solution of implementing one database system to handle both kinds of data with optimum performance is a large undertaking, and even then does not satisfy the user who has already committed to an existing commercially available database.

Continuous Mapping - the seamless database

As one begins to build a GIS, the problem of large data volumes appears to be the most important one to solve. A solution in the form of the map sheet approach, chosen by most systems today, simply mirrors what happens with physical map sheets. This is then combined with map management techniques to try to hide the cracks. While this partitioning helps to solve the first problem, it destroys the end user's dream of a single, continuous, seamless landscape. At best this may be termed contiguous mapping, but in all cases, such schemes lead to difficulties when one is trying to deal with objects which span more than one map sheet. The problem of dealing with variable density data, for example between urban and rural areas, is tackled in an ad hoc manner as there is no single size of map sheet to yield optimal performance in all cases.

The reason that a map sheet approach is popular is that it can easily be implemented with a conventional filing system and it is straightforward to find which files should be searched for a given spatial query. However, the system can hardly be called integrated, since the geometry, topology and the relationship between graphic and application data are all handled in an ad hoc manner.

What is now demanded is that map data is handled just as it is in the real world truly continuously, with no cracks. But this must be done so that spatial queries of the form "inside an area" or "near to an object" can be handled efficiently. A common query is of the form "conforming to such-and-such and is on the screen" i.e. filtering out the rest of the world. Normal indexing methods provided with conventional database systems are inadequate to answer this type of query, and so a variety of spatial indexing methods have been invented to try to solve this problem.

The general idea is to construct a tree structured index (e.g. quad tree, range tree, field tree) the leaves of which are not disjoint they overlap in two dimensional space and this accommodates objects of different sizes at different levels. In this respect, the quad tree (Samet, 1984) is beautifully simple but has serious deficiencies for large data volumes. The range tree (Guttman, 1984) is rather more complicated but is more efficient. Systems which are optimal for retrievals are not necessarily optimal for updates and vice versa. As data is read far more often than it is written, the emphasis should be on providing maximum retrieval performance, while still maintaining an acceptable rate of interactive update.

However, given today's disk technology, it matters little how efficient the index is if the data is scattered on disk. Thus it is essential that data is clustered on disk so that spatially similar data have similar addresses (Bundock, 1987). This can be achieved in the relational and other tabular models provided that data is stored ordered by a properly generated spatial key. There are a lot of good ideas now in this area but it is yet to be seen which are the best.

Accommodating Existing Databases

There are a number of shortcomings in existing available database technology for the building of a GIS. Nevertheless, there are now many organisations which have committed in a big way to one of the emerging de-facto standard database systems. It is not acceptable for a GIS vendor to try to displace such a database with something else tuned for GIS applications. It is necessary to engineer a solution which preserves the users' investment while at the same time doing as good a job as possible in providing a GIS capability. As mentioned above, if one tries to handle all data in the commercial DBMS, then it is highly likely that a serious performance problem will result. If one runs a geometric DBMS alongside, then serious problems of backout, recovery and integrity may result. It would seem that what is needed is some "virtual DBMS" which can act as a front end to two or more physical DBMSs, and that this should have a knowledge of all aspects to do with data dictionary, integrity, and access to the various data.

Version Management - the problem of the long transaction

It is necessary that large databases can be accessed simultaneously by multiple users. The normal assumption in a commercial DBMS is that a transaction is a short term affair, and that a locking system can be used to keep users from unintentionally interfering with one another. Such methods are used to preserve the integrity of the database at all times. As far as any particular user is concerned, there are two versions of the database of which he is aware, the version when he started his transaction and the version he can see at the instant prior to committing it. The problem is, when a user is in mid transaction in such systems, then all other users are locked out of doing certain things particularly updating any records that have been locked. This is all fine, provided that the transactions are short.

However, in a GIS system, as in design database systems in general, a transaction can be very long indeed, possibly days or even weeks. A planner for example may take a considerable time to complete his work before he gets it approved and therefore available to the other users of the system. In these circumstances it is totally unreasonable to expect other users to avoid the area.

Copying the whole database is completely impractical in many cases. A system of check-out, check-in can be used, but this gives rise to the problem of achieving integrity through combining incompatible changes in different checked-in versions. It is quite possible for two separate sets of changes to have integrity, but for the combination to lack it.

Users who are starting the implementation of a GIS system may be blissfully unaware of this problem. To start a project, most organisations will initiate a pilot study: a simplified data-model, only one or two users of the system and test data that can be thrown away if the pilot database becomes corrupted.

The situation in a production environment is the opposite of the above: a complex data-model, a large number of users and concurrent activities, and a large investment in the creation of the database. The problem of version management could dominate all database activities in a production system and yet does not manifest itself in a pilot project.

Versioning is not addressed by today's round of DBMSs. The complete problem covers the management of alternatives, the management of chronology (what did my database look like three months ago), the management and policing of change, and various forms of semantic versioning such as time series, variant designs and "as built" records.

Hybrid Vector Raster Database

It is an amusing paradox that whereas the bottle-neck with vector based GIS is in capturing the map base, the problem with remotely sensed data is the sheer volume of it that pours down upon us from the heavens, a veritable fire-hose of data. For example SPOT1, at 50Mbits/second, captures an area 60km x 60km in 9 seconds. Increasing resolution to 5 metres would mean boosting transmission to 200Mbits per second for every hour of daylight.

Whereas the value of an attribute-less map base is limited, the value of satellite data is potentially immense, for here surely we can observe and monitor the catastrophic way in which we humans are changing and damaging this paradise in which we live. Green issues aside, the vector people are still struggling with data capture while the image processing people have a major computational problem with the sheer bulk of data.

These two forms must come together in one system since the combination will yield far more than the parts. Currently vector based GIS systems tend to be separate from raster based image processing systems. Raster data is derived from a variety of sources e.g. rasterized paper maps (from scanners), grid data derived from digital terrain models and remotely sensed data (from satellites).

Many applications such as overlay analysis (described below), recording changes in land use and changes in the environment can be tackled much more effectively when the two forms of data can be manipulated, viewed and analysed in one integrated, seamless environment.

Overlay Analysis - a problem of robustness

Given a multi-thematic map base, overlay analysis is used to answer questions of the type: "show me all areas with sandy soil and agricultural land use, where the ground slope is less than 5 degrees because I want to grow potatoes".

The idea behind overlay analysis is very neat: one starts by producing a single combined topological model of all the themes which are to contribute to the analysis. Producing the combined topology is both a geometric as well as a topological problem. Once it has been produced, all queries and analyses can be carried out wholly in the topological domain without any further reference to the geometry and are therefore amenable to be tackled via conventional query languages such as SQL (Frank 1987).

The problem is that producing the combined topology robustly is far less trivial than it seems at first sight. Anyone who has wrapped their brain around the problem of producing a robust solid modeller will know the kind of pathological cases that can arise. Fortunately, the problem is not quite as bad as producing a robust solid modeller; however, any system implementor who underestimates the difficulty of this problem had better beware - schoolboy geometry just isn't good enough.

But robustness is not the only problem. It turns out that not infrequently two or more themes may share the same boundary curve. Typically these will have been digitized separately and with different accuracies, therefore their representations will be different in detail. If two such themes are now combined, then many superfluous small polygons (called slivers) will be produced. Slivers can easily outnumber the real polygons leading to an explosion in data volumes, messy graphics, a collapse in efficiency and questionable analysis. Sliver removal is just another of those irritating issues which must be addressed if a practical system is to result.

In any analysis, it is important to try to make some assessment of the accuracy of the results. In practice, each theme of data will contain inaccuracies and errors, the bounds of which may be known. The consequent effects of these individual errors on the results from many combined themes may be impossible to gauge (Walsh et al, 1987), overlay analysis systems available today should carry the caveat: "let the user beware". (As an aside here, the whole subject of quality meta data, including such aspects as source, accuracy and coverage would be a good candidate for the eleventh problem.)

Overlay analysis is one of those techniques which is mind-blowingly tedious to do without a GIS. However, without robust clean algorithms there is a grave risk that the results produced will be meaningless.

Very Large Polygons

When Slartibartfast designed the coast of Norway (The Hitch Hiker's Guide to the Galaxy, Adams, 1979), he was really setting a major challenge for the designers of Scandinavian GIS systems. He probably also designed the Stockholm Archipelago with its 23000 islands as a challenge to those trying to do spatial analysis in the Baltic. Whereas very large linear features can be straightforwardly subdivided into manageable pieces, very large polygonal areas, including those with many embedded islands are not amenable to the same techniques.

However, one doesn't need to travel to scenic northern regions to encounter this problem, since very large polygons can occur in everyday applications. The GIS designer is confronted with the problem of doing spatial analysis on small parts of such polygons, as well as answering questions to do with inside and outside, without processing the millions of points which there may be in such polygons.

As with very large databases, the goal is to produce systems whose performance is dependent only on the data of interest and not to be burdened with the irrelevant.

The Front End Language - the seamless programming environment

A problem with GIS customers, seen from the system vendor's viewpoint, is that they are all different. Each customer will have a different data-model and will want substantial enhancements to his user interface and functionality. Whereas there are good facilities for data-modelling, the tools for developing and customising today's systems leave a lot to be desired. Much effort has gone into toolkits for developing and customising systems including standard graphics libraries, user interface managers, data managers, windowing systems etc. However, if one wishes to get in and program any of the systems, one is confronted with operating system commands, Fortran, C, SQL, embedded SQL, some online command language, domain specific 4GLs or a combination of these; not to mention auxiliary "languages" to define user syntax, menus, data definitions, data dictionaries etc. With these kinds of programming tools it can take many man-months of skilled programmer effort to achieve even modest system customisation.

It is a common problem with systems that contain parts that are front ended by different languages that it is not possible to integrate them properly. For example, a graphics system for mapping, which is "hooked into" a database, typically does not allow the full power of the database to be accessed from within the graphics command language, nor can the power of the graphics system be invoked from within the database query language. What is really needed is a system such that all data and functions can be accessed and manipulated in one seamless programming environment (Butler, 1988).

What has been shown by a number of organisations is that the same development carried out with an on-line object orientated programming language can cut such development times by a very large factor (e.g. 20). Object orientation does not just mean that there is a database with objects in it, but that the system is organised around the concept of objects which have behaviour (methods). Objects belong to classes which are arranged in a hierarchy. Subclasses inherit behaviour and communication between objects is via a system of message passing thereby providing very robust interfaces.

Firstly, such a language should be able to provide facilities covering the normal requirements of an operating system, customisation, applications programming and most systems programming. Secondly, the language should have a friendly syntax that allows casual users to write simple programs for quick online customisation. Larger programs should be in a form that is easily readable and debuggable. Thirdly, the language must be usable in an interactive mode as well as being able to be "supercharged" by means of full compilation.

There are several languages around which satisfy some of these requirements: Basic is all right for simple programs (but look at what some people now achieve in the micro world with some of the more advanced modern Basics); Lisp has much of the power and speed, but is hardly readable (but just see what Autocad has achieved with Autolisp). Smalltalk has both speed and object orientation, but with the total exclusion of any other programming construct. Postscript is very widely used and has a number of the desired features, but is another write-only language (i.e. "unreadable" by anyone, including the original programmer). Hypertalk is wonderful, but you would not write a large system in it. C++ has much of the required syntax and semantics, but it is not available as a front end language and can therefore only be accessed by a select few system builders, normally employed by the system vendor.

Having dismissed most of the well known languages developed during the last 30 years, then what is required? It is an on line programming language, with a friendly Algol-like control structure and the powerful object oriented semantics of languages like Smalltalk.

The Query Language

Modern query languages such as SQL are not sufficient in either performance or sophistication for much of the major development required in a GIS system - but then one would argue that they were not intended for this. One can see why people like SQL; it can give immense power in return for some fairly simple "select" constructs. A problem which has to be addressed is spatial queries within the language, since trying to achieve this with the standard set of predicates provided is extremely difficult and clumsy. (An example of a spatial query is to select objects "inside" a given polygon.)

If the route adopted is to provide two databases in parallel, a commercial one driven by SQL and a geometry database to hold the graphics, then there is a problem constructing queries that address both databases. Ideally, the query language should be a natural subset of the front end language allowing access to the same seamless environment that the front end language provides. Much work needs to be done in the area of query languages for GIS.

Conclusion

Ten years ago the evolution of CAD systems was at about the same stage that the evolution of GIS systems is now. One can recall the arguments over whether mechanical design systems should be 2D or 3D; what was it that allowed a 3D system to claim to be a solid modeller? The purists endlessly discussed CSG versus BReps; Euler's rules had to be obeyed at all costs - while in the mean time the faceted modeller brigade stole the show. The surface designers insisted that C2 was a minimum and some pronounced that C7 was essential! Your surface system was definitely not macho unless it was founded on NURBS - even though the humble conic can handle adequately 95% of practical problems. Today, most of the world seems to be happy hacking away on a fairly modest little drafting system called Autocad.

So it is now with GIS: What is a GIS and what isn't a GIS? Do the ever longer synonyms for GIS say anything new? Should the system be based on raster or vector? One of us recently asked a vector bigot what percentage of spatial analysis problems can be tackled in the raster domain - back came his answer: "None!". Arguments about topological models are rife while protagonists argue about what are the acceptable ways to break the strict rules laid down for data-modelling, in order to achieve a working system.

Let us beware in GIS, too much concentration on how things are done as opposed to what we are trying to do will lead to much wastage of time and wrong decisions being made. It is all very well for the salesman to try to identify unique selling features in his product in order to exclude another vendor from the sale, but let us learn from the experience of CAD when this is happening. It is important to identify the real application requirements of a system, this can be crucially dependent on that system having strong fundamentals. These can be exposed by careful benchmarking (McLaren, 1988).

In this paper we have attempted to lay bare ten difficult problems in GIS. In fact they are not wholly in dependent problems indeed some are a consequence of trying to solve another. The encouraging thing is that most of them have now been identified, and indeed for many there are some promising solutions.

Acknowledgements

The authors gratefully acknowledge Robin McLaren for reading the draft of this paper and thank him for his thoughtful comments.

References

Adams, Douglas (1979). The Hitch Hiker's Guide to the Galaxy, Pan Books.

Bordeaux, La Communaut Urbaine de (1988). Programme du Concours pour un systSme interactif graphique de gestion de donnes urbaines localises.

Bundock, Michael (1987). Integrated DBMS Approach to Geographical Information Systems, Auto Carto 8.

Butler, R. (1988). The Use of Artificial Intelligence in GIS, Mapping Awareness and Integrated Spatial Information Systems, Vol. 2, No. 3.

Chrisman, N. R. (1987). Efficient digitizing for error detection and editing, International Journal of Geographical Information Systems, Vol. 1, No. 3.

Frank, Andrew U. (1987). Overlay Processing in Spatial Information Systems, Auto Carto 8.

Frank, Andrew U. (1988). Requirements for a Database Management System for a GIS, Photogrammetric Engineering and Remote Sensing, Vol. 54, No. 11.

Guttman, Antonin (1984). R-Trees: A Dynamic Index Structure for Spatial Searching, ACM

McLaren, Robin A. (1988) Benchmarking within the GIS Procurement Process, Mapping Awareness and Integrated Spatial Information Systems, Vol. 2, No. 4.

Nimtz, H. (1988). Scanning und Vektorisierungs Software zur Erfassung von Katasterplanwerken, Proceedings Automated Mapping/Facilities Management European Conference IV.

Rhind, D. W., Green, N. P. A. (1988). Design of a Geographical Information System for a Heterogeneous Scientific Community, International Journal of Geographical Information Systems, Vol. 2, No. 2.

Samet, H., (1984). The Quadtree and Related Hierarchical Data Structures, ACM Computing Surveys, Vol. 16, No. 2.

Visvalingam, M., Wade, P., Kirby, G. H. (1986). Extraction of area topology from line geometry, Proceedings Auto Carto London.

Walsh, Stephen J., Lightfoot, Dale R., Butler, David R. (1987). Assessment of Inherent and Operational Errors in Geographic Information Systems, Technical Papers, 1987 ASPRS-ACSM Annual Convention, Vol. 5.

No comments: