Philip Sargent
1st
October 1998
This paper only addresses internal, hidden feature identifiers which would be used to manage incremental updates.
These are almost certainly the same identifiers used to implement relationships.
Compound objects are assumed to be implemented using relationships.
This paper assumes a handle-based mechanism for identification and location of indentifer-servers, see [Bishr99, Sargent99].
Incremental update requires permanent feature identifiers, but also intrinsically means that some identifiers will not be permanent because they will be deleted ("retired").
This conflict can be managed if we design it properly. But we cannot suggest a solution which requires any kind of centralised international database.
We have all the information we need to do this now, so there's no excuse for not getting on with it.
The concept of published- versus dirty-feature identifiers is introduced.
This turns out to be sufficient. We do not need to specify the full semantics of a formal system to specify long-transaction or version control architectures.
A key idea is that incremental publishing is what we want to achieve. We do not need at this time an all-singing, all-dancing global peer-to-peer transactional universal historical feature identifier monster.
This paper is structured like this:
We can start simply, with read-only incremental updates to read-only GIS clients.
We then go on to consider a tight community which is collaboratively updating a GIS dataset, e.g. within a single cartography publisher's organisation. This has lower priority within OGC at this time.
We then show that these two architectures can be composed in one specific way which works and has good global properties. There may be other ways of carefully composing these two architectures.
With incremental publishing, there is a notional single master database, the "publisher" which periodically issues incremental updates. The clients have read-only copies of the data which they can use in various value-added ways [Arctur98]. The thing they do that is relevant to this discussion is that they create references to feature identifiers in other software, web-pages, other GIS packages etc.
Example: a hydrographic agency distributes CD-ROMS of charts to pleasure-boat owners with daily "notices to mariners" broadcast using the GSM telephone short-message service. Some boat owners have add-on software which presents the data as a 3D visualisation and correlates it with on-shore harbour information purchased from a yachting club.
Thus every feature identifier which is sent out from the master is published and the master takes responsibility for ensuring that it continues to be useful so long as clients follow some simple instructions and implement a local registry.
Assume for the moment that the client only works within a closed group and no reference is made to the data except via the client, i.e. a strict tree.
The client local registry contains an index of every feature identifier that the client has used in its value added activities.
When an update packet arrives from the master, it contains the geodata update in some file format and a feature identifier registry update, probably in some XML encoding. This says which identifiers are being split, merged and deleted [Arctur98]. (It probably lists all the new ids too.) If the client has used any of these ids, it has responsibility for:
When we resolve updated identifiers we have a choice:
The second option is one that often happens when there is a mixture of archive data which must be kept consistent, e.g. compound objects represented using relationships, and current data which must reflect the current configuration. Other software systems distinguish the two by defining this behaviour as an aspect of the relationship type, like cardinality or bidirectionality.
Now we relax the condition that access to the geodata is through a strict tree. This is more than we need at this point in OGC since our user-base wants controlled incremental publishing first.
Any set of clients which are synchronized in their updates can use each others value-added software because the feature ids will be identical. [All clients treat the geodata as read-only, remember.]
Now consider that one of these clients has created some additional information about an object whose geodata is in the public domain. That client publishes a web-page containing feature identifer references. A reader of that web-page may not want to go to that client to get the public data, but wherever the reader goes to get it may not be synchronized so some identifers will be meaningless.
We can fall back on a universal master in this case (which could be replicated). It would have to have interfaces appropriate for readers coming to it with identifiers from any one of the historical update packages. If we say that clients must always quote their current update package level ("patch number") whenever they quote a set of identifiers, then the reader will know that and can get the appropriate resolution and thus correct data from the master.
Note that when dealing with read-only data it is always possible to set up replicated servers and client-side caches to improve scalability and performance.
Whereas incremental publishing must be able to cope with clients which may number in hundreds of thousands who are largely out of control, our current requirements for distributed editing involve a few tens of editing clients which are under close control.
It is the editing process which splits, merges and creates identifiers.
Example: a mapping organisation has 20 groups working on different segments of data. These segments may be defined by tiles, irregular spatial boundaries or by theme (feature class).
Each editor checks out a writable segment from the common master database and works on it, checking versions of his segment back in from time to time. Each editor creates dirty feature identifiers which it has to reconcile with the master. It may reconcile its segment by keeping a copy and updating the identifiers with those it gets from the master, but more likely it will just delete its local segment and re-check-out a new version of the segment from the master.
After a while all editors have resolved their changes back into the master. At this point the master can be published. [Note that this means that the version network has achieved closure. This requirement will be relaxed later.]
All the dirty identifers never leave the organisation in which they were created and no editing client performs any value-added activity in which any identifer is used at all. As segments are checked back in, new identifers are either kept or renumbered if they conflict with another, but they are never clean until the dataset as a whole is published (perhaps as an incremental update, perhaps as a whole).
So long as a client operates on an identifier strictly according to either the read-only incremental publishing protocol or the distributed editing role, it can do both at once. So some of its identifiers (those it got from an incremental update from the master) are published and some (those it creates itself) are dirty. It can use the published ones in value added activities, but it can't use the dirty ones. It has to wait until it gets them back from the master in published form.
Alternatively, the editing client can use its own dirty identifiers if it takes full responsibility for them for its sub-clients who use them, i.e. it acts like a master which a whole subsidiary architecture of feature identifier update packets and identifer registries at its sub-clients. If you issue a dirty identifier, it is your responsibility to clean up afterwards.
This introduces the notion of "local cleanliness". A sub-subclient would not be aware that it had a dirty identifier, and indeed, so long as it only talked to other sub-clients of the same client (or to it's master - which we call an editing-client), the identifier would be effectively published. It only appears dirty outside that little group. Everything is fine so long as that little group achieves local closure, i.e. eventually the editing client reconciles itself with its master and then cleans up all its sub-clients: then the sub-clients can be normal clients of the master.
Dirtiness is thus a relative, not an absolute concept: it is only dirty outside the "isolation ward". This is OK so long as we maintain our context properly which is why we do need some Universal Resource Identifier (URI) or handle system to keep track of who we are talking to. DNS itself would be sufficient (if awkward) because it can define machine aliases, but the extra level of indirection from a handle system is well worth it [Sargent99].
How do we cope with out-of-control copying of data as we might find in the open Internet ?
This is the situation where we cannot assume that a client has had all (or any) of the interim update packets, or where anyone copies a dataset from a client without re-registering with the master. This is fundamentally always possible, even if we were to devise some Kerberos security architecture, if we do not use military-style no-write-down no-read-up controls which are impossible.
This is the same problem as the "non tree client network" we discussed above, with the same solution, unless the uncontrolled copy was made of dirty data. In that case the original master won't be able to help, and the edit-update client may be unaware of the copy and may have reconciled all its sub-clients and deleted all historical records of dirty identifiers. It will always delete them because that is the whole point of reconciliation, to save space in what otherwise would be an exponentially expanding identifier list.
This remaining unresolvable problem will alwys be with us in some form: someonbe can always take a copy of data, change it in an arbitrary way, then give it to someone else and vanish. Thus the fact that we can't deal with this case is not important as no system can deal with it.
The handle (URI) for a master which issues update packets needs to be written into the header of the packet so that a client which needs the next update always knows where to find it.
The individual identifiers in the packet should also be annotated with their server's URI if they are being passed on from another master server. That is the same thing as saying that an update-editing client which put out its own dirty identifiers would use its own URI as a prefix for the dirty ones, while using its master's URI as a prefix for the published ones. Thus a sub-client within an "isolation ward" would in fact be able to tell the difference between globally published and locally published (globally dirty) identifiers.
What would an update packet look like ? The geodata part would be some established file-format which had feature identifiers, e.g. an object form of SDTS [Arctur98], or a set of OGC Simple Feature transactions. The identifer part of the packet might look like this:
<idpacket> <master> hdl://it.jrc/sai/geo/ogis/laghi/M37-1 </master> <packet> 171-5 </packet> <metadata> [...omitted..] </metadata> <update> <new> 17321 </new> <new> a37b5 </new> <delete> 17320 </delete> <delete> a37b3 </delete> <delete> a37b4 </delete> <split> <source> 17230 <target>17321 <target>a37b5 </split> <merge> <source> a37b4 <source> a37b3 <target> a37b5 </merge> </update> </idpacket>
Between which objects can we support relationships ?
What must a GIS be able to do in order to participate in this scheme ?
Simple. The minimum and maximum requirements are the same: it must support permanent, immutable feature identifiers within itself.
Everything else, the identifier registries etc., can all be handled by external, add-on software which translates the identifiers into some internal form for the GIS and which can handle URIs to talk to master servers and read update packets.
The version control architetcure described here is over-simple with a single binary distinction between published and dirty identifiers. OGC will eventually need a proper long-transaction protocol, version semantics etc. This is a short-cut: everything which really should be logically distinct and separately specified at different levels of abstraction has been bundled together for the sake of speed and simplicity. This is not a long-term solution, but it may be good enough for now.
"In case that the source supports versioning, a registry of the relationship between the ID of the original version andthat of subsequent ones should be maintained. [...] Case 4:... the client requires to be notified of any updates on the retrieved objects. A broadcast mechanism can be designed such that the source sends an update alert on the network." [Bishr99]
"These references include such things as value-added attributes, additional feature relationships, etc. While it may be possible to resolve all of these references immediately in a small centralized database, it is intractable considering the number of geospatial databases and the volume of features currently in existence. Therefore, one must employ a technique that supports on-demand resolution of these references at any point in the future. This technique must also support tracing the lineage of a feature back in time through changes in delineation, or forward to the present from some historical date." [Hair97]
[Bishr99]
A Globally Unique Persistent Object ID for Geospatial information Sharing, Yaser A. Bishr, Interop'99 submission. Online at: http://www.opengis.org/members/fid.wg/index.htm
[Sargent99]
Feature Identities, Descriptors and Handles, Philip Sargent, Interop'99 submission.
http://purl.oclc.org/NET/sargents/Philip/feature-ids/base.html
[Arctur98]
Issues and prospects for the next generation of the spatial data transfer standard (SDTS), David Arctur, David Hair, George Timson, E.Paul Martin, Robin Fegeas. IJGIS (1998) 12 (4) 403-425. Online at: http://www.opengis.org/members/fid.wg/index.htm
[Hair97]
Feature Maintenance Concepts, Requirements, and Strategies, Version 3.0 May 28, 1997, David Hair, EROS Data Center, George Timson, Mid-Continent Mapping Center , Paul Martin, Rocky Mountain Mapping Center.Published by U.S. Geological Survey/National Mapping Division. Online at: http://www.opengis.org/members/fid.wg/index.htm
[Shklar97]
New approaches to cataloging, querying and browsing geospatial metadata, L.Shklar, C.Behrens, E.Au, IEEE Metadata Conf., 1997. http://computer.org/conferen/proceed/meta97/papers/lshklar/lshklar.html
This work performed at the European Commission Joint Research Centre.