FeatureRelationships: Implications

Philip Sargent
30 January 1998 10:00 CET

Summary

This note explores the implications of representing relationships between features by implementing the relationship as a stored-value attribute that records a list of Feature ObjectIdentifiers (OIDs).

This might seem to be an easy way of getting feature-feature relationships almost "for free", but this note shows that although the essence of the idea is good, it needs a little more machinery at the beginning to make it elegantly extensible for real-life practical purposes later on.

This only considers persistent relationships and not ad hoc relationships or attributional relationships (using Orest's terminology). Both of these I would view as the result of a query evaluation.

This note demonstrates some implications of Orest's document 98-037, as included in Sam's revision of Topic 8 in 98-005. It proposes that we need a signature for each relationship. It also has some implications for the definition of OIDs and Feature Collections.

Caveat

This is a note about implementation. This is not the proper (eventual) target of our work. We are writing abstract specifications for interfaces. However, I believe that looking at an implementation is instructive and it suggests what kinds of interface we could need.

Implications for OIDs

If we are to use OIDs to represent relationships at all, that tells us something about OIDs. We have two choices, either the function interface returns OIDs for the client software to manipulate directly, or it hides them behind a function interface. If it hides them it still has to return some feature handle - a proxy for the OID - unless one defines a closed list of allowed relationships as part of the interface definition.

Feature handle lifetime

What then is the lifetime of a feature handle?

Architecture

If we consider a simple two-level architecture: a user-interface running on a PC connected by a COM, CORBA or SQL interface to a repository somewhere, then the minimum lifetime is obvious: the length of the "session": the time the client program is running which will be as long as the user is sitting at it.

However, as discussed in the WWW group at San Rafael, the concept of "a session" is not well defined at all in modern architectures. If we consider a middleware process, which appears to the repository as a client and to an end-user as a repository, then it might be running continuously for months.

One of the uses of 3-tier architectures is redundancy for continuous availability. So eventually we will want middleware to connect to duplicate, redundant repositories and perhaps swap to using later versions of a repository (which might have been reindexed, repartitioned or have had schema additions) on the fly. The question is, can we put up with the middleware flushing all the features from its cache whenever it changes repository? I propose that this is not only undesirable but also unnecessary if the feature handles are actually the same as the persistent repository feature OIDs.

Incremental Update

Some modern data transfer standards allow for incremental update, e.g. S-57. If a relationship spans a segment that is being updated, then this requires persistent OIDs. The more we use relationships, the more acute this problem becomes. It is closely related to the issue of supporting long transactions and versioned repositories that have been solved by two different commercial GIS systems: Smallworld and LaserScan.

Make Persistency Implementation Dependent

Whether or not an implementation can support feature handles that are persistent between sessions I suggest be made an explicitly optional part of any RFP. I suggest that we should define an interface function which says whether it is supported on not. It would be a legitimate way for advanced systems to offer the greater functionality in a well-understood way.

Feature collections

There is an outstanding issue in the Abstract Spec. concerning the definitiion of Feature Collections. We can invert the issue, and define various sorts of feature collections by how they allow different types of use of persistent OIDs. We need the concept of a 'lineage' of feature collections (datasets) within which OIDs mean something, e.g. versions, alternatives/clones, segments-under-update, etc. Then we can use short, fixed length OIDs (maybe as short as 32 bit) within each dataset, with a (variable-length ?) dataset-id that is valid within a lineage and is accompanied by a whole chunk of metadata and catalog data.

How many relationships ?

Implementing a relationship as a named attribute which takes OIDs as its value domain is equivalent to saying that a dataset (feature collection) can have as many different relationships as the schema designer cares to define named attributes of that OID type.

The argument for defining a completely rigid, predefined set of named relationships is that it makes interoperation between systems easier.

The arguments against such a course of action are that:

Under what circumstances is the 'for' argument valid ? Well a good example is the different types of topology relationships.

Design choices

We have two choices:

  1. we either allow only a predefined, tiny set of well-defined relationships which have indisputable mathematical definitions,
  2. we define a way of representing the constraints on a relationship, and we include in our standard some pre-defined relationship names complete with recommended constraints.

Option 2 has all the advantages and none of the disadvantages listed above. It

The drawback is that we have to design a relationship-definition language. Fortunately other people have already done this in non-GIS fields.

First-class status for user-defined relationships

Referring back to Orest's discussion document 96-037, this approach would mean that composite, geometry-sharing and topological relationships can be defined and described in the same way as user-defined relationships. Putting that another way, user-defined relationships achieve first-class status, on a par with topology and composition.

Dependency is delegation

Orest also discussed in 98-037 what he called a 'dependency' relationship where a feature gets its geometry from another feature. We need to generalise this since elsewhere we have already agreed that a feature can have several geometry attributes, so I propose a generic 'delegation' relationship. Here a relationship points to another feature (i.e. its value is a single OID) but the relationship needs to be annotated with the name of the attribute on that feature.

That is useful, but it still doesn't quite do what we want. We want the 'MyGeometry' attribute of a pole to be a stored value attribute of type geometry. We want the 'MyGeometry' attribute of the transformer to be a delegation relationship attribute. If we have a single namespace of attribute names we can't do that since they are of two different types. If we use FeatureInterfaces and only insist on a single namespace within each FeatureInterface, and further have a universal constraint on constructing schemas that FeatureInterfaces can only be included on a feature if the names don't clash, then that solves the problem and is another argument for expressing ourselves in those terms.

Enforced birdirectional (inverse) relationships

There are (at least) two ways of restricting a relationship between features to be allowable only between certain FeatureTypes, or rather, only between features supporting certain FeatureInterfaces. This paper uses the language of multiple FeatureInterfaces proposed by Adrian Cuthbert, but you can substitute the language of single FeatureTypes if you prefer and it makes no difference here.

The 'source' end of the relationship is easy: the attribute is defined in the usual way in a FeatureInterface and only those features which have that interface can be a 'source' of the relationship. The 'target' end can be restricted by:

  1. setting up an entire set of new value domains which are sets of FeatureInterfaces, so that the attribute is typed as being one of these domains only, e.g. a domain of OIDs which must all support the "hydro" FeatureInterface which is applied to features which are bodies of fresh water.
  2. set a constraint that says that the relationship has a named 'inverse' relationship that must be present. Then the attribute defining the inverse relationship is restricted quite naturally to that FeatureInterface which defines it.

The drawback of the second technique is that for some small class of relationships, we might want them to be unidirectional: we really do want them to be simple pointers only. This is not actually as common in practice as one might think. The classic (nearly the only) example is for access control: if you have hundreds of owners and millions of owned-features, you must have the pointer on each feature but you can sometimes do without the tens of thousands on pointers on each owner. Unidirectional pointers can only be traversed in one direction, so this example would not allow you to answer the question "what do I own" quickly, though it would allow you to answer the question "do I own these ?" quickly.

A small refinement gets around this drawback with option 2: we set another flag somewhere that says that although the target must have an 'inverse' relationship attribute, this does not actually have to hold the back-pointers but holds a null value or empty list. Where do we define such a cardinality flag ? In the same place that we define the constraint defining the inverse relationship: in the signature of the relationship attribute. All attributes require signatures. The signature accompanies the definition, name and type of the attribute where it is defined in the FeatureInterface (in the FeatureType declaration for those averse to FeatureInterfaces).

Defining a relationship with a signature

Now we have enough background to sketch out what should be in the signature of a relationship attribute. We need a list of parameters:

Attribute Name: string
Attribute Type: string // will be "OID" for relationships
Inverse relationship exists ?: Boolean
Inverse Relationship Name: string
Cardinality of relationship: enumeration[0,1,N]
Default value: OID
Can hold NULL value?: Boolean
Can be empty list ?: Boolean
Allow repeated OIDs in list? Boolean

I think this reproduces the 10 classes of cardinality in Orest's paper. It does split the definition over two places: the signatures of the two ends of a bi-directional relationship, but that's not too bad. [In any case, none of this is intended to propose an implementation, is intended to produce a complete design so that we can see what a complete set of interfaces looks like. We will then go and design a different but equally capable set of interfaces.]

We can use the MIME naming convention for extensions: anything with a prefix X- is a non-standard extension signature parameter, and we could say that the defining organisation should put its DNS name immediately after the X-.

X-megacorp.com-defined extension parameter: atype

Orest's delegation relationship can be done like this:

Delegation relationship? : Boolean
Delegated attribute name: string // must be a valid attribute name

Namespace for parameters

A small point we might consider is the namespace of these parameter names in the signature. A valid, self-defining mechanism would be to say that they are just attribute names defined in the signature of a compiled-in, system-defined FeatureInterface called "Signature". For clarity, this paper will continue to call them "signature parameters".

Update semantics

Orest described the difference between the 'aggregation' and 'composition' relationships as being based in their update semantics: do deletes cascade or not ? Unfortunately, although Orest listed 6 basic types of update implicated action, update semantics do not form a simple enumeration in the real world. For example, a sub-station feature can be deleted if the last cable connected to it has been deleted if the substation is a stand-alone feature but not if it forms the ground floor of a larger structure. Complete update semantics require a full programming language, e.g. SQL3's proposed stored-procedure language, LULL or Magik. JavaScript (ECMAscript) is a future possibility.

Rather than give up at this point, we should ask whether there is there a small set of mathematically well defined update semantics, common enough in the real world, that are pragmatically worth standardising on ? I agree with Orest and suggest that there are, if we restrict ourselves to immediately related features, and ignore granduncles and second cousins and drop any counting except the distinction between 'one' and 'many'.

So we need some more signature parameters:

Cascade delete unconditionally? : Boolean
Cascade delete if there is only one target? : Boolean
Cascade delete to targets if target has no children (targets of this relationship)? : Boolean
Cascade delete to targets if target has no other parents (targets of the inverse relationship)? : Boolean
Cascade delete to targets if target has no other parents (features which have the target as a target of their copy of this relationship)? : Boolean
Cannot delete self if children of this relationship exist? : Boolean

This is probably already too many. The two different constraints for targets with no other parents are identical if there is an enforced inverse relationship. Orest's list (Default Implicit, Propagate Implicit, Default Explicit etc.) look like a more complete and precise set, but my point is that this is a pragmatic decision not a mathematical one.

We can include quite a lot of these constraints into a standard. but too big a set might produce two datasets that actually have the same update semantics but are just expressed in different ways. It also becomes easier to produce inconsistent sets of constraints that are only discoverable at run time. These are problems that full-programming language solutions always have so we are not losing anything there. However, complex dependencies are better expressed in a programming language: a concise block of code is easier to understand than a set of separate assertions spread throughout the feature schema.

There is even a halfway house before falling back on a full programming language. One can imagine a special delete-cascade constraint language that can count and navigate relationships. Such a thing need have no concept of state and could be a pure functional language like ML. You might be able to apply automated theorem provers over such language statements to detect some classes of inconsistency. (However it works, it would probably have 'java' in its name somewhere.) That is clearly a research project and not a standardisation exercise.

A signature is a kind of metadata but part of the schema

This implementation assumes that the signature of an attribute is available at run-time as part of the queryable FeatureSchema. Therefore it could conceivably change during a single "session", i.e. a system may or may not support dynamic schema operations.

The reason why we put some metadata into a formal structure within the Feature schema is that it enables us to incorporate "legacy, new or changed local models into its generic structure of metadata to support evolution without system redesign or recompilation" (University of Laval, MDB Group).

Dynamic schemas are not weird

The whole idea of a dynamic schema is viewed as far-fetched by some GIS developers. However it is standard technology in some other types of engineering software. There is a misconception that you can't write client code in "normal" languages if the schema is dynamic. This is false.

If you try to identify a different compile-time defined struct (class, object type etc.) with each individual feature type in the schema, e.g. a C++ class "road" to match a feature "road", then of course that will not work. You can't change C++ class definitions at run time like that. The solution is simple. Using the same example, you have a C++ class "feature" and a machinery of methods and helper classes which matches an instance of that C++ class at run time with the feature "road" in the spatial dataset. This machinery is probably a standard pattern available from a pattern library somewhere (but I have not checked).

I am not suggesting that we require dynamic schema capabilities. I am suggesting that our standards be phrased not to forbid it. Within the lifetime of the OpenGIS standards we can confidently expect that 24hour, 7day, 365day GIS systems will be running that never go off-line. Dynamic schema, or schema evolution of some kind, is then an absolute necessity if systems are to be maintainable and extensible.

Event sequences and short transactions

Orest is concerned that deferred constraints can cause transactions (short transactions) to fail and make it difficult for an application to respond to errors when they accrue. Deferring constraint operation to the end of short transactions is certainly necessary when updating 1:1 relationships. Unfortunately, Event sequencing architectures are a little more complex than one might wish, but fortunately there is a great deal of practical experience to draw on. This whole area is well-described in a short paper which describes a commercial GIS with a defined sequence of update and constraint events: validation constraints which confirm local data integrity and merge constraints which confirm global integrity when merging segments of a dataset. Yes, transactions sometimes fail if the data would otherwise be corrupted, but that's the point.

We should not attempt to standardise any one constraint/event sequence for transactions, but perhaps it would be interesting if we could standardise a way of formally describing event sequences ? It does not appear to be necessary at this time: the simple fact that a transaction mechanism exists and that constraints are only valid outside transactions is sufficient at this stage ?

Communicating signatures

It is useful to consider how we might construct an interface for querying and returning signatures (sets of signature parameters and their values). There are three broad approaches:

  1. Define a whole load of functions to do all of this.
  2. Define a Well Known Encoding, similar to the POSC mechanism adopted for specifying Spatial Reference Systems, and pass a signature as a string.
  3. Use the convention that a signature is just another set of attributes, but defined on a system-defined FeatureInterface (FeatureType if you prefer), and use whatever interfaces we already have to read and write attribute values. We need to define only a single new function: a call to return a reference to the system-defined FeatureInterface object.

How do we read and set relationships, e.g. setting the default feature target? We need a standard encoding for a handle to an OID. It needs to be fast since it will be used a lot, so my bet is that the handle has to be a 32bit unsigned integer, moving to 64bit when most people's hardware is up to it. This is an argument for having different (shorter) handles rather than using the OIDs directly.

Multiple connections

If two roads intersect several times, the list of OIDs that is the value of the 'intersection' attribute on one road will include duplicate OIDs. So the value of this parameter in the relationship signature will be TRUE:

Allow repeated OIDs in list? Boolean

If it is important to distinguish these connections persistently, e.g. by giving them a name, then we can either replace the connections with real features, or we can annotate the relationship with its own set of attributes using 'arc objects'.

Arc objects

Now we get to the interesting stuff. As Orest says, "relationships can have their own attributes". If the relationship is really a genuine real-world feature then there is no problem, and appropriate allocation of attributes to FeatureInterfaces and FeatureInterfaces to FeatureTypes will set up appropriate constraints. If the relationship isn't a real feature then we need to somehow collect a bundle of attributes and attach them to the relationship. This is surprisingly easy to do.

The solution is to create a different type of object: an 'arc object', which is actually a perfectly well defined ordinary FeatureType just like any other feature. This is 'stuck in the middle' of the relationship arc.

The problem is that we often want the system to behave as if the arc objects aren't there at all, e.g. for delete cascades, and sometimes we even want to produce two copies of the dataset, one with and one without arc objects, and have the same code work with both.

For example, in a utility planning office there is a complex pipe network and every junction between pipes is heavily annotated. One of the annotations is when the junction will be made (its effectivity date), who is the contractor responsible etc. In the field, maintenance people have hand-held devices that read 20 MByte floptical copies of the same dataset updated weekly. These are memory and computation limited, so all the planning data is deleted when the datasets are downloaded. To save unnecessary expense, some application programs must run on both datasets unchanged.

In the deployed database, there are 'connect-to' and 'connect-from' relationship attributes on pipe features that are inverses. In the planning database, there are also 'up' and 'down' relationships that are defined on arc objects of FeatureType 'pipe-junction'. The signatures of the relationships are changed so that 'up' is a valid inverse of 'connect-from' and 'down' is a valid inverse of 'connect-to'. So a relationship is like this:

pipe1 -connect-to-> junct0 -up-> pipe2

and

pipe1 <-down- junct0 <-connect-from- pipe2

when the arc objects are deleted, the references are fixed up in the obvious way. Many function calls continue to operate without change because the attributes on the real features have the same names.

All functions that traverse relationships have to be able to recognise arc objects and most will skip over them. Recognising them is hard, since they are just ordinary instances of a FeatureType. We could recognise that all arc objects of all types have two relationship attributes 'up' and 'down', or we could introduce a special, system-recognised 'arc' FeatureInterface and ensure that all arc objects include it in their FeatureType. We could have all arc objects of all types have the relationships 'up' and 'down' or we could define new names for each type. Since they are invisible anyway, it is an implementation decision. There also need to be a couple of special functions that don't skip over them so that you can read and write their attributes.

Implementation hint: Experience with this structure shows that it does not scale well with relationships with large, asymmetric branching factors. Expressing the problem in simple terms: the software spends too much time traversing intermediate arc objects before it gets to the eventual target object and discovers that it was not the one it was looking for. The solution is that the implementer should maintain a completely hidden parallel set of direct references between the source and target features, as well as those to and from the arc objects, and keep them synchronised. This will slow down update but will speed up read access for some types of common navigation.

This mechanism allows repeated relationships , e.g. between two roads that cross several times, to each have their own arc object instance with as many attributes as necessary. You can have as many different arc objects as you like, and the arc objects can take the same attributes as ordinary features if you include the FeatureInterfaces appropriately.

Defining arc objects is done in the signature of the relationship attribute:

Takes arc object? Boolean
Arc object FeatureInterface: string // must be valid FeatureInterface name which includes a special 'arc' FeatureInterface
Upstream reference attribute name: string // e.g. 'up'
Downstream reference attribute name: string // e.g. 'down'

Further definitions of delete cascade etc. are done on the signatures of the attributes 'up' and 'down' which are perfectly ordinary relationship attributes whose signatures are defined in the FeatureInterface specified.

There is one last 'gotcha'. Although you mostly want to get at the attributes on an arc object via one of the features connected to it, sometimes you need to get at the arc object directly. Usually arc objects will not have name attributes (why clutter up your name index ?) but it is a good idea not to put that supposition in writing (i.e. in C++ code).

System Relationships

Orest mentions that the relationship between a feature instance and its FeatureType (class) is also a relationship. This can also be represented in the same uniform manner using attribute relationships and their signatures, as can class-subclass relationships, user-group/role relationships, read access/ownership and even the relationship between an attribute and the FeatureInterface which defines its signature. This is left as an exercise for the reader.

Other Relationships

An arc object is fundamentally a 3-way relation, and theory says that any complex relation can be expressed as a set of 3-way relations. Thus this allows the representation of arbitrary n-ary relationships.

Acknowledgements

My thanks to Mark Ridler of Quillion Systems Ltd. for introducing me to this way of looking at the world.