Chapter 3


Representing information


Representation and integration

Representing materials information in databases and knowledge-bases has to take into account three things:

This chapter and the chapter on data interchange cover the same ground. Whilst most materials information systems today stand alone, it is their integration with other CAD/CAM software which is important for the future. Integration capability has to be built in from the beginning, hence issues of information transfer assume a critical importance.

Materials property information has unusual characteristics which make naive database construction treacherous for the unwary. This is not so much a problem with individual databases as with sets of independently developed databases. Depending on context, the same aspect of a material might be treated as a variable or as a constant. Structural inconsistencies of this kind make integration difficult but attention to the fundamentals of what materials are and how their properties arise can provide a common basis.

The approach taken in this chapter is to consider database :construction and integration at the same time on the premise that databases should be designed both to provide a specific service and to be a potential component of larger CAD/CAM systems. The chapter is in three sections. It begins with a discussion of some fundamentals of materials information, continues with a review of database structures appropriate for materials and finishes with a review of three methods which formalize representations of the meaning of data.

Materials information

The concepts and fields present in a particular database, such as strength, hardness, elongation, yield strength etc., have relationships with each other which are not usually made explicit, but they have to be described and made explicit if either (a) data interchange with other materials databases is planned, or (b) a standard interface to data visualization software services is hoped for.

The most popular and apparently natural representation of relationships between materials concepts is the hierarchical, tree-like structure, possibly with multiple cross-links. Thus information could be subdivided into property  information or test information, a test could be an indentation test or a tensile test, and a property could be a hardness or a yield stress. A cross-link of one type would be that hardness properties are measured by indentation tests, another type would be that yield stress could be related to both elastic and plastic property classes.

The data from the same set of tests exists in a number of different databases in a variety of 'states' depending on whether it is raw data, validated data or fully evaluated information. The information which describes which state pertains to a particular piece of data, and the audit trail of the process, is information which the database system will have to maintain (see Chapter 6 on data quality). Thus the concepts relevant to this process are classified even at this early stage.

Examination of the conceptual structures of many materials databases indicates that there is no one, simple correct data model structure, but that thinking about interactions and sets of concepts does give some guidance as to when to use which type of representation: orthogonal concepts imply tables, sequence and discontinuity in interaction imply trees, both require conceptual ontologies. Ontology is the study of the notion of existence. Here it means the construction of a taxonomy of concepts and their semantic derivations.

There is always a need for taxonomic support tools: dictionaries which define terms with respect to some basic vocabulary, thesauri which relate terms to other terms, and encyclopedias that inform on the meaning of a term with extra information and examples.

Orthogonal concepts

Hierarchies are used for nearly all conceptual structures of materials information, but, at some level of detail, they fail to capture some important relationships even when generalized to permit multiple inheritance. The most common example is that of orthogonal concepts: deformation stresses can be either compressive or tensile, and the onset of plasticity can defined as either the yield point or the proof stress at 0.2% strain . These two classifications are independent and neither could really be considered a child of the other (unless an arbitrary decision is taken for the purpose of standardization) but there are still four resulting distinct concepts (compressive yield point, tensile yield point, compressive proof stress, tensile proof stress), with distinct attributes which require representation (see Figure 3.1).

Figure 3.1 Orthogonal Properties

It is expected that much of this type of information structuring is a consequence of the physical fundamentals underlying materials properties and some will be a consequence of how materials and properties are defined. It is expected that such structuring will be useful in the development of catalogues, data dictionaries, data interchange procedures and the next generation of more broadly-based materials knowledge systems.

Basic limits to hierarchies

In his book Sciences of the Artificial, Simon discussed the criteria by which one can decide whether hierarchical representations are appropriate or not [Sim81]. The criteria that show when a set of objects form good hierarchies are arrived at by considering the interactions between members of the same set, and interactions between sets. If these interactions (defined in the broadest possible manner) form some kind of sequence from general to particular (even if only locally to those immediate sets involved), and if the sequence has discontinuities from level to level, then a satisfactory hierarchical description is possible which, without further elaboration, represents the significant relationships in the system.

NUCLEI ATOMS MOLECULES POLYMERS

INTERACTION STRENGTH

Figure 3.2 Bond strength hierarchy

An example is the sequence of sets of objects from nuclei, through atoms and molecules, to polymers (macro-molecules). The interaction here has a simple interpretation, it is simply the strength of the force between objects and it is clearly a sequence with nuclear forces being much stronger than the much weaker Van de Waals forces and hydrogen bonds between polymers (see Figure 3.2). The distinct levels arise because of discontinuities in the bond strengths: the forces within nuclei vary from element to element, but they are always much stronger than the chemical bond forces between atoms. If the strengths were continuous, or worse, overlapping, then there would be no clear distinction between atoms and molecules.

An analogous situation with materials information is that there is no clear distinction between the information relating, for example, one high-nickel stainless steel with another, to that relating one iron-containing nickel alloy with another. Also, if the hierarchy is good but only locally true, for instance if the distinction between crystalline and glassy ceramics is compared to the distinction between semi-crystalline and glassy thermoplastic polymers, then it can be seen that different semantic interpretations in different parts of the tree could cause problems in automated systems (both types of glass have glass transition temperatures, but the usefulness of this attribute is different).

Two major divisions

It is instructive to consider which representations are suitable for materials and properties that are amenable to presentation on Ashby-type materials property charts (see Chapter 2). The answer is that a single simple flat file which lists all the values of the properties for each material is sufficient because all materials are represented with the same list of properties. The condensation of materials into classes of materials holds for all the properties up to some level of detail. Fine distinctions, such as between yield stress and 0.2% proof stress, are unnecessary for any conceptual design method so the properties can be identified easily without a large degree of qualification.

Capricious properties

The question then arises as to what makes the properties not amenable to the Ashby method so different, whether there is a clear distinction, and what forms of representations are suitable for those. The answers are not yet certain, but a particular view is presented here which uses the idea that the predictability of a property can be viewed as an information transformation.

Those properties which, when plotted on property charts, show local clusters of similar materials appear so because they depend largely on the strength of the atomic and molecular bonds, on the identity of the atoms involved, and to a lesser degree on the crystallographic structure of the material. Other properties do not display such close clustering of similar materials. The corrosion behaviour of an aluminium alloy can be wildly different from that of pure aluminium.

Figure 3.3 Clustering and non-clustering properties for two materials

Figure 3.3 shows that two classes of material (denoted by circles and crosses) might cluster on one property but not on another. This does not necessarily mean that the second property appears to have widely varying values with respect to a more fundamental parameter (although it usually does). In the figure this is illustrated by showing a correlation between property 2 and the varying parameter.

The non-clustering properties are 'capricious' because they are really processes and not properties. Thus corrosion behaviour is not a simple ranking based on the electrochemical series, but a complex result of perhaps a dozen different competing and cooperating processes. This is also true for wear and fatigue. Because it is the balance between competing processes that is changed by external factors (temperature for instance, or oxygen partial pressure), the resulting behaviour is sensitive to fundamentally small causes. Things become clearer if we ask precisely what the process-properties are capricious with respect to ? The answer is with respect to what we commonly think of as sets of related materials, which are materials which are related by having similar compositions. Compositions imply atoms, and proportions of different atoms directly imply bonds and crystallography.

There are degrees to this capriciousness and they can be estimated by considering what software models one would have to write to predict the behaviour. Without knowing what the precise relationships and coefficients are, several decades of materials science theory often enables us to identify the number and identity of processes involved, and the number and identity of the more fundamental parameters that go into these models. Not surprisingly perhaps, the better theories use some parameters which are closely tied to the underlying physics, such as activation energies which in turn depend on the identity of the atoms and the stiffness of the bonds. The more numerous the parameters, the more variable the property. However most of the apparent unpredictability comes not from just rearranging these fundamental properties in different ways, but from another class of phenomenon entirely: the material's microstructure.

State and process

The microstructure of a material is everything that is larger than atomic size but small enough that a mechanical engineer can consider it is a continuum. It includes dislocations, stacking faults, grain boundaries, phase boundaries and all of their three-dimensional arrangement. It is this geometric and topological variety which produces a combinatorial explosion of different behaviours. Any measured properties determined by the microstructure are describable by a lumped-parameter model, and since the internal (lumped) parameters are individually not observable the result appears to have little coherent relationship to the observable inputs. Engineering materials information is very different from that for gases, liquids and chemicals in this respect.

There is a deep difference between properties which are characteristics of a material's state and those which are characteristic of some process that the material undergoes. Distinguishing between them, however, is a matter for the observer to decide, depending on how many parameters he wishes to use to totally define a material's state. The position and orientation of every grain in a specimen is an unworkably large list of parameters but it would permit precise prediction of some properties. The average grain size and standard deviation is a short list, but some behaviour (such as 'earing' of deep-drawn sheet) will be indescribable.

Most manufacturing processes rely on capricious, process-type properties. This implies (as is found in practice) that competitive manufacturing processes are always based on a state of semi-ignorance. There is always more that could be discovered about the process but that effort is better spent in empirical tuning of an existing procedure in order to optimize performance. All process models are accurate only to some level of detail to get better performance than the accuracy of a model, empirical tuning is always necessary.

Figure 3.4 A property determined by many parameters

Structure-sensitive properties

The distinction developed here is similar, but not identical, to the distinction between 'structure-sensitive' and 'structure-insensitive' properties proposed by Chalmers [Cha59]. Chalmers made the classification purely to bring some order into discussing properties for teaching purposes, he did not propose any further implications. His distinction is one between crystallographically determined properties, properties of the perfect crystal, and microstructurally determined properties which result from crystal defects. Conversely the stable and capricious properties proposed here are better distinguished as representing 'state' and 'process', irrespective of microstructure. (So corrosion is a capricious property whereas Chalmers would classify it as a structure-insensitive property of the perfect crystal.) The interpretation of the distinction in terms of levels of predictability with respect to the type of models required to simulate behaviour has not previously been made.

It has been observed that materials selection is much easier when the relevant properties are sizing as opposed to discriminating (non-sizing) properties [Lew90a]. Sizing properties are those defining simple mechanics behaviour which are used to size components: strength, fracture toughness, stiffness etc.

In practice certain state properties, such as magnetic hysteresis, are determined by the complex microstructure which results from the processes of materials heat treatment, so while they are state properties in one sense, their position on property charts can be highly variable. They are capricious even though they appear to depend only on the material's state. This should be contrasted with corrosion and fatigue properties where the property itself is inherently a dynamic process.

The distinction is false, however, if one considers a material's state as being defined by measurable parameters. To predict magnetic properties these measurable inputs would have to include specifics of the heat treatment, so the process of microstructure development is after all indivisible from the magnetic property.

There is one type of process property which is less capricious than others and that occurs when all the internal processes have something in common. Thus the yield strength of plastically deforming materials depends on many different dislocation-dislocation interaction mechanisms, but these are all directly proportional to the stiffness of the material and the density of the dislocations. That is why strength and fracture toughness are well-behaved enough to appear on Ashby's charts, though a detailed examination will show rather wider variation for them than for the other properties.

Promising directions

The scope for development seems most promising for those least-capricious of the process properties: those where decades of research have already produced relatively stable predictive models of materials behaviour, particularly that subset of models where the models' parameters are themselves well-behaved because the models are well-founded in the underlying physics.

Less promising directions

Although predictive causal, thermodynamic models of metallic and ceramic phase-equilibria have recently made great progress, the next step of using that information to predict properties such as microstructures in castings still lies in the future. The same can be said of electrochemical models of corrosion processes. Both are difficult because of the fundamental type of materials information involved. Both are solvable at some expense for specific, narrow classes of material (such as for weld-metal in a few steels) but such solutions will not be generally applicable to materials not covered by the original studies.

Some types of currently capricious property may eventually yield to research and become fully predictable in the future but they will be quite specific properties, for example such properties as sulphur-free dry-air corrosion of solution-strengthened transition-metal alloys. Many properties however have causes buried deep in so many independent non-linear effects that the simplest 'simulation' of the material is the material itself. These intrinsically capricious properties are the metallurgical analogue of chaotic behaviour. They are, at some level of detail, intrinsically unpredictable. Other properties may be not intrinsically unpredictable but may just require unfeasible computing resources.

Ill-structured problems

There is an inexact parallel between stable materials properties and well-defined design problems, and between process properties and ill-defined problems [Sim81]. Systematic methods can be used for the first class, usually after a prolonged phase of analysis which for basic properties would be the task of accumulating the data for all candidate materials and arranging it in a form that enables merit indices to be calculated. The second class is best tackled by looking first for a single, feasible solution (a material that will function correctly, if not well) and then exploring potentially better solutions incrementally from that starting point. This is explicitly the approach taken by SPLINTER which takes the view that materials selection is inherently an ill-structured problem [Zuc89].

Ill-defined design problems are never understood properly without relating them to a potential solution, whereas process-property problems might well be understood conceptually, but the lack of either numeric data or a reliable predictive model nevertheless implies an exploratory type of search. An image might help: the comparison is between moves in chess where each piece can confidently move to any feasible position, and the moves made by an explorer attempting to traverse a swamp where he does not know whether a clump of grass will support his weight until it is stepped on. Cast as an optimization problem, this is the same distinction as between solving for the solution directly and having to use a step-wise search algorithm.

Open questions

It is likely that some descriptive structure based on hierarchies, with or without multiple-inheritance, is required to represent materials and properties (perhaps with an underlying tabular representation for the convenience of portability and communication), but how are the decisions to be made as to where to use which type of representation ? Putting the question another way, what is it, at a fundamental level, that makes some concepts naturally hierarchical and some naturally tabular ?

We have seen above Simon's primary symptom that makes trees unsuitable, but is there anything in the physics of materials behaviour which will make that symptom predictable ? Perhaps there is a further question too: if a single tree is not enough, is it possible that a large number of trees, with appropriate cross-links, could suffice ? Already the matter of orthogonal concepts indicates that the answer to the last question is probably no.

Another way to look at representation problems is to look at the situation from the point of view of materials information naturally forming categories of various kinds: steels, oxides, cheap materials, brittle materials, properties, strengths, failure criteria, surface treatments etc. The representation modelling problem then transforms into one of deciding how to represent the relationships between these categories.

The concepts embodied in these sets do seem more fundamental than trees or lattices but offer only limited aid in providing powerful ways of manipulating the information. (The SPLINTER system reasons at this level and suffers from this disability [Zuc89].) This has implications for the design of materials information components of CAD/CAM design environments.

Relationships and adequacy

A data model is the description of the capabilities of a number of databases or knowledge structures in terms of what kinds of things are permitted to be stored, what kind of relationships are enforced and what operations can be done on the data when retrieving or updating it [Dat90]. It is distinguished from a database schema which is the description for a particular database structure which usually has only one set of data, one database, associated with it. Materials databases usually involve complex lattice-like relationships to represent metadata such as units, test conditions, material processing history etc., as well as tables holding the numeric data itself.

Figure 3.5 Four different strength measures

When deciding what kind of data model should be ideally used to represent materials for any engineering purpose [Wes86, Rum90], there are three fundamental issues that have to be addressed:

  1. Is the representation adequate? Is it able to represent the data and relationships in a way which is usable without a distortion of the truth ?
  2. Can the data model cope with the many different and independently invented designations for materials, properties, tests and processes ?
  3. Is the data model able to make a useful distinction between structural matters (syntax) and the meanings (semantics) so that both can be handled appropriately ?

An example: a tensile test on a metallic specimen can give at least four types of strength property measurement: yield point, 0.2% proof stress, ultimate tensile stress and fracture stress (see Figure 3.5). These must be stored and distinguished correctly.

A database used by engineers must support a data model such that (a) all this data can be stored, and (b) when data is retrieved, the user is informed when 'similar' information is available in addition to that which was requested. For this particular problem a number of techniques are possible:

  1. use a network or object-oriented database with some kind of default inheritance for both the terms and the data
  2. use a conventional relational database to store the data, coupled with an separate index to handle cross-references and conflicts, plus dictionaries and encyclopedias for user-oriented help
  3. as (2) above but using an on-line thesaurus derived from an investigation of the important concepts from an ontological point of view in addition to a keyword index .

Typically the data itself is stored in a variety of conventional relational databases and it is just the description of the content and cross-references which have to be handled intelligently.

Associativity

Materials information systems must be able to represent two things:

  1. the relationship between names of properties (fieldnames) and their values, and
  2. the interrelationship or associativity between sets of these names and values. The relationship that exists between the data and that information which describes or modifies it.

Examples of associations are those between the maximum and minimum values of a measurement, or the ranges of conditions over which some other measurement is valid, or the existence of a functional dependence of a measurement on some other 'property' (e.g. hardness depends strongly on the heat-treatment history whereas density does not).

The design of materials systems requires a large number of different categories, typically several hundred, with complex interrelationships [Amm88]. The expression of the degrees of associativity permitted in a data model is a direct consequence only of the allowable syntax and is not affected by the meanings of the terms. In the past the associativity and the meanings of names and terms have been confused but separating them makes many issues clearer..

Names and values

It is not always obvious what should be a name of a field and what should be a value. For numeric data it is usually straightforward: a fieldname of 'Modulus' and a value of '42.1' is obvious, as is 'Material' with a value of 'Copper'.

Material Modulus T-melt
'Copper' 42.1 1356
'Iron' 64 1810
'Diamond' 530.6 4000

However, at least one materials database stores all numeric data as 'triples' (as suggested by McCarthy [McC87]) consisting of Material-name, Parameter, and Number, e.g. 'Copper' 'modulus' '42.1', and now the string 'modulus' is a value of the field 'Parameter', and 42.1 is the value of the field 'Number' (in GPa):

Material Parameter Number
'Copper' 'Modulus' 42.1
'Copper' 'T-melt' 1356
'Diamond' 'T-melt' 4000
'Diamond' 'Modulus' 530.6

This example comes from a datafile of parameters used by two programs which predict material properties by simulating creep mechanisms, a deformation mechanism map program and a hot isostatic pressing process modelling program. This type of schema can be seen to be more flexible in that new types of measurement, such as the normalized sensitivity of the modulus on temperature, could be added to the database without having to redesign it. This is a particular advantage for a research system but it has the disadvantage that most other databases are not structured like this, which can lead to communication problems.

The problem is usually more complex for data enumerations since different database constructors may use quite different schemes [Ken78]. One database may have three logical yes/no fields called Plate, Bar and Strip to describe whether an alloy is available in those forms.

Material Plate Bar Strip
'Copper' N Y Y
'Aluminium' Y N Y

Another database may have only one field named Form which takes the values 'plate', 'bar' and 'strip'. This kind of inconsistency can only be handled by agreeing on standards.

Material Form
'Copper' 'Bar'
'Copper' 'Strip'
'Aluminium' 'Plate'
'Aluminium' 'Strip'

Standards are required to describe the structure of databases explicitly. This is known as data modelling.

Data modelling

Informational modelling, also termed data modelling, is the process of analysing the relationships and associations between the types of information that it is planned to represent in an information system [Ken78]. The result of data modelling is a database schema, the detailed structure of a specific database which requires only the data values to be entered to be complete.

Deciding which types of data and which concepts are necessary to represent is the task known as systems analysis. This phase can never be omitted, even when a highly specific materials data computerization is planned and representations seem obvious [Rum90]. Knowledge-based systems and object-oriented databases are especially not exempt; expert systems are usually created specifically in order to change people's behaviour in some way, which means that some study of how these people will react is vital. Where the situation is poorly defined, a 'soft system methodology' is available [Che90].

Whatever technique is used to implement a materials information system, it is often easier to analyse the associations between the concepts using some other knowledge-structuring tool. A number of methodologies, such as the Entity- Relationship method, exist and software aids are available [Avi88, Shu86]. The Express language and the Express-G diagramming notation were developed specifically for data modelling the entities and relationships in the STEP standard (see Glossary) [Sar89a]. More recently semantic data models have been used to describe conceptual relationships [Hul87, Ram89, Dat90].

Functional dependency

Functional dependency is the dependence of some properties on the values of other properties for the same material. Any such dependence is an unchanging aspect of the database, like the names of the fields and the tables, not an ephemeral aspect like the values in the tables at any one instant of time. Functional dependency is thus part of the database schema.

In many materials databases there is usually only one commonly assumed dependency: the dependence of a material's properties on its designation or identification. Because all dependencies are forced into this one mould, the definitions of what is needed in a materials identification become more and more complex, including such things as materials supplier, heat-treatment schedules, chemical composition, age, etc. Many of these 'designatory' properties should be represented as simple data, with the functional dependencies explicitly documented.

A particular problem with materials data is that in many cases, although the data is known and measured to some degree of accuracy, the dependencies are unknown, e.g. does the creep-rate of nylon depend on humidity or not ?

Some properties have functional dependencies on more than one material. These include friction, galvanic attack, weldability, wear etc. Obviously most properties of surfaces are of this type. Any proposed schema should always be tested by seeing whether it can represent these properties properly. If it cannot then it is a strong indication that it is insufficiently general in scope.

Partial identity

Unambiguous materials identification is more difficult to achieve in practice than is usually believed, even for metals and alloys [Rey89], and for materials such as ceramics and composites it is even worse. Materials designation depends not only on the functional dependencies and level of abstraction (as discussed earlier), but also on the partial differences in usage between different classes of material.

Partially synonymous terms are always a problem in materials databases, for properties as well as materials. For example, 'elastic limit' and 'fracture' are identical for ceramics, but distinct for metals and polymers (because they are plastic and not brittle). Any materials information system or interchange system based only on 'global' names (defined to be appropriate and valid for the whole database) cannot handle this problem unless the syntax allows some way for the meaning or interpretation of the name to be modified under certain circumstances.

This is an area where a strictly tree-like logical structure for a database presents no problems; the meaning of a term can be easily redefined to apply to all subtrees from that point, to be redefined in a local context. However, for lattice data models the same item of information can be reached by more than one route, leading to conflicts in meaning.

Relational data model

A brief review of relational databases and their multi-tabular nature is presented to aid the description of database catalogues which document functional dependencies and other associations between data.

The name 'relational' has nothing to do with the relationships between different data entities, it is merely a technical name for a particular type of table describing a set of data. E. F. Codd first derived the relational model for data in 1970. Unlike previous ad hoc approaches, it is based on a rock solid mathematical foundation which gives much firmer guidance as to how extensions to the basic model, such as NULL values, should be handled [Dat90].

A relational database is one whose interface to the user/programmer appears as a set of tables and operators (to manipulate the data and the tables themselves) and only as a set of tables and operators. An essential component, in addition to the tables and operators, is the invisible and automatic maintenance of constraints: the implicit cross-references between the tables. Note that the internal structure of a relational database can be anything at all; the only requirement is that it is invisible and unreachable by the programmer, user or any software.

There are degrees of 'relationality' depending on how well a particular implementation of a database system measures up to various ideals in the three areas of structure, operators and constraints. A clear classification of how well implementations fulfil the ideal, and what this implies for people using the database, is given by Date [Dat90].

The associativity expressible using a relational database derives from the tabular structure, on the principle that any data value can form the cross-link to associate with another table, and from the constraint that in any one table all rows must be distinct. This is one reason why NULL values caused by missing data, or by a database schema that makes some properties inappropriate for some materials, cause such problems [Dat90].

Normalization and normal forms

The concepts involved in normalizing a specific relational database :schema (structure) derive from the need to arrange matters such that the functional dependencies between the data items do not cause anomalies as the database is updated and as old data is deleted. There are also multivariable dependencies to be noted, where although the value of one field is not fixed by another, it is nevertheless determined to be one of some set of values. The more complex the dependencies in the real world being modelled in the database, the more convoluted the route to designing the appropriate structure.

The procedure of normalization is to take some set of tables and a list of dependencies, and to systematically split up the tables into smaller tables that are equivalent to the original set in their description of the real world. Techniques and theory are best described by Date [Dat90] but are covered in many texts [Enc90, Rum90, Bey91].

A modern relational data-structuring exercise for materials data was described by Colton for the GELAC Advanced Materials Database of Lockheed-Georgia. The technique involves the use of lists and diagrams of 39 functional and multivariable dependencies between the data types to ensure that the structure of the 26 tables is fully normalized [Col88].

Data modelling of materials database schemas directly in terms of the pure relational model is, for the majority of people, a complicating rather than a simplifying procedure. The solid foundation of relational theory means that there can be little argument that databases should be eventually implemented in relational database management systems. However some higher level of abstraction is clearly required in order to perform data modelling easily. Formal data modelling or diagramming techniques now available mean that the normalization and implementation can be done largely automatically. The Entity- Relationship technique is a great improvement over working directly with tables, but it is too general in some respects for materials database construction. Other diagramming methods, such as Express-G, build the functional dependencies directly into the representation. This is usually easier to understand but still can require unusual skill for some problems [Dat90, Avi88].

The normalization procedure is important to understand in principle, if not in detail, because two different relational databases might describe identical ranges of materials data using identical fieldnames and definitions, but might still be unable to communicate because they make different assumptions about the real-world dependencies. These different assumptions would probably not be stated explicitly in either database but would be implicit in the way the database was normalized. Whatever higher level data modelling technique might have originally been used, when it comes to data interchange we usually have to deal directly with the tables because these are likely to be the only common medium of communication.

Semantic databases: attempt to implement the functional dependencies directly and explicitly, and are likely to be increasingly important for materials data in the future for this reason [Hul87, Dat90]. (Object-oriented databases, so far as their data-structuring capabilities are concerned, are a special case of semantic databases.)

Hierarchies and catalogues

Some of the same basic elements of materials information and the relationships between them are required in all computerized systems [Wes86].

A materials information specialist, when first given a problem of representing and relating (a) families of metal alloys, (b) a variety of mechanical properties, (c) test methods and (d) potential processing routes, is inclined to start by sketching some kind of tree-like (hierarchical) representation, perhaps with cross-links (multiple-inheritance lattices).

Much later it may become apparent that for transferring this information between different software systems, some lowest common denominator of several information systems is required. Hierarchical systems vary greatly (even the subset of object-oriented systems are largely incompatible with each other) and no fully generalized system seems possible. Tabular methods based on relational databases, however, bring the full generality of the relational algebra to the problem [Sar89c]. The specialist who used a hierarchical description then has much devious reinterpretation of definitions to do to achieve a transformation to tables (unless a data modelling technique designed for relational implementation was used).

Unfortunately the generality of a relational system, as the lowest common denominator of many materials information systems, then produces problems of its own: it is at too elementary a level to represent clearly some important relationships [Ken78].

The problems are these: hierarchical representations are easy at first, but become difficult and arbitrary at detailed levels (different people produce different trees). But the pure relational database approaches fail to capture many hierarchical associations explicitly, leaving many functional dependencies only implicit in the tabular structure. The answer is to use a tabular system but then to supply extra, explicit catalogue tables which explicitly describe the functional dependencies. A standardization of such a system is proceeding (see page 55).

A data dictionary is a complete documentation of every field of every table comprising the database, of every program that uses the data and why it uses it. It usually is stored in an entirely distinct database with a unique structure. Conversely a data catalogue is the formal description of a particular database schema, stored as tables in the same database management system as the database it describes.

Since the database schema documented in the database's catalogue assumes such importance, it would be useful to communicate the catalogue in some common medium for expression. Pending the development of an object-oriented standard at the right level of abstraction to be communicated along with the data, the only clear candidate for such a medium is a set of tables which could be communicated by any tabular data interchange protocol (see Chapter 4).

Units

In many database systems the units (MPa, kg m^-3 etc.) are not handled properly: they are either included with the fieldnames as part of the name or are omitted altogether. This is unsatisfactory. The units should really be easily accessible as machine-processable information, which is difficult because they are a property of the fieldname itself, not the material. They are an aspect of the database schema.

A good way to represent units, though not always possible with existing databases, is to enter the fieldnames themselves as data using special 'meta-fieldnames' such as #Fields# and #Units#.

#Fields# #Units#
'Yield' MPa
'Elongation' %
'Diameter' mm

Most commercial relational databases, including those that handle multiple tables, unfortunately cannot detect that some values of one table are actually fieldnames in another.

This method can be seen to require some agreed names for the meta-fieldnames #Fields# and #Units# which means that no user could have fields of these names for any other purpose. In future it would be useful to suggest meanings and uses for special fieldnames, perhaps of the syntactic form #____#. These special fieldnames are meta-metadata and if they are to be used effectively they should eventually become standardized so that databases can communicate on a common basis of understanding [Sar89a-c, Sar90a-c]. These meta-fieldnames should be examples of key concepts identified by concept modelling [Sku90].

Attributes

Units are only one kind of attribute that is applicable to an entire column (field) of data rather than to a single data item. Materials property data often requires such other attributes as 'maximum', 'minimum', 'typical', 'measured' 'inferred', 'required', 'notched', 'unnotched' etc. and other, more generally applicable, attributes can also be useful: 'domain', 'type' or 'dimensions' [Dat90, McC88].

As an example, consider a data evaluation office whose function is to take experimental data from several laboratories, compare it with published data and theoretical predictions from models, evaluate a new physically-based model describing that experimental behaviour, and then finally provide a table of recommended values, complete with error estimates, based on that model for the purposes of design. The office could both accept the experimental data and deliver the parameters in tabular form: many of the fieldnames would be the same, with the same units, but other attributes would be quite different.

Local, short names

A general and clear method for naming fields and field attributes is to use tables to record the attributes and to make the connection between data stored in the database and some dictionary of definitions stored elsewhere. The principle is the same as that used in all databases: 'one fact should be stored in one and only one place'.

#Local# #Attribs# #DictionaryName#
'SMod-1' 'max' 'Shear Modulus'
'SMod-1' 'required' 'Shear Modulus'
'SMod-2' 'min' 'Shear Modulus'
'SMod-3' 'typical' 'Shear Modulus'
'KDlm-1' 'notched' 'Delam. Fracture Toughness'
'CPFT-1' 'notched' 'Crossply Fracture Toughness'
'CPFT-1' 'design' 'Crossply Fracture Toughness'

The example displays a simplified version of some of the functionality of McCarthy's data thesaurus [McC87] which describes relationships between fields but also includes information about specific data items. The #DictionaryName# would be an entry in the data thesaurus. Note that 'SMod-1' has two attributes set: it is a required maximum upper bound. Since it refers to only one set of data this means that accuracy information has to be put into another table; otherwise it would appear twice, once for each occurrence of 'SMod-1', and these two values could get out of step if only one were changed when the data was updated. (This is an example of normalization.)

The design value is given an accuracy of 0%. What this signifies would have to be defined. The accuracy which individual sets of measurements can be assumed to have should be given with respect to the most specific name suitable. The local use of a short name means that the common thesaurus could use arbitrarily long 'names' to define terms, such as 'Youngs modulus for a metal determined in tension by the company standard XYZ' for example.

#Local# #Accuracy#
'SMod-1' 5%
'SMod-2' 5%
'SMod-3' 7%
'KDlm-1' 30%
'CPFT-1' 25%
'CPFT-1' 0%

The International Standardization Organization is working on a general fieldname naming scheme in its Information Resource Dictionary System (IRDS) project but it is designed to support software engineering and so materials data will require a specialized extension of this when it becomes available [Gra88, Bey91].

Naming and meaning

The remainder of this chapter is concerned with techniques to ensure that users' interpretations of information in a database are indeed those which the designers intended. We begin with the simplest approach, a common reference vocabulary, (CRV) and progress to conceptual data modelling where the meanings of terms, as well as their interrelationships, are related in a systematic manner. Finally two approaches which integrate both intended meaning and database structure are presented: the standard information resource dictionary system IRDS and the data thesaurus.

Materials engineering is an open-ended activity and so materials data differs fundamentally from conceptually constrained activities such as 3D geometry for CAD. In CAD systems it is possible to define absolute limits to what is to be represented by means of a closed set of predefined entities, e.g. as defined by IGES and STEP (see Glossary). This indicates that thesauri and dictionaries for materials will also need to be arbitrarily extendible and open-ended [McC87]. All these issues affect the design of individual databases, apart from their implications for data interchange [Sar89c]. The naming or designation problem is one of great difficulty and importance in conventional, detailed materials databases, as has been discussed earlier [Ben89, McC88, CEC86, CRV89, Sar89b].

Term definition

It became apparent during the development of ASTM's standards and during the European Commission's Materials Databank Demonstrator Programme that the precise definition of a name, label or term was currently unstandardized and subject to misinterpretation. The definition of terms, units and properties is a problem. These definitions need to be under audit control to ensure that, for example, several different types of measurement are not all recorded as 'fracture strength' by different data-entry clerks with different ideas of the precise definitions. This is possible, though not easy, for individual databases but currently impossible for independently developed databases, since existing definitions are not precise enough. Easy access to supporting thesauri is also seen as aiding the integrity of the database as well as providing an improved user interface.

ASTM subcommittee E.49.03 has taken a first step in defining a machine-readable format in which such definitions could be written, the 'standard structure for a term definition record used for terms in computerized test reporting and material designation formats'. Such machine-readable thesauri of terms are known to be essential to producing high-quality user interfaces to materials information systems [McC87, McC88, Ben89, Hix89, Lau90].

Managing the production and quality of a suitably comprehensive set of terms is an unresolved problem. During 1990 ASTM reduced, due to lack of manpower, its efforts to producing only vocabularies of terms which will appear within the wording of its own documents.

Common vocabularies

It is obvious to begin by agreeing a common vocabulary together with a dictionary of definitions. Several such dictionaries exist, though most were devised before the needs of formal computerization became apparent, such as that of the UK Institute of Metals. The Common Reference Vocabulary was devised as a means of harmonizing the documentation of the databases participating in the five-year European materials databank programme [Krö87c, Swi90]. The first version covers about 2000 terms in the nine Community languages; it is widely accepted that this does not adequately cover the field and that it needs to be corrected, re-edited and doubled in size at least [CRV89].

The single 'flat' list of terms in the CRV could profitably be split and graded into a tree-like structure of related concepts. This would be more in line with the sort of structure required in materials databases as well as making the maintenance of the CRV itself more straightforward. Ideally the CRV could form the basis of an internationally standard data thesaurus system.

Conceptual data modelling

Conceptual data modelling is the phrase used to describe the necessary process of elucidating what was meant when historical databases were constructed, what precisely needs to be represented in new systems and how to represent the meanings in a way that is useful for later software development. Later software, especially knowledge manipulation systems, must be extremely careful to be consistent in their interpretation of the meaning of technical terms. They must be documented explicitly and fully.

A method is required for systematizing the concepts behind the information stored in many disparate, independently produced databases. The conceptual structure produced is needed as a basis for the development of later software systems intended to help engineers find the information they need, irrespective of any mismatch between the engineer's query and the actual naming system used in each database.

The major problem, and the reason why the meanings of the concepts require explicit structuring, is that (a) the everyday vocabulary used is ambiguous, (b) the same data item will have many valid names, (c) similar information is obtained by different laboratories using different methods and different scientific conceptual structures. There are also many organizational problems concerning responsibility, authority and maintenance of lists of terms.

Procedure

A limited number of 'high-level' concepts must be identified as the major ways in which engineers or designers might want to locate data. Documenting concepts requires that the individual characteristics and properties are enumerated and that the relationships that hold within systems of concepts are recorded. This is a similar procedure to that of data modelling where the subject of the exercise was data. Here it is the meanings of the concepts which form the relationships, not their functional dependencies.

A method to capture all the detailed information and the inter-relationships between concepts is needed. Traditional tools such as paper, word processors, spreadsheets or relational databases quickly become inadequate for capturing the complex inter-relationships that evolve between the initial concepts. A tool which allows users to maintain and visualize a conceptual model is CODE [AIL90]. This can be used for capturing and structuring concept ontologies (relationships) [Sku90]. CODE provides a graphical environment for viewing hierarchical and lattice relationships between concepts during analysis; its sophisticated inheritance features permit the development of multiply-inherited concepts with the ability to identify and easily resolve inheritance conflicts. A tool such as this is invaluable in concept modelling.

Advanced representational methods in artificial intelligence such as frame based systems and object-oriented systems seem appropriate for this task, but on closer evaluation no systematic methodology for developing these conceptual models, especially in a scientific and technical area, can be found. Most methods used in the development of these systems in the past have been one-shot or ad hoc in the sense that they do not make explicit the assumptions made on the characterization of the concepts that populate the system. This implicit knowledge, when analysed, usually displays conceptual and terminological confusion. Very recent work in the development of ontological structures for the purpose of knowledge acquisition for building knowledge bases, machine translation and thesaurus building have started addressing this shortcoming [Sku90].

CODE: a tool for conceptual analysis

Concept hierarchy in the CODE system is an organization of concept descriptions with the specification of inheritance both between classes of concepts and instances of concepts. Multiple inheritance of properties can be encoded for either, and a system of flags gives unique control of the management of inheritance. CODE is an implementation of a specific semantic data model devised to represent the relationships in all semantic models.

In order to build, maintain and modify concept hierarchies in a systematic manner, the properties which characterize the concepts (i.e. which help to discriminate between the concepts) are categorized into system properties and user properties. There are three categories of user properties: attributes, related entities and constraints. An example of an attribute for any concept would be its name. The category related entities is for describing specific relationships between concepts, for example, the relationship between a test and the material property measured by that test. Such a relationship would also have a name (as in the Entity-Relationship data modelling technique of which this is a highly extended superset). Constraints allow for the specification of bounds or conditions. CODE has been used for the initial stages of a project to rationalize the corporate technical data holdings of the Alcoa Technical Center [Dow91].

Information resource dictionaries

In order to be able to exploit information interchange to the full, it is necessary to have a formal description of what data exists, and what kind of data can be permitted to exist in each database. This formal description must be accessible to all software accessing the database. This is a formalization of what has been referred to earlier as the catalogue of the database: a 'self-describing' database [Mar86, McC88].

The International Standards Organization (ISO) has been working on defining the level of abstraction above data dictionaries IRDS [Gra88, Bey91], and at this level all the different existing materials databases can find agreement [McC87].

1         domains, associations
          |
2     data-dictionary schema <= fundamental data
      |    
3 database schema <= data-dictionary data    
  |        
4 database data        

Figure 3.6 Information resource dictionary (IRD) levels

Figure 3.6 shows the four levels of ISO's Information Resource Dictionary System (IRDS), though this scheme did not originate with ISO [Mar86, Bey91]. At the bottom level is the data, e.g. 'Copper', '1356', 'kelvin'. At the level above this is the data dictionary which names the fields of the database; for the values just listed these would be 'material-name', 'melting-point', and 'unit' respectively. This level also contains the database schema which defines the fields: 'material-name' and 'unit' are text, 'melting-point' is a positive-numeric, and all represent data fields.

At the next higher level the valid value domains for the database, e.g. text, numerics, or integers, are defined together with the valid constraints on the data and access-permissions for users. These are all held in the data dictionary schema.

The database schema is, from the point of view of the data dictionary, just data. Similarly the structure of the data dictionary is just data described formally in the data dictionary schema. (The database catalogue described earlier would be just a subset of this data dictionary.) Also defined at this level is the list of permitted concepts: the concepts of 'fields', 'domains', and 'constraints' are defined in terms of 'fundamental concepts', the data model (top level). In brief, each level contains data in a format which is given meaning by the level above [Gra88].

Communication using a shared data dictionary puts the complexity burden on the receiver rather than on the sender. The sender has to ensure that the information transferred is a complete description; effectively it 'publishes' its own capabilities and descriptive methods together with the data [Mar86, Bey91]. The receiver has to read the data and description and has to ensure that appropriate translations are made for non-identical but similar concepts.

The self-describing database approach to data interchange requires that all the collaborating databases do share the same global view of how materials data should be organized, i.e. the same data model even if they are represented differently at the data dictionary and database schema levels.

McCarthy's data thesaurus

The data thesaurus is a conceptual tool that unifies and extends the concept of a data dictionary, an on-line thesaurus, a multi-lingual glossary and multiple indexes. From a data administration standpoint it supports the integration of multiple independent data sources, data quality control and query translation. The thesaurus can be implemented using current data management technology. A simple version of a materials data thesaurus is operational with the MPD Network and contains names, synonyms, abbreviations and classification hierarchies.

A data thesaurus is a classification of terms including broader terms, narrower terms and related terms. It aims to offer the same kind of facilities as does IRDS but it also contains further information about the data values in a database [McC87, McC88]. Using the previous example, it would contain a description of 'Copper' as a metal and describe 'melting point' as being a special case of 'temperature'. The data thesaurus forms an excellent basis for user-interface systems which are user-friendly and informative [Ben89, McC88]. It should also support the 'domains' and integrity constraints described in the relational model of databases but seldom supported by database implementations [Dat90].

If appropriate manpower and programmers could be found then a set of internationally standard materials data thesauri would solve nearly all of the major outstanding problems with materials representation and data interchange, but it would be sensible to build it on the basis of ISO standard IRDS and existing vocabularies. To construct such thesauri would involve a substantial concept modelling exercise which would require computer support using a tool such as CODE.


Footnotes

  1. The Express language is suitable, but currently requires extraordinary expertise for effective use.
  2. IRDS uses a much more specific definition of the term data dictionary than is more generally undeerstood.

Previous| Contents | Next