A Tabular Materials Data Interchange Format
Philip M. Sargent
Cambridge University
Engineering Dept.,
CAMBRIDGE CB2 1PZ, U.K.
CUED/C-MATS/TR.162 August 1989
Also published (without appendixes) in CODATA Bulletin 24 (1) (1992) 47-53 (Hemisphere Press) and presented at 12th Int. CODATA Conf., Columbus 15 July 1990, and aalso in J. Chemical Information and Computer Sciences 31 (1991) 297-300 American Chemical Soc., publishers.
A simple data interchange format is presented which is designed to offer the same facilities as dBase-III "flat-file" database files. The format uses plain text and human-oriented syntax so that it can be produced by hand using a word-processor. The typical use envisaged is for taking tables of test data results produced by an automatic data acquisition system and hand-editing them so that they can be transferred to another organisation for storage or evaluation. The format is entirely at OSI level 7, "application" level, and relies on reliable end-to-end communication being set up by lower levels.
Formats and protocols used to exchange data between materials property databases are a critical enabling technology required for integrating materials information with Computer-Aided Engineering. However it is necessary to ensure that the format is at least as capable of expressing "associativities" between data as are any of the participating databases. It is shown that a simple extension to the "flat-file" or "tabular" data format described above can cope with arbitrary complexity in the associativity between the data items in a dataset, e.g. between the numeric data and the "metadata" describing the experimental conditions. The exchange format here is purely syntactical. No fieldnames for materials or properties are defined as it is assumed that users will agree to refer to some independently collated thesaurus of technical terms. The only exceptions to this are a few "meta-fieldnames" describing catalogue structures. Finally it is demonstrated how to make the tabular format compatible with other data exchange standardisation activities.
Even if this format suffers from sufficient deficiencies that it is never used, nevertheless the layout of this report: the definition of the format in terms of a formal syntax, the specifications to which implementations should adhere, the listing of required errors to be detected and standard error numbers and, finally and most important, the existence and use of a conformance suite of test data and required output, could all form a model for the specification of other, better interchange formats.
associativity, data, databank, database, dBase, format, functional dependence, grammar, interchange, materials, MDI, metadata, multi-variable dependence, non-bibliographic, numeric, properties, protocol, relational, syntax, table, tabular.
A Tabular Materials Data Interchange Format *
Summary *
Keywords *
Contents *
Copyright Notice *
A Tabular Materials Data Interchange Format *
1. Materials Property Data Interchange *
2. A Tabular Materials Data Interchange Format *
3. SGML and CALS-Conformant Formats *
4. Multi-Tabular (Relational) Format *
5. Extended Format *
6. Multi-Tabular Extended Format *
7. More Sophisticated Use *
8. CODATA Guidelines *
9. Conclusions *
Appendix I - dBase Database File Format *
Header Format *
Field Descriptor Format *
Data Format *
Data Types *
Restrictions *
Example Hexidecimal of a .dbf File *
Appendix II - Formal Syntax for Cambridge Tabular Formats *
BNF Description of CTDIF-1 *
BNF Description of CTDIF-2 *
BNF Definitions used for both CTDIF-1 and CTDIF-2 *
Numeric Values *
Restrictions for CTDIF-1 *
CTDIF Keywords *
Appendix III - Translator Implementations Specifications *
dBase -> CTDIF Defined Error Conditions *
CTDIF -> dBase Defined Error Conditions *
Conformance Testing *
Appendix IV - Existing Translator Implementations *
Glossary *
References *
Making a single copy of this paper for individual private use is permitted without payment or royalty.
This paper, or an extract thereof, may be copied and distributed in multiple copies for any non-profit-making purpose provided that reproduction is done without alteration and provided that the title, authors name, report number (CUED/C-MATS/TR.162) and name and address of the Cambridge University Engineering Dept. is present on each copy or extract. Any extract distributed in this way must also state plainly that it is incomplete and must also state where and from whom a full copy may be obtained.
The title and summary may be distributed royalty free without further permission by computer-based or other information-service systems.
Permission to republish this paper, or any substantial portion thereof, must be sought from the author. Short extracts, for the purpose of review, may be published without permission on condition that a copy of the article containing the extracts is sent to the author.
Requests for additional copies should, in the first instance, be addressed to The Librarian, Cambridge University Engineering Dept. and a nominal fee will be charged. The author also keeps some copies for distribution.
Dr. Philip M. Sargent
Formerly of:
e-mail Philip.Sargent@computer.org (updated 21 August 1997)
A Tabular Materials Data Interchange Format
1. Materials Property Data Interchange
Communications of materials property data must be able to express two things: the relationship between names of properties and their values (e.g. "Hardness" and/equals "305.6"), and the interrelationship or associativity between several of these names and values (e.g. "Hardness", "Copper","Temperature","oC","305.6"; which of these are names and which are values requires extra information). This associativity is necessary in order to express the relationship between the simple data and that data which describes or modifies it, and the real world is usually more complex than the database systems designers would wish [Ken78].
In devising a data interchange format to be used between databases it is necessary to ensure that the format is at least as capable of expressing associativities as are any of the participating databases, which is why a tabular format is proposed. The arguments for and against tabular and item-based formats are presented elsewhere [Sar89b].
2. A Tabular Materials Data Interchange Format
A well-known commercial database format (dBase) is already being used by many materials properties database managers to upload test-data into their databases. The main disadvantage of this format is that it is defined in binary and the contents of these files are not easily viewed or edited, especially not if the files are transferred to minicomputers or mainframes. A plain-text version of this format is presented here which maintains two-way "inter-translatability" in that no information is lost in conversion either way.
The aims of this tabular format are to be:
The argument for supporting a simplest possible useful format is that, whatever happens, people will use a simplest-possible format because using formal international standards has high overheads in programmer-training and software cost. If such formats are going to be used anyway, it makes sense to design one in such a way that it can be (a) used as part of existing standards schemes (e.g. SGML) and (b) extended smoothly to the higher functionality required by other users.
Simplicity and User-Editing
The requirement to be simple for users to edit by hand has several implications. First, there must be no arbitrary "counts" such as numbers of records or lengths of strings to be typed in: it is awkward to have to re-edit a number at the beginning of the file whenever a spelling mistake is corrected near the end. The user should not have to make unnecessary decisions about the data which are not relevant to his own work, such as how long to allow for the length of text fields or the number of places of decimals required for numerics.
There should be no artificial and arbitrary limits on what can be typed so long as it obviously makes sense, this is an ideal but the proposed format goes further towards it than most. Software is quite capable of counting the number of fields, of distinguishing between strings and numbers, of measuring the length of the longest string and of recognising a wide variety of number formats; so it makes no sense to impose these tasks on the user.
dBase Files
The "dBase" file format is a de facto industry standard for the communication of database files between MS-DOS personal computers (PCs). Nearly all other database software products for PCs support translation to and from this format (e.g. Borlands Reflex v2 and Paradox v3, Lotus 1-2-3 v3, Quadbases dQuery) and Ashton-Tate have published the specification (see Appendix I). The same file format is thus supported on more types of computers than will actually run the dBase program itself.
The format describes a single "flat-file", best thought of as a two-dimensional table of data where the columns are named (by "field-names") and the rows ("tuples") are not. There are a number of restrictions on the range of numbers permitted, the lengths of strings, the number of fields (columns) and the total size of any one file (see Appendix I) but they are mostly not too onerous if they are handled by software, which they must be since the format is in binary and cannot be hand-edited.
Error-Checking, Redundancy and Size
Any materials data interchange format is classified as being at the top "Application" level of the 7-layer OSI hierarchy of communications systems. [The 7-layer OSI system can be remembered using the mnemonic "Another Pretty Scheme To Normalise Divergent Protocols", which stands for the "Application, Presentation, Session, Transport, Network, Data-link and Physical" layers]. The top level need not do any data checking since reliable "end-to-end" communication is provided by the lower levels. This ties in neatly with the user-editing requirements that the format should not contain "counts". These "counts" are obvious and can be read from the datafile itself if the data has been reliably transmitted. Therefore the proposed format contains no redundancy and as a result cannot do any error-checking, also the usability requirement means that the file is longer than it would be if the information were encoded and packed into the minimum number of bytes.
If datafiles in the format are to be exchanged on a potentially noisy system, such as by sending floppy discs through the letter post, it makes sense to also use commercial or public domain software to compress the file and to calculate a redundancy check (CRC) which can be checked by the recipient on decompressing to ensure that uncorrupted files are received.
Cambridge Tabular Data Interchange Format
The plain-text translation of dBase datafiles is called the Cambridge Tabular Data Interchange Format (CTDIF). The actual format is best understood by first studying an example. In the following "file" the keywords are IMPLEMENTATION, NAME, FIELDLIST, ENDFIELDS, CTDIF-1, FIDTC-1. Strings are enclosed in double-quotes if they contain spaces or other separator characters, but need not be quoted otherwise. Only two datatypes are used: strings and numerics. Numerics are recognised whether they appear as integers or decimals, with or without an exponent. Separators can be spaces, linefeeds, tabs or commas in any mixture. Multiple separators are not significant.
CTDIF-1 0.1
implmentation "PMS dBase Converter v0.1 21-July-1989"
name NIMONICB updated 89/7/21
fieldlist "sample_no" weight length strength_MPa
elongation_to_fracture endfields
#1-fred 3 5.0e-4 200.3 0.23
#2BA 3.2 1e-3 205.2 0.235
"#3Z ++" 3.333 1e-3 205.3 0.236
FIDTC-1
It can be seen that the data consists of a number of "tuples" (rows or records) where each tuple contains one value for each fieldname. The word "tuple" is used because the separators are not necessarily the spaces and linefeeds shown above which put one tuple on each row. Appendix II defines the formal syntax.
The IMPLEMENTATION string in the example describes the software that was used to prepare the datafile (it might be a persons name and a word-processor). This string is lost when the data is converted to dBase, but it will be rewritten by software that recreates a CTDIF file from that dBase file and it will have a new value describing the software implementation that recreated the CTDIF file. The NAME field refers to the original name of the dBase file (so that iif the CTDIF format is retranslated it can have the original name) and the UPDATED field refers to the last date the data was edited or typed. The names of the fieldnames are listed between the keywords FIELDLIST and ENDFIELDS before the data itself appears.
The basic tabular structure of the data is similar to the "spreadsheet" structure used by Stanton et al. with a little of their fieldname information as in their IGES-like format [Sta88].
Automatic Recognition of Types
The numeric and string types can be automatically recognised by the values that appear, so #1-fred is clearly a string and 5.0e-4 is clearly a number, and this detection must look at all the values of a field before deciding which type it is. If the user requires strings which consist only of digits then they would have to be put in quotes, e.g. "007". There is a slight danger that a single typing error such as O ("oh") for 0 ("zero") may cause an entire set of numbers to be classed as strings, but the translation software should be aware to this possibility and produce appropriate warning messages (see Appendix III+). Thus the format is "strongly typed" but it does not require any type declarations (like the programming language Icon but unlike Pascal [Gri86]).
Restrictions
We must apply a number of restrictions if we are to maintain free interconversion with dBase files and these are described in appendix I. The most notable is possibly that which limits fieldnames to being only 10 characters long. Some of these restrictions are awkward and unpleasant, but rather than extend them immediately it makes sense to define a base format (this one) and a separate, upwardly compatible extended format which removes the restrictions. This is because there is a great deal to be gained by having a dBase-convertible format.
dBase does not permit NULL values for numeric (or date) fields, although strings can contain the empty string, and Logicals (see appendix) can be "unset" (neither True not False). This lack of NULLs is a severe disadvantage for materials property data where absent values are very common. Within dBase the only way to represent NULL is to associate every numeric field with another field which contains a value determining whether the numeric is NULL or not (which is how databases that do support NULLs actually do it). An alternative "workaround" is to use special numbers which will "never" appear in real data such 0.0, -999 etc. but this is dangerous (many materials in some databases acquire melting points of 32oF because unavailable values have been set to zero degrees Celsius).
The versions of CTDIF described here cannot handle dBases "memo" fields for free-format, variable-length text of up to 500 bytes (dBase III+) or 64K bytes (dBase IV). There is no fundamental difficulty with this and the facility could be added in a later version.
One potentially more serious drawback is that the requirement for automatic type recognition means that all data in a dBase file which is neither string nor numeric is converted to string type. If the CTDIF file is then translated back into dBase all these "dates" and "logicals" would reappear as strings. One could extend automatic recognition to these field types but the likelihood of a mistake increases. If the current situation presents a problem to any user then it is probably best dealt with by editing the dBase programs that read the dBase .dbf files so that they work with strings instead. However, for most users of MDI strings and numerics are most important.
Multiple Tuples
In relational databases it is essential for tuples to be distinct. Multiple tuples have no meaning and even in cases where they can be entered into a database they cannot then be retrieved. Unlike relational databases, dBase and CTDIF (and SQL for that matter) permit several tuples to be the same. This is actually a common occurrence in experimental data and only causes problems if the data communication is going to be immediately and automatically loaded into a relational database on receipt. If it does cause problems then the cure is simple: add a field to all tuples which consists of a sequence number.
Alternative Presentations
The same example datafile as used above is shown below using commas and line-feeds (newlines):
CTDIF-1 0.1 implementation "PMS dBase Converter v0.1 21-July-1989" name NIMONICB updated 89/7/21 fieldlist "sample_no" weight length strength_MPa elongation_to_fracture endfields #1-fred 3 5.0e-4 200.3 0.23 #2BA 3.2 1e-3 205.2 0.235 "#3Z ++" 3.333 1e-3 205.3 0.236 FIDTC-1
This can be seen to be moderately compact. If a clearer version is required for hand-editing then an entirely separate "pretty-printer" program could be written by a third party which produced a newline character before each tuple separated the values by tabs so that they lay in columns. This freedom to change the layout independently of the content is a result of permitting a variety of separator characters and of making multiple separators mean the same as a single separator.
3. SGML and CALS-Conformant Formats
Any data, in any format, can be referred to from an SGML document by the name of the file in which it is stored. This is true for images, vector drawings, recorded speech or music, anything. If there is a formatting program available which can produce printed output from such a data file, for example a translator that produced written music from a recording, or a pretty-printer which produced numeric data in a particular layout, then by referring to this program from within an SGML document it is possible to include the data so that it appears in the right place when the SGML document is printed [Bry88]. The following (or something similar) is placed in the document to define the notation "ctdif" in one of 4 forms with a parser for each:
<!ELEMENT ctdif
NDATA>
<!ATTLIST ctdif level NOTATION (L-1 | L-2 | L+1 | L+2)
#REQUIRED>
<!NOTATION L-1 SYSTEM "c:\ctdif\parser-1.exe">
<!NOTATION L-2 SYSTEM "c:\ctdif\parser-2.exe">
<!NOTATION L+1 SYSTEM
"c:\ctdif\parser&1.exe">
<!NOTATION L+2 SYSTEM
"c:\ctdif\parser&2.exe">
Then the sets of data are defined as SGML "entities":
<!ENTITY myNidata
SYSTEM "c:\ctdif\nimonicb.c-1" NDATA L-1>
<!ENTITY myCudata SYSTEM "c:\ctdif\copper1.c&2"
NDATA L+2>
Then, somewhere in the text of the document, a reference to the data can appear in an SGML document like this:
"...and now we demonstrate that the data of <authors> Smith and Jones <\authors> on pure copper <emphasis> clearly contradicts <\emphasis> that of other workers &myCudata; and therefore it is concluded that the results are either in error or are of very significant interest..."
Therefore the tabular data format proposed here presents no problems when used in conjunction with SGML and thus it is also compatible with ODA and ODIF by encoding the SGML using SDIF [Sar89a].
CALS Conformancy
The American Department of Defense programme of Computer-Aided Acquisition and Logistic Support (CALS) currently does not specify a format to be used for tabular data. IGES formats are required for engineering drawings and vector graphics (later migrating to PDES and STEP as they become available), bitmap, raster and tiled-raster formats are defined for images, but only the very general military standard 1840A is defined for numeric data [MIL88].
CALS uses SGML-format text documents to link together compound "documents" consisting of information recorded in any of the other formats using part of the facilities in SGML described above. Thus while CTDIF could be considered to be CALS-conformant purely because it can be used with SGML, this interpretation may not be accepted by other users of CALS standards and so conformance with MIL-STD-1840A must be considered.
MIL-STD-1840A Conformancy
This standard requires that files of data conforming to it be expressed in ASCII characters and contain certain specific information at the beginning:
where the text describes the source identification (srcdocid), the destination identification (dstdocid), and any notes which should be used in interpreting the data (of unspecified nature), i.e. who wrote it and who is supposed to read it. This is the format required for "output specification data file header records" and "document type definition data file header records". All this information is to be in lines exactly 80 characters long with any blank space filled with space characters. There is also the implication and assumption that the data itself would also be in this 80 character "card-image" form. This implication is not spelt out and may not be binding, in any case CTDIF format can be reformatted in any layout without changing its meaning, so CTDIF files can be made conformant to MIL-STD-1840A [MIL88].
Word-Processor and Editor Conformance
The tabular data format only uses ASCII characters for letters, digits, usual typographic symbols, tabs, line-feeds and carriage-returns i.e. ASCII characters 9, 10, 13 and 32 to 126. Therefore it can be produced and edited by any text editor and any word-processor set to "non-document" mode.
4. Multi-Tabular (Relational) Format
There are many drawbacks of the simple, single tabular exchange format even without considering the restrictions imposed by dBase compatibility. These are:
These can be alleviated by using multiple data files to describe the same set of data, a technique known as "normalisation". If all the data is sent (or stored) in a single table as defined above then it is "simply normalised" or "in 1st normal form", further normalisation removes redundancy without losing information. The principle involved is very simple: any one item of information should only be represented once. Howes [How85] gives a "cookbook" approach and describes what to do in order to achieve better normalisation (and gives many examples) and Date [Dat86] describes clearly and consisely why the method works. Ullman [Ull82] gives formal mathematical proofs. A good bibliography and introduction to database techniques for materials information is given in the proceedings of the Schluchsee workshop [Wes86].
Cambridge Multi-Tabular Format
The simplest extension to the simple tabular format is to add an extra definition file to a set of CTDIF-1 files encoded according to the simple tabular form. To distinguish this extension from the simple case, the format of the new definition file is called CTDIF-2 and contains a list of filenames which refer to simple CTDIF-1 files (see appendix for the precise format definition). A further step would be to add at the end of this definition file (before the CTDIF-2 terminator) the contents of all the simple files, complete with CTDIF-1 headers and terminators, rather than just their names. This has the advantage of neatness and conceptually provides precisely the same descriptive power. Here is an example of what the definition file might look like:
Note again the lack of "counts", the number of files is just the number of filenames that appear between the filelist and endfiles keywords. If a file of data is included rather than just being referenced by name then it appears between the endfiles and the FIDTC-2 keywords and its name does NOT appear in the list of files. If all the CTDIF-1 files are included in this way then the keywords filelist and endfiles can be omitted (see syntax in Appendix II). No meaning should be ascribed to the order in whch the filenames or the tables appear.
The recommended filename extensions for CTDIF datafiles are ".c-1" and ".c-2", in the same way that dBase uses ".dbf".
Restriction
Using multiple tables to describe a single set of data gives more capabilities but also introduces a few more possibilities of error. The tables are linked by having some fields in common: therefore every table must have at least one field in common with another table, though not necessarily the same field. If a table has no common fields then it is really describing a distinct set of data and should not be referenced by the CTDIF-2 file. However this is not an error and translating software should only produce a warning.
Relational Capabilities
An example now follows showing how to represent in CTDIF some of the data used as an example by the SAE interchange format which put forward a simple table format for use by CALS - although the CALS administration has not yet adopted it [SAE88]:
| UNS | ASTM | Form | Diam_mm | X-Area | Tensile | Yield | Elong |
| M11311 | AZ31B | Bars | "less than 6.32" | all | 241 | 145 | 7 |
| "6.35-38.07" | 241 | 152 | 7 | ||||
| "38.10-63.47" | 234 | 152 | 7 |
Table 1 - Part of the Example from Fig 6.3-2 of [SAE88]
Table 1 shows one "row" of the the unnormalised data: some fields contain a range of values, some contain a list, e.g. the identifier "M11311" is intended to apply to all the data. Table 2 shows the same information in 1st normal form [Dat83].
| UNS | ASTM | Form | min_D | max_D | X-Area | Tensile | Yield | Elong |
| M11311 | AZ31B | Bars | 0 | 6.32 | all | 241 | 145 | 7 |
| M11311 | AZ31B | Bars | 6.35 | 38.07 | all | 241 | 152 | 7 |
| M11311 | AZ31B | Bars | 38.10 | 63.47 | all | 234 | 152 | 7 |
Table 2 - Part of the Example from [SAE88] in 1st Normal Form
Now the structure is simpler and it is possible to see how it can be further normalised to remove redundancy, as in tables 3 and 4 which are linked by the common field "UNS".
| UNS | ASTM | Form | X-Area |
| M11311 | AZ31B | Bars | all |
Table 3 - Identification information
| UNS | min_D | max_D | Tensile | Yield | Elong |
| M11311 | 0 | 6.32" | 241 | 145 | 7 |
| M11311 | 6.35 | 38.07 | 241 | 152 | 7 |
| M11311 | 38.10 | 63.47 | 234 | 152 | 7 |
Table 4 - Numeric Information
Note that in Table 4 the repetition of the UNS number is not redundant because it signifies that all the rows refer to the same material. The fact that AZ31B is a synonym for M11311 and that all of the data in Table 4 refers to Bars is stored in Table 3. The original 1st normal form of the data can be recovered by performing an "outer-join" upon all the constituent tables. If the normalisation has been done properly then this join should not produce any NULL fields [Dat83].
This example shows that splitting up a file of data into tables to be used together through duplicated fieldnames can be easy and obvious, but it can also be unnatural and awkward [McC87]. This is the major drawback of tabular formats, but item-based formats also become awkward for similarly complex data [Sar89b] and do not extend so smoothly from simple to complex cases.
Units
In the above example the Units (MPa etc.) have been included with the fieldnames as part of the name. This is unsatisfactory, the Units should really be easily accessible as machine-processable information. This cannot easily be done because the Units are a property of the fieldname itself, not the data.
The best way to represent this, though not always possible with existing databases, is to enter the fieldnames themselves as data using special "meta-fieldnames" such as #Fields# and #Units#. The values have been placed in quotes to distinguish them from fieldnames:
| #Fields# | #Units# |
| "Yield" | MPa |
| "Elongation" | % |
| "Diameter" | mm |
Most databases, including those that handle multiple-tables (e.g. Paradox), cannot detect that some values of one table are actually fieldnames in another.
This method can be seen to require some agreed names for the meta-fieldsnames #Fields# and #Units# which means that no user could have fields of these names for any other purpose. It is not necessary to modify the formal CTDIF-1 syntax, files with these fields will still be translated satisfactorily from and to dBase format. It is only when the Units information is actually used by some other software that any problem would arise if the contents of the #Units# field is actually some other user-defined quantity. Thus the Cambridge Tabular Interchange Formats do not place any restrictions on data or fieldnames but could in future suggest meanings and uses for special fieldnames of the form #____#. (If fieldnames are considered to be metadata, then these special fieldnames are meta-metadata and should eventually become harmonized with those used by IRDS [Sar88,89a], [Gra88].)
Conclusions
The multi-tabular format can be seen, informally, to have precisely the same descriptive capability as a fully-relational database and thus should handle anything definable by the relational calculus. It is therefore an excellent standard against which to measure the capability of other data interchange formats. Unfortunately in practice effective normalisation becomes increasingly difficult for the user as the associativity increases in complexity and there are many traps for the unwary [Ken78], [How85], [Dat83].
The single-table and multiple-table CTDIF-1 and CTDIF-2 formats are files or collections of files which are dBase compatible but this will not always be adequate. The extended CTDIF+1 is, like CTDIF-1, a single-table format but most of the restrictions associated with dBase compatibility are removed. These are restrictions on the kind of data that can be expressed in the format, they restrict what data can be transmitted.
The following restrictions are revised or removed:
The formal syntax for the extended format is not defined in Appendix II since it would be premature to define the format in advance developing programs that used it.
Implementation Restriction
Since the extended format can describe data not expressible in dBase format, information could be lost if data files of this type were translated into dBase. The extended format is therefore intended as an interchange format in its own right, to be used between software and databases which contain direct translators. This implies that the range of software available which can use this more powerful format will be smaller than the range available for the dBase compatible version.
Units
It is awkward to have to set up a distinct table to contain the units information, as is necessary in CTDIF-2. So in the extended format a "unitlist" is permitted containing the same number of strings as the "fieldlist" and containing the units.
Downgrading the Extensions
It would be useful, perhaps, to write a CTDIF+1 to CTDIF-2 translator which replaced long fieldnames with shorter names (via a table), removed comments, replaced NULL by either the empty string or -999, and constructed the distinct table to replace the units description as described above. This would mean that the extended format could be represented in dBase compatible form with some less-stringent restrictions. The following table demonstrates how long fieldnames (up to 254 characters) containing unusual characters and upper and lower case could be represented via shorter distinct names:
| #LocalName# | #Units# | #LongName# |
| "Mod300" | "GPa" | "Shear Modulus at 300K" |
| "Yld0-2" | "MPa" | "Proof Yield at 0.2% Offset" |
| "ShapeF" | "none" | "Moment of Bending divided by the Cross-sectional Area squared |
This facility has the important consequence that any external thesaurus or dictionary does not need to define short, arbitrary codes (see Section 7.).
6. Multi-Tabular Extended Format
This simply consists of the extensions to the single-table format carried through to collections of these tables described (or included in) a definition file. Extensions to the definition file format itself consist only of the ability to include comments.
The recommended filename extensions for CTDIF datafiles using the extended options are are ".c&1" and ".c&2" because many filing systems, including MS-DOS v3.3, do not permit the "+" character to appear in filenames:
Restriction
A new restriction is necessary when combining the multi-table facility with the possibility of NULL values. For the fields common to more than one table it is important that no values in any of the tables are NULL. A moments though will show that merely because two distinct sets of data (tuples) are lacking a data point (i.e. have a NULL value somewhere) does not mean that they should be linked.
Units are only one kind of attribute that is applicable to an entire column (field) of data rather than to a single data item. Materials property data often requires such other attributes as "maximum", "minimum", "typical", "measured" "inferred", "required" ,"notched", "unnotched" and other, more generally applicable, attributes can also be useful: "domain", "type" or "dimensions".
Consider a data evaluation office whose function is to take experimental data from several laboratories, compare it with published data and theoretical predictions from models, evaluate a physically-based new model describing that experimental behaviour, and then finally provide a table of parameter values, complete with error estimates, based on that model for the purposes of design. The office could both accept the experimental data and deliver the parameters in CTDIF form: many of the fieldnames would be the same, with the same units, but other attributes would be quite different.
This attribute problem is usually handled by devising a different fieldname for each type of use, e.g. "max_strength_MPa" or "min_thickness_cm". A problem with this technique arises when different databases use different fieldnames: one database might use fieldnames for maximum and minimum values and another might store the mean and a range. This indicates that each fieldname in the dictionary should have four available names (at least), alternatively some formal name-modifiers would have to be agreed, such as suffixes "_max", "_min", "_+/-range" with the arithmetic mean being the default for the simple name.
A more general and clearer method is to use tables to record the attributes and to make the connection between data stored in the datafile and some dictionary of definitions stored elsewhere. The principle is the same as that used in all tabular databases: "one fact should be stored in one and only one place". What follows is a simplified version of some of the functionality of McCarthys data thesaurus [McC87].
| #LocalName# | #Attribs# | #DictionaryName# | #Accuracy# |
| "SMod-1" | "max" | "Shear Modulus" | 5% |
| "SMod-2" | "min" | "Shear Modulus" | 5% |
| "SMod-3" | "typical" | "Shear Modulus" | 7% |
| "SMod-1" | "rquired" | "Shear Modulus" | 5% |
| "KDlm-1" | "notched" | "Delamination Fracture Toughness" | 30% |
| "CPFT-1" | "notched" | "Cross-Ply Fracture Toughness" | 25% |
| "CPFT-1" | "design" | "Cross-Ply Fracture Toughness" | 0% |
Note that "SMod-1" has two attributes set: it is a required maximum upper bound. The units in which a property is measured are obviously necessary, but it is also possible to specify the dimensions of a quantity so that dimensionally correct transformations can be made to a set of data. For example if a set of specimen dimensions and loads is stored in a datafile, together with a set of stengths and moduli, it should be possible to check automatically whether the stress quantities are compatible with the load-divided-by-the-cross-sectional-area. In theoretical terms, what these three special tables are doing is simulating a "semantic model" of data using only relational tables [McC87], [Dat83].
| #DictionaryName# | #Dimensions# |
| "width" | "L" |
| "thickness" | "L" |
| "strength" | "MT-2L-1" |
| "load" | "MLT-2" |
A simpler method of checking is to devise a set of "domains" and to classify all quantities as belonging to one and only one domain. Domains were an original part of the relational model of data but no implementations of relational databases have ever used them adequately [Dat83]. These domains would be such things as "stress", "strain", "temperature" etc. However using domains it is possible to distinguish between fields that have the same dimensions but different kinds of uses, such as stresses and energy densities or erosion-rates and velocities. This example also shows that many of these "attribute tables" can be represented in a single table.
| #DictionaryName# | #Domain# | #Dimensions# |
| "wear-rate" | "rate" | "LT-1" |
| "impact-vel" | "velocity" | "LT-1" |
| "modulus" | "Stress" | "MT-2L-1" |
| "batteryEffc." | "EnergyDensity" | "MT-2L-1" |
| "load" | "Force" | "MT-2L" |
| "ActvnEnergy" | "Energy" | "MT-2L2" |
It can also be useful, when transferring large quantities of data automatically, if the type of the data value is specified so that errors can be trapped. This means specifying whether the value of a field can be a string, an integer, a positive real number (most material properties never have negative values) etc.
| #DictionaryName# | #Type# |
| "width" | "posReal" |
| "thickness" | "posReal" |
| "UNS number" | "string" |
| "Thermal Exp.Coeff." | "real" |
| "Poissons Ratio" | "real" |
Note that the type, domain and dimensions can be applied to the #DictionaryName# and will automatically be applied to a number of local names differing in attributes such as "max.", "min.", "typical" etc. thus ensuring some consistency automatically. Note also that the #Type# can imply a bound on a value: if a value is given as "less than 6.3" and the type is a positive real number ("posReal") then the lower bound is 0.0, but if the type is simply "real" then the lower bound would have to be minus-infinity.
None of these special tables are defined as any part of the CTDIF formats, they merely serve to illustrate the variety of associations it is possible to represent with tabular data formats. Future data dictionaries and thesauri could define interpretations for these special fieldnames and for the values to be expected for the fieldnames #Attribs# and #Type#.
Diverse Data
Sometimes if there are a very large number of fieldnames used in a set of data, but only a few have values for any one tuple (material), then it is appropriate to use a table of the following form:
| Material | NumericParameter | Number |
| "Copper" | "Modulus" | 42.1 |
| "Copper" | "T-melt" | 1356 |
| "Diamond" | "T-melt" | 4000 |
| "Iron" | "Modulus" | 64 |
This example comes from a datafile used by both a deformation mechanism map program and a Hot Isostatic Pressing (HIP) process modelling program [Sargent & Ashby, unpublished work] and is very similar to a mechanism suggested by McCarthy [McC87].
After the CODATA-sponsored Schluchsee meeting in 1985 an informal working group was established to devise a materials data interchange format and a set of guidelines produced. The formats described here go some way to satisfying these guidelines.
The format envisaged by CODATA is for "passive" interchange (one-way without acknowledgement) between databases and software written in conventional programming languages. It must be capable of being understood by a scientist or engineer who is only a casual user of computers and must support the following facilities:
But it should not attempt to support diagrams, pictures and the full-text of articles. In addition, the format itself should have the following properties:
It can be seen that the Cambridge Tabular format CTDIF-2 handles most of these. The uncertainties of numerics and the specification of the structure of graphs can be modelled using multiple tables. This must be so since the full power of the relational model of data is available, but names for the "meta-fieldnames" would need to be defined for such things as "x-axis" and "independent variable" for graphs. The "definition of new variables" is the only defined behaviour for CTDIF: all fieldnames are declared but their meanings are not defined. Mathematical expressions can only be represented as strings, which is probably adequate. If the parse-tree of a mathematical expression is required then it is most sensible for the parsing to be done by receiving software rather than in the interchange format itself.
Arabic and Greek alphabets are independent of the definition of CTDIF. ISO has defined "octet" (byte) codes for several non-Latin alphabets (including Japanese which requires 2-byte codes) and CTDIF could be used directly once translations of the keywords (such as FIELDNAMES) have been defined.
The data part of a CTDIF file is clearly delimited by a header and a tailer, but tuples are distinguished only by ordinary separator characters: to select a tuple it is necessary to count data fields from the beginning of the file (software should have no trouble doing this). The recipient is not identified but the originating software is written into the format, the merits of including an intended recipient field would have to be discussed during the precise definition of the extended versions of the CTDIF formats. The specification of a transmission date/time is not appropriate for CTDIF because a floppy disc could spend a substantial length of time in transit and, depending on the postal service, may have no predefined arrival date.
The greatest deficiency in the CTDIF formats from the CODATA point of view is not actually explicit in the CODATA guidelines, that is that there is no defined dictionary of fieldnames within CTDIF and that one must be separately agreed upon.
Simple tabular and multi-tabular (relational) , dBase compatible and extended formats all have their uses and it is inevitable that if only one of these is adopted by a standards-making body, some user-communities would then develop others. It makes sense to forestall such a divergence of standards by designing a system where all types can coexist and where there are defined methods for transforming from one to the other. This paper defines Version 1.0 of CTDIF-1 and Version 1.0 of CTDIF-2 (dBase-compatible). It gives suggestions for the extended formats CTDIF+1 and CTDIF+2 but further thought is required to arrive at a sensible and workable set of extensions.
The systems described here, however, contain no mechanism to handle meanings or interpretation of terms. Therefore the CTDIF formats will only be useful to a data-sharing community if that community also defines a common fieldname-dictionary with definitions to ensure that meanings are also communicated as intended.
Acknowledgements
This work was undertaken whilst the author was the holder of an Advanced Fellowship from the UK Science and Engineering Research Council. dBase is a trademark of Ashton-Tate Inc. MS-DOS is a trademark of Microsoft Corp.
Appendix I - dBase Database File Format
Database files storing data (as opposed to indexes) are stored in ".dbf" file format. The format consists of a header which contains firstly some invariant fields (bytes 0-31), then a list of the fields including fieldnames, then finally the data itself. The .dbf files for dBase III and III+ are almost identical, and dBase IV has some small extensions (noted below where they occur). dBase II had quite a different .dbf file format and Ashton-Tate make available a utility DCONVERT which translates from II to III+.
The index file format is not described here since this type of file can be automatically generated from the .dbf file. (In passing it is worth noting that the index file format is quite different in dBase III+ and IV.)
The memo files (.dbt files) are used by dBase to store free-form text and are not described here. CTDIF+1 format permits arbitrary length strings and it would be possible to devise a translation program to convert CTDIF+1 long strings to a dBase-type memo file, but this would not handle all the other extensions from CTDIF-1 to CTDIF+1.
In what follows the term "XXh" is taken to mean a byte with a hexadecimal value of XX, e.g. 0Ah is 10 decimal, 0Bh is 11 decimal, 10h is 17 decimal.
| Byte | Length | Description |
| 0 | 1 | 03h for dBase III+, 83h if a memo file needed (bits 0-2 version number, 3-5 for SQL use, 6-7 for valid memo file required) |
| 1-3 | 3 | date: YY MM DD in binary, e.g. 59h is 89 decimal for YY (representing 1989) and 6Ah is 106 (representing 2006). |
| 4-7 | 4 | number of records in the .dbf file, expressed as a 32 bit number (least significant bits first) e.g. 03-00-00-00 is 3 (decimal)*. |
| 8-9 | 2 | number of bytes in the header (p), as a 16 bit number (least significant bits first) e.g. c1-00 is 193 (decimal). |
| 10-11 | 2 | number of bytes in each data record, as a 16 bit number (least significant bits first) e.g. 4a-00 is 74 (decimal). |
| 12-13 | 2 | reserved by Ashton-Tate |
| 14 | 1 | flag for transaction processing, set to 01h (ASCII 1, "ctrl-A") at beginning of transaction, either completion or rollback resets it to 00h (NUL). |
| 15 | 1 | encryption flag, 01h if encrypted, 00h (ASCII 0, NUL, "ctrl-@") if plaintext |
| 16-27 | 12 | reserved by Ashton-Tate for multi-user applications |
| 28-31 | 20 | reserved by Ashton-Tate |
| 32-n | 32*q | a list of field descriptors, 32 bytes each |
| n+1 | 1 | 0Dh (ASCII 13, carriage return, "ctrl-M") header terminator character. |
| n+2 | 1 | 00h (ASCII 0, NUL) in dBase III ONLY ! DOes not appear in III+ or IV. |
If the number of bytes in the header is p, then the number of fields q = (p - 33)/32.
An extra byte (00h) appears between the 0Dh header terminator character and the delete flag of the first data record in dBase III files, but not in dBase III+ or dBase IV files. dBase III+ can cope with this and the current CTDIF-1 translator detects it. [Actually this field can be totally wrong and dBase III+ will still handle the data correctly. Thus it cannot be relied upon and any software reading .dbf files should be prepared to ignore it and count lengths, fields and data for itself.]
Each field descriptor is exactly 32 bytes long (as mentioned above).
| Byte | Length | Description |
| 0-10 | 11 | field name in ASCII capital letters, terminated by NUL (ASCII 0). Since there must always be at least one NUL character the name can be at most 10 characters long |
| 11 | 1 | field type, an ASCII character, one of C, N, L, D and M. dBase IV also allows F. |
| 12-15 | 4 | field data address (irrelevant for disc files, is only used when the file is copied into RAM memory) |
| 16 | 1 | field width: the number of characters allowed in the field value. A binary number from 1 to 254. |
| 17 | 1 | field decimal length: the number of characters allowed after the decimal point. A binary number from 0 to 15 (no more than 2 less than the field width). |
| 18-19 | 2 | Reserved by Ashton-Tate. |
| 20 | 1 | Work Area ID (used internally by dBase) |
| 21-22 | 2 | Reserved by Ashton-Tate for multi-user applications. |
| 23 | 1 | SET FIELDS flag, NUL (ASCII 0) is default, the field is available. |
| 24-31 | 8 | Reserved by Ashton-Tate. |
The data follows the header without any separators, just bytes of values in the order specified in the field descriptors. Before each data record is a single byte: the delete flag.
| Byte | Length | Description |
| 0 | 1 | delete flag: 20h (ASCII 32, " " space character) if data is valid, 2Ah (ASCII 42, "*" character) if it is marked as deleted. |
| 1-m | m | the data |
At the end of the file, after all the data, is a single character:
| Byte | Length | Description |
| p | 1 | 1Ah (ASCII 26, "ctrl-Z"): the end of file marker. |
| C | character data, a fixed-length string of 1 to 254 ASCII characters |
| N | numeric, a string of 1 to 19 characters from the following set: - . 0 1 2 3 4 5 6 7 8 9 (20 characters in dBase IV). dBase permits "15.9 digits of accuracy" in fixed-point calculations in dBase programs, 19 (or 20) digits accuracy in .dbf files.. |
| L | logical, a single character, one of y Y n N t T f F or ? (unset). Note that one might expect a single bit to be sufficient but Ashton-Tate allocate a whole byte (8 bits) and allow the "unset" value in addition to normal Boolean logic. |
| D | date: 8 digits in YYYY MM DD format, not the same format as used in the file header |
| M | 10 digit pointer to a memo file block (for free-format text stored in a separate .dbt file, not relevant here) |
| F | only in dBase IV, not dBase III+. Floating point binary numeric. This has exactly the same format in .dbf files as the "N" numeric, but it indicates that in dBase programs floating-point rather than fixed-point operations should be performed in calculations. As far as a transfer format is concerned there is no difference between "N" and "F" and so "N" should be preferred since it is recognised by dBase III+. |
Fieldnames must contain only upper-case letters, digits and the underline character, must begin with a letter and be between 1 and 10 characters long. Numerics must be convertible to a number of less than 19 digits, including the decimal point but without any exponent. All strings must be less than 254 characters.
The filename of the .dbf file must be a string of between 2 and 8 characters, beginning with a letter, and containing only letters (upper and lower-case not distinguished, as for fieldnames), digits and the characters $ & # ~ % ( ) - _ @ ^ { } ! i.e. it must be a valid MS-DOS filename but not a single letter.
Maximum number of records per file: 1e9
Maximum number of bytes per file: 2e9 bytes (2 Gigabytes)
Maximum record size: 4000 bytes
Maximum number of fields per record: 128 (255 in dBase IV).
However the maximum file length under MS-DOS version 3.3 is "only" 32 Megabytes. MS-DOS version 4 and OS/2 extend this, but they are not yet in such common use. Unix and Macintosh filing systems can handle very long files and several software packages are available that can read dBase files.
Example Hexidecimal of a .dbf File
This is a hexidecimal printout of a .dbf file identical to the example CTDIF-1 file listed earlier.
Olivetti HEXDUMPRelease o1.0 Dumping File: C:NIMONICB.DBF 0000: 03 59 07 15 03 00 00 00 c1 00 26 00 00 00 00 00 ".Y........&....." 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "................" 0020: 53 41 4d 50 4c 45 5f 4e 4f 00 00 43 0d 00 d7 4a "SAMPLE_NO..C...J" 0030: 07 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 "................" 0040: 57 45 49 47 48 54 00 00 00 00 00 4e 14 00 d7 4a "WEIGHT.....N...J" 0050: 07 03 00 00 01 00 00 00 00 00 00 00 00 00 00 00 "................" 0060: 4c 45 4e 47 54 48 00 00 00 00 00 4e 1b 00 d7 4a "LENGTH.....N...J" 0070: 08 05 00 00 01 00 00 00 00 00 00 00 00 00 00 00 "................" 0080: 53 54 52 45 4e 47 54 48 5f 4d 00 4e 23 00 d7 4a "STRENGTH_M.N#..J" 0090: 0a 01 00 00 01 00 00 00 00 00 00 00 00 00 00 00 "................" 00a0: 45 4c 4f 4e 47 41 54 49 4f 4e 00 4e 2d 00 d7 4a "ELONGATION.N-..J" 00b0: 05 03 00 00 01 00 00 00 00 00 00 00 00 00 00 00 "................" 00c0: 0d 20 23 31 2d 66 72 65 64 20 20 33 2e 30 30 30 ". #1-fred 3.000" 00d0: 20 30 2e 30 30 30 35 30 20 20 20 20 20 32 30 30 " 0.00050 200" 00e0: 2e 33 30 2e 32 33 30 20 23 32 42 41 20 20 20 20 ".30.230 #2BA " 00f0: 20 33 2e 32 30 30 20 30 2e 30 30 31 30 30 20 20 " 3.200 0.00100 " 0100: 20 20 20 32 30 35 2e 32 30 2e 32 33 35 20 23 33 " 205.20.235 #3" 0110: 5a 20 2b 2b 20 20 20 33 2e 33 33 33 20 30 2e 30 "Z ++ 3.333 0.0" 0120: 30 31 30 30 20 20 20 20 20 32 30 35 2e 33 30 2e "0100 205.30." 0130: 32 33 36 1a "236." HEXDUMP Complete
Appendix II - Formal Syntax for Cambridge Tabular Formats
While the syntax alone does not fully describe the interchange standard, it is nevertheless a good starting point. Files which do not conform to these BNF (Backus-Naur Form) rules are definitely in error, while files which do conform may still have more subtle errors (for example, this syntax does specify what to do with strings containing separator characters).
file1 ::= <text> <format1> <text> format1 ::= <header1> <fields> <body> <tailer1> header1 ::= "CTDIF-1" <sep> <version> <implem> <name> <update> tailer1 ::= "FIDTC-1" fields ::= "FIELDLIST" <sep> <slist> "ENDFIELDS" body ::= <value> <sep> | <value> <sep> <body>
file2 ::= <text> <format2> <text> format2 ::= <header2> [<files>] [<format1>]* <tailer2> header2 ::= "CTDIF-2" <sep> <version> <implem> <name> <descrip> tailer2 ::= "FIDTC-2" descrip ::= <string> files ::= "FILELIST" <sep> <slist> "ENDFILES"
BNF Definitions used for both CTDIF-1 and CTDIF-2
version ::= <digit> "." <digit> [<digit>] <sep> implem ::= "implementation" <sep> <string> <sep> name ::= "NAME" <sep> <nstring> <sep> update ::= <year> "/" <month> "/" <day" sep ::= "," | " " | <tab> | <lf> | <sep> <sep> value ::= <string> | <numeric> slist ::= <string> <sep> | <string> <sep> <slist> nstring ::= <letter> [<digit> | <letter>]* string ::= <qstring> | [<letter> | <digit> | <chars>]* qstring ::= <quote> <string> <quote> digit ::= "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" | "0" letter ::= "A" | "B" | ... | "Z" | "a" | "b" | ... | "z" chars ::= "!" | "$" | "%" | ... | " " etc.
where <quote> is the quotation mark character (ASCII 34) and <chars> is any non-letter or digit ASCII character except <quote> i.e. ASCII 0-33, 35-47, 58-64, 91-96, 123-126. <tab> is ASCII 9 and linefeed (LF) is ASCII 10. <text> is unparsed bytes before or after the CTDIF part of the file. Carriage return (ASCII 13) is ignored everywhere except in strings and is not a valid separator character (which does not matter because whenever it appears it is almost always before or after a linefeed). Note that the date format is neither American nor European but ISO: operating from the general to the particular from the left to the right so that time in hours, minutes, seconds etc. can simply be added at the end in later versions.
The syntax for <numeric> is not currently defined precisely, but as a working definition it should include the format of every legal real literal in Borlands Turbo Pascal (version 5) and in Icon version 7.5 [Gri86] (except that the tilde (~) is not a valid negation indicator), e.g. 1.0, 1e5, 0.1e-4, -2, .1, -.03 are all valid provided that they are preceded and followed by a separator character.
Note that the limits for CTDIF-1 are between about 1e-17 and 1e+19. These are the limits of the values storable in the data file. These are not the limits that are given in the dBase documentation: 1e-107 to 1e+108, (or 1e-307 to 1e+308 in dBase IV) which apply to the values which can be processed by the dBase language.
There are no comments permitted and only the first 10 characters of a fieldname are significant. Fieldnames must be as in dBase (see above) and numerics must be in the same range of values but can be written in a variety of formats. The NAME value must be a string of between 2 and 8 characters, beginning with a letter, and containing only letters (upper and lower-case not distinguished, as for fieldnames), digits and the characters $ & # ~ % ( ) - _ @ ^ { } ! i.e. it must be a valid MS-DOS filename.
The automatic recognition of values means that fielnames and other strings that contain the delimiter characters space, tab, comma or linefeed must be enclosed in quotes. In the current version the quote character (") itself is not permitted in strings at all (although the apostrophe () is), but even this restriction could be removed by the introduction of an escape character.
These CTDIF specifications have introduced a number of keywords, and the syntax definition is such that this has a restricting effect on some values that can be transmitted, in particular no string can have the value "FIDTC" as this always signifies the end of the format, whether it appears in quotes or not.
These are defined above and are listed here in alphabetical order for easy reference. The header and tailer keywords must always be in capital letters, the others can be in upper, lower or mixed case.
CTDIF-1
CTDIF-2
ENDFIELDS
ENDFILES
FIDTC-1
FIDTC-2
FIELDLIST
FILELIST
IMPLEMENTATION
NAME
Appendix III - Translator Implementations Specifications
Warnings occur but permit processing to continue. Errors always cause the translator to stop when they occur. The numbers 1-999 are usually used by the underlying programming language to report run-time errors, if the translator is written correctly these should never occur, and error 1301 should only appear on unreasonably large files (where the definition of unreasonable is up to the writer of the software, but must be documented).
If at all feasible mistakes and problems should trigger warnings and continue processing rather than trigger errors which terminate. This is particularly true for translators which read dBase format as these files contain redundancy so many "errors" are not fatal. For example, although the length of the header, the length of each record and the total number of records are stated in the dBase header none of this information is necessary to parse the file correctly.
dBase -> CTDIF Defined Error Conditions
The following errors and warnings must be detected by any software implementing this format as a checker or parser which converts dBase .dbf files into CTDIF-1 files. There are very many more defined errors than for the other direction of translation because dBase is a highly specific bit-level format, so there are many things that can be wrong.
Warnings
| 1101 | Empty file: no fieldnames or values but otherwise correct format |
| 1102 | memo file required |
| 1103 | Unrecognised dBase version: ",id) |
| 1104 | output file already exists, making backup |
| 1105 | Invalid format of last update Date |
| 1106 | Logical field(s) present: value(s) converted to characters |
| 1107 | Date field(s) present: value(s) converted to strings |
| 1108 | Record marked as deleted |
| 1109 | File continues after dBase file terminator character. |
| 1110 | SQL flag is set on this file, dBase IV only! |
| 1111 | Bad delete bit at beginning of record, ignored. |
| 1112 | Unsupported field type present (cannot parse memo fields) |
| 1113 | Incorrect header length stated in header (too long), correct length will be used |
| 1114 | Header length stated is too small, correct length will be used |
| 1115 | Incorrect record length stated in header, correct length will be used |
| 1116 | Bad fieldname, no terminating NUL, complete 11-byte name will be used |
| 1117 | Unrecognised value for SET FIELDS flag, assumed to be valid |
| 1118 | Data truncated: incomplete record read |
| 1119 | Duplicate tuples (records) found in file (NOT IMPLEMENTED) |
| 1120 | Unset Logical value set to ? |
| 1121 | Encrypted data in this file |
| 1122 | Missing end of file character after dBase data |
| 1123 | Unrecognised field type, treated as string |
| 1124 | Header incorrect, wrong number of data records |
| 1125 | Transaction flag set: data may be inconsistent |
| 1126 | Cannot read numeric value in data record, assumed zero |
| 1127 | String value contains "FIDTC-1", changing to "F_I_D_T_C-1" |
Errors
| 1201 | Cannot open input .dbf file |
| 1202 | Cannot read a byte from input file |
| 1203 | Cannot open output file |
| 1204 | Cannot write to output file |
| 1205 | Premature end of dBase file, incorrect header |
| 1206 | dBase II is not supported by this program |
| 1207 | Incorrect field width for field type |
| 1208 | Invalid number of decimal places for numeric field |
| 1209 | Unrecognised field type |
| 1210 | Cannot read numeric value: third failure in same record |
Implementation Errors
| 1301 | Internal miscount of fields |
| 1302 | error in translating 16 bit integer |
| 1303 | error in translating 32 bit integer |
CTDIF -> dBase Defined Error Conditions
The following list of errors and warnings must be detected by any software operating as a checker or parser which converts CTDIF files into dBase .dbf files or any other database format. Even a pretty-printer should detect these errors.
Some of the size limitations specified are due to limitations of the dBase language implementations and are not inherent in the dBase .dbf file format. Thus it is quite possible to produce .dbf files with more than 255 fieldnames or more than 4000 bytes per record (tuple), or numerics longer than 20 digits (the format allows one byte to describe the length, so 254 digit numbers would be feasible, just as 254 character strings are allowed). dBase would not be able to handle these, but other software which recognised the .dbf file format (such as Borlands Paradox, Microrims R:Base or one of the many dBase clones) might have no problems. This is why some of the warnings are not classed as fatal errors.
Warnings
| 1101 | Empty file: no fieldnames or values but otherwise correct format |
| 1102 | Repeated tuple, the tuples and sequence numbers should be listed |
| 1103 | numeric not convertible to number of 19 digits or less (including decimal point): truncated with loss of accuracy. |
| 1104 | fieldname too long: truncated. |
| 1105 | A field which contains mostly numerics values but which contains a few (less than 3% or fewer than 3, whichever is the larger) non-numerics. This is to catch typing mistakes in numerics. The implementation should produce a list of the values and the tuples and tuple sequence numbers. |
| 1106 | Greater than 128 fieldnames, the resulting file will only be readable by dBase IV not dBase III+ |
| 1107 | string longer than 254 characters: truncated (later versions could convert such strings to dBase memo fields instead) |
| 1108 | Greater than 255 fieldnames, not translatable even into dBase IV but maybe Paradox could cope. |
| 1109 | Record (tuple) length greater than 4000 bytes |
| 1110 | More than 1e9 tuples in the file (not tested in conformance suite) |
| 1111 | More than 2e9 bytes in the file (not tested in conformance suite) |
| 1112 | numeric too large or too small to be convertible to a number of 19 digits or less, i.e. less than 1.0e-17 or more than 1.0e19 - 1. |
Errors
| 1201 | Number of data values not divisible by the number of field names, or no data items at all. |
| 1202 | missing end tag FIDTC-1. |
| 1203 | fieldnames not distinguishable (after truncation if required). |
| 1205 | Unmatched quotation marks ", the number of them in the file not divisible by 2. |
| 1206 | Missing fieldname list |
Implementation Errors
| 1301 | Internal error in translator, usually occurs because of buffer overflow e.g. if a datafile contained a sequence of 256 separators together and the translator only provided a buffer of 255. |
Conformance requirements must define two types of behaviour for a list of inputes, for each input it must be defined what must occur and what must not occur. There will always be behaviour which is undefined, the goal of a good conformance system is to ensure that the undefined behaviour is always unimportant.
A standard specification is no use if there is no way to test it. The development of OSI protocols and ANSI C compatible compilers shows that there is no substitute for a comprehensive set of test files which must be processed by the implementation under test to produce a defined output [ISO88]. A set of such test files, together with copies of conforming output, is available from the author. They test all the errors defined above for the dBase -> CTDIF-1 translator but only singly, not in combination, so are not a guarantee of a good implementation, just a passable one. The suite itself has to be under strict version-control and has to be kept precisely in step with any new versions of CTDIF. Maintaining the suite is a professional job that is not inexpensive [Owe89]. Ideally a conformance testing laboratory should produce a log to show that the tests have been properly carried out.
Implementations must be able to accept any amount of text before the beginning and after the end of the CTDIF part of a file (i.e. before and after the bits between "CTDIF-1" and "FIDTC-1"). They must be able to cope with at least 1k separators in a row, at least 255 fields per tuple (as dBase IV) and must be able to truncate long fieldname strings (at least 1K long) cleanly without losing track of the data or missing any quotation marks. These requirements are more to do with the stability of an implementation to cope with errors than they are to do with output speed or with behaviour for correct input. This is intentional. These tests are also included in the conformance suite.
Appendix IV - Existing Translator Implementations
The author has produced a dBase -> CDTIF-1 translator written in the language Icon and developed on MS-DOS machines. Icon has the following advantages for this project:
The Icon language is polymorphic, like Poly and ML, and in some ways resembles AWK especially in the "associative array" or "table" data structure [Aho88], [Gri83, 86, 88]. Full details are available from The Icon Project, Dept. Computer Science, Gould-Simpson Building, The University of Arizona, Tuscon, Arizona 85721. In Europe all documentation and software can be obtained at a nominal charge from Grey Matter Ltd., 4 Prigg Meadow, Ashburton, Devon TQ13 7DF, UK. (+44 1364-53499)
These translators could have been written very easily in dBase language, or in a dBase clone such as dBXL, Quicksilver, Clipper or Foxbase, or (less easily) in C or Pascal, but the purpose of this work is to test the specification for completeness and usability rather than to produce commercial software. dBase and dBXL have the additional disadvantage that anyone using the program would have to buy their own copy of the commercial package.
The current translator does not use the best techniques as exemplified in the best textbooks [Aho86]. If this interchange format is thought to be useful to a significant number of commercial companies, then a professional programmer (or preferably a professional software consultancy) should be retained to rewrite the translators in ANSI C. The translator from CTDIF to other formats is much easier to write than the reverse translator because CTDIF is defined using a fully context-free grammar, which means that a syntax-directed parser can be generated automatically using the lex and yacc software tools [Aho86].
It is intended that a minimal conformance suite and translators for both dBase -> CTDIF-1 and CTDIF-1 -> dBase should be available at the Materials Data Interchange Workshop on Sept. 14/15th 1989 at Rolls Royce in Derby, UK. The job of maintaining the conformance suite should be adopted by a national or international standards laboratory [Owe89].
A very extensive glossary and a list of acronyms is given in the proceedings of the Schluchesee workshop [Wes86].
| CRC | Cyclic-Redundancy Check, a single number obtained from all the bytes in a file. If any byte is corrupted then the CRC will be different. A CRC using a public algorithm is calculated and sent with a file, the recipient also calculates the CRC and compares it with the one that has been sent. Any errors and the CRCs will be different. |
| CRV | Common Reference Vocabulary, European Commission project |
| IRDS | Information Resource Dictionary System (ISO IEC JTC1/SC21/WG3) [Gra88] |
| ISO | International Standards Organisation |
| MDI | Materials Data Interchange |
| OSI | Open Systems Interconnection: 7-layer model (ISO TC97/SC16) |
| SAE | Society of Automotive Engineers (USA) |
| SDIF | SGML (cf) Data Interchange Format, ISO. Communication of SGML composite documents using OSI protocols by prefixing them with header written in the language ASN.1 |
| SGML | Standard Generalized Markup Language, ISO 8879. |
| SQL | Structured Query Language (ISO) |
| Aho86 | A.V. Aho, R. Sethi and J.D. Ullman, (1986) "Compilers: Principles, Techniques and Tools", Addison-Wesley, ISBN 0-201-10194-7. |
| Aho88 | A.V. Aho, B.W. Kernighan and P.J. Weinberger (1988) "The AWK Programming Language", Addison-Wesley Publ. Co. ISBN 0-201-07981-X |
| Bry88 | M. Bryan (1988) "SGML: An Authors Guide to the Standard Generalized Markup Language", Addison-Wesley Publ. Co. ISBN 0-201-17535-5 |
| Dat83 | C.J. Date, "An Introduction to Database Systems", Fourth Edition, Vols I (1983) and II (1988), Addison-Wesley Publ. Co., Reading, Mass., USA, ISBN 0-201-14474-3. |
| Gra88 | D.J.L. Gradwell, Ed. "Information Resource Dictionary System: Framework", ISO/IEC JTC1/SC21 N2642, and "Data Modelling Facilities" ISO/IEC JTC1/SC21/WG3 N634. Data Dictionary Systems Ltd., 80A High St., Camberley, GU15 3RS, UK. |
| Gri83 | R.E. Griswold and M.T Griswold "The Icon Programming Language" , Prentice-Hall Inc., New Jersey, USA, ISBN 0-13-449777-5. |
| Gri86 | R.E. Griswold and M.T Griswold "An Icon Tutorial ", Byte Magazine, October 1986 167-178. |
| Gri88 | R.E. Griswold (1988) "Programming with Generators", The Computer Journal 31 (3) 220-228. |
| How85 | D.R. Howe (1985) "Data Analysis for Database Design", Part 2: Normalisation and Part 3: Entity-Relationship Modelling, Edward Arnold Publ. Ltd. ISBN 0-7131-3481-X |
| ISO88 | ISO DIS 9646-1 Version 6.10 19 December 1988 "OSI Conformance Testing Methodology and Framework; Part 1: General Concepts" |
| Ken78 | W. Kent, (1978) "Data and Reality: Basic Assumptions in Data Processing Reconsidered", Elsevier North-Holland Inc., New York. ISBN 0-444-85187-9. |
| McC87 | J.L. McCarthy, (1987) "Information System Design for Materials Property Data" in Proc. 1st Intl.Symp. on Computerization and Networking of Materials Property Databases, Philadelphia Nov.2-3 1987 publ.ASTM 1989. |
| MIL88 | MIL-STD-1840A Military Standard: Automated Interchange of Technical Information, 22 Dec.1987 + change notice 20 Dec.1988, American Dept. of Defence. Section 5.1.4.2 p19 and 5.1.4.9 p25. |
| Oly88 | P.L. Olympia, R. Russell Freeland and R. Wallin (1988) "dBase Power: Building and Using Programming Tools", Ashton-Tate Pulishers, ISBN 1-55519-021-9. |
| Owe89 | J. Owen, S. Kemmerer and A. Woodal (1989) "Conformance testing Methodology and Framework: General Concepts" ISO TC184/SC4/WG1 Document N355, 21 April 1989. |
| SAE88 | SAE (1988) "Specification for an Automated Interchange of Standards Data", SAE Aerospace Standard AS 4159, issued 12 April 1988 and submitted by SAE to ANSI. |
| Sar88 | P.M. Sargent, (1988) "Materials Data Interchange for Component Manufacture" , Cambridge University Engineering Department UK, Technical Report CUED/C-MATLS/TR.147 Sept. 1988. |
| Sar89a | P.M. Sargent, (1989) "Definition Study for the Establishment of Demonstrator Projects in Materials Data Interchange", Contract No.320111 for CEC JRC Petten, June 1988. |
| Sar89b | P.M. Sargent, (1989) "Associativity in Material Property Data" , Cambridge University Engineering Department UK, Technical Report CUED/C-MATLS/TR.163 August 1988. |
| Sta88 | E. Stanton, T.E. Kipp Jr., K.J. Meyer (1988) "Final Report: Advanced Materials Database System" report PDA-TR-5167-10-02 for contract F04 606-86-D-0021-0001, CET ALC/MME 87-741-A1-1 Amendment 1, PDA Engineering, 2975 Redhill Avenue, Costa Mesa, California, USA. |
| Wes86 | J.H.Westbrook, H.Behrens, G.Dathe and S.Iwata (1986) Editors, "Material Data Systems for Engineering", Proc. CODATA Workshop, Schluchsee, FRG, 1985. ISBN 3-88127-100-7. |
| Ull82 | J.D. Ullman (1982) "Principles of Database Systems" 2nd Edition, Ch.7 on dependences and lossless decomposition, Pitman Publ. Ltd., London ISBN 0-914894-36-6 |
Re-formatted, MS Word97 (HTML)
PMS August 21, 1997