Internet-Draft                                   M.T. Carrasco Benitez
<draft-carrasco-xdossier-00.txt>                                  EMEA
Expires 29 February 1999                              1 September 1999 


                              Xdossier


Status of this memo

This document is an Internet-Draft and is in full conformance with all 
provisions of Section 10 of RFC2026.

Internet-Drafts are working documents of the Internet Engineering Task 
Force (IETF), its areas, and its working groups. Note that other 
groups may also distribute working documents as Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six months 
and may be updated, replaced, or obsoleted by other documents at any 
time.  It is inappropriate to use Internet-Drafts as reference 
material or to cite them other than as "work in progress."

The list of current Internet-Drafts can be accessed at 
http://www.ietf.org/ietf/1id-abstracts.txt

The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.


Abstract

This is an informational memo for Xdossier. A Xdossier is a data 
object designed for browsing with web browsers and mappable to XML. It 
is based on a directory structure containing files in several formats.


Table of Contents

1.    Introduction
2.    Rationale
3.    Terminology
4.    Name
5.    Representation
6.    File extension
7.    File formats
8.    Character sets
9.    Web Formats
10.   Directory Index
11.   Root directory
12.   Well-formed and valid Xdossier
13.   Xdossier DTD
13.1. By-Example Xdossier DTD
13.2. Syntactic Xdossier DTD
14.   Self-containness
15.   Compound Xdossier
16.   Mapping
17.   HTML for Index
18.   References
19.   Author
19.1. Disclaimer


1. Introduction

It is recommended to play with a Xdossier example, as this memo should 
be easier to understand. For examples look in http://xdossier.com.

This recommendation is about organising files. They are organised into 
a data object called Xdossier.

Informally, a Xdossier is a directory structure with files in several 
formats created for web browsing; direct browsing ("file:") or served 
browsing ("http:").

Classifying files within directories is easy and very instinctive. A 
few HTML files with some descriptions and links can greatly help the 
browsing and give a feel of "oneness". One can easily start organising 
using the directory structure point of view. By following a few rules, 
one can end up with a data object easy to browse and with a 
significant structure. 

A directory structure is a tree similar to an XML document. There is a 
strong parallelism:

   directory structure                XML
   -------------------                ---
   root directory                     document element/document entity
   directory                          element
   file                               entity
   directory name                     element name
   file name                          entity reference
   content XML file                   parsed entity
   content of non XML file            unparsed entities

With a formal mapping to XML, the directory structure could be 
transformed into an XML document.

A strategy could be to start with the (main) "tree" and to progress 
with the organisation towards the content of the individual files (the 
"leaves"): a few files could be XML files, eventually the whole 
Xdossier should be transformable into a XML document.

This approach is particularly useful to organise large amount of 
legacy data in several formats for which there is no clear formal 
definition.


2. Rationale

- Usable with web browsers.

- Easy to "produce" and easy to "consume".

- Usable "as is" and adapted to further processing. For example, a CD-
ROM must be usable directly ("raw" consumption) and programs should be 
capable of mechanical processing to load into a DBMS, web server, etc.

- Easy to prepare with resources (computer equipment, programs, staff, 
etc) in most firms or acquirable at low cost. In particular, it should 
be easy to prepare by hand without the need of special programs.

- Mappable to XML.

- Vendor independent.

- Usable as an interface to exchange data.


3. Terminology

The specific terms to this memo have usually the first character of 
each token in capital.

- By-Example Xdossier DTD: A type of Xdossier DTD.

- By-Example DTD: Abbreviation of "By-Example Xdossier DTD".

- Directory Index: File, usually named "index.html", that contain 
links to and information on files in a particular directory.

- Xdossier: (1) The concept as described in this memo. (2) 
Abbreviation of "Xdossier Instance".

- Xdossier Instance: Parallel meaning with XML document instance.

- Xdossier Skeleton: A type of Xdossier DTD.

- Xdossier Table of Contents: The Xdossier in the root directory. The 
Xdossier Table of Contents must allow the navigation of the whole 
Xdossier. Typically, there would be links to other Directory Indexes.

- Index: Abbreviation of "Directory Index".

- Instance: Abbreviation of "Xdossier Instance".

- Skeleton: Abbreviation of "Xdossier Skeleton".

- Table of Contents: Abbreviation of "Xdossier Table of Contents".


4. Name

Xdossier that do not conform to this section are "Non-Naming 
Conformant". Though all Xdossier must conform at least to the naming 
in XML [XML].

"Name" is a token composed of the following characters:
- Letters "a" to "z"; i.e., lower case only; [U+0061 to U+007A]. 
- Digits "0" to "9" [U+0030 to U+0039].
- "-" [HYPHEN-MINUS, U+002D].
- "_" [LOW LINE, U+005F].

The notation "U+" refers to the Unicode [UNICODE] notation.

       Correct Names
         part_a
         part-b
         myfile
         hello

       Incorrect Names
         part a       (' ' ; SPACE is not allowed)
         Myfile       (capitals are not allowed)
         myfile.xml   ('.' ; FULL STOP is not allowed)
         hello:html   (':' ; COLON is not allowed)

"Directory Name" is a Name.

"File Name" is one Name followed by one or more Name(s) separated by a 
'.' (FULL STOP, U+002E).

       Correct File Names
         a_part
         myfile.html
         hello.en.xml
         hello.en.xml.gz

       Incorrect Names
         a part       (' ' ; SPACE is not allowed)
         Myfile.html  (capitals are not allowed)
         hello:xml    (':' ; COLON is not allowed)

"Document Name" is the first Name in the File Name. Example, "docname" 
in the File Name "docname.ext"

"File extension(s)" is/are the second and following Name(s). For 
example, "ext1", "ext2" and "ext3" in the File Name 
"docname.ext1.ext2.ext3"


5. Representation

The same information could be represented in different fashions. The 
dimensions considered are:

- Language; e.g., English, Spanish.
- Media type; e.g., HTML, PDF.
- Encoding; e.g., gzip, compress.


6. File extension

File extensions are used to indicate representations. For example:
         hello                no extension
         hello.html           format HTML
         hello.en             language English
         hello.gz             compressed using "gzip"
         hello.en.html        English in HTML
         hello.html.gz        HTML, gziped
         hello.en.gz          English, gziped
         hello.en.html.gz     English, HTML, gziped

File extensions, particularly the last one, are operating systems 
dependants:
- Syntax: e.g., DOS allows up to three characters file extensions.
- Association: which program is associated with the extension.

The extension should correspond to widely used mapping between 
Internet Media Types [IMT] and file extensions. The examples above 
work for transparent content negotiation in Apache.

Note the difference between "file" and "document". File refers to 
physical storage; e.g., "mydoc.txt" is a file. Document refers to 
content; e.g., "mydoc" is a document represented in the files 
"mydoc.txt" and "mydoc.html", they contain the same document in 
different formats.

Another memo should address the syntax for file extensions.


7. Format

Priority should be given to file formats (media types) with a good 
chance of being readable "forever"; e.g., in 50 years. This points to 
"neutral" formats: formal standard, industrial standard, vendor 
independent, "text-like", etc.

One should not discard proprietary formats, as they could be the 
"source" format; i.e., the format in which the data was originally 
produced. Often, information is lost in format transformation. The 
recommendation is to include:

- The source format.

- At least one neutral format.

- Indicate the method used in the format transformations; e.g. source 
format saved HTML using the "Save as" facility in such application.

The file formats in order of preference are:

- Text: XHTML, HTML, XML, text, RTF and PDF.
- Graphic: JPEG, GIF and TIFF.

Other formats could also be included. They should be widely used 
formats.

[Relation to Xdossier DTD]: It could include a list of accepted 
formats in order of preference and different mapping between the 
Internet Media Type and file extensions.


8. Character encoding

The character encoding ("charset") in order of preference are:

- Unicode UTF-8, Unicode 16 bits [ISO10646].

- ISO-8859-1 (Latin-1) or appropriate ISO-8859-x; e.g., ISO-8859-7 for 
Greek. 

Other character encoding could also be used. They should be widely 
used character encoding.

[Relation to Xdossier DTD]: It could include a list of accepted 
character encoding in order of preference.


9. Web Formats

These are file formats well adapted to the web and widely supported in 
browsers; corollary: a very good format for the web, but not massively 
supported in browsers is not a Web Format.

Web Format is a fuzzy moving definition. It is also "community 
dependent"; e.g., a certain community could consider XML a Web Format 
and another community could consider that it is not a Web Format.

By default, the only Web Format is HTML.

[Relation to Xdossier DTD]: It could redefine the list of Web Formats.


10. Directory Index

Directory Index, abbreviated to Index, is a document in Web Format 
included in each directory.

There is a closed association among the directory, the Index in the 
directory and the file(s). They refer to each other as "his". For 
example, "Index and his directory and files".

Index should fulfil the dual function:
- Browsing (informal view).
- Metadata (formal view).

For browsing, Index should have a description of his directory and 
meaningful labels with links to at least his file(s). It could also 
contain links to other files/resources. Links to files within a 
Xdossier must be relative. Whenever possible, links within a Xdossier 
should point to an Index rather than a file.

For Metadata, Index should contain the metadata of his directory and 
files. The metadata should be machine processable.

The browsing and metadata are functions. Syntactically, they could be 
interwoven.

Syntactically, there are two types of Indexes:
- Informal Index: it does not follow any particular syntax.
- Formal Index: It follows a syntax.

If Index is not present, the File Names in the directory should be 
meaningful.

The default Document Name for Index is "index" and the default format 
is the default Web Format. Hence, at present the default File Name for 
Index is "index.html". 

Another memo should address the syntax for Formal Index. 

[Relation to Xdossier DTD]: It could redefine the default Index name.


11. Root directory

The root directory must contain only one file, the Index; and zero or 
more directories. Corollary: The trivial Xdossier is composed on one 
Index.

The intention for allowing only one file in the root directory is to 
make it obvious that the file present is the Table of Contents, as the 
other elements must be directories.


12. Well-formed and valid Xdossier

Well-formed Xdossier is when it follows these recommendations. In 
particular, it does not need a Xdossier DTD.

Valid Xdossier is when it is well-formed and in addition follows the 
restrictions in a Xdossier DTD (By-Example DTD or Syntactic DTD).


13. Xdossier DTD

Xdossier DTD is needed for valid Xdossiers.

There is two type of Xdossier DTDs:

- "By-Example".

- " Syntactic".


13.1. By-Example Xdossier DTD

A Xdossier Instance could be a Xdossier DTD just by declaration that 
it is a Xdossier DTD; i.e., follow "this example". Probably, some 
aspects would be fuzzy.

More realistically, a By-Example DTD should be an "Xdossier Skeleton"; 
i.e., purpose built example. Typically, the files in the Skeleton 
means that they must be present in the instantiations with the same 
name and format. Additional instructions should be in the Indexes; 
e.g. "such a file is optional".

People with limited knowledge in computers could create By-Example 
DTDs, as it is instinctive. Probably, the path would be to create a 
well-formed Instance and them to proceed with creation of a Skeleton. 

As the approach does not have a fixed syntax, it is not intended for 
full mechanical validation by computer. Some parts would have to be 
validated by humans, though parts that follow a syntax could be 
validated mechanically. For example, the content model of the 
files/directories could be defined as: 

- DTD: an XML DTD.

- Pair of values: for example a list of pair of values like
  "/food/choco/index.html=Documents about chocolate"


13.2. Syntactic Xdossier DTD

This is needed to implement computer programs that could do full 
mechanical validation of Xdossiers.

Another memo should address the syntax for Syntactic Xdossier DTD. 


14. Self-containness

There are three levels:

- Absolute Xdossier: When all the resources are in the Xdossier.

- Self-Contain Xdossier: When all "Essential Resources" are in the 
Xdossier. For example, the CSS is in the Xdossier, though there could 
be secondary references to other resources such as a reference to the 
W3C site at http://w3.org. At least this level should be attained. 

- Fragment Xdossier: When at least one "Essential Resource" is not in 
the Xdossier. For example, the CSS is not in the Xdossier and it 
relies in an external CSS such as the one in the W3C site at 
http://www.w3.org/StyleSheets/Core/. It is only recommended as a 
directory of Xdossier. Otherwise, there should an agreement between 
producers and consumers of the Xdossier.

Essential Resources are the ones needed for navigation and display.

[Relation to Xdossier DTD]: It could include the minimal level of 
Self-containness requested and a re-definition of the Essential 
Resources.


15. Compound Xdossier

It is a Xdossier where all the directories in the root directories are 
Xdossier themselves. These directories could also be Compound 
Xdossiers and so on.

[Relation to Xdossier DTD]: It could include require Compound 
Xdossier.


16. Mapping

Mapping is for transforming between Xdossier and XML. Xdossier should 
be transformable into XML.

Mapping:
 directory      <-> element
 directory name <-> element name
 root directory <-> document element
 Index          <-> attributes (for his directory and files)
 file           <-> entity
 file name      <-> entity name
 file content   <-> entity reference
 XML file       <-> Parsed entity
 Non-XML file   <-> Unparsed entity


17. HTML for Index

The HTML Indexes should follow the indications: 

- Simple mainstream HTML; i.e., facilities easy to write and that work 
in most browsers.

- XHTML [XHTML] approach. For example, well formed documents, separate 
content from the presentation (e.g., CCS [CSS2]), etc.

- It is recommended to use one CSS for all the Indexes.

- A reasonable presentation with the most popular browsers (e.g. 
Internet Explorer, Navigator, etc) and text only browsers (e.g. Lynx). 

- Links that work when read directly (e.g., a CD-ROM inserted into a 
PC) or served by an HTTP server; i.e., "file:" or "http:".

- Links that point directly to files, except when the intention is to 
show the content of the directory. One should not assume that Xdossier 
would be served by server; i.e., it should work directly ("file:") or 
served ("http:").

- No frames, scripts (e.g. JavaScript) and Java Applets. 

- Images (IMG) with alternative texts. 

- Relative links within the Xdossier; e.g. href="../doc.html".
   
- Use language attributes (lang, xml:lang, etc), to indicate the 
language of the text. 

[Relation to Xdossier DTD]: It could change the HTML indications. In 
particular, it could include at CSS for the Indexes.


18. References

[ALLEN] Package or Perish. Terry Allen
Pages 385-390 in SGML/XML '97 Conference Proceedings. SGML/XML '97.

[CML] Chemical Markup Language
http://www.venus.co.uk/omf/

[CSS2] Cascading Style Sheets, level 2
http://www.w3.org/TR/REC-CSS2

[DC] Dublin Core
http://purl.org/dc

[ISO10646] Information Technology -- Universal Multiple-Octet Coded 
Character Set (UCS) -- Part 1: Architecture and Basic Multilingual 
Plane, ISO/IEC 10646-1:1993

[HTML] HTML 4.0 Specification
http://www.w3.org/TR/REC-html40   

[IMT] Internet Media Types
http://www.isi.edu/in-notes/iana/assignments/media-types/media-types 

[MHTML] The MIME Multipart/Related Content-type. E. Levinson
ftp://ftp.ietf.org/rfc/rfc2387.txt

[RDF] Resource Description Framework Model and Syntax Specification
http://www.w3.org/TR/REC-rdf-syntax 

[SCHEMA1] XML Schema Part 1: Structures ("work in progress")
http://www.w3.org/TR/xmlschema-1/

[Unicode] Unicode Consortium
http://www.unicode.org 

[XDOSSIER-TRAN] Xdossier Transport ("work in progress")
http://xdossier.com

[XHTML] XHTML 1.0: The Extensible HyperText Markup Language ("work in 
progress")
http://www.w3.org/TR/WD-html-in-xml 

[XML] Extensible Markup Language (XML) 1.0
http://www.w3.org/TR/rec-xml 

[XSL] Extensible Stylesheet Language Specification ("work in 
progress")
http://www.w3.org/TR/WD-xsl


19. Author

Manuel Tomas CARRASCO BENITEZ
The European Agency for the Evaluation of Medicinal Products
7 Westferry Circus
Canary Wharf
London E14 4HB
U.K. 

Telephone +44 171 418 86 45

carrasco@dragoman.org
http://dragoman.org/carrasco


19.1. Disclaimer

This memo represents only the view of the author.