Visual Net Server

Features Overview

News

VNS Component Administrator

Developer's Guide

Tutorials

Content Entity Reference
>>calculateentity
>>copyentity
>>directoryentity
>>generateentity
>>htmlentity
>>mediacompentity
>>requestentity
>>sqlentity
>>staticentity
>>systementity
>>tagentity
>>textentity
>>userentity
>>valuesetentity
>>wordentity
>>xmlentity

Fetch Reference

Transform Reference

Send/Save Reference

Params/Variables/Constants

User Management

Attribute Reference

Trouble Shooting

Tips and Tricks

Contact

About CNet



Other CNet Products
NewsToBuy
Termado
AdeTransact

info@cnet.se

Content Entity Reference:


wordentity

Attributes
Specific
  • flat
  • shallow
  • stylestoelements
  • encode
The wordentity translates MS Word documents into XML, for further processing in a Content structure. The result can then be presented e.g, as HTML, or further transformed.

This content entity can thus be used to integrate the contents from MS Word documents with other XML-structures, or to present those documents in different formats and media.


<SYNTAX>
  <elementName wordentity="'yes'" url="filname" flat="'yes'|'no'" shallow="'yes'|'no'" stylestoelements="'yes'|'no'" encode="'yes'|'no'"/>
SYNTAX>


The following attributes can be used to control the transformation:
  • url - file name of a Word document
  • flat (optional) ="yes": used to specify that the content model should be a flat linear sequence of paragraphs. The value "no" means that the content model will follow the Word outline structure resulting in a nested paragraph structure. Default value: "no".
  • shallow (optional) ="yes": specifies that only mark-up on paragraph-level should be transformed, leaving out any formatting inside text contents, such as italics, bold etc. Default value: "no".
  • stylestoelements - (optional) can be used to specify that the names of the style elements should be used to name paragraph elements.
  • encode - (optional) is used to specify that reserved XML characters (<, >, &) appearing in the Word source, should be encoded in the output XML.
The basic transformation results in a content enity with a content model corresponding to the Word outline of the source document and with all the Word style information kept.

The generated XML is based on an XML-vocabulary with element types for,
  • The document structure: paragraphs, text, tables
  • Formating elements: italics, bold, list items
  • Links, images and footnotes
  • Style definitions
  • Metadata derived from the Word document, including date created, author, nrof words, lines and chars.
The document element has the following defintion,
	
	
where the DOC_INFO and STYLE elements encode the documents metadata and style definitions, and the PARAGRAPH elements represent the structure and contents using the additional mark-up.


Notes:

  • The difference between the two content models, outline and linear (flat), is reflected in the structure of the PARAGRAPH element.

    In the outline model, the document contains one or more PARAGRAPH elements containing other PARAGRAPH or TEXT elements nested, where the TEXT elements contain the character data or other mark-up,
    	
    	
    			
    	
    		
    In the flat model on the other hand, the document is a sequence of paragraph elements just containing character data or other mark-up,
    	
    	
    
  • The flat content model is useful when the documents structure is less important for the tranformation of a wordentity, e.g, making the XSL-stylesheets less complex. The flat model also requires less processing time.
  • The XML elements are described in this DTD (outline model). This DTD is only provided for reference purposes and is not required for creating or transforming wordentities.
  • The meta data elements generated can be transformed to standardized meta data descriptions, to increase interoperability with other web applications. For descriptions of such common metadata elements see the Dublin Core element set.





  • Example: Word Entity

    In this example a fragment of a Word document is translated using the flat mark-up model. Below is an HTML mimic of the input document.

    REFERENCE

    NAME

    wordentity

    EXPLANATION

    The wordentity provides a straightforward way to transform the contents in MS Word documents into XML. The result can then be presented e.g, as HTML, or further transformed and integrated with other content entities.

    The basic transformation of a Word document results in a content enity with element mark-up corresponding to the Word outline of the source document and with Word style information kept.


    Here is a content entity declaration with the document name in the url attribute,

    <CONTENTSTRUCTURE>
      <MANUALSECTION wordentity="yes" url="wrdentity.doc" shallow="no" flat="yes" altsource="exceptions.xml"/>
    CONTENTSTRUCTURE>

    executing this entity will result in the following XML,

    <CONTENTSTRUCTURE>
      <MANUALSECTION wordentity="yes" url="wrdentity.doc" shallow="no" flat="yes" altsource="exceptions.xml">
         <OFFICEDOC sourcetype="Word document" sourcename="wrdentity.doc">
            <DOC_INFO>
               <TITLE> REFERENCETITLE>
               <AUTHOR> StaffAUTHOR>
               <COMPANY> CNet Svenska ABCOMPANY>
               <CREATED> 2001-04-05 14:05:00CREATED>
               <REVISION> 3REVISION>
               <LASTSAVED> 2001-05-01 13:24:00LASTSAVED>
               <PAGES> 1PAGES>
               <WORDS> 68WORDS>
               <CHARACTERS> 374CHARACTERS>
            DOC_INFO>
            <PARAGRAPH style="Rubrik_1" level="1"> REFERENCEPARAGRAPH>
            <PARAGRAPH style="Rubrik_2" level="2"> NAMEPARAGRAPH>
            <PARAGRAPH style="Normal" level="10"> wordentityPARAGRAPH>
            <PARAGRAPH style="Rubrik_2" level="2"> EXPLANATIONPARAGRAPH>
            <PARAGRAPH style="Normal" level="10"> The
               <ITALIC> wordentityITALIC> provides a straightforward way to transform the contents in MS Word documents into XML. The result can then be presented e.g, as HTML, or further transformed and integrated with other content entities.
            PARAGRAPH>
            <PARAGRAPH style="Normal" level="10"> The basic transformation of a Word document results in a content enity with element mark-up corresponding to the Word outline of the source document and with Word style information kept.PARAGRAPH>
            <STYLE name="Rubrik_1" base="Normal" face="Arial" size="16" bold="bold" italic="normal"/>
            <STYLE name="Rubrik_2" base="Normal" face="Arial" size="14" bold="bold" italic="italic"/>
            <STYLE name="Normal" base="" face="Times New Roman" size="12" bold="normal" italic="normal"/>
         OFFICEDOC>
      MANUALSECTION>
    CONTENTSTRUCTURE>




    Frequently Asked Questions
    GIF-images from my Word-documents looks two small in my web pages, when I have used a wordentity.
    outputs the width and height of the image as attributes width and height on an IMG-tag. These are coordinates in Word. Try to multiply with a factor 1.3-1.4. Another solution is to reduced the width and height of the original image in an imaging program.
    I am editing a word document and has saved it and then I run wordentity, which generates an error message or hangs.
    The requires your document to be closed before it can perform an operation on it.
    How can I make wordentity execute faster?
    When you use the attributes ="no" ="no" wordentity performs an in-depth analysis of your word document, which is time-consuming. On the other hand you get a detailed markup of your document. If speed is an issue, try with shallow="yes" and flat="yes" instead. If that is not enough you can instead try using ="html". Then wordentity is working on the html version in stead of the word document which is considerably faster.
    I am using a wordentity to parse word documents, but actually I am not interested in creating a markup but rather the text of the document.
    If you use insertas="text", VNS will skip trying to parse the word document for stylesheet names and instead simple include the text of the document. Insertas="textnl" will do the same but also preserve newlines.


    Known Bugs
    A parsing error occurs in wordentity when my document's title or other properties contains an & ampersand character.
    This is a confirmed bug. VNS encodes & in the document text but not in the property fields.
    Workaround:  Encode your document properties manually, .i.e. change & to &
    A parsing error occurs in wordentity when using stylestoelements and my style name contains spaces
    This is a confirmed bug. VNS does not treat stylenames containing spaces correctly in the stylestoelements attribute
    Workaround:  Change your Word style names so that they don't contain spaces.



    (c) 2008 CNet, all rights reserved