The xhtmlhook Import Hook

Authors

Implementation: David Boddie <david@boddie.org.uk>
Initial concept: Paul Boddie <paul@boddie.org.uk>
Other contributors:

Latest change: 2005-06-19

__version__ = 1.12

WWW: http://www.boddie.org.uk/python/xhtmlhook/

Download: http://www.boddie.org.uk/python/downloads/xhtmlhook-1.12.zip

Abstract

The xhtmlhook import hook was written to allow Python source code to be included in XHTML documents using a particular class of preformatted text. The underlying mechanics of this includes modifications to the mechanism that the Python interpreter uses to import modules through the ihooks module and use of the xml.dom.minidom module to obtain the code included within documents. Modules can now be written, with some effort, in a web browser which supports editing, although a method for enabling Python to run such code as scripts is not yet in place.

Introduction

The authors appreciate good documentation when they encounter it. However, it is often necessary to rely on documentation generated from comments included in source code. Despite claims that, "The source code is the documentation," by proponents of various software engineering methodologies and language evangelists, such generated documentation often only provides cursory descriptions of the interfaces exposed by libraries and other resources. Learning how to use such resources often requires the developer to inspect the source code itself before tentatively trying various approaches to use within an interactive session.

We decided that we would like to see better documented code included within web pages for convenient browsing. The motivation behind this peculiar aim is to be able to include high quality documentation alongside working code, hopefully making it easier for programmers to produce more maintainable, readable programs. With easy-to-use editing facilities included with web browsers such as Amaya this aim is within reach.

There are a number of steps required to enable the Python interpreter to import code embedded within webpages:

  1. The file extension used for such documents needs to be registered so that the general methods for determining the types of file which can contain source code will include this type of file. Therefore, ".html" will be included as a recognised suffix for source code rather than any other type of Python code, such as bytecode, for example.
    This is achieved by subclassing the Hooks class in the ihooks module.
  2. Although the method used to search for modules on a given path will not need to be modified in order to support the importing of source code in XHTML files, the use of Uniform Resource Locators (URLs) in the paths to be searched requires that the need to be modified in some way so that such files are located using the urlopen function from the urllib2 module.
    This is achieved by subclassing the ModuleLoader class in the ihooks module and reimplementing the find_module method. If URL support is omitted, the find_module class need not be reimplemented.
  3. The XHTML documents need to be intercepted before their contents are compiled to bytecode by the interpreter and the included code converted to a suitable form. The approach taken should not affect the import of existing file types. Support for remote modules must be included where appropriate.
    This is achieved by subclassing the load_module method in the ModuleLoader class in the ihooks module. Although attempts are made to minimise disruption to the import process used by the base class, it is necessary to override the import process both for the case of XHTML documents (locally and remotely stored) and for all file types when remotely stored.
  4. The subclasses must be instantiated and registered through with a subclass of the ModuleImporter class from the ihooks module. This subclass modifies the behaviour of the import_it method to treat XHTML documents as packages when they contain multiple submodules. This instance itself is registered with the import hooks mechanism through a call to its install method.

The following section presents the source code used to implement the module, including comments and docstrings where appropriate. This code is used by the xhtml2py.py script to generate the module when the setup.py script is run; the functions used to extract the Python source code are taken from methods of the subclassed ModuleLoader class. Hence, the module can import itself in its original form.

The module's implementation

This section contains an implementation of the module within an XHTML document. This is used as a base for implementing more advanced features and better documentation.

Importing modules

Let us begin by importing the modules which will allow us to modify the Python interpreter's module loading behaviour. The ihooks module provides support for changing this behaviour; the os and string modules will be generally useful during the loading process


import ihooks, os, string

The imp module contains the PY_SOURCE and PKG_DIRECTORY constants which is used when registering new import hooks to handle the ".html" suffix associated with XHTML files.


from imp import PY_SOURCE, PKG_DIRECTORY, C_BUILTIN

In order to find the preformatted text in XHTML documents, we will use the built in minidom module:


from xml.dom import minidom

The imp module will, for now, be used to create modules explicitly. Ideally, this would be done through the load_module method of the Hooks object from the ihooks module.


import imp

Finally, the urllib2 module will be used to retrieve modules from locations other than in the local filesystem.


import urllib2, urlparse from urllib2 import URLError, HTTPError

The XHTMLHooks class

The ihooks module provides a Hooks class suitable for subclassing which determines which types of files can be imported as modules into Python. To extend the importing behaviour, we need to provide a subclass which will later be registered with the ihooks module through instances of another class we will create.

class XHTMLHooks(ihooks.Hooks):

NewHooks(ihooks.Hooks)

Subclass adding a new filename suffix to those already recognised by the standard library.

    
    def get_suffixes(self):
    
        return [(os.extsep+'html', 'r', PY_SOURCE)] + ihooks.Hooks.get_suffixes(self)
    

Add another method to return the suffixes to be searched for on remote servers, possibly using a different suffix separator character to the local platform.

    
    def get_url_suffixes(self):
    
        suffixes = self.get_suffixes()
        
        l = len(os.extsep)
        
        url_suffixes = []
        
        for suffix, mode, datatype in suffixes:
        
            c = 0
            url_suffix = ""
            
            while 1:
            
                at = string.find(suffix, os.extsep, c)
                
                if at == -1: break
                
                url_suffix = url_suffix + suffix[c:at] + "."
                
                c = at + l
            
            if c < len(suffix):
            
                url_suffix = url_suffix + suffix[c:]
            
            url_suffixes.append((url_suffix, mode, datatype))
        
        return url_suffixes
    

Note that the subclass extends the list of suffixes understood by the imp module by adding the ".html" suffix to the list, indicating that files of this type should be opened using the "r" (read) mode, and specifies that the contents will be Python source code.

This class in itself would allow Python to import XHTML files if it were registered with an instance of the ModuleImporter class from the ihooks module. However, the interpreter would fail to parse the content of such files.

The XHTMLLoader class

To perform the actual loading of the code contained within XHTML files, we create a class based on the ModuleLoader from theihooks module. This will be instantiated with an XHTMLHooks object and passed as an argument to the constructor of a ModuleImporter class from the ihooks module.

class XHTMLLoader(ihooks.ModuleLoader):

Initially, we define a method for properly indenting docstrings.

    
    def _indent_docstring(self, text, cmtuple):
    

list = _indent_docstring(self, text, docstring)

Returned an indented docstring based on the indentation used in the text which it precedes.

    
        line = 0
        indentation = 0
        
        while line < len(text):
            
            this_line = text[line]
            
            spaces = this_line.count(u' ')
            
            if spaces < len(this_line):
            

Not all the characters on the line are spaces.

            
                while indentation <= spaces:
                
                    if this_line[indentation] != u' ':
                    
                        break
                    
                    indentation = indentation + 1
                
                break
            

Try the next line.

            
            line = line + 1
        

Wrap the text found to fit within lines of the required width.

        
        leftover = u''
        

Add any previous text left over from preceding lines and try to split the line at a space.

        
        cmtype, cmtext = cmtuple
        
        if cmtype == "docstring":
        
            cmwidth = 78 - indentation
        
        else:
        
            cmwidth = 76 - indentation
        
        new_text = []
        leftover = u''
        n = 0
        
        while n < len(cmtext) or leftover != u'':
        
            if n < len(cmtext) and leftover == u'':
            

Only read another line if the previous line(s) have been written.

            
                line = cmtext[n]
                n = n + 1
            
            else:
            
                line = leftover
        

Wrap the text to the desired line width.

            
            at = len(line)
            
            while at >= cmwidth:
            
                at = line.rfind(u' ', 0, at)
            

If no spaces are found then keep the whole line together.

            
            if at == -1: at = len(line)
            
            line, leftover = line[:at], line[at + 1:]
            

The comment line contained more than just whitespace so include a leading comment character and a space ("# ").

            
            if cmtype == "comment" and line.count(u' ') < len(line):
            
                line = u'# ' + line
            

Add the new line to the list of lines to write to the output list.

            
            new_text.append((indentation * u' ') + line)
        

The above loop should catch any left over text so we should be able to just return the list of new lines to the calling method.

        
        return new_text
    

One of the clearest ways to format source code in an XHTML document is to leave lines containing the appropriate level of indentation before and after blocks of code. This unfortunately creates duplicate lines of whitespace in the extracted source code. We introduce a method to remove a line from each pair of adjacent duplicate lines which contain only whitespace.

Introduce a method for removing single lines from paris of duplicate adjacent lines of whitespace.

    
    def _reduce_empty(self, text):
    

list = _reduce_empty(self, text)

Try to remove duplicate consecutive lines which only consist of spaces.

Take the first item and place it in a new list. Record the last item added.

    
        new = text[:1]
        last = text[:1]
        

Examine the items in the rest of the list.

        
        for item in text[1:]:
        
            if last is None:
            

If there are no restrictions on the next item in the list then add a new item to the list and update the last item variable.

            
                new.append(item)
                last = item
            
            elif item == last:
            

If the item is the same as the last item added to the new list then check whether it is just made up of spaces.

            
                if item.count(u' ') < len(item):
                

A non-space character was found so add this item to the new list.

                
                    new.append(item)
            

If all the characters were spaces then nothing is added to the new list and the last item remains the same.

            
            else:
            

The last item added to the new list and the current item in this list are not the same. Add the item to the new list and update the last item variable.

            
                new.append(item)
                last = item
        

Return the list of new items.

        
        return new
    

A method will be required to extract preformatted text of the relevant class. This descends an instance of the xml.dom.minidom.Document class, looking for nodes corresponding to preformatted text of class "Python" and collecting the contents of the text nodes found beneath. Although the getElementsByTagName method of this class could be used to collect preformatted text we would still need to programatically look for both text and line break nodes. The _return_text method, while long-winded, is considered to be more elegant that using a separate method for collecting text.

    
    def _return_text(self, document, in_text = None, listings = None, comments = None):
    

listings, comments = return_text(document)

Return a list of tuples, each containing the name of a code block and a list of lines in that block.
The comments list returned contains elements which are lists of lines in each comment or docstring found.

Create a new list of output and comments for this level of recursion.

        
        if listings == None:
        
            listings = [[None, []]]
        
        if comments is None:
        
            comments = []
        
        for node in document.childNodes:
        

Each node is checked to determine whether it has any child nodes.

        
            if node.hasChildNodes():
            

If so, then its name and class (if present) are checked against the values we wish to find.

            
                if node.getAttribute(u'class') == u'Python' and \
                    in_text is None:
                

If the class used for the node is "Python" then the contents of the relevant child nodes are appended to the list of output. We descend, setting the second parameter to one to indicate that text found beneath this node is to be collected.

                
                    listing, cm = self._return_text(node, u'Python')
                    

Join all the text to create a single string and split it at each newline.

                    
                    text = listing[0][1]
                    
                    joined = u''.join(text)
                    text = joined.split(u'\n')
                    

If a comment or docstring was being remembered then indent it correctly using the indentation of the first non-empty line as a guide and reset the docstring to None.

                    
                    for comment in comments:
                    
                        listings[-1][1] = listings[-1][1] + self._indent_docstring(text, comment)
                        
                    comments = []
                    

Add the list of preformatted text lines to the output list.

                    
                    listings[-1][1] = listings[-1][1] + text
                
                elif node.getAttribute(u'class') == u'Docstring' and \
                    in_text is None:
                

For nodes outside Python code, check whether the "Docstring" class is used and generate docstrings if it is. Ensure that instances of quotes are escaped and that the level of indentation remains the same.

                
                    listing, cm = self._return_text(node, u'Docstring')
                    

The docstring value, ds, is meaningless in this case as all the text for the docstring is contained within an element in the listing list.

                    
                    text = listing[0][1]
                   

Join the items in the text list then escape all the quotes used in the complete string.

                    
                    text = u''.join(text)
                    

Join together line breaks (from <br/> elements) and carriage returns (generated from newlines in <p> elements) to form a single newline in each case.

                    
                    pieces = text.split(u'\n\r')
                    text = u'\n'.join(pieces)
                    
                    new_text = []
                    
                    for c in text:
                    
                        if c == u'"':
                            new_text.append(u'\\"')
                        elif c == u'\r':
                            new_text.append(u' ')
                        else:
                            new_text.append(c)
                    
                    new_text = u''.join(new_text)
                    

Compile a docstring list from the text returned.

                    
                    docstring = [u'"""']
                    docstring = docstring + new_text.split(u'\n')
                    docstring.append(u'"""')
                    
                    comments.append(("docstring", docstring))
                
                elif node.getAttribute(u'class') == u'Comment' and \
                    in_text is None:
                

It is useful to be able to include comments in the generated source code, although this is mainly useful for generating a new version of the bootstrap.py file from an XHTML file. Note that we write any docstring being collected since it should appear before comments in the source code.

                
                    listing, cm = self._return_text(node, u'Comment')
                    
                    text = listing[0][1]
                    
                    joined = u''.join(text)
                    

Join together line breaks (from <br/> elements) and carriage returns (generated from newlines in <p> elements) to form a single newline in each case.

                    
                    pieces = joined.split(u'\n\r')
                    joined = u'\n'.join(pieces)
                    
                    pieces = joined.split(u'\r')
                    joined = u' '.join(pieces)
                    
                    text = joined.split(u'\n')
                    

Add the comment lines to the comments list, specifying the appropriate type..

                    
                    comments.append(("comment", text))
                
                elif node.getAttribute(u'class') == u'Submodule' and in_text is None:
                

If we encounter a node with children which has the correct class attribute then collect the textual contents of the child nodes. This does not affect the persistent listings or comments lists until the text has been assembled.

                
                    listing, cm = self._return_text(node, u'Submodule')
                    
                    text = listing[0][1]
                    
                    joined = u''.join(text)
                    

Start a new listing using the text collected as the name of the listing.

                    
                    listings.append([joined, []])
                
                else:
                

If the node fails to satisfy the criteria then we continue looking for suitable preformatted text in its child nodes, maintaining the comments list.

                
                    listings, comments = self._return_text(node, in_text, listings, comments)
            

For nodes with no child nodes, we are only concerned with those which are located within preformatted areas of the required class.

            
            elif in_text is not None:
            

If suitable text has been found the contents of the node are added to the output list.

            
                if node.nodeName == "#text" and in_text == u'Python':
                

Suitable text has been found. Add the contents of the node to the output list.

                    
                    listings[-1][1].append(node.data)
                
                elif node.nodeName == "#text" and (in_text == u'Docstring' or in_text == u'Comment'):
                

For paragraph text being collected for comments or docstrings add the contents of the node to the output list.

                    
                    lines = node.data.split(u'\n')
                    
                    listings[-1][1].append(u'\r'.join(lines))
                
                elif node.nodeName == "#text" and in_text == u'Submodule':
                

For a submodule name declaration, just collect the text as it is for later processing.

                    
                    listings[-1][1].append(node.data)
                

If line breaks are found then insert newlines; this is required because Mozilla appears to find it fashionable to include line break tags in preformatted text rather than actual newlines.

            
                elif node.nodeName == u'br':
                
                    # Append a newline to the text collected.
                    listings[-1][1].append(u'\n')
        

The contents of the output list and the docstring list are returned.

        
        return listings, comments
    

Finding modules

The techniques employed to find modules for Python to use may be modified by implementing a replacement for the find_module method. Here, we add the ability to search for modules stored on remote servers.

    
    def find_module(self, name, path = None):
    

stuff = find_module(name, path = None)

Return information on a module if it can be found or None if it cannot be located. The path parameter, if specified, should be a list of paths to search for the module of the name given.
If successful, the stuff tuple contains an open file-like object for reading the module, the filename of the module and an information tuple; this tuple contains the module file's suffix, reading mode and the data type of its content.

There are two scenarios which concern us: where a module is specified explicitly as a URL and where it is merely specified as a name. In the first case, we will ignore the path given and try to locate the module using the urllib2 module; in the second case, we may encounter paths which are themselves URLs so will have to be prepared to fetch the module from these. To avoid duplication of effort, where the name is given as a complete URL we will split the module filename from the rest of the URL, placing it in the name parameter and replacing the path list with the rest of the URL.

Note: The first case is not pursued in the current version of this module as it is difficult to supply module names with the sufficient range of characters required to specify URLs.

If no path is supplied then we must use the default path, but first we must check for built in modules.

        
        if path is None:
        
            stuff = self.find_builtin_module(name)
            
            if stuff is not None:
            
                return stuff
            

No built in module was found, so check paths in the default list.

            
            path = self.default_path()
        

We now examine each path in turn, trying to find the module.

        
        for direct in path:
        

We test whether the directory exists on the local filesystem as a crude way to determine whether the directory is a non-URL based path.

            if os.path.exists(direct):
            

For directories which are assumed to be non-URL based paths, just use the base class's method for finding the module.

            
                stuff = ihooks.ModuleLoader.find_module(self, name, [direct])
                
                if stuff is not None:
                
                    return stuff
            

For all URL paths, check that the module name is not "_winreg". This is to prevent Windows platforms from looking for non-existant modules on remote servers then complaining when it can't find them.

            
            elif name != "_winreg":
            

For a path which is assumed to be in the form of a URL, check that the name of the module has a suitable suffix. If the module has a suffix then remove it, placing it in a list, otherwise use the name with the standard list of suffixes recognised by the importer.

                at = string.rfind(name, ".")
                
                if at != -1:
                
                    suffix = name[at:]
                    name = name[:at]
                    

Look up the corresponding reading mode and type value for the suffix in use, providing a fallback option in case the suffix is not found.

                    
                    suffixes = []
                    
                    for s, mode, datatype in self.hooks.get_url_suffixes():
                    
                        if s == suffix:
                        
                            suffixes = [(suffix, mode, datatype)]
                            break
                    
                    if suffixes == []:
                    
                        suffixes = self.hooks.get_url_suffixes()
                
                else:
                

For module names which are specified without a suffix, ensure that we check for a package of that name. We may want to add a check for "__init__" files at this point to prevent further searching for packages called "__init__". Typically, this isn't required because the remote server should inform us that there is no directory called "__init__" at the URL given. We place the package directory at the end of the suffix list because it is more intensive to determine whether a package exists than it is to find a module.

                
                    suffixes = self.hooks.get_url_suffixes() + [("", "", PKG_DIRECTORY)]
                

For each suffix, join the directory and file names together using the urljoin function from the urlparse module to create a URL. If the directory name is missing a terminating slash character ("/") this will be added.

            
                if direct != u"" and direct[-1:] != u"/":
                
                    direct = direct + u"/"
                
                for suffix, mode, datatype in suffixes:
                
                    url = urlparse.urljoin(direct, name + suffix)
                    

Try to open this URL constructed from the path and the module name with the urlopen function from the urllib2 module.

                    
                    try:
                    
                        file = urllib2.urlopen(url)
                        

Use the URL associated with the file object:

                        
                        if self.normalise_path(url) == self.normalise_path(file.url):
                            url = file.url
                        else:
                            raise URLError, "URL used was not the URL asked for."
                        

If the URL had no suffix and yet returned a valid file descriptor for the content we asked for, we assume that a directory was found and some form of content returned. We then go on to assume that a file called "__init__.py" exists beneath the directory on the server and try to import that.

                        
                        if suffix == "":
                        
                            stuff = self.find_module("__init__", [url])
                            
                            if stuff:
                            
                                file = stuff[0]
                                if file: file.close()
                                
                                return None, url, ("", "", PKG_DIRECTORY)
                       

Return the file object, filename, suffix, mode and data type.

                        
                        else:
                        
                            return file, url, (suffix, mode, datatype)
                    

If the URL created by combining the path and name does not represent a URL then try and open it using the default find_module method from the base class.

                    
                    except ValueError:
                    
                        stuff = ihooks.ModuleLoader.find_module(self, name, [direct])
                        
                        if stuff is not None:
                        
                            return stuff
                    

If the URL cannot be resolved or cannot be found then continue searching through the suffixes and paths.

                    
                    except (URLError, HTTPError):
                    
                        pass
        

If no modules were found then return None to indicate failure.

        
        return None
    

Loading modules

The load_module method is used by the following method which is called by an instance of the ModuleImporter class when a module is to be imported from a file.

    
    def load_module(self, name, stuff):
    

module = load_module(self, name, stuff)

Return a module object using the imp.load_source function or raise an ImportError if there is a problem with the file in question.

The name parameter passed to this method is the name of the module to be created; the stuff parameter is a tuple commonly used by the imp module to manage information related to modules.

        
        # Unpack the stuff tuple.
        file, filename, info = stuff
        
        # Unpack the info tuple.
        (suffix, mode, datatype) = info
        
        #print "load_module:", name, filename
        

We try to deal with the data that was supplied in the stuff tuple.

        
        try:
        

For built in modules, load the module and return.

        
            if datatype == C_BUILTIN:
            
                return self.hooks.init_builtin(name)
            

We need to deal with remote Python code as well since the load_package function in the imp module insists on getting file objects, which is not going to happen for remote objects. For package directories, we need to manually deal with importing the "__init__" files.

        
            path = None
            
            if datatype == PKG_DIRECTORY:
            

Load a variant of the "__init__.py" file inside the directory using the filename variable given, which will contain the full URL, as the path to the module.

                
                new_stuff = self.find_module("__init__", [filename])
                
                if not new_stuff:
                
                    raise ImportError, "No module named %s" % name
                
                init_file, init_filename, init_info = new_stuff
                

Ensure that the information for the "__init__" file is used instead of that for the directory.

                file = init_file
                path = [filename]
                filename = init_filename
                suffix = init_info[0]
                mode = init_info[1]
                datatype = init_info[2]
            

We are interested in source code found in files ending in ".html". That such files are recognised as source code by the calling object is due to an instance of the replacement XHTMLHooks class which we defined above. Alternatively, if the module to be imported is stored remotely then we will need to perform the compilation and installation since the default loader will refuse to deal with URLs rather than files.

            
            if os.path.exists(filename):
            
                is_url = 0
            
            else:
            
                is_url = 1
            
            if is_url or (datatype == PY_SOURCE and (suffix == os.extsep+"html" or suffix == ".html")):
            

Within the safety of an exception clause, read the contents of the file or URL:

            
                try:
                
                    data = file.read()
                

For consistency with other import methods an ImportError is thrown if any of the file operations fail. Ideally, XML parsing exceptions should also be caught in the same way.

                
                except IOError:
                
                    # Raise an ImportError.
                    raise ImportError, "Failed to import %s" % filename
                

Files specified by URLs are read using the read binary ("rb") mode so objects which should be read using the textual read ("r") mode will need some modification.

                
                if mode == "r":
                
                    pieces = string.split(data, "\r\n")
                    data = string.join(pieces, "\n")
                

For XHTML documents, create an xml.dom.minidom.Document object using the contents of the file or URL in question.

                
                if suffix == os.extsep+"html" or suffix == ".html":
                
                    d = minidom.parseString(data)
                

Retrieve the Python code from the document (the first item in the tuple returned from _return_text) and check for missing newlines.

                    
                    code_listings = self._return_text(d)[0]
                    

For any pair of empty but indented lines in the source code, remove one of the pair.

                    
                    listings = []
                    
                    for code_name, code_listing in code_listings:
                    
                        new_listing = self._reduce_empty(code_listing)
                        

Join the list items together to make a unicode string, checking for the absence of a terminating newline character.

                        
                        code_string = u'\n'.join(new_listing)
                        
                        if code_string != u'' and code_string[-1] != u'\n':
                        
                            code_string = code_string + u'\n'
                        
                        listings.append((code_name, code_string))
                

Some platforms may have difficulty in encoding the unicode string as ASCII so just leave it in the encoding found.

                    
                else:
                
                    code_string = data
                    
                    listings = [(None, code_string)]
                

Ideally, we would pass the source code found to the imp module's load_source function via the hooks mechanism but this appears to fail because an actual file object is required rather than a cStringIO.StringIO object.

We manually create a new module for this source to reside in.

                
                module = self.hooks.add_module(name)
                

The source code from each part of the file is compiled.

                
                for code_name, code_string in listings:
                

If there is a corresponding name for the code then create a submodule for it. The resulting bytecode is executed within the context of the new module.

                
                    code = compile(code_string, filename, "exec")
                    
                    if code_name is not None:
                    

Split the name associated with the code string at each "." to generate a list of submodules in descending order.

                        
                        sub_names = string.split(code_name, ".")
                        
                        parent = module
                        parent.__path__ = [filename]
                        
                        for sub_name in sub_names:
                        

Compile code to generate a submodule from within the newly created module and execute it..

                        
                            if sub_name not in parent.__dict__:
                            
                                mksub = compile(
                                    "import imp\n%s = imp.new_module('%s')\n" % (sub_name, sub_name),
                                    filename, "exec"
                                    )
                            
                                exec mksub in parent.__dict__
                            

For deeper submodules, use the current submodule as the parent module.

                            
                            submodule = parent.__dict__[sub_name]
                            submodule.__file__ = filename
                            
                            parent = submodule
                        

Execute the code listing in the lowest submodule created.

                        
                        exec code in submodule.__dict__
                    
                    else:
                    

Using the technique found in the FancyModuleLoader class from the ihooks module, we assign the path to the module to the __path__ attribute of the module object created.

                        
                        if path is not None:
                        
                            module.__path__ = path
                        
                        exec code in module.__dict__
            

The base class is called on to deal with files other than those handled by the specific functions introduced in this class.

            
            else:
            
                # Use the base class's method.
                #print "Use base class:", name, stuff
                return ihooks.ModuleLoader.load_module(self, name, stuff)
        

Like the base class, some tidying up is required if the import operation fails.

        
        except:
        
            module = None
        

For successful imports, we return the module object to the caller.

        
        else:
        
            module.__file__ = str(filename)
        
        if file: file.close()
        return module
    

    
    def normalise_path(self, url):
    
        pieces = urlparse.urlsplit(url)
        path = pieces[2].split(u"/")
        pieces = tuple(pieces[:2],) + (
            u"/".join(filter(lambda piece: piece != u"", path)),
            ) + tuple(pieces[3:])
        return urlparse.urlunsplit(pieces)

The XHTMLImporter class

We are able to import XHTML files as single modules, with some support for submodules, using just the above classes. However, it is helpful to be able to treat an XHTML document which contains submodules as a simple kind of package. To do this, it is necessary to modify the class which imports the modules into the correct namespaces so that it can deal with these submodules correctly.

class XHTMLImporter(ihooks.ModuleImporter):

    def import_it(self, partname, fqname, parent, force_load=0):
    
        try:
        
            return parent.__dict__[partname]
        
        except (KeyError, AttributeError):
        
            return ihooks.ModuleImporter.import_it(
                self, partname, fqname, parent, force_load
                )

Creating and installing a new importer

With the above classes in place, it now only remains for us to create an instance of the new hooks mechanism:

new_hooks = XHTMLHooks()


This will be passed as a parameter of the constructor as we create an instance of the new loader class.

new_loader = XHTMLLoader(hooks = new_hooks)


We create an instance of the XHTMLImporter class, using the new loader and import hooks to extend its range of supported source files to include Python in XHTML.

importer = XHTMLImporter(loader = new_loader)


Install the new importer with the ihooks module, and therefore with the Python interpreter.

importer.install()

A convenience function for URL importing

We considered it useful to provide a convenience function for occasions where the user or developer wishes to import a module stored at a remote location specified by a URL. Typically, one might import such a module by modifying the sys.path list to include the URL referring to the resource's container before using the import keyword in the usual manner. However, such modification of the sys.path list for temporary purposes could possibly be regarded as bad practice.

We define a function to provide a convenient method for importing remote modules whose locations are specified using URLs.

def load(url, as_name = None):

    stuff = new_loader.find_module(url)
    
    if stuff is None:
    
        # Raise an ImportError.
        raise ImportError, "Failed to import %s" % url
    
    if as_name is None:
    
        as_name = urlparse.urlsplit(url)[2]
        slash = as_name.rfind("/")
        if slash != -1:
            as_name = as_name[slash + 1:]
        dot = as_name.find(".")
        if dot != -1:
            as_name = as_name[:dot]
    
    return new_loader.load_module(as_name, stuff)

Testing the remote importing features


A simple test of the remote importing features of the module is as follows:

def test():

    test_url = "http://www.boddie.org.uk/python/modules"
    print "Testing the module by trying to import a module from %s" % test_url
    print
    
    import sys
    
    print "Appending %s to sys.path" % test_url
    print
    sys.path.append(test_url)
    
    try:
    
        import xhh_test_module
    
    except ImportError:
    
        print "Import failed."
        return
    
    print
    print "Import was successful."
    

Future additions

Possible future additions may include support for:

Changes

2005-06-20
Made changes suggested by Kirby Angell to fix module imports within remote packages.
2004-08-29
Added a convenience function (load) to allow modules to be imported from a location explicitly given by a URL.
2003-05-30
Changed the version number to reflect changes in the DistUtils setup script.
2003-05-19
Added a link to the online location for the archive so that this document can be used as an online index page.
2003-05-18
Changed the class for docstring to "Docstring" and removed the restrictions on the type of nodes in which comments, docstrings and Python code can reside.
2003-05-06
Added check for files which have been opened using the "rb" mode rather than the required "r" mode in the load_module method of the XHTMLLoader class. Corrected conversion of platform suffixes to those suitable for URLs in the get_url_suffixes method of the XHTMLHooks class.
2003-05-05
Subclassed the ModuleImporter class from ihooks in order to treat some XHTML documents like packages.
2003-05-04
Added submodule feature so that multiple example scripts can be included within a single document. Care must be taken to avoid including executable code in the top-level of each submodule as it will be executed when the document is imported. Abandoned disastrous attempt to treat XHTML files as package directories.
Removed dependency on checking for ":" in order to determine whether a path represented a URL or not as this was understandably causing chaos on Windows.
Added support for nested submodules ("module.submodule.subsubmodule") in documents, although importing "subsubmodules" directly and deeper does not work.
2003-05-03
Changed the comment and docstring formatting to use carriage return characters internally so that only newlines caused by line breaks are translated into newlines in the docstrings and comments generated.
Following problems with remote importing caused by some servers returning valid pages instead of issuing a 404 (Not Found) error code. The find_module method now checks whether the URL corresponding to the file descriptor returned by the urlopen function is the same as the URL asked for.
2003-05-02
Tidied up comment and docstring line wrapping. Noted that ipython will not start up if this module is automatically imported.
2003-05-01 (later)
Added support for comments and docstrings. Fixed remote importing to handle other forms of files. Gave many of the paragraphs the relevant class attribute so that they will appear as comments in the xhtmlhook.py file. Moved the docstrings out of the source code into paragraphs with the appropriate class attribute.
2003-05-01
Although remote importing of XHTML modules was implemented fairly trivially, the infrastructure for remote importing of packages proved to be more frustrating. Packages can now be imported from a remote server but the files must be given in XHTML format rather than the usual types, such as source code (.py) or compiled source (.pyc).
2003-04-29
Added the ability to import modules from URLs given in the sys.path list. Try appending a URL for some online resources to this list to see how it works.

For example, try the built in test:

>>> import xhtmlhook
>>> xhtmlhook.test()

2003-04-27
Updated the Introduction section. Newlines keep disappearing from preformatted areas thanks to Amaya's and Mozilla's inability to leave the formatting as it is. This could be a problem for readability.
2003-04-26
Created this file based on the xhtmlhook.py file.

License

From the included LICENSE.txt file:

xhtmlhook license:

Copyright (C) 2003-2005 David Boddie, Paul Boddie

This software is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License as
published by the Free Software Foundation; either version 2 of
the License, or (at your option) any later version.

This software is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public
License along with this library; see the file COPYING
If not, write to the Free Software Foundation, Inc.,
59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.