Strategies for migrating legacy HTML to Drupal

Repurposing old HTML is always a challenge. Before you move the old HTML, consider the following questions:

How many pages?

If your content is relatively simple and there aren't many pages, manually pasting each page into a new Drupal node might be the easiest route. How many is "too many" is entirely subjective. You can also automate the import of legacy HTML, but you need time to configure and test the conversion process. If you have a significant amount of content to move, using the additional ramp-up time to set up an automated process is the most effective route.

Where does the HTML currently live?

The actions you'll take depend on where your legacy HTML currently exists, which can include in a database, in static HTML files, or on a server only that's available through a browser.

In a database

If the legacy HTML is in a database, it will be relatively easy to export the data in a format which Drupal, through the Feeds module, can import. You can use the Feeds SQL module to pull data directly from the database into a feed.

You can also manually craft a CSV (Comma Separated Value) file from the database. This involves exporting the data to a CSV file, RSS or Atom feed, or XML file. After you create this file, you can set up a feed ingestion node to process the file into new Drupal nodes. For more information about this process, see the Feeds documentation on

If the legacy HTML is in a database as a large amount of data in one row, it may be easier to export the data and then break it up.

In static files

If the legacy HTML is not in a database, it's probably either:

  • A series of static PHP or ASP files using an include string for the header and footer portions of the document
  • Static HTML files, each with full markup

Only available through a browser

If the legacy HTML is only accessible through a browser, you can use a site-downloading utility (such as HTTrack for Windows or Site Sucker for OS X) to download your site as local, static files. From there you can use the static files to import the data.

How uniform is the non-essential markup?

Regardless of which of these states your content is in, you'll need to extract the contents of the title and body tags and then parse them into valid CSV, RSS, or XML files. One approach is to use command-line tools to perform powerful find-and-replace actions with regular expressions (regex). The following is a basic example for generating a CSV file that contains the required columns (including title, body, published, and guid). These are the tasks this script would need to perform:

  1. Iterate through all HTML files in a directory. For each file:
    • Replace all markup from the beginning of the file to the end of the <title> tag with a unique number and a delimiter, such as the vertical bar (|)
    • Replace all markup from the beginning of the </title> tag to the end of the </body> tag with a delimiter, such as the vertical bar (|)
    • Replace all markup from the beginning of the </body> tag to the end of the file with a return character
  2. Concatenate all files into one CSV file.

Another approach is to create a function in PHP, Python, Ruby, or another language to remove the non-essential markup. The following steps describe a basic procedure to generate a CSV file using PHP:

  1. Iterate through all HTML files in a directory. For each file:
    1. Create a new DOMDocument.
    2. Load the contents of a file into the DOMDocument class.
    3. Use the getElementsByTagName() function to extract the contents of the title and body tags into variables.
    4. Add the variables to an array.
  2. After all files are processed, iterate through the array, concatenating a unique number, a title, and body tags to a string. Write the string to a file using file_put_contents.

How much formatting needs to remain?

You can migrate obsolete markup (including font tags and table formatting) to CSS using many tools, one of which is htmLawed. It's wise to use this time to examine whether the obsolete markup serves a semantic purpose or merely exists to complement a prior site design. If at all possible, removing obsolete markup relating to previous designs can help future content updates to be less complex and fragile.

Are there attached media files?

Attached images, movies, or audio files will add significant complexity. The Feeds module is equipped to handle the import of associated media from the Drupal public or private files directory. It is complex because you have to parse each HTML file for instances of the image (<img>) and other media tags. You'll not only need to extract the value of these tags, but you also need to rewrite them to use the Drupal file system. For instance, you need to rewrite an image tag containing the location as public://animal_pictures/cat.png.

Are there modules that can help with this?

You can use the Import HTML module to automate the migration process without writing your own code, but you'll need to configure the module based on the specific HTML and content that you're importing.

The module will handle most of the points under the Iterate bullets in the preceding section, but can only extract data from static HTML documents if the XML/XSLT parser and HTMLTidy extensions for PHP are installed and available.

Sign in to vote or comment