Strategies for migrating legacy HTML to Drupal

Repurposing old HTML is always a challenge. Before you move the old HTML, consider the following questions:

How many pages?

If your content is relatively simple and there aren't many pages, manually pasting each page into a new Drupal node might be the easiest route. How many is "too many" is entirely subjective. You can also automate the import of legacy HTML, but you need time to configure and test the conversion process. If you have a significant amount of content to move, using the additional ramp-up time to set up an automated process is the most effective route.

Where does the HTML currently live?

The actions you'll take depend on where your legacy HTML currently exists, which can include in a database, in static HTML files, or on a server only that's available through a browser.

In a database

If the legacy HTML is in a database, it will be relatively easy to export the data in a format which Drupal, through the Feeds module, can import. You can use the Feeds SQL module to pull data directly from the database into a feed.

You can also manually craft a CSV (Comma Separated Value) file from the database. This involves exporting the data to a CSV file, RSS or Atom feed, or XML file. After you create this file, you can set up a feed ingestion node to process the file into new Drupal nodes. For more information about this process, see the Feeds documentation on

If the legacy HTML is in a database as a large amount of data in one row, it may be easier to export the data and then break it up.

In static files

If the legacy HTML is not in a database, it's probably either:

  • A series of static PHP or ASP files using an include string for the header and footer portions of the document
  • Static HTML files, each with full markup

Only available through a browser

If the legacy HTML is only accessible through a browser, you can use a site-downloading utility (such as HTTrack for Windows or Site Sucker for OS X) to download your site as local, static files. From there you can use the static files to import the data.

How uniform is the non-essential markup?

Regardless of which of these states your content is in, you'll need to extract the contents of the title and body tags and then parse them into valid CSV, RSS, or XML files. One approach is to use command-line tools to perform powerful find-and-replace actions with regular expressions (regex). The following is a basic example for generating a CSV file that contains the required columns (including title, body, published, and guid). These are the tasks this script would need to perform:

  1. Iterate through all HTML files in a directory. For each file:
    • Replace all markup from the beginning of the file to the end of the tag with a unique number and a delimiter, such as the vertical bar (|)
    • Replace all markup from the beginning of the tag to the end of the

Add new comment

Plain text

  • No HTML tags allowed.
  • Lines and paragraphs break automatically.
  • Web page addresses and email addresses turn into links automatically.
By submitting this form, you accept the Mollom privacy policy.

Contact supportStill need assistance? Contact Acquia Support