"Data migration for Drupal 7," Ken Rickard: Twin Cities Drupal Camp 2011

At Drupal Camp Twin Cities 2011, I attended the session “Data Migration for Drupal 7,” by Ken Rickard of palantir.net. Ken is a core contributor, a co-author of the book Drupal 7 Module Development. He has worked on a number of important migration efforts, and is known as the migration expert at Palantir. This included some complicated migrations for ForeignAffairs.com and Grinnell College.

Here is the session description:

We’ll take an in-depth, technical look at the challenges of migrating external data into Drupal. Working from a live example, we'll use Drupal 7’s Migrate module to pull data into a Drupal site.

Here are my notes:


Everybody has to do a migration at some point or another. They can be painful, but within a crisis always lies an opportunity.

Cyrve.com is a nice busines focused entirely around migrations from one system to another.

Planning a migration? Some managers think that they can just state what the business logic of what goes where, it goes into a mysterious black box and out pops a new web site! Black boxes are pretty scary.

Tools

We use tools instead of a mysterious black box. These tools can include:

  • Custom scripts: these can be difficult, because Drupal data storage is not script friendly. APIs are better.
  • Custom Drupal modules
  • Feeds: The Feed module can be used to import export feeds from WordPress, for example.
  • ImportAPI
  • Migrate: This is the best option

Source data

First, you need to understand your source data. Make sure to consider the following:

  1. Analyze your data source
  2. Structure your Drupal objects
  3. Document in plain English

It’s very important to make sure to document in plain language, so the project sponsors can clearly understand what you are doing. Also, write out your rules for how you will process the data, which data you will skip, etc. For example, for documentation, you might provide a table with Drupal fields, Source field names and notes.

If you are doing anything over 1000 records, for sure you need to do this documentation. For 100 nodes, you can do it by hand. Over 500, for sure script it.

Hierarchy of pain

From least painful to most painful:

  • MySQL/pgSQL data
  • SQL data
  • XML data
  • HTML: the content that actually gets put out by the browser
  • A bunch of old Word documents

“Oh dear lord HTML to Drupal migrations are painful.” Bad Word HTML is particularly prone to errors.

Try to encourage clients to move their content into less painful formats.

Are you doing a fire and forget migration, where you will be importing content once and never doing it again? Or is this an ongoing process?

Helpful tools

  • DBTNG (Database: The Next Generation)
  • SimpleXML
  • QueryPath: essentially jQuery for PHP
  • Migrate

If you are migrating XML, make sure that it parses correctly before you start working. No validation errors allowed!

The rest of this talk will about the Migrate module.

Creating the plan

  • Content types
  • Vocabularies
  • Users
  • Etc.

Source data types

  • csv.inc
  • json.inc
  • list.inc
  • mssql.inc
  • sql.inc
  • sqlmap.inc
  • xml.inc

These are all the data types that Migrate understands by default.

Target object types

  • comment.inc
  • entity.inc
  • fields.inc
  • file.inc
  • node.inc
  • path.inc
  • table_copy.inc
  • term.inc
  • user.inc

These are all the types of things you can put your source data into.

Example migration

MPR had thousands of news stories going back to 1972 in an Oracle database. They got a grant to make this content web accessible. Oracle is not really web friendly, so this needed to be migrated.

Challenges included:

  • Stories could be part of one more story collections about a specific topic
  • Wanted stories to be sorted by on this day.

Database tables

  • {exported_asset}
  • {exported_asset_data}
  • {exported_bin_asset_entry}
  • {exported_collection_bin}
  • {exported_field_type}
  • {exported_user_data}
  • Migration structure

    • Migrate class for each type
    • Inherit and extend default classes
    • Small module to define migrations

    There is not really a user interface to build a migration: you basically write a small module for your particular migration. There is a user interface to display what you are migrating once you have set up your script.

    Example migration

    • Get topics list
    • Create taxonomy terms
    • Map foreign keys to Drupal

    In this example, there was a foreign key that was set the parent key: the id of the topic that was a parent to that topic.

    The migration code you write is PHP. You need to do things like create a vocabulary in this case, lay out the ids for all the fields in the table for the source data. Import a row from the source table. Then set a destination, a taxonomy vocabulary in this case. Then you write code to map the source data to the target data: here is this bit of source data, and this is where it should be put within Drupal.

    To perform the actual migration once the script is ready, there is no user interface to do this: you need to run the script through Drush, so familiarity with a command line is necessary. There is an opportunity to build a user interface for this module to actually run the scripts.

    Once he ran the script, the topics were now taxonomy terms in a particular vocabulary.

    This was a simple example, but still pretty powerful.

    In theory, you could map from one Drupal installation to another Drupal installation, then use cron to run this on a regular basis. This could be used to move content from a staging server to a live site.

    There are a lot of techniques you can use to handle errors, such as ensuring when you add a piece of content that you have not already added it before.

    Migration dependencies

    • If nodes can have topics…
    • And collections are node references…
    • How do I import stories?
$this->dependencies = array('Collection,'Topic','Contributor);

In other words, run these other imports first.

One challenge can be that CPU use can spiking when importing a ton of nodes. There are scripts you can use that will fire off another process if CPU usage hits a certain point.

You can do things that are “dead sexy” like handling translation of parent items, which are foreign keys in the source table, to the way Drupal handles that natively.

Another challenge is that sometimes the source data table you are importing from doesn’t have all the fields you will need. You can set up the field that will handle this, then preprocess each row to find this data in other parts of the source material, then bring it in. Complicated, but powerful.

If the Migrate module doesn’t know how to natively handle a particular field, you can write some code to tell it how to understand what that field is and map it correctly.

My take

I will be honest, this felt a bit above me: it was an advanced session, because this is very powerful stuff. I can only hope there is great documentation for how to do this. Now I know the Migrate module is there, and it may well be a lifesaver. I have a project coming up where I may need to migrate a literal ton of content into Drupal. If I can do that programmatically, rather than by hand, that will literally save me months of work. With time, I think I will be able to grasp the details.

I am somewhat disappointed there is no UI to help with building the migration or even with running the migration. However, Drupal is a do-ocracy, so that could change in the future. To be fair, migrations are so specific, there may well be no way to build a UI robust enough to handle all the use cases.

So, things to dig into in the future!

Thanks for the great session, Ken!