Cleaning HTML while migrating - Remove Microsoft HTML

Recently I had to move bunches of tables from an old system to a Drupal site. The table data was heavily infested with the crappy HTML inserted from Microsoft Word.

The MS HTML was 1) Redundant and making the HTML almost 5 times its actual size and 2) Breaking the page HTML on the new system, at times.

Used the htmLawed Library from http://www.bioinformatics.org/phplabware/internal_utilities/htmLawed/ind...

Once included, it is as simple as

<?php
$clean_html
htmLawed($crappy_ms_html, $htmlawedsettings);
?>

$htmlawedsettings can carry a multitude of settings as explained in http://www.bioinformatics.org/phplabware/internal_utilities/htmLawed/htm...

However at a minimum you can have it to be

<?php
  $htmlawedsettings
= array(
   
'clean_ms_char' => 2
 
);
?>

Last but not the least, as the migration script was a Drupal module, included the htmLawed.php, places in the same folder as the .module file as

<?php
require_once ( dirname(__FILE__) . '/htmLawed.php');
?>

There you go. Sparkling clean HTML that is close to being w3c compliant!

Comments