A neighborhood association I am involved in has a web site for local news that grew out of some hand-made PHP scripts from 20 years ago. They have really shown their age, but the work involved in updating it is enormous as each page is its own PHP script with partial HTML, and parts in various templates. It has over a thousand news articles that are worth keeping. What do do, in 2022, to get it manageable?
Well, it seems WordPress is the One True System™ for easy publishing these days, so import it there, then? But as I said, this is a home-grown system with over a thousand PHP pages with hand-edited code.
Perl to the rescue. Fortunately, there were two index files, one with “modern” articles (from after the initial scripting broke down) and one with “archived” pages (from before, where the page display quality was rather low), so at least I had a list of all the pages to import. Right? Well, not really, some had chained links to sub-pages (picture carousels and such), so I had to add that. And making something to parse a thousand “almost similar” PHP documents does require some work, but after several hours of work I have managed to get the tool to read the indices, insert the missing pages where they belong, read the pages and parse their contents and meta-data (headers, dates), and spit out a WordPress-compatible 3 megabyte large XML for import.
Thanks to WP Sandbox, I have managed to test the XML (and iteratively fix the bugs in it) so that I now have one XML file that imports all the articles (minus one that no matter what I do end up being identified as a duplicate of another article, but I will cut my losses and lose that one, it was just a link to another site anyway). Of course I have none of the images and other pages on the sandbox, so I cannot really test that everything works out as expected, that will have to be done on the target site.
Oh, and, of course, I did make the script generate a giant .htaccess file with Redirect directives to map the old self-publish URLs to the new WordPress URLs. We can’t have old links become invalid, can we?
Since this is obviously never ever going to be useful for anyone else, as this is a unique home-grown system, I have published the script over at GitHub.