Social Bookmarking
I’m working with faculty (and students) to use social bookmarking tools (specifically Diigo) to create dynamic and annotated resources for each other (and for and by students).
Transmogrifying those Google Reader JSON dumps into something useful
9For the last few years (my JSON feed tells me: since 2008), I have been tagging and annotating articles of interest as they passed before my eyes in Google Reader. This served a two-fold purpose:
- I could find them again later, easily, because they were tagged and annotated.
- I could share an RSS feed with those annotations to particular interest groups that I worked with (e.g. anything tagged “for robotics” would show up on my advanced computer science class’ portal page, or anything tagged “for academic computing” would show up on my school home page).
This was a great way to share (and manage) resources. Granted, much of what passed before my eyes in Google Reader was trivial and not of lasting value, but this filtering allowed me to hang on to at least a few gems for future reference.
And then Google Reader got the Google+ treatment and sharing items broke. But you could download a JSON dump of all the items that you had ever shared. It wasn’t entirely clear what you could do with this JSON dump, but… there it was. And then: I realized that all of my other information is stashed on my web server (and that I have become increasingly distrustful of relying on cloud services to maintain my data and workflows — e.g. my weekly backup of all my Google Docs… just in case).
Wouldn’t it be handy to import that JSON feed into a new blog on my server? So I wrote a PHP script that converts (at least my) Google Reader JSON dump into an XML file that WordPress can import as a list of posts. With the tags and annotations converted over. In fact, with all of the data in the JSON dump embedded in the XML file (although WordPress doesn’t read all of it).
This comes with a few caveats:
- For items that came from blogs with a full feed, the result is a republication of the original post — which feels ethically dubious to me. (I have made my new blog of Google Reader shared items private, so that I have the data but I’m not sharing it with the world).
- I’ve made guesses as to how to treat some of Google’s data. Reasoned, educated guesses, but guesses nonetheless. For example, I’m not super-clear on which dates in the file correspond with what events — does a publication date refer to when the item was shared or the original post was posted?
- I’ve added in some arbitrary (and therefore, ideally, eventually, configurable) WordPress tags to make the import go more smoothly. Where I have done that, I mark it in the script as a TODO item. (And, in truth, I didn’t really test to see if all of these items were necessary.)
- The original authors of the posts are transfered to the XML file, which means that when the actual import into WordPress is done, you will have the option to either laboriously create a new user for each distinct author or simply revert authorship to the currently logged-in WordPress user. It doesn’t seem like WordPress has a format for exporting or importing users (or, at least, my cursory search didn’t find it). Clearly an ancillary SQL query could be generated that pre-populated the WordPress database with the users that the XML file refers to. But I haven’t bothered to do that.
- You’ll need your own PHP-compatible webserver to run the script, since I have been quick and dirty and simply imported the JSON file from and exported the XML file to the script’s local directory. And I have no interest in setting up my world-facing webserver to take the traffic hit of processing other people’s multi-megabyte JSON dumps.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 | /**********************************************************************
* Google Reader to Wordpress
*
* 2011-11-27
* Seth Battis (seth@battis.net)
*
* This script takes the output of Google Reader's JSON export of
* shared items and converts it into an XML file that can be imported
* into a Wordpress blog as posts. All of the data in the original JSON
* file is preserved in the XML file, either by transfering it to
* an appropriate format (e.g. Google Reader categories are converted
* to WordPress post tags) or simply as an additional XML tag (e.g. the
* Google Reader commentInfo metadata for recent shared items). In
* situations where actual data has to be converted to make it readable
* for WordPress, the original data is included as the JSON attribute
* of that tag (e.g. timestamps and categories).
*
* As currently written, the script looks in its local directory for
* Google Reader "shared-items.json" file and generates a matching
* "shared-items.xml" file, also in its local directory.
*
* There are a number of potentially configurable (i.e. arbitrary)
* values marked as TODO.
*
* Caveat emptor: this has been tested against my ~1000 item Google
* Reader shared items feed and on my WordPress 3.2.1 site. I would
* presume that it should work fairly well for others, but make no
* guarantees!
*********************************************************************/
/* returns a Wordpress slug-version of the given text (only
alphanumeric characters and dashes) */
function sluggify($text)
{
return preg_replace("|[^a-z0-9]+|", "-", strtolower($text));
}
/* SimpleXML doesn't really support namespaces unless you force it */
$xml = '<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0"
xmlns:excerpt="http://wordpress.org/export/1.1/excerpt/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:wp="http://wordpress.org/export/1.1/"
></rss>';
$rss = new SimpleXMLElement($xml);
$namespaces = $rss->getDocNamespaces(true);
/* load the Google Reader JSON file */
$file = file_get_contents("shared-items.json");
$json = json_decode($file, true);
/* Wordpress will choke if our post names aren't unique, so we track
them separately */
$post_names = array();
/* header information describing the file itself */
$channel = $rss->addChild("channel");
$channel->addAttribute("direction", $json["direction"]);
$channel->addAttribute("id", $json["id"]);
$channel->addAttribute("self", $json["self"][0]["href"]);
$channel->addAttribute("author", $json["author"]);
$channel->addChild("title", $json["title"]);
$channel->addChild("link");
$channel->addChild("description");
$pubDate = $channel->addChild("pubDate", gmdate("D, j M Y G:i:s O", $json["updated"]));
$pubDate->addAttribute("json", $json["updated"]);
$channel->addChild("language");
$channel->addChild("wxr_version", "1.1", $namespaces["wp"]);
/* run through the list of items and add them to the XML */
foreach ($json["items"] as $item)
{
$rssItem = $channel->addChild("item");
/* a bunch of Google Reader-specific metadata */
if (isset($item["isReadStateLocked"]))
{
$rssItem->addAttribute("isReadStateLocked", $item["isReadStateLocked"]);
}
$rssItem->addAttribute("crawlTimeMsec", $item["crawlTimeMsec"]);
$rssItem->addAttribute("timestampUsec", $item["timestampUsec"]);
$rssItem->addAttribute("id", $item["id"]);
if (isset($item["commentInfo"]))
{
while (list($commentInfoKey, $commentInfoValue) = each($item["commentInfo"]))
{
$commentInfo = $rssItem->addChild("commentInfo", $commentInfoKey);
$commentInfo->addAttribute("permalinkUrl", $commentInfoValue["permalinkUrl"]);
$commentInfo->addAttribute("commentState", $commentInfoValue["commentState"]);
}
}
/* annoyingly, not every item has its content in the content
element -- sometimes it's in the summary element (and once in a
while, it's just not there). I think this is an artifact of how
the original RSS feeds were constructed. I think. */
if (isset($item["content"]))
{
$content = $item["content"]["content"];
}
else if (isset($item["summary"]["content"]))
{
$content = $item["summary"]["content"];
}
else
{
$content = "";
}
/* sometimes items don't even have titles */
if (isset($item["title"]))
{
$rssItem->addChild("title", $item["title"]);
}
/* most items store the original linkback information in the
alternate element -- Wordpress won't honor this link tag when
it's imported (i.e. it won' treat it like the Daring Fireball
feed), so I have embedded a more descriptive linkback after the
annotations at the start of the content, using some information
from the origin element. The linkback is in the
"google-reader-alternate-href" div (for easy CSS-wrangling!) */
if (isset($item["alternate"]))
{
$rssItem->addChild("link", htmlentities($item["alternate"][0]["href"], ENT_COMPAT, "UTF-8"));
$content = htmlspecialchars("<div class=\"google-reader-alternate-href\"><p>Originally posted at <a href=\"{$item["alternate"][0]["href"]}\">{$item["origin"]["title"]}</a></p></div>", ENT_COMPAT, "UTF-8") . $content;
}
/* I haven't bothered to figure out if the dates are really GMT or
localized -- GMT was an easier assumption to make */
$pubDate = $rssItem->addChild("pubDate", gmdate("D, j M Y G:i:s O", $item["published"]));
$pubDate->addAttribute("json", $item["published"]);
/* not every item has an a author, either -- again, an artifact of
the original RSS feeds */
if (isset($item["author"]))
{
$rssItem->addChild("creator", $item["author"], $namespaces["dc"]);
}
/* annotations were tricky -- I have added them as their own XML
tags _and_ inserted them within a "google-reader-annotation" div
at the top of the post content (to match the original format on-
screen). All of the original data is preserved in the XML tag,
with an ID that matches the embedded div ID. */
foreach($item["annotations"] as $annotation)
{
$annotationHTML = htmlentities("<div id=\"" . md5($annotation["content"] . $annotation["author"]) . "\" class=\"google-reader-annotation\"><blockquote><p>{$annotation["content"]}</p><p class=\"author\">{$annotation["author"]}</p></blockquote></div>", ENT_COMPAT, "UTF-8");
$content = $annotationHTML . $content;
$rssAnnotation = $rssItem->addChild("annotation", $annotation["content"]);
$rssAnnotation->addAttribute("id", md5($annotation["content"] . $annotation["author"]));
$rssAnnotation->addAttribute("author", $annotation["author"]);
$rssAnnotation->addAttribute("userId", $annotation["userId"]);
$rssAnnotation->addAttribute("profileId", $annotation["profileId"]);
$rssAnnotation->addAttribute("profileCardParams", $annotation["profileCardParams"]);
}
/* again, sometimes content is in content, sometimes it's in the
summary element */
$rssContent = $rssItem->addChild("encoded", $content, $namespaces["content"]);
if (isset($item["content"]))
{
$rssContent->addAttribute("direction", $item["content"]["direction"]);
}
if (isset($item["summary"]))
{
$excerpt = $rssItem->addChild("encoded", $item["summary"]["content"], $namespaces["excerpt"]);
$excerpt->addAttribute("direction", $item["summary"]["direction"]);
}
/* more Google reader metadata, this time about the original feed
that the item came from -- which is used above to format the
linkback that is embedded a the start of the content */
$origin = $rssItem->addChild("origin");
$origin->addAttribute("streamId", $item["origin"]["streamId"]);
$origin->addAttribute("title", $item["origin"]["title"]);
$origin->addAttribute("htmlUrl", $item["origin"]["htmlUrl"]);
/* it's not clear to me whether the published or modified date is
when the original post was published or when the item <span class="hiddenGrammarError" pre="item ">was
shared</span> -- I think when published refers to when it was shared. */
$postDate = $rssItem->addChild("post_date", date("Y-m-d G:i:s", $item["published"]), $namespaces["wp"]);
$postDate->addAttribute("json", $item["published"]);
$rssItem->addChild("comment_status", "open", $namespaces["wp"]); // TODO make configurable
$rssItem->addChild("ping_status", "open", $namespaces["wp"]); // TODO make configurable
/* make a Wordpress friendly title slug for the post */
if (isset($item["title"]))
{
$slug = sluggify($item["title"]);
}
else
{
/* if no title, generate the slug from the timestamp */
$slug = date("Y-m-d-G-i-s", $item["published"]);
}
/* make sure that our slug is unique -- add a counter to the end
if it is not, and track those counter values in $post_names[] */
if (isset($post_names[$slug]))
{
$post_names[$slug]++;
$slug .= "-" . $post_names[$slug];
}
else
{
$post_names[$slug] = 0;
}
$rssItem->addChild("post_name", $slug, $namespaces["wp"]);
/* more Wordpress metadata -- all of which could be tweaked */
$rssItem->addChild("status", "publish", $namespaces["wp"]); // TODO make configurable
$rssItem->addchild("post_parent", 0, $namespaces["wp"]); // TODO make configurable
$rssItem->addChild("menu_order", 0, $namespaces["wp"]); // TODO make configurable
$rssItem->addChild("post_type", "post", $namespaces["wp"]); // TODO make configurable
$rssItem->addChild("post_password", "", $namespaces["wp"]); // TODO make configurable
$rssItem->addChild("is_sticky", 0, $namespaces["wp"]); // TODO make configurable
/* convert categories to post tags -- nota bene that Google Reader
has conflated the reader's categories with the original post's
tags, creating a... mish-mash. */
foreach($item["categories"] as $category)
{
if (!preg_match("|.*/com\.google/.*|", $category))
{
$cleanCategory = $category;
$cleanCategory = preg_replace("|user/\d+/label/(.*)|", "$1", $cleanCategory);
$rssCategory = $rssItem->addChild("category", htmlentities($cleanCategory, ENT_COMPAT, "UTF-8"));
$rssCategory->addAttribute("domain", "post_tag");
$rssCategory->addAttribute("nicename", sluggify($cleanCategory));
$rssCategory->addAttribute("json", $category);
}
}
/* add comments -- note that for privacy reasons, while the
commenter's metadata is added as an XML tag, it is not embedded
in the Wordpress-readable wp:comment tags */
foreach($item["comments"] as $comment)
{
$rssComment = $rssItem->addChild("comment", "", $namespaces["wp"]);
$rssComment->addAttribute("id", $comment["id"]);
$commentContent = $rssComment->addChild("comment_content", $comment["htmlContent"], $namespaces["wp"]);
$commentContent->addAttribute("plainContent", $comment["plainContent"]);
$author = $rssComment->addChild("comment_author", $comment["author"], $namespaces["wp"]);
$author->addAttribute("userId", $comment["userId"]);
$author->addAttribute("profileId", $comment["profileId"]);
$author->addAttribute("profileCardParams", $comment["profileCardParams"]);
$author->addAttribute("venueStreamid", $comment["venueStreamId"]);
$commentDate = $rssComment->AddChild("comment_date", $comment["createdTime"], $namespaces["wp"]);
$commentDate->addAttribute("modifiedTime", $comment["modifiedTime"]);
$rssComment->addAttribute("isSpam", $comment["isSpam"]);
}
}
/* dump the converted XML out as a file */
header ("Content-type: text/xml");
echo $rss->asXML();
file_put_contents("shared-items.xml", $rss->asXML()); |