and I'm all out of bubble gum…

So, a couple days ago, Nate posted about my Yahoo Pipes solution for a blog comment aggregating conundrum he was running into in his history class. My solution ended up being a little technical, but I think it’s an interesting enough example of the power of Yahoo Pipes that it’s worth talking through what’s going on. (I love Yahoo Pipes — although one of my colleagues is a major fan of WebRSS, which meets similar needs.)

A mildly cleaned-up version of the solution is below, in the Pipes’ visual programming diagram:

Yahoo Pipes Solution (click to embiggen)

  1. The pipe itself starts on the left, with Fetch Feed module, in which Nate had helpfully plugged in the comments feeds for all of his students. This takes all of the comment feeds from their blogs and aggregates them into one, enormous comment feed.
  2. Following the blue “pipe” from the Fetch Feed module, we reach the Rename module, which is actually be used to rename a copy of the “link” field of each item to “orig_author”. Clearly, this sentence needs some explanation. In an RSS feed, each item represents a single entry (in this case, a single comment on a particular blog). The RSS feed is made up of a bunch of items (usually the ten or so most recent items in the feed, so the ten most recent comments on each blog). We have aggregated these ten most recent comments from each blog into a feed that includes the hundred or so most recent comments from all the blogs (ten blogs x ten comments each = one hundred comments). Each item is made up of a series of fields that describe the item (title, description, link, etc.). Normally, in a feed reader, we only see the title and description, although we may also “see” the link field when the title is turned into a link back to the original comment on the originating blog. In this case, I have made a copy of the link field, for future reference, so that every item now has a second copy of the link field, named “orig_author” (short for original author, because programmers are lazy). It turns out that, in the comment feeds, the link is the only part of the item that refers to the name of the blog on which the comment was made, and that cryptically (e.g. “…” — i.e., as the first word in the address of the blog itself). More on this in a second.
  3. Again, following the blue pipe from the Rename module to the Regex module, we see two lines, which read, more or less:
    1. In item.orig_author replace


    2. In item.title replace
      (Comment on )(.*)( by .*)


      $1${orig_author}’s post “$2”$3

    What this means, in layman’s terms, is basically that I want to take whatever text appears in the orig_author field (which is a copy of the link field, which was a link to the original comment on the originating blog), and extract only the name of the blog from the URL. I won’t dive into the details of regular expressions right now, but what the first one above essentially says is “look for a pattern that starts with ‘http://’ followed by any number of characters — (.*) — followed by ‘.wordpress’ followed by any number of characters — .* — and replace the text that matches that pattern with whatever was in the first set of parentheses — $1.” In other words, replace the entire link with just the name of the blog that’s between ‘http://’ and ‘.wordpress’.
    The second regular expression is a bit more involved. At this point, it helps to realize that each set of parentheses is referred to by its order in the sequence of the original pattern, prefaced by a dollar sign — $1, $2, $3, etc. In this case, we’re looking at the original title of the comment item, which started off something like “Comment on My First Blog Post by Seth B.” I want to take my newly discovered blog name (from the first regular expression) and insert it into this in a meaningful way. To do that, I create a pattern that breaks the original title into its component phrases “Comment on “, “My First Blog Post” and “by Seth B.” With this in hand, I just plug in the current value of the orig_author field — which I just clipped in the previous regular expression, add some nice curly quotes, and put it all back together again to read something like “Comment on nkogan’s post “My First Blog Post” by Seth B.”

  4. Again, following the pipe down to the Sort module, I’ve added the handy little fillip of sorting all of the comments in our aggregated feed by the time at which they were posted (rather than grouping them by the blog on which they were posted, as they would be coming out of the original Fetch Feeds module — the first blog linked to, followed by the second blog linked to, etc.

I hope this is useful, or at least intriguing, in thinking about using the both Yahoo Pipes and regular expressions. A really wonderful reference on regular expressions can be found at, and I really like JRX as a tool for fine-tuning my regular expressions as I write them.

August 31st, 2009

Posted In: How To

Tags: , , , , , , ,