Scratching my itch

It’s been a while since I wrote some code to scratch purely my own itch. Most of my time is spent writing code to scratch my employer’s itches, and occasionally I get to write little programs that scratch small itches I get while writing code for my employer — things like extensions to git-p4 that allow me to pull information from a git repository and use it to generate merge commands for Perforce, so I don’t have to figure out which commits/changes I want to merge.

I know; nothing someone else would be interested in.

But I’ve had an itch for a little while that someone else might be interested in. I listen to the NPR Hourly News Summary via my podcast app on my Nexus 6. The web page might list a bunch of back episodes, but the RSS feed only publishes the most recent summary. But I want to listen to SOME older episodes, just not all of them. What I had been doing was having my podcast app keep every episode and then mark the ones I wasn’t interested in as done, but that was tedious, especially considering I only wanted to listen to four or five of the 24 episodes published each day.

So I thought about what I wanted. I wanted a program that would fetch the RSS feed every hour and check to see if the currently published episode was one of the ones I wanted, and, if it was, store it in a database and then spit out a new RSS feed with the last N episodes I’d stored in the database. I realized I didn’t need to actually fetch the episodes themselves, because NPR doesn’t remove the episodes after they disappear from the RSS feed (as evidenced by the web page with multiple episodes). I also realized I didn’t need to generate this new RSS feed dynamically: NPR’s feed only gets updated once an hour, so I only needed to generate my feed once NPR’s feed was updated, and, since I wasn’t generating the feed every time my podcast app asked for it, I could generate the feed on my desktop computer and then copy the XML file up to my web server (since my desktop has way more computing power than my web server).

And, of course, I wanted to use perl, because that’s my favorite programming language.

One of perl’s strengths is that, whatever you want to do, there’s probably a CPAN module that will do the heavy lifting for you. There’s also a Perl Cookbook for commonly used patterns in perl programming.  I found the recipe for Reading and Writing RSS Files, and there was an example for filtering an RSS feed and generating a new one. The example uses LWP::Simple to fetch the RSS feed, XML::RSSLite to parse the feed, and  XML::RSS to generate a new RSS feed. The cookbook even states “It would be easier, of course, to do it all with XML::RSS”. So I did.

Actually, I didn’t rewrite the RSS too much. Rather than building a completely new RSS feed, I used XML::RSS to parse the feed and extract the one item from it. But even though XML::RSS has a method for adding items to the feed, it doesn’t have a method for removing items from the feed. This left me with no choice but to dig through the source code of XML::RSS and figure out what was necessary to clear out the list of items. Once I cleared out the one item out of the feed, I re-loaded the feed with the items I’d stored.

Wait… I had stored items, right?  Oh, crud, I forgot about that part.  Ok, I need to store the last N items. I could use a text file, but that’s difficult to manage.  I could set up a database in PostgreSQL or MySQL, but that’s a lot of overhead for just storing a bit of data. If only there was a self-contained, serverless, zero-configuration, transactional SQL database engine.  Something like… SQLite!

So I set up a simple schema; one table with two columns: one to hold the timestamp of the episode, and one to hold the block data I needed to shove the episode back into the feed. Since it’s SQLite, I didn’t expect the datafile to exist the first time I ran it, so I put in a test to see if the data file existed before I connected to it, and, if it didn’t, create the schema.

The rest is fairly straightforward; I checked the current episode extracted from the feed to see if it was one of the times I wanted. If it wasn’t, I just skipped ahead to generating the new feed. If it was one of the episodes I wanted, I checked to make sure I didn’t already have it in the database.  If I did, I skipped ahead.  If it wasn’t in the database, I added it to the database and then deleted everything in the database older than 7 days.

#!/Users/packy/perl5/perlbrew/perls/perl-5.22.0/bin/perl -w

use DBI;
use Data::Dumper::Concise;
use DateTime::Format::Mail;
use LWP::Simple;
use XML::RSS;
use strict;

use feature 'say';

# list of times we want
my @keywords = qw( 7AM 8AM 12PM 7PM );
my $days_to_keep = 7;

# get the RSS
my $URL = 'http://www.npr.org/rss/podcast.php?id=500005';
my $content = get($URL);

# parse the RSS
my $rss = XML::RSS->new();
$rss->parse($content);


my @items = get_items( $rss->_get_items );

# make new RSS feed
$rss->{num_items} = 0;
$rss->{items} = [];

foreach my $item ( @items ) {
    $rss->add_item(%$item);
}

say $rss->as_string;


sub get_items {
    my $items = shift;

    # build the regex from keywords
    my $re = join "|", @keywords;
    $re = qr/\b(?:$re)\b/i;

    my $mail = DateTime::Format::Mail->new;

    my $dbh = get_dbh();

    my $insert = $dbh->prepare("INSERT INTO shows (pubdate, item) ".
                               "           VALUES (?, ?)");

    my $exists_in_db = $dbh->prepare("SELECT COUNT(*) FROM shows ".
                                     " WHERE pubdate = ?");

    foreach my $item (@$items) {
        my $title = $item->{title};
        $title =~ s{\s+}{ };  $title =~ s{^\s+}{}; $title =~ s{\s+$}{};

        if ($title !~ /$re/) {
            next;
        }

        my $dt = $mail->parse_datetime($item->{pubDate});
        my $epoch = $dt->epoch;

        $exists_in_db->execute($epoch);
        my ($exists) = $exists_in_db->fetchrow;
        if ($exists > 0) {
            next;
        }

        $insert->execute($epoch, Dumper($item));
    }

    my $now = DateTime->now();
    my $too_old = $now->epoch - ($days_to_keep * 24 * 60 * 60);
    $dbh->do("DELETE FROM shows WHERE pubdate < $too_old"); my $query = $dbh->prepare("SELECT * FROM shows ORDER BY pubdate");
    $query->execute();

    my @list;
    while ( my($pubdate, $item) = $query->fetchrow ) {
        push @list, eval $item;
    }

    return @list;
}

sub get_dbh {
    my $file = '/Users/packy/data/filter-npr-news.db';
    my $exists = -f $file;

    my $dbh = DBI->connect(          
        "dbi:SQLite:dbname=$file", 
        "",
        "",
        { RaiseError => 1}
    ) or die $DBI::errstr;

    unless ($exists) {
        # first time - set up database
        $dbh->do("CREATE TABLE shows (pubdate INTEGER PRIMARY KEY, item TEXT)");
    }
    return $dbh;
}

And it worked!  I then created a git repository on my desktop for it, and pushed it up to my GitHub account under its own project, filter-npr-news.

And that kept me satisfied for a whole day.

Next time, I’ll write about what started bothering me and the changes I made to fix that.