Preview of my next post…

My next project in the “whittling wood”?  Figuring out why XML::RSS::LibXML parses these tags without any problem:

<itunes:category text="News &amp; Politics"/>
<itunes:image href="http://media.npr.org/images/podcasts/2013/primary/hourly_news_summary-c464279737c989a5fbf3049bc229152af3c36b9d.png?s=1400"/>

and produces this internal data structure:

      category => bless( {
        _attributes => [
          "text"
        ],
        _content => "",
        text => "News & Politics"
      }, 'XML::RSS::LibXML::MagicElement' ),
      image => bless( {
        _attributes => [
          "href"
        ],
        _content => "",
        href => "http://media.npr.org/images/podcasts/2013/primary/hourly_news_summary-c464279737c989a5fbf3049bc229152af3c36b9d.png?s=1400"
      }, 'XML::RSS::LibXML::MagicElement' ),

But then doesn’t have these tags anywhere in the re-rendered XML when it spits it back out again. I know it has to do with the fact that these tags have no content (there’s no opening and closing tag, there’s just the one tag closed with a />), but I don’t know why XML::RSS isn’t properly rendering it when it converts the data structure back to XML.

The easy work-around would be to just do some string matching, recognize these tags in the original XML, copy them and re-insert them into the rendered XML afterwards.

The more difficult fix is to figure out what’s wrong with XML::RSS and try to fix it myself.

Guess which road I’m taking?

But what if I’ve still got an itch?

Ok, I wanted to write a follow-up post about my little program and the changes that I needed to make to it about a day later, but I found myself writing more and more code, and not having any time to actually write about writing the code. My wife, Kay, has dubbed it “my whittling wood”. So, let’s run down things that started to bother me about my creation…

The first thing that bothered me was when I was walking out of my office the first night. I checked my podcast app, and I didn’t have the 7PM news podcast yet, and it was 7:15PM already. I knew immediately what happened: NPR had been late updating the feed, and my cron job had run at 7:12 and missed it. So I thought about how to fix that problem (I managed to get the 7PM news podcast at 8:12 because NPR was late with the 8PM episode as well, so I was able to pick up the 7PM episode on that run).

I immediately dismissed the idea of running the script multiple times an hour. It didn’t feel clean to me. What I decided I needed to do was check to see if the episode I was looking at this time was different than the episode I was looking at the last time the script ran–I would need a new table to track this–and, if it was the same episode, sleep for a minute or two and try again, and continue retrying until I either got a new episode or I decided I’d waited long enough (20 attempts seemed to be a good cutoff number).

Of course, I also decided I wanted to be able to check up on what was happening, so I needed to write a log file. If I was going to be able to see this log file when I wasn’t home, however, I’d have to copy it up to my web server along with the RSS feed XML.

And this brings me to where I was when I wrote my first post. I already had more code in the script, but I blogged about the first draft, wanting to come back to this second draft with a followup blog post.

And that’s when things got crazy.

We had a big filming day coming up for PacKay Productions that weekend, and I had a lot of work to do, some of which I’d already done and blogged about. After the filming was done, I needed to prep for Halloween.  And even with the changes I’d made to this script, things were going wrong with my setup.

One of the things I did wrong was setting up my wrapper shell script to run the perl program.  I’m not really adept at Bourne shell scripting, and I always leave things out. Then, last Thursday night, I was idly wondering how easy it would be to correct the other major annoyance I have with the NPR Hourly News Summary: the inconsistency of the sound levels.

Sometimes, the news summary is recorded at a good level, and I’m able to hear everything just fine.  Other times, the levels are set so low that even with my player’s volume cranked all the way up and my headphones pressed into my ears, I find it impossible to hear what’s being said over the sounds of the street in New York City.

So, of course, I started looking to see if somebody else had already solved my problem.  I ran across this post in the ask ubuntu StackExchange forums, audacitywhich outlined two solutions: Audacity, an open source visual sound editor I was already intimately familiar with, and SoX, which was billed as “the Swiss Army knife of sound processing programs”.

SoX: the Swiss Army knife of sound processing programs

SoX

SoX is a command line tool for processing audio files, and the more I read about it, the more I liked it.  Normalizing an audio file used to be a two-step process in SoX: running a command once in an analysis mode to get the maximum volume of the file, and a second time to boost that volume to the maximum possible without distortion. However, with version 14.3 of SoX, its developers made all of that possible in one single command:

sox --norm infile outfile

I briefly pondered cloning SoX’s git repository and building from source, but I realized that chances were slight that I was going to be making changes to SoX; I just wanted it as a command line tool.  So I turned to one of the most wonderful things you can have on your Mac: Homebrew.

Homebrew is a package manager for OS X that’s all git and ruby under the hood, and it has a beer theme! It installs software in a “Cellar”. It doesn’t have packages, it has “bottles.  It even uses the beer emoji: ????

Installing new software with Homebrew is painfully easy:

brew install sox

Once I got SoX installed, modifying my code to used it was dead easy.

Finally, I decided to tackle the big thing that I wasn’t doing in the program itself: copying files up to the webserver. At first I looked at Net::Scp, but for some reason I couldn’t get it to work (it kept telling me that my remote directory didn’t exist).  So I switched over to Net::OpenSSH, and I was able to get the copy working.

I also cleaned up the code a lot, and added a ton of comments.  I want this code to be able to document itself, so it’s really obvious what I’m doing and why. Some would say that once a program is working, it’s done.  But when I’m writing code for myself, it’s not done until I’ve commented the heck out of it, because I know myself: a year later, I’m going to come back to this code and think “What was I smoking when I wrote this?”

I doubt I’ll think that when I come back to this code.

#!/Users/packy/perl5/perlbrew/perls/perl-5.22.0/bin/perl -w

use DBI;
use Data::Dumper::Concise;
use DateTime;
use DateTime::Format::Mail;
use LWP::Simple;
use Net::OpenSSH;
use URI;
use XML::RSS;
use strict;

use feature qw( say state );

# define all the things!
use constant {
    URL         => 'http://www.npr.org/rss/podcast.php?id=500005',
    TITLE_ADD   => ' (filtered by packy)',
    TITLE_MAX   => 40, # characters
    SLEEP_FOR   => 120, # seconds (2 minutes)
    MAX_RETRIES => 10,
    KEEP_DAYS   => 7,

    REMOTE_HOST => 'www.dardan.com',
    REMOTE_USER => 'dardanco',
    REMOTE_DIR  => 'www/packy/',

    MEDIA_URL   => 'https://packy.dardan.com/npr',

    TZ          => 'America/New_York',
    LOGFILE     => '/tmp/npr-news.txt',
    XMLFILE     => '/tmp/npr-news.xml',
    IN_DIR      => '/tmp/incoming',
    OUT_DIR     => '/tmp/outgoing',
    DATAFILE    => '/Users/packy/data/filter-npr-news.db',

    SOX_BINARY  => '/usr/local/bin/sox',
};

# list of times we want - different times on weekends
my @keywords = is_weekday() ? qw( 7AM 8AM 12PM 6PM 7PM )
             :                qw( 7AM     12PM     7PM );

my $dbh = get_dbh();  # used in a couple places, best to be global

my $rss;   # these two vars are only used in the main code block,
my $items; # but can't be scoped to the foreach loop

# since, for cosmetic reasons, we're starting the count at 1, we need
# to loop up to MAX_RETRIES + 1; otherwise, we'll only have the first
# attempt and then (MAX_RETRIES - 1).  If I'd called the constant
# MAX_ATTEMPTS then it would make sense to start at zero...
foreach my $retry (1 .. MAX_RETRIES + 1) {

    # get the RSS
    write_log("Fetching " . URL);
    my $content = get(URL);

    # parse the RSS using a subclass of XML::RSS
    $rss = XML::RSS::NPR->new();
    $rss->parse($content);
    write_log("Parsed XML");

    $items = $rss->_get_items;

    # if a new show was published in the feed, we don't need to wait
    # in a loop for a new one
    last unless same_show_as_last_time( $items );

    # we don't want the script to wait forever - if no new episode
    # appears after a maximum number of retries, give up and generate
    # the feed with the episodes we have
    if ($retry > MAX_RETRIES) {
        write_log("MAX_RETRIES (".MAX_RETRIES.") exceeded");
        last;
    }

    # for debugging purposes, I want to be able to not have the script
    # sleep, and the choices were add command line switch processing
    # or check an environment variable. This was the simpler option.
    if ($ENV{NPR_NOSLEEP}) {
        last;
    }

    # log the fact that we're sleeping so we can observe what the
    # script is doing while it's running
    write_log("Sleeping for ".SLEEP_FOR." seconds...");

    # since I usually want to listen to these podcasts when I'm away
    # from my desktop computer, copy the log file up to the webserver
    # so I can check on it remotely.  this way, if it's spending an
    # inordinate amount of time waiting for a new episode, I can see
    # that from my phone's browser...
    push_log_to_remotehost();

    # actually sleep
    sleep SLEEP_FOR;

    # and note which number retry this is
    write_log("Trying RSS feed again (retry #$retry)");
}

# test to see if the new item matches our inclusion criteria, and then
# fill the item list with items we've cached in our database
get_items_from_database( $items );

# make new RSS feed devoid of the original items... ok, ITEM
$rss->clear_items;

foreach my $item ( @$items ) {
    $rss->add_item(%$item);
}

re_title($rss);

write_log("Writing RSS XML to " . XMLFILE);
open my $fh, '>', XMLFILE;
say {$fh} $rss->as_string;
close $fh;
push_xml_to_remotehost();

#################################### subs ####################################

sub get_items_from_database {
    my $items = shift;

    # build the regex for matching desired episodes from keywords
    my $re = join "|", @keywords;
    $re = qr/\b(?:$re)\b/i;

    my $insert = $dbh->prepare("INSERT INTO shows (pubdate, item) ".
                               "           VALUES (?, ?)");

    my $exists_in_db = $dbh->prepare("SELECT COUNT(*) FROM shows ".
                                     " WHERE pubdate = ?");

    # I know the feed only has the one item in it, but it SHOULD have
    # more, so let's go through the motions of checking each item

    foreach my $item (@$items) {

        # pawn off the specifics of how we get the information to a sub
        my ($epoch, $title) = item_info($item);

        # again, for debugging purposes, I wanted to be able to not
        # have the script skip the current item, and the choices were
        # add command line switch processing or check an environment
        # variable. This was the simpler option.

        if ($title !~ /$re/ &amp;&amp; ! $ENV{NPR_NOSKIP}) {
            write_log("'$title' doesn't match $re; skipping");
            next;
        }

        # check to see if we already have it in the DB
        $exists_in_db->execute($epoch);
        my ($exists) = $exists_in_db->fetchrow;

        if ($exists > 0) {
            write_log("'$title' already in database; skipping");
            next;
        }

        # the NPR news podcast is notoriously bad at normalizing the
        # volume of its broadcasts; some are easy to hear and some are
        # so quiet it's impossible to ehar them when listening on a
        # city street, so, let's normalize them to a maximum volume

        normalize_audio($item);

        write_log("Adding '$title' to database");

        # it's easier to store the data in the episode cache table as
        # a perl representation of the parsed data than it is to
        # serialize it back into XML and then re-parse it when we need
        # it again.
        $insert->execute($epoch, Dumper($item));
    }

    # go through the database and dump episodes that are older than
    # our retention period.  Since we're using epoch time (seconds
    # since some date, usually midnight 1970-01-01) as the key to our
    # episode cache table, it's really easy to determine which
    # episodes are too old

    my $now     = DateTime->now();
    my $too_old = $now->epoch - (KEEP_DAYS * 24 * 60 * 60);
    $dbh->do("DELETE FROM shows WHERE pubdate < $too_old");

    # now let's fetch the episodes from the episode cache table in
    # oldest-first order. Again, since we're keyed on the episode's
    # publish date in epoch time, we can do this with a simple numeric
    # sort.
    my $query = $dbh->prepare("SELECT * FROM shows ORDER BY pubdate");
    $query->execute();

    @$items = ();
    while ( my($pubdate, $item) = $query->fetchrow ) {

        # just blindly evaluating text is a potential security problem,
        # but I know all these entries came from me writing dumper-ed code,
        # so I feel safe in doing so...
        my $evaled = eval $item;

        push @$items, $evaled;

        # log which episodes we're putting into the feed
        my ($epoch, $title) = item_info($evaled);
        write_log("Fetched '$title' from database; adding to feed");
    }
}

sub same_show_as_last_time {
    my $items = shift;

    # so we know when the feed is late in publishing a new item,
    # we have a table that stores the publication date of the last
    # episode we saw.  It also stores the title of the episode so
    # we can log which episode it was.

    my $get_last_show = $dbh->prepare("SELECT * FROM last_show");

    # get the information for the current episode
    my ($epoch, $title) = item_info($items->[0]);

    # fetch the last epsiode from the DB
    $get_last_show->execute;
    my ($last_time, $last_title) = $get_last_show->fetchrow;

    # save the episode we just fetched for next time
    my $update = $dbh->prepare("UPDATE last_show SET pubdate = ?, title = ? ".
                               " WHERE pubdate = ?");
    $update->execute($epoch, $title, $last_time);

    # now compare the current episode with the one we got from the DB
    my $is_same = ($last_time == $epoch);

    if ($is_same) {
        write_log("RSS feed has not updated since '$last_title' was published");
    }

    return $is_same;
}

#################################### audio ####################################

sub filename_from_uri {
    my $uri = shift;

    # abstract out the complexities of fetching the filename from a
    # URI so the code will read easier; in this case, we're
    # instantiating a new URI class object and calling path_segments()
    # to get the segments of the path, and then returning the last
    # element, which is going to be the filename.

    return( ( URI->new($uri)->path_segments )[-1] );
}

sub normalize_audio {
    my $item = shift;
    my $uri  = item_url($item);
    my $file = filename_from_uri($uri);

    # perl idiom for "if directory doesn't exist, make it"
    -d IN_DIR  or mkdir IN_DIR;
    -d OUT_DIR or mkdir OUT_DIR;

    # construct fill pathnames to the file we're downloading and
    # then normalizing to
    my $infile  = join '/', IN_DIR,  $file;
    my $outfile = join '/', OUT_DIR, $file;

    # fetch the MP3 file using LWP::Simple
    my $code = getstore($uri, $infile);
    write_log("Fetched '$uri' to $infile; RESULT $code");
    return unless $code == 200;

    # if, for some reason, we don't have the program to normalize audio,
    # crash with a message complaining about it being missing
    -x SOX_BINARY
        or die "no executable at " . SOX_BINARY;

    # call SoX to normalize the audio
    write_log("Normalizing $infile to $outfile");
    system join(q{ }, SOX_BINARY, '--norm', $infile, $outfile);

    # the feed doesn't publish an item length in bytes, but it really
    # ought to, so let's get the size of the MP3 file.
    my $size = -s $outfile || 0;

    # re-write the bits of the item we're changing
    item_url($item, join '/', MEDIA_URL, $file);
    item_length($item, $size);

    # send the normalized MP3 file up to the webserver
    push_media_to_remotehost($outfile);

    # clean up after ourselves
    unlink $infile;
    unlink $outfile;
}

#################################### db ####################################

sub get_dbh {
    my $file = DATAFILE;

    # check to see if the datafile exists BEFORE we connect to it
    my $exists = -f $file;

    my $dbh = DBI->connect(          
        "dbi:SQLite:dbname=$file", 
        "",
        "",
        { RaiseError => 1}
    ) or die $DBI::errstr;

    # if the datafile didn't exist before we connected to it, let's set up
    # the schema we're using
    unless ($exists) {
        $dbh->do("CREATE TABLE shows (pubdate INTEGER PRIMARY KEY, item TEXT)");
        $dbh->do("CREATE INDEX shows_idx ON shows (pubdate);");
        $dbh->do("CREATE TABLE last_show (pubdate INTEGER PRIMARY KEY, ".
                 "                        title   TEXT)");
    }

    return $dbh;
}

#################################### time ####################################

sub now {
    # set the time zone in the DateTime object, so we get non-UTC time
    return DateTime->now( time_zone => TZ );
}

sub is_weekday {
    # makes our code easier to read
    return now()->day_of_week < 6;
}

################################### copying ###################################

sub push_to_remotehost {
    my ($from, $to) = @_;

    my $connect = join '@', REMOTE_USER, REMOTE_HOST;

    state $ssh = Net::OpenSSH->new($connect);

    write_log("Copying $from to $connect:$to");

    if ( $ssh->scp_put($from, $to) ) {
        write_log("Copy success");
    }
    else {
        write_log("COPY ERROR: ". $ssh->error);
    }
}

# helper functions to make the code easier to read

sub push_xml_to_remotehost {
    push_to_remotehost(XMLFILE, REMOTE_DIR);
}

sub push_log_to_remotehost {
    push_to_remotehost(LOGFILE, REMOTE_DIR);
}

sub push_media_to_remotehost {
    my $from = shift;
    push_to_remotehost($from, REMOTE_DIR . 'npr/');
}

################################### logging ###################################

sub write_log {
    # I'm opening and closing the logfile every time I write to it so
    # it's easier for external processes to monitor the progress of
    # this script
    open my $logfile, '>>', LOGFILE;

    my $now = now();
    my $ts  = $now->ymd . q{ } . $now->hms . q{ };

    # I don't write multiple lines yet, but I might want to!
    foreach my $line ( @_ ) {
        say {$logfile} $ts . $line;
    }

    close $logfile;
}

BEGIN {
    unlink LOGFILE; # write a new log each time we run
    write_log('Started run'); # log that the run has started

    # register a DIE handler that will write whatever message I die() with
    # to our logfile so I can see it in the logs
    $SIG{__DIE__} = sub {
        my $err = shift;
        write_log('FATAL: '.$err);
        # if we die(), after this runs, the END block will be executed!
    };
}

END {
    # when the program finishes, log that
    write_log('Finished run');

    # and, so I can see these logs remotely, push them up to the webserver
    push_log_to_remotehost();
}

##################################### XML #####################################

sub re_title {
    my $rss = shift;

    # append some text to the channel's title so I can differentiate
    # this feed from the original feed in my podcast app

    my $existing_title = $rss->channel('title');
    my $add_len        = length(TITLE_ADD);

    if (length($existing_title) + $add_len > TITLE_MAX) {
        $existing_title = substr($existing_title, 0, TITLE_MAX - $add_len - 1);
    }

    $rss->channel('title' => $existing_title . TITLE_ADD);
}

sub item_info {
    state $mail = DateTime::Format::Mail->new; # only initialized once!

    my $item  = shift;
    my $title = fix_whitespace($item->{title});
    my $dt    = $mail->parse_datetime($item->{pubDate});
    my $epoch = $dt->epoch;
    return $epoch, $title;
}

sub fix_whitespace {
    my $string = shift;

    # multiple whitespace compressed to a single space
    $string =~ s{\s+}{ };

    # remove leading and trailing spaces
    $string =~ s{^\s+}{}; $string =~ s{\s+$}{};

    return $string;
}

# let's define some pseudo-accessors (since these are unblessed
# hashes, not objects) that will make our code easier to read

sub enclosure_pseudo_accessor {
    my $hash = shift;
    my $key  = shift;
    if (@_) {
        $hash->{enclosure}->{$key} = shift;
    }
    return $hash->{enclosure}->{$key};
}

sub item_url {
    my $hash = shift;
    enclosure_pseudo_accessor($hash, 'url', @_);
}

sub item_length {
    my $hash = shift;
    enclosure_pseudo_accessor($hash, 'length', @_);
}

# since XML::RSS doesn't provide a method to clear out the items in an
# already-parsed feed, I'm creating a subclass to provide that
# functionality rather than just executing code that manipulates the
# internal data structure of the object in my main program

package XML::RSS::NPR;
use base qw( XML::RSS );

sub clear_items {
    my $self = shift;
    $self->{num_items} = 0;
    $self->{items} = [];
}

# since we're creating a subclass, we can override the default XML
# modules that are used to be the ones we need - no calling
# add_module() from our main program!

sub _get_default_modules {
    return {
        'http://www.npr.org/rss/'                    => 'npr',
        'http://api.npr.org/nprml'                   => 'nprml',
        'http://www.itunes.com/dtds/podcast-1.0.dtd' => 'itunes',
        'http://purl.org/rss/1.0/modules/content/'   => 'content',
        'http://purl.org/dc/elements/1.1/'           => 'dc',
    };
}

__END__

Read it on GitHub: filter-npr-news

Scratching my itch

It’s been a while since I wrote some code to scratch purely my own itch. Most of my time is spent writing code to scratch my employer’s itches, and occasionally I get to write little programs that scratch small itches I get while writing code for my employer — things like extensions to git-p4 that allow me to pull information from a git repository and use it to generate merge commands for Perforce, so I don’t have to figure out which commits/changes I want to merge.

I know; nothing someone else would be interested in.

But I’ve had an itch for a little while that someone else might be interested in. I listen to the NPR Hourly News Summary via my podcast app on my Nexus 6. The web page might list a bunch of back episodes, but the RSS feed only publishes the most recent summary. But I want to listen to SOME older episodes, just not all of them. What I had been doing was having my podcast app keep every episode and then mark the ones I wasn’t interested in as done, but that was tedious, especially considering I only wanted to listen to four or five of the 24 episodes published each day.

So I thought about what I wanted. I wanted a program that would fetch the RSS feed every hour and check to see if the currently published episode was one of the ones I wanted, and, if it was, store it in a database and then spit out a new RSS feed with the last N episodes I’d stored in the database. I realized I didn’t need to actually fetch the episodes themselves, because NPR doesn’t remove the episodes after they disappear from the RSS feed (as evidenced by the web page with multiple episodes). I also realized I didn’t need to generate this new RSS feed dynamically: NPR’s feed only gets updated once an hour, so I only needed to generate my feed once NPR’s feed was updated, and, since I wasn’t generating the feed every time my podcast app asked for it, I could generate the feed on my desktop computer and then copy the XML file up to my web server (since my desktop has way more computing power than my web server).

And, of course, I wanted to use perl, because that’s my favorite programming language.

One of perl’s strengths is that, whatever you want to do, there’s probably a CPAN module that will do the heavy lifting for you. There’s also a Perl Cookbook for commonly used patterns in perl programming.  I found the recipe for Reading and Writing RSS Files, and there was an example for filtering an RSS feed and generating a new one. The example uses LWP::Simple to fetch the RSS feed, XML::RSSLite to parse the feed, and  XML::RSS to generate a new RSS feed. The cookbook even states “It would be easier, of course, to do it all with XML::RSS”. So I did.

Actually, I didn’t rewrite the RSS too much. Rather than building a completely new RSS feed, I used XML::RSS to parse the feed and extract the one item from it. But even though XML::RSS has a method for adding items to the feed, it doesn’t have a method for removing items from the feed. This left me with no choice but to dig through the source code of XML::RSS and figure out what was necessary to clear out the list of items. Once I cleared out the one item out of the feed, I re-loaded the feed with the items I’d stored.

Wait… I had stored items, right?  Oh, crud, I forgot about that part.  Ok, I need to store the last N items. I could use a text file, but that’s difficult to manage.  I could set up a database in PostgreSQL or MySQL, but that’s a lot of overhead for just storing a bit of data. If only there was a self-contained, serverless, zero-configuration, transactional SQL database engine.  Something like… SQLite!

So I set up a simple schema; one table with two columns: one to hold the timestamp of the episode, and one to hold the block data I needed to shove the episode back into the feed. Since it’s SQLite, I didn’t expect the datafile to exist the first time I ran it, so I put in a test to see if the data file existed before I connected to it, and, if it didn’t, create the schema.

The rest is fairly straightforward; I checked the current episode extracted from the feed to see if it was one of the times I wanted. If it wasn’t, I just skipped ahead to generating the new feed. If it was one of the episodes I wanted, I checked to make sure I didn’t already have it in the database.  If I did, I skipped ahead.  If it wasn’t in the database, I added it to the database and then deleted everything in the database older than 7 days.

#!/Users/packy/perl5/perlbrew/perls/perl-5.22.0/bin/perl -w

use DBI;
use Data::Dumper::Concise;
use DateTime::Format::Mail;
use LWP::Simple;
use XML::RSS;
use strict;

use feature 'say';

# list of times we want
my @keywords = qw( 7AM 8AM 12PM 7PM );
my $days_to_keep = 7;

# get the RSS
my $URL = 'http://www.npr.org/rss/podcast.php?id=500005';
my $content = get($URL);

# parse the RSS
my $rss = XML::RSS->new();
$rss->parse($content);


my @items = get_items( $rss->_get_items );

# make new RSS feed
$rss->{num_items} = 0;
$rss->{items} = [];

foreach my $item ( @items ) {
    $rss->add_item(%$item);
}

say $rss->as_string;


sub get_items {
    my $items = shift;

    # build the regex from keywords
    my $re = join "|", @keywords;
    $re = qr/\b(?:$re)\b/i;

    my $mail = DateTime::Format::Mail->new;

    my $dbh = get_dbh();

    my $insert = $dbh->prepare("INSERT INTO shows (pubdate, item) ".
                               "           VALUES (?, ?)");

    my $exists_in_db = $dbh->prepare("SELECT COUNT(*) FROM shows ".
                                     " WHERE pubdate = ?");

    foreach my $item (@$items) {
        my $title = $item->{title};
        $title =~ s{\s+}{ };  $title =~ s{^\s+}{}; $title =~ s{\s+$}{};

        if ($title !~ /$re/) {
            next;
        }

        my $dt = $mail->parse_datetime($item->{pubDate});
        my $epoch = $dt->epoch;

        $exists_in_db->execute($epoch);
        my ($exists) = $exists_in_db->fetchrow;
        if ($exists > 0) {
            next;
        }

        $insert->execute($epoch, Dumper($item));
    }

    my $now = DateTime->now();
    my $too_old = $now->epoch - ($days_to_keep * 24 * 60 * 60);
    $dbh->do("DELETE FROM shows WHERE pubdate < $too_old"); my $query = $dbh->prepare("SELECT * FROM shows ORDER BY pubdate");
    $query->execute();

    my @list;
    while ( my($pubdate, $item) = $query->fetchrow ) {
        push @list, eval $item;
    }

    return @list;
}

sub get_dbh {
    my $file = '/Users/packy/data/filter-npr-news.db';
    my $exists = -f $file;

    my $dbh = DBI->connect(          
        "dbi:SQLite:dbname=$file", 
        "",
        "",
        { RaiseError => 1}
    ) or die $DBI::errstr;

    unless ($exists) {
        # first time - set up database
        $dbh->do("CREATE TABLE shows (pubdate INTEGER PRIMARY KEY, item TEXT)");
    }
    return $dbh;
}

And it worked!  I then created a git repository on my desktop for it, and pushed it up to my GitHub account under its own project, filter-npr-news.

And that kept me satisfied for a whole day.

Next time, I’ll write about what started bothering me and the changes I made to fix that.

I need help… at work!

I need help.  There’s a lot of work to do at my day job, and we need another developer.  We’ve got a job posting up on our jobs site, and we’re posting to the appropriate job sites, but I really want to fill this position.  Mostly because I’m lonely.

I used to be a beta geek in an office filled with alpha geeks.  I loved this, because there were always people who understood the ideas I had and had ideas about how to make my ideas better.  I hate being the sole alpha geek in an office because then nobody understands the ideas I have.  But I also hate only having one other alpha geek to bounce ideas off of, because then if we can’t agree, there’s nobody to break the tie.

I’m not going to go into great detail about the job.  It’s a coding job, and it uses either Perl or Java (or both, if you’re so inclined).  If you’re reading this, you know me, and you’ll know that I’m still working for the current incarnation of what I’ve called “the best job I’ve ever had.”  If you’ve got a decade of experience, know either Perl or Java, don’t mind working in jeans and a t-shirt, don’t mind working in New York City and don’t think working with me would be a sign of insanity, let me know and I’ll get you in for an interview.

Eureka!

I think I’ve found the solution to a problem I’ve had at work for ages: Win32::Exe.

I would love to examine the PE version information of a Windows file that’s been uploaded to a Linux server. For a long time, I’ve punted on this problem, and waited until I had the file back on a Windows machine before examining this information, mostly because it’s much easier to get this info using Windows’ API calls to get the data than manually parsing the PE header info.  However, just tonight just stumbled across this perl module mentioned in a stackoverflow post, and it doesn’t depend on modules that we don’t already use.

Now this problem will stop bugging me, and I can go to sleep!

Update: Unfortunately, the files I need to examine are large (> 200MB), and Win32::Exe (via Parse::Binary) seems to load the entire file into memory.  This causes an out of memory error.  But maybe I can use this code as a launching point for a different solution.