Oi. I hate finding out I’ve been hacked late in the evening…

Earlier this evening, I got an email from Google saying that they’d added a new administrator to one of the domains I have.

Except I didn’t make anyone an administrator.

It seems that someone had used some of the security holes in WordPress to set up a shadow website inside one of my idle websites, and they’d just told Google they were an administrator by putting a verification HTML file in the web root.

I’ve removed the file, disabled the idle website, and gone through patching the security holes in my WordPress websites.  I’d rather not be hosting a site that’s providing page-ranks for spammy Chinese and Japanese websites.

Now time for sleep.

Preview of my next post…

My next project in the “whittling wood”?  Figuring out why XML::RSS::LibXML parses these tags without any problem:

<itunes:category text="News &amp; Politics"/>
<itunes:image href="http://media.npr.org/images/podcasts/2013/primary/hourly_news_summary-c464279737c989a5fbf3049bc229152af3c36b9d.png?s=1400"/>

and produces this internal data structure:

      category => bless( {
        _attributes => [
          "text"
        ],
        _content => "",
        text => "News & Politics"
      }, 'XML::RSS::LibXML::MagicElement' ),
      image => bless( {
        _attributes => [
          "href"
        ],
        _content => "",
        href => "http://media.npr.org/images/podcasts/2013/primary/hourly_news_summary-c464279737c989a5fbf3049bc229152af3c36b9d.png?s=1400"
      }, 'XML::RSS::LibXML::MagicElement' ),

But then doesn’t have these tags anywhere in the re-rendered XML when it spits it back out again. I know it has to do with the fact that these tags have no content (there’s no opening and closing tag, there’s just the one tag closed with a />), but I don’t know why XML::RSS isn’t properly rendering it when it converts the data structure back to XML.

The easy work-around would be to just do some string matching, recognize these tags in the original XML, copy them and re-insert them into the rendered XML afterwards.

The more difficult fix is to figure out what’s wrong with XML::RSS and try to fix it myself.

Guess which road I’m taking?

But what if I’ve still got an itch?

Ok, I wanted to write a follow-up post about my little program and the changes that I needed to make to it about a day later, but I found myself writing more and more code, and not having any time to actually write about writing the code. My wife, Kay, has dubbed it “my whittling wood”. So, let’s run down things that started to bother me about my creation…

The first thing that bothered me was when I was walking out of my office the first night. I checked my podcast app, and I didn’t have the 7PM news podcast yet, and it was 7:15PM already. I knew immediately what happened: NPR had been late updating the feed, and my cron job had run at 7:12 and missed it. So I thought about how to fix that problem (I managed to get the 7PM news podcast at 8:12 because NPR was late with the 8PM episode as well, so I was able to pick up the 7PM episode on that run).

I immediately dismissed the idea of running the script multiple times an hour. It didn’t feel clean to me. What I decided I needed to do was check to see if the episode I was looking at this time was different than the episode I was looking at the last time the script ran–I would need a new table to track this–and, if it was the same episode, sleep for a minute or two and try again, and continue retrying until I either got a new episode or I decided I’d waited long enough (20 attempts seemed to be a good cutoff number).

Of course, I also decided I wanted to be able to check up on what was happening, so I needed to write a log file. If I was going to be able to see this log file when I wasn’t home, however, I’d have to copy it up to my web server along with the RSS feed XML.

And this brings me to where I was when I wrote my first post. I already had more code in the script, but I blogged about the first draft, wanting to come back to this second draft with a followup blog post.

And that’s when things got crazy.

We had a big filming day coming up for PacKay Productions that weekend, and I had a lot of work to do, some of which I’d already done and blogged about. After the filming was done, I needed to prep for Halloween.  And even with the changes I’d made to this script, things were going wrong with my setup.

One of the things I did wrong was setting up my wrapper shell script to run the perl program.  I’m not really adept at Bourne shell scripting, and I always leave things out. Then, last Thursday night, I was idly wondering how easy it would be to correct the other major annoyance I have with the NPR Hourly News Summary: the inconsistency of the sound levels.

Sometimes, the news summary is recorded at a good level, and I’m able to hear everything just fine.  Other times, the levels are set so low that even with my player’s volume cranked all the way up and my headphones pressed into my ears, I find it impossible to hear what’s being said over the sounds of the street in New York City.

So, of course, I started looking to see if somebody else had already solved my problem.  I ran across this post in the ask ubuntu StackExchange forums, audacitywhich outlined two solutions: Audacity, an open source visual sound editor I was already intimately familiar with, and SoX, which was billed as “the Swiss Army knife of sound processing programs”.

SoX: the Swiss Army knife of sound processing programs

SoX

SoX is a command line tool for processing audio files, and the more I read about it, the more I liked it.  Normalizing an audio file used to be a two-step process in SoX: running a command once in an analysis mode to get the maximum volume of the file, and a second time to boost that volume to the maximum possible without distortion. However, with version 14.3 of SoX, its developers made all of that possible in one single command:

sox --norm infile outfile

I briefly pondered cloning SoX’s git repository and building from source, but I realized that chances were slight that I was going to be making changes to SoX; I just wanted it as a command line tool.  So I turned to one of the most wonderful things you can have on your Mac: Homebrew.

Homebrew is a package manager for OS X that’s all git and ruby under the hood, and it has a beer theme! It installs software in a “Cellar”. It doesn’t have packages, it has “bottles.  It even uses the beer emoji: ????

Installing new software with Homebrew is painfully easy:

brew install sox

Once I got SoX installed, modifying my code to used it was dead easy.

Finally, I decided to tackle the big thing that I wasn’t doing in the program itself: copying files up to the webserver. At first I looked at Net::Scp, but for some reason I couldn’t get it to work (it kept telling me that my remote directory didn’t exist).  So I switched over to Net::OpenSSH, and I was able to get the copy working.

I also cleaned up the code a lot, and added a ton of comments.  I want this code to be able to document itself, so it’s really obvious what I’m doing and why. Some would say that once a program is working, it’s done.  But when I’m writing code for myself, it’s not done until I’ve commented the heck out of it, because I know myself: a year later, I’m going to come back to this code and think “What was I smoking when I wrote this?”

I doubt I’ll think that when I come back to this code.

#!/Users/packy/perl5/perlbrew/perls/perl-5.22.0/bin/perl -w

use DBI;
use Data::Dumper::Concise;
use DateTime;
use DateTime::Format::Mail;
use LWP::Simple;
use Net::OpenSSH;
use URI;
use XML::RSS;
use strict;

use feature qw( say state );

# define all the things!
use constant {
    URL         => 'http://www.npr.org/rss/podcast.php?id=500005',
    TITLE_ADD   => ' (filtered by packy)',
    TITLE_MAX   => 40, # characters
    SLEEP_FOR   => 120, # seconds (2 minutes)
    MAX_RETRIES => 10,
    KEEP_DAYS   => 7,

    REMOTE_HOST => 'www.dardan.com',
    REMOTE_USER => 'dardanco',
    REMOTE_DIR  => 'www/packy/',

    MEDIA_URL   => 'https://packy.dardan.com/npr',

    TZ          => 'America/New_York',
    LOGFILE     => '/tmp/npr-news.txt',
    XMLFILE     => '/tmp/npr-news.xml',
    IN_DIR      => '/tmp/incoming',
    OUT_DIR     => '/tmp/outgoing',
    DATAFILE    => '/Users/packy/data/filter-npr-news.db',

    SOX_BINARY  => '/usr/local/bin/sox',
};

# list of times we want - different times on weekends
my @keywords = is_weekday() ? qw( 7AM 8AM 12PM 6PM 7PM )
             :                qw( 7AM     12PM     7PM );

my $dbh = get_dbh();  # used in a couple places, best to be global

my $rss;   # these two vars are only used in the main code block,
my $items; # but can't be scoped to the foreach loop

# since, for cosmetic reasons, we're starting the count at 1, we need
# to loop up to MAX_RETRIES + 1; otherwise, we'll only have the first
# attempt and then (MAX_RETRIES - 1).  If I'd called the constant
# MAX_ATTEMPTS then it would make sense to start at zero...
foreach my $retry (1 .. MAX_RETRIES + 1) {

    # get the RSS
    write_log("Fetching " . URL);
    my $content = get(URL);

    # parse the RSS using a subclass of XML::RSS
    $rss = XML::RSS::NPR->new();
    $rss->parse($content);
    write_log("Parsed XML");

    $items = $rss->_get_items;

    # if a new show was published in the feed, we don't need to wait
    # in a loop for a new one
    last unless same_show_as_last_time( $items );

    # we don't want the script to wait forever - if no new episode
    # appears after a maximum number of retries, give up and generate
    # the feed with the episodes we have
    if ($retry > MAX_RETRIES) {
        write_log("MAX_RETRIES (".MAX_RETRIES.") exceeded");
        last;
    }

    # for debugging purposes, I want to be able to not have the script
    # sleep, and the choices were add command line switch processing
    # or check an environment variable. This was the simpler option.
    if ($ENV{NPR_NOSLEEP}) {
        last;
    }

    # log the fact that we're sleeping so we can observe what the
    # script is doing while it's running
    write_log("Sleeping for ".SLEEP_FOR." seconds...");

    # since I usually want to listen to these podcasts when I'm away
    # from my desktop computer, copy the log file up to the webserver
    # so I can check on it remotely.  this way, if it's spending an
    # inordinate amount of time waiting for a new episode, I can see
    # that from my phone's browser...
    push_log_to_remotehost();

    # actually sleep
    sleep SLEEP_FOR;

    # and note which number retry this is
    write_log("Trying RSS feed again (retry #$retry)");
}

# test to see if the new item matches our inclusion criteria, and then
# fill the item list with items we've cached in our database
get_items_from_database( $items );

# make new RSS feed devoid of the original items... ok, ITEM
$rss->clear_items;

foreach my $item ( @$items ) {
    $rss->add_item(%$item);
}

re_title($rss);

write_log("Writing RSS XML to " . XMLFILE);
open my $fh, '>', XMLFILE;
say {$fh} $rss->as_string;
close $fh;
push_xml_to_remotehost();

#################################### subs ####################################

sub get_items_from_database {
    my $items = shift;

    # build the regex for matching desired episodes from keywords
    my $re = join "|", @keywords;
    $re = qr/\b(?:$re)\b/i;

    my $insert = $dbh->prepare("INSERT INTO shows (pubdate, item) ".
                               "           VALUES (?, ?)");

    my $exists_in_db = $dbh->prepare("SELECT COUNT(*) FROM shows ".
                                     " WHERE pubdate = ?");

    # I know the feed only has the one item in it, but it SHOULD have
    # more, so let's go through the motions of checking each item

    foreach my $item (@$items) {

        # pawn off the specifics of how we get the information to a sub
        my ($epoch, $title) = item_info($item);

        # again, for debugging purposes, I wanted to be able to not
        # have the script skip the current item, and the choices were
        # add command line switch processing or check an environment
        # variable. This was the simpler option.

        if ($title !~ /$re/ &amp;&amp; ! $ENV{NPR_NOSKIP}) {
            write_log("'$title' doesn't match $re; skipping");
            next;
        }

        # check to see if we already have it in the DB
        $exists_in_db->execute($epoch);
        my ($exists) = $exists_in_db->fetchrow;

        if ($exists > 0) {
            write_log("'$title' already in database; skipping");
            next;
        }

        # the NPR news podcast is notoriously bad at normalizing the
        # volume of its broadcasts; some are easy to hear and some are
        # so quiet it's impossible to ehar them when listening on a
        # city street, so, let's normalize them to a maximum volume

        normalize_audio($item);

        write_log("Adding '$title' to database");

        # it's easier to store the data in the episode cache table as
        # a perl representation of the parsed data than it is to
        # serialize it back into XML and then re-parse it when we need
        # it again.
        $insert->execute($epoch, Dumper($item));
    }

    # go through the database and dump episodes that are older than
    # our retention period.  Since we're using epoch time (seconds
    # since some date, usually midnight 1970-01-01) as the key to our
    # episode cache table, it's really easy to determine which
    # episodes are too old

    my $now     = DateTime->now();
    my $too_old = $now->epoch - (KEEP_DAYS * 24 * 60 * 60);
    $dbh->do("DELETE FROM shows WHERE pubdate < $too_old");

    # now let's fetch the episodes from the episode cache table in
    # oldest-first order. Again, since we're keyed on the episode's
    # publish date in epoch time, we can do this with a simple numeric
    # sort.
    my $query = $dbh->prepare("SELECT * FROM shows ORDER BY pubdate");
    $query->execute();

    @$items = ();
    while ( my($pubdate, $item) = $query->fetchrow ) {

        # just blindly evaluating text is a potential security problem,
        # but I know all these entries came from me writing dumper-ed code,
        # so I feel safe in doing so...
        my $evaled = eval $item;

        push @$items, $evaled;

        # log which episodes we're putting into the feed
        my ($epoch, $title) = item_info($evaled);
        write_log("Fetched '$title' from database; adding to feed");
    }
}

sub same_show_as_last_time {
    my $items = shift;

    # so we know when the feed is late in publishing a new item,
    # we have a table that stores the publication date of the last
    # episode we saw.  It also stores the title of the episode so
    # we can log which episode it was.

    my $get_last_show = $dbh->prepare("SELECT * FROM last_show");

    # get the information for the current episode
    my ($epoch, $title) = item_info($items->[0]);

    # fetch the last epsiode from the DB
    $get_last_show->execute;
    my ($last_time, $last_title) = $get_last_show->fetchrow;

    # save the episode we just fetched for next time
    my $update = $dbh->prepare("UPDATE last_show SET pubdate = ?, title = ? ".
                               " WHERE pubdate = ?");
    $update->execute($epoch, $title, $last_time);

    # now compare the current episode with the one we got from the DB
    my $is_same = ($last_time == $epoch);

    if ($is_same) {
        write_log("RSS feed has not updated since '$last_title' was published");
    }

    return $is_same;
}

#################################### audio ####################################

sub filename_from_uri {
    my $uri = shift;

    # abstract out the complexities of fetching the filename from a
    # URI so the code will read easier; in this case, we're
    # instantiating a new URI class object and calling path_segments()
    # to get the segments of the path, and then returning the last
    # element, which is going to be the filename.

    return( ( URI->new($uri)->path_segments )[-1] );
}

sub normalize_audio {
    my $item = shift;
    my $uri  = item_url($item);
    my $file = filename_from_uri($uri);

    # perl idiom for "if directory doesn't exist, make it"
    -d IN_DIR  or mkdir IN_DIR;
    -d OUT_DIR or mkdir OUT_DIR;

    # construct fill pathnames to the file we're downloading and
    # then normalizing to
    my $infile  = join '/', IN_DIR,  $file;
    my $outfile = join '/', OUT_DIR, $file;

    # fetch the MP3 file using LWP::Simple
    my $code = getstore($uri, $infile);
    write_log("Fetched '$uri' to $infile; RESULT $code");
    return unless $code == 200;

    # if, for some reason, we don't have the program to normalize audio,
    # crash with a message complaining about it being missing
    -x SOX_BINARY
        or die "no executable at " . SOX_BINARY;

    # call SoX to normalize the audio
    write_log("Normalizing $infile to $outfile");
    system join(q{ }, SOX_BINARY, '--norm', $infile, $outfile);

    # the feed doesn't publish an item length in bytes, but it really
    # ought to, so let's get the size of the MP3 file.
    my $size = -s $outfile || 0;

    # re-write the bits of the item we're changing
    item_url($item, join '/', MEDIA_URL, $file);
    item_length($item, $size);

    # send the normalized MP3 file up to the webserver
    push_media_to_remotehost($outfile);

    # clean up after ourselves
    unlink $infile;
    unlink $outfile;
}

#################################### db ####################################

sub get_dbh {
    my $file = DATAFILE;

    # check to see if the datafile exists BEFORE we connect to it
    my $exists = -f $file;

    my $dbh = DBI->connect(          
        "dbi:SQLite:dbname=$file", 
        "",
        "",
        { RaiseError => 1}
    ) or die $DBI::errstr;

    # if the datafile didn't exist before we connected to it, let's set up
    # the schema we're using
    unless ($exists) {
        $dbh->do("CREATE TABLE shows (pubdate INTEGER PRIMARY KEY, item TEXT)");
        $dbh->do("CREATE INDEX shows_idx ON shows (pubdate);");
        $dbh->do("CREATE TABLE last_show (pubdate INTEGER PRIMARY KEY, ".
                 "                        title   TEXT)");
    }

    return $dbh;
}

#################################### time ####################################

sub now {
    # set the time zone in the DateTime object, so we get non-UTC time
    return DateTime->now( time_zone => TZ );
}

sub is_weekday {
    # makes our code easier to read
    return now()->day_of_week < 6;
}

################################### copying ###################################

sub push_to_remotehost {
    my ($from, $to) = @_;

    my $connect = join '@', REMOTE_USER, REMOTE_HOST;

    state $ssh = Net::OpenSSH->new($connect);

    write_log("Copying $from to $connect:$to");

    if ( $ssh->scp_put($from, $to) ) {
        write_log("Copy success");
    }
    else {
        write_log("COPY ERROR: ". $ssh->error);
    }
}

# helper functions to make the code easier to read

sub push_xml_to_remotehost {
    push_to_remotehost(XMLFILE, REMOTE_DIR);
}

sub push_log_to_remotehost {
    push_to_remotehost(LOGFILE, REMOTE_DIR);
}

sub push_media_to_remotehost {
    my $from = shift;
    push_to_remotehost($from, REMOTE_DIR . 'npr/');
}

################################### logging ###################################

sub write_log {
    # I'm opening and closing the logfile every time I write to it so
    # it's easier for external processes to monitor the progress of
    # this script
    open my $logfile, '>>', LOGFILE;

    my $now = now();
    my $ts  = $now->ymd . q{ } . $now->hms . q{ };

    # I don't write multiple lines yet, but I might want to!
    foreach my $line ( @_ ) {
        say {$logfile} $ts . $line;
    }

    close $logfile;
}

BEGIN {
    unlink LOGFILE; # write a new log each time we run
    write_log('Started run'); # log that the run has started

    # register a DIE handler that will write whatever message I die() with
    # to our logfile so I can see it in the logs
    $SIG{__DIE__} = sub {
        my $err = shift;
        write_log('FATAL: '.$err);
        # if we die(), after this runs, the END block will be executed!
    };
}

END {
    # when the program finishes, log that
    write_log('Finished run');

    # and, so I can see these logs remotely, push them up to the webserver
    push_log_to_remotehost();
}

##################################### XML #####################################

sub re_title {
    my $rss = shift;

    # append some text to the channel's title so I can differentiate
    # this feed from the original feed in my podcast app

    my $existing_title = $rss->channel('title');
    my $add_len        = length(TITLE_ADD);

    if (length($existing_title) + $add_len > TITLE_MAX) {
        $existing_title = substr($existing_title, 0, TITLE_MAX - $add_len - 1);
    }

    $rss->channel('title' => $existing_title . TITLE_ADD);
}

sub item_info {
    state $mail = DateTime::Format::Mail->new; # only initialized once!

    my $item  = shift;
    my $title = fix_whitespace($item->{title});
    my $dt    = $mail->parse_datetime($item->{pubDate});
    my $epoch = $dt->epoch;
    return $epoch, $title;
}

sub fix_whitespace {
    my $string = shift;

    # multiple whitespace compressed to a single space
    $string =~ s{\s+}{ };

    # remove leading and trailing spaces
    $string =~ s{^\s+}{}; $string =~ s{\s+$}{};

    return $string;
}

# let's define some pseudo-accessors (since these are unblessed
# hashes, not objects) that will make our code easier to read

sub enclosure_pseudo_accessor {
    my $hash = shift;
    my $key  = shift;
    if (@_) {
        $hash->{enclosure}->{$key} = shift;
    }
    return $hash->{enclosure}->{$key};
}

sub item_url {
    my $hash = shift;
    enclosure_pseudo_accessor($hash, 'url', @_);
}

sub item_length {
    my $hash = shift;
    enclosure_pseudo_accessor($hash, 'length', @_);
}

# since XML::RSS doesn't provide a method to clear out the items in an
# already-parsed feed, I'm creating a subclass to provide that
# functionality rather than just executing code that manipulates the
# internal data structure of the object in my main program

package XML::RSS::NPR;
use base qw( XML::RSS );

sub clear_items {
    my $self = shift;
    $self->{num_items} = 0;
    $self->{items} = [];
}

# since we're creating a subclass, we can override the default XML
# modules that are used to be the ones we need - no calling
# add_module() from our main program!

sub _get_default_modules {
    return {
        'http://www.npr.org/rss/'                    => 'npr',
        'http://api.npr.org/nprml'                   => 'nprml',
        'http://www.itunes.com/dtds/podcast-1.0.dtd' => 'itunes',
        'http://purl.org/rss/1.0/modules/content/'   => 'content',
        'http://purl.org/dc/elements/1.1/'           => 'dc',
    };
}

__END__

Read it on GitHub: filter-npr-news