Fetching RSS Feeds

You know how to fetch url using curl now. (See Using curl to make REST API calls) And let’s look at a specific use of the curl wrapper class, in which you will learn how to fetch RSS feeds and parse for specific info you are looking for.

Let’s briefly mention about RSS specification which is the short name for Really Simple Syndication. The details can be found at http://cyber.law.harvard.edu/rss/rss.html. It is an XML based specification conforms to XML 1.0. It is a very simple XML format in which the syndicated articles are listed as <item> nodes under the element <channel> which describes the feed itself, as in the following excerpts from a New York Times feed:

<channel>
        <title>NYT &gt; Home Page</title>
        <link>http://www.nytimes.com/pages/index.html?partner=rss</link>
        <description/>
        <language>en-us</language>

        <copyright>Copyright 2010  The New York Times Company</copyright>
        <lastBuildDate>Sat, 17 Jul 2010 17:43:58 GMT </lastBuildDate>
        <item>
            <title>For BP, Rising Pressure in Oil Well Seen as a Positive Sign </title>
            <link>http://feeds.nytimes.com/click.phdo?i=7ca98bec358b81188e619bc29c3988cf</link>
            <guid isPermaLink="false">http://www.nytimes.com/2010/07/18/us/18spill.html</guid>
            <description>Officials on Saturday said that pressure readings in the well were rising steadily after the valves were closed on a cap at the well’s top.&lt;br clear=&quot;both&quot; style=&quot;clear: both;&quot;/&gt;
&lt;br clear=&quot;both&quot; style=&quot;clear: both;&quot;/&gt;
&lt;a href=&quot;http://ads.pheedo.com/click.phdo?s=7ca98bec358b81188e619bc29c3988cf&amp;p=1&quot;&gt;&lt;img alt=&quot;&quot; style=&quot;border: 0;&quot; border=&quot;0&quot; src=&quot;http://ads.pheedo.com/img.phdo?s=7ca98bec358b81188e619bc29c3988cf&amp;p=1&quot;/&gt;&lt;/a&gt;
&lt;img alt=&quot;&quot; height=&quot;0&quot; width=&quot;0&quot; border=&quot;0&quot; style=&quot;display:none&quot; src=&quot;http://segment-pixel.invitemedia.com/pixel?code=Business&amp;partnerID=167&amp;key=segment&quot;/&gt;&lt;img alt=&quot;&quot; height=&quot;0&quot; width=&quot;0&quot; border=&quot;0&quot; style=&quot;display:none&quot; src=&quot;http://pixel.quantserve.com/pixel/p-8bUhLiluj0fAw.gif?labels=pub.29518.rss.Business.18272,cat.Business.rss&quot;/&gt;</description>
            <pubDate>Sat, 17 Jul 2010 16:30:51 GMT</pubDate>
        </item>
        <item>
            <title>For Kenneth Feinberg, More Delicate Diplomacy</title>
            <link>http://feeds.nytimes.com/click.phdo?i=76e429d7108fe523c9cffe56ef0adac1</link>
            <guid isPermaLink="false">http://www.nytimes.com/2010/07/17/us/17feinberg.html</guid>
            <description>Even with a $20 billion fund from BP to compensate those harmed by the spill, Kenneth R. Feinberg is playing the salesman role.&lt;br clear=&quot;both&quot; style=&quot;clear: both;&quot;/&gt;
&lt;br clear=&quot;both&quot; style=&quot;clear: both;&quot;/&gt;
&lt;a href=&quot;http://ads.pheedo.com/click.phdo?s=76e429d7108fe523c9cffe56ef0adac1&amp;p=1&quot;&gt;&lt;img alt=&quot;&quot; style=&quot;border: 0;&quot; border=&quot;0&quot; src=&quot;http://ads.pheedo.com/img.phdo?s=76e429d7108fe523c9cffe56ef0adac1&amp;p=1&quot;/&gt;&lt;/a&gt;
&lt;img alt=&quot;&quot; height=&quot;0&quot; width=&quot;0&quot; border=&quot;0&quot; style=&quot;display:none&quot; src=&quot;http://segment-pixel.invitemedia.com/pixel?code=Business&amp;partnerID=167&amp;key=segment&quot;/&gt;&lt;img alt=&quot;&quot; height=&quot;0&quot; width=&quot;0&quot; border=&quot;0&quot; style=&quot;display:none&quot; src=&quot;http://pixel.quantserve.com/pixel/p-8bUhLiluj0fAw.gif?labels=pub.29518.rss.Business.18272,cat.Business.rss&quot;/&gt;</description>
            <pubDate>Sat, 17 Jul 2010 16:10:23 GMT</pubDate>
        </item>
</channel>

Though you will find lot more sub-nodes under <item> element for other attributes of an article, an RSS feed requires to have only 3 attributes: title,link and description that are specified in the corresponding nodes. Therefore, a generic class that would try to parse details of articles listed in an RSS feed can reliably look only for those attributes.

The RssFeed class just does that: returns list of articles in an array for the RSS feed provided as input. The hash it uses to store article attributes contains pubDate also, as that is a common optional attribute most every RSS feed seems to include. For specific requirements, this class can be modified to include other attributes.

<?php
include_once 'CurlWrap.php';

class RssFeed
{
   private $feed_url;
   private $curl;
   private $items=array();

   function __construct($url='')
   {
      $this->curl=new CurlWrap();
      if ($url!='') {
         $this->feed_url=$url;
         $this->loadItems();
      }
   }

   function setFeedUrl($url) {$this->feed_url=$url; }

   function loadItems()
   {
      $this->curl->exec($this->feed_url);
      if ($this->curl->getHttpCode()==200) {
         $rss=new SimpleXmlElement($this->curl->getExecResponse());
         $items=$rss->channel->item;
         foreach ($items as $item) {
            $title=(string)$item->title;
            $link=(string)$item->link;
            $description=(string)$item->description;
            $pub_date=(string)$item->pubDate;
            $item_hash = array (
                'title'=>$title,
                'link'=>$link,
                'description'=>$description,
                'pubDate'=>$pub_date
            );
            $this->items[]=$item_hash;
         }
      }
   }

   function getItems() {return $this->items; }
}
?>

The sample program simply prints the article links and related titles.

<?php
include_once 'RssFeed.php';

$feed=new RssFeed("http://www.nytimes.com/services/xml/rss/nyt/pop_top.xml");
foreach ($feed->getItems() as $item) {
   print "link: ". $item['link']. "\n";
   print "title: ". $item['title']. "\n";
}
?>

Multiple feeds can be parsed by simply changing the feed url (setFeedUrl) and reloading the article items (loadItems).

Advertisements
This entry was posted in PHP. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s