Filtered Web Page Example

In order to find a good filter, you need to examine the source code for an article link from a feed you wish to filter. You can either save the html to a file and examine it with an editor or view the source from the view menu of the browser.

Choose a feed to filter. Select an article. Go to the link of the article. Examine the html you find and locate the parts you wish to display in the filtered web page. Here is a portion of a simple example file:

<html>
[OMITTED HTML]
<td class="storybody">
<!-- Start0 -->
  <p class="h"><b>Heading</b></p>
  <p>Content1</p>
<!-- StartI -->
  <img src="" alt="alt text" height="100" width="100">
<!-- EndI -->
  <p>Content2</p>
<!-- End0 -->
  <div class="xyz3">
    <p>junk</p>
  </div>
  <div class="socialBookMarks">
    <h3>Bookmark with:</h3>
    <ul>
      <li class="stumbleupon">
        <a href="http://www.stumbleupon.com/submit?url=">StumbleUpon</a>
      </li>
    </ul>
  </div>
</td>
[OMITTED HTML]
</html>

The content we are interested in is contained in Heading, Content1 and Content2. Different people may be interested in different aspects of the file.

Regular Expression

Filtering this with a regular expression is easy:

<!-- Start0.*<!-- End0 -->

In this regular expression, all characters represent themselves except the .* in the middle. The . represents any character and the * says to match the preceding . zero or more times. This matches all characters between the Start0 and End0 comment tags in the html. Many sites have such comment tags at useful locations in the html files. As a matter of fact, I didn't really make up the example.

JavaScript

A first try at JavaScript is:

getElementsByClass("storybody", "td", linkDOM)[0].innerHTML;

This leaves in some stuff we might wish to remove.

The following code first goes through all the <div> tags of class xyz3 and all <div> tags of class socialBookMarks and removes their html before returning the html of the <td> tag of class storybody:

for each ( var el in getElementsByClass("xyz3", "div", linkDOM) ) 
{ el.innerHTML = "";} 
for each ( var el in getElementsByClass("socialBookMarks", "div", linkDOM) ) 
{ el.innerHTML = "";}  
getElementsByClass("storybody", "td", linkDOM)[0].innerHTML;

Someone who didn't wish to remove the social bookmarking links could easily leave them in.

A lot of sites are this easy, but of course some are not.