Filtered Web Pages

Version 1.0.2 has a new NewsFox function: filtered web pages. This allows you to download web content to examine while offline. But it allows you to filter the web content before it is stored, both to make it smaller and give it a higher signal/noise ratio.

Here is how it works. A new column is added to the article pane(labeled '+' in en-US). It is hidden by default, so you'll have to enable it with the columnpicker(column named 'Filtered Web' in en-US). The icon for this column has five states:

  • read icon(small gray dot): use the usual description from the feed as before
  • blue downarrow: the filtered web page is being downloaded
  • unread icon(larger green dot): the filtered web page will be displayed
  • unread icon with yellow background: images from the filtered web page are being downloaded(the filtered web page will still appear when selected)
  • red X: there has been an error, make sure javascript.options.showInConsole is equal to true, rerun if necessary and then look at the error console to see errors

Clicking on the icon in the read(regular text view) state will start downloading. Clicking on the icon in the unread(filtered web) state will revert to the regular text view. What happens to the downloaded filtered web page depends on the value in newsfox.x.removeXbody. If this value equals 1(default), the filtered web page will be discarded. If this value equals -1, the filtered web page will be retained. If this value equals 0, the user will be asked whether to delete the filtered web page.

Now how do you tell NewsFox how to filter the web page? In the feed options dialog, there is a new tab 'Web filter' that lets you construct a filter. This is quite an advanced feature, and problems with its use will typically be outside the scope of NewsFox technical support. There are two methods you can use: 1) regular expressions and 2) JavaScript. You do either by entering the filter inside the text box. You can also just augment the usual feed description by checking the box to download images and not adding a filter.

Regular Expression method: You enter a regular expression and the text/html from the article link page matching the regular expression becomes the filtered web page. The default uses the regular expression <p.*?<\/p> which selects all the <p> elements from the web page and throws everything else away(assuming the <p> elements are closed). To construct a more useful regular expression, you need to look at the text/html from web pages you are interested in and find useful markers to use.

Here are a couple example general references on regular expressions:

JavaScript method: You enter JavaScript in the box. The last value will be used as the filtered web page. This must evaluate to a string. The JavaScript is evaluated in a sandbox. The following variables are available: linkDOM - the DOM of the article link page, linkHTML - the text/html of the article link page, win - a copy of the window object of the iframe the result will be displayed in(changes to win do nothing, only the last value of the JS makes any difference), doc - a copy of the document object of the iframe the result will be displayed in, getElementsByClass(class, tag, node) which returns an array of elements with given class and tag contained in node, getDomAndHtml(site) which returns an object with properties HTML and DOM giving the text/html and DOM of the given site. getDomAndHtml(site) should be avoided as it needs to work synchronously and thus locks up the entire browser while the request is being processed. An example of JavaScript to enter in the box to return the entire web page would be

getElementsByClass("","html",linkDOM)[0].innerHTML;

or

linkHTML;

Note that external CSS is not downloaded in either method. JavaScript on the page is not run, which may make some aspects of the page unavailable. linkHTML and linkDOM are just the text/html and DOM of the given page with nothing else loaded.