pages_to_rss.rb

Last updated: 11/04/2006

pages_to_rss.rb is a script that scrapes the content from web pages and writes it to an RSS file.

If you’d like to try it out:

  1. Download the program
  2. Unzip the archive
  3. Open pages_to_rss.rb in your favorite text editor
  4. Add the URLs you’d like to scrape to pages
  5. Change the value of write_path to the path where you’d like the program to write the RSS file
  6. Save your changes
  7. Open a terminal window and change the directory to the location of the program
  8. Type ruby pages_to_rss.rb

A few words of warning:

  • The content written to the RSS feed’s description elements may be rather messy. I can’t guarantee that the feed will actually be readable in your RSS reader.
  • Parts of the code are very ugly. If you have any suggestions on how to make it better, I’d love to hear from you.
  • I haven’t figured out how to escape high-bit characters such as bullets. Pages that contain such characters will likely cause your feed to be invalid. I’m planning on fixing this next.

This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 License.

Creative Commons License