pages_to_rss.rb
Last updated: 11/04/2006
pages_to_rss.rb is a script that scrapes the content from web pages and writes it to an RSS file.
If you’d like to try it out:
- Download the program
- Unzip the archive
- Open pages_to_rss.rb in your favorite text editor
- Add the URLs you’d like to scrape to
pages - Change the value of
write_pathto the path where you’d like the program to write the RSS file - Save your changes
- Open a terminal window and change the directory to the location of the program
- Type
ruby pages_to_rss.rb
A few words of warning:
- The content written to the RSS feed’s
descriptionelements may be rather messy. I can’t guarantee that the feed will actually be readable in your RSS reader. - Parts of the code are very ugly. If you have any suggestions on how to make it better, I’d love to hear from you.
- I haven’t figured out how to escape high-bit characters such as bullets. Pages that contain such characters will likely cause your feed to be invalid. I’m planning on fixing this next.
This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 License.
