Here's a handy command line for using the Linux utility wget as a web crawler.
wget -r -np -p -k http://www.example.com
An explanation of the options -
r - enable recursive downloads
np - wget will not follow links up the url. e.g. it will not follow a link from example.com/abcd/page1.html to example.com/page2.html.
p - get all the page requisites. e.g. get all the image/css/js files linked from the page.
k - convert all links to make them suitable for local viewing. Will convert all absolute links to relative links if the file has been downloaded locally.
Some more useful tips:
c - continue a previous download. This option is very handy to resume past aborted download attempts. It compares the local filesize with the remote filesize and downloads only the difference. Beware that if the files have changed on the server, you would end up with a garbled file.
X - Supply a list of directories to exclude from downloading. Is helpful for example when you want to not download a particular section of the site. You can include wildcards in the directory pattern. e.g. -X /ads/* will skip over anything that begins with www.example.com/ads folder.
No comments:
Post a Comment