Author Topic: Scrape a website with wget (Read 1853 times)

proxx · « **on:** June 10, 2013, 08:35:15 PM »

Hello,

This is not really bash but whatever, dont know where to put it.
Wget is a very useful tool and for those unaware of its power I would like to show how to download a website, partially or completely.
One thing that is useful is the option to ignore robots.txt

Code: [Select]

wget -H -r --level=5 -e robots=off http://somewebsite.org/

-H --span-hosts go to foreign hosts when recursive.
-r --recursive --level=5 recursive + recursion dept, in this case 5
-e robots=off execute + ignore robots.txt

This will download the entire webpage including files etc.
Must have used it a million times. screw GUI tools

Have a nice day.

vezzy · « **Reply #1 on:** June 10, 2013, 08:42:56 PM »

Yeah, I'm pretty sure every *nix user is aware of this. Even some more advanced Windoo users too.

Kulverstukas · « **Reply #2 on:** June 10, 2013, 08:49:57 PM »

If you use link depth of 5, does wget download stuff outside the root website too?
Example: let's say there is a link to some blog post, does wget check if the link goes out of the website to another and skip or does wget download everything on the blog too?

proxx · « **Reply #3 on:** June 10, 2013, 09:10:07 PM »

Quote from: Kulverstukas on June 10, 2013, 08:49:57 PM

If you use link depth of 5, does wget download stuff outside the root website too?
Example: let's say there is a link to some blog post, does wget check if the link goes out of the website to another and skip or does wget download everything on the blog too?

My mistake its recursion dept.
Not link dept.

Quote

Yeah, I'm pretty sure every *nix user is aware of this. Even some more advanced Windoo users too.

Me too but for those new to *nix it might be a nice example of its power.

zoup · « **Reply #4 on:** June 10, 2013, 11:46:33 PM »

wget has a --spider switch too. So it don't downloads anything and you can go through the output and look if you find some interesting. But i think you already know. But i have to get posts :-))))

Author Topic: Scrape a website with wget (Read 1853 times)

proxx

Scrape a website with wget

vezzy

Re: Scrape a website with wget

Kulverstukas

Re: Scrape a website with wget

proxx

Re: Scrape a website with wget

zoup

Re: Scrape a website with wget