This forum is in archive mode. You will not be able to post new content.

Author Topic: Scrape a website with wget  (Read 1547 times)

0 Members and 1 Guest are viewing this topic.

Offline proxx

  • Avatarception
  • Global Moderator
  • Titan
  • *
  • Posts: 2803
  • Cookies: 256
  • ФФФ
    • View Profile
Scrape a website with wget
« on: June 10, 2013, 08:35:15 PM »
Hello,

This is not really bash but whatever, dont know where to put it.
Wget is a very useful tool and for those unaware of its power I would like to show how to download a website, partially or completely.
One thing that is useful is the option to ignore robots.txt
Code: [Select]
wget -H -r --level=5 -e robots=off http://somewebsite.org/

-H  --span-hosts                go to foreign hosts when recursive.
-r   --recursive --level=5     recursive + recursion dept, in this case 5
-e robots=off                    execute + ignore robots.txt


This will download the entire webpage including files etc.
Must have used it a million times. screw GUI tools :P
Have a nice day.
« Last Edit: June 10, 2013, 09:17:11 PM by proxx »
Wtf where you thinking with that signature? - Phage.
This was another little experiment *evillaughter - Proxx.
Evilception... - Phage

Offline vezzy

  • Royal Highness
  • ****
  • Posts: 771
  • Cookies: 172
    • View Profile
Re: Scrape a website with wget
« Reply #1 on: June 10, 2013, 08:42:56 PM »
Yeah, I'm pretty sure every *nix user is aware of this. Even some more advanced Windoo users too.
Quote from: Dippy hippy
Just brushing though. I will be semi active mainly came to find a HQ botnet, like THOR or just any p2p botnet

Offline Kulverstukas

  • Administrator
  • Zeus
  • *
  • Posts: 6627
  • Cookies: 542
  • Fascist dictator
    • View Profile
    • My blog
Re: Scrape a website with wget
« Reply #2 on: June 10, 2013, 08:49:57 PM »
If you use link depth of 5, does wget download stuff outside the root website too?
Example: let's say there is a link to some blog post, does wget check if the link goes out of the website to another and skip or does wget download everything on the blog too?

Offline proxx

  • Avatarception
  • Global Moderator
  • Titan
  • *
  • Posts: 2803
  • Cookies: 256
  • ФФФ
    • View Profile
Re: Scrape a website with wget
« Reply #3 on: June 10, 2013, 09:10:07 PM »
If you use link depth of 5, does wget download stuff outside the root website too?
Example: let's say there is a link to some blog post, does wget check if the link goes out of the website to another and skip or does wget download everything on the blog too?

My mistake its recursion dept.
Not link dept.

Quote
Yeah, I'm pretty sure every *nix user is aware of this. Even some more advanced Windoo users too.
Me too but for those new to *nix it might be a nice example of its power.
Wtf where you thinking with that signature? - Phage.
This was another little experiment *evillaughter - Proxx.
Evilception... - Phage

Offline zoup

  • Serf
  • *
  • Posts: 29
  • Cookies: 3
  • I don't understand anything here !
    • View Profile
Re: Scrape a website with wget
« Reply #4 on: June 10, 2013, 11:46:33 PM »
wget has a --spider switch too. So it don't downloads anything and you can go through the output and look if you find some interesting. But i think you already know. But i have to get posts :-))))

 



Want to be here? Contact Ande, Factionwars or Kulverstukas on the forum or at IRC.