This forum is in archive mode. You will not be able to post new content.

Author Topic: 4chan image download in Ruby  (Read 1105 times)

0 Members and 1 Guest are viewing this topic.

Offline Fur

  • Knight
  • **
  • Posts: 216
  • Cookies: 34
    • View Profile
4chan image download in Ruby
« on: October 07, 2014, 05:42:24 PM »
Wrote this a few hours ago for the obvious reason.

Code: (Ruby) [Select]
# 4chan-image-downloader.rb
# Ruby script to download images from 4chan threads.
# Author: Fur
# Date: 15th Oct 2014
# Usage: 4chan-image-downloader.rb 4chan.org/board/thread /dir/to/download/images/to
# Notes:
  # Files in the directory will be overwritten, so be careful!
 
require 'open-uri'
require 'json'
require 'fileutils'

require 'trollop'
require 'colored'

VERSION = "0.2.3"

# Parse a 4chan thread url.
# Returns hash containing board and thread number.
def parse_4chan_thread_url(thread_url)
  match_result = thread_url.match /4chan\.org\/(\w{1,4})\/thread\/(\d+)/
  if !match_result
    throw ArgumentError, "Url was not a valid 4chan thread url!"
  end
 
  return {:board => match_result.captures[0], :thread_number => match_result.captures[1]}
end
 
# Download and parse information about a 4chan thread and parse as JSON
# See: https://github.com/4chan/4chan-API
# board is the board that the thread was posted in.
# thread_number is number of the thread as a string.
# Returns: json object representing 4chan thread.
def download_thread_info(board, thread_number)
  open("http://a.4cdn.org/%s/thread/%s.json" % [board, thread_number]) do |f|
    JSON.parse(f.read)
  end
end

# Get urls for all images in 4chan thread.
# thread_url is the url for the 4chan thread.
# Returns: array containing image urls as URI objects.
def get_images_in_thread(thread_url)
  url_info = parse_4chan_thread_url(thread_url)

  thread_info = download_thread_info(url_info[:board], url_info[:thread_number])

  images = []
  thread_info["posts"].each do |post|

    # Filename only appears on posts with images and we can't download deleted images.
    if post["filename"] && !post["filedeleted"]
      filename = post["tim"].to_s + post["ext"]
      image_url = "http://i.4cdn.org/%s/%s" % [url_info[:board], filename]
     
      images << URI.parse(image_url)
    end
  end
  images
end


# Save a file from the Internet to disk.
# url is the url at which the file to download resides.
# filename is the path to save the file to.
# Warning: will truncate existing files!
def download_file(url, path)
  File.open(path, "wb") do |saved_file|
    open(url, "rb") do |read_file|
      saved_file.write(read_file.read)
    end
  end
end

# Download images in thread and print download information to stdout.
# thread_url is the url of the thread to download
# directory is the directory to download the images to. Must exist.
def download_images_in_thread(thread_url, directory)
  images = []
 
  begin
    images = get_images_in_thread(thread_url)
  rescue OpenURI::HTTPError => e
    abort("Could not download thread information: #{e.message}!")
  rescue ArgumentError => e
    abort("Invalid thread url!")
  end

  i = 1
  images.each do |image|
    filename = File.basename(image.path)

    print "downloading %s (%i/%i)... " % [filename, i, images.size]
   
    begin
      download_file(image, File.join(directory, filename))
    rescue OpenURI::HTTPError => e
      print "error! (#{e.message})".red
    else
      print "done!".green
    end

   print "\n"
   
   i += 1
   
  end
end

def main
  opts = Trollop::options do
      version "#{$0} v#{VERSION}"
      banner <<-EOS
  #{$0} is a command-line program to download images from 4chan threads
  Usage:
          #{$0} <thread_url> <image_download_directory>
  EOS
    end

  thread_url = ARGV.shift
  directory = ARGV.shift

  if !thread_url || !directory then
    Trollop::die("Required argument(s) missing!")
  end

  directory = File.expand_path(directory)

  if !File.exists?(directory) then
      FileUtils::mkdir_p(directory)
  end

  download_images_in_thread(thread_url, directory)
end

main

Dependency installation:
Code: (Bash) [Select]
gem install trollop
gem install json
gem install colored

Changelog:
  Added text colouring to the "done!" and "failed (error)!" output.
  Lowered the default interval from 1 to 0.5
  Script now automatically makes image download directory if it does not exist
  Corrected the help banner.
  Completely refactored code and removed interval parameter because it was useless, given how much bandwidth 4chan has.
« Last Edit: October 15, 2014, 01:38:34 PM by Fur »

Offline lucid

  • #Underground
  • Titan
  • **
  • Posts: 2683
  • Cookies: 243
  • psychonaut
    • View Profile
Re: 4chan image download in Ruby
« Reply #1 on: October 07, 2014, 06:09:45 PM »
Nice. I've never really seen Trollop used before. I always take care of the args parsing manually, without a library, but it gets a little difficult since my args parsing can get pretty involved. Multiple hosts/files multiple options, optional arguments, etc..

Perhaps I should just use optparse or something. Anyway, looks good.
« Last Edit: October 07, 2014, 06:10:12 PM by lucid »
"Hacking is at least as much about ideas as about computers and technology. We use our skills to open doors that should never have been shut. We open these doors not only for our own benefit but for the benefit of others, too." - Brian the Hacker

Quote
15:04  @Phage : I'm bored of Python

Offline Fur

  • Knight
  • **
  • Posts: 216
  • Cookies: 34
    • View Profile
Re: 4chan image download in Ruby
« Reply #2 on: October 07, 2014, 06:30:41 PM »
I've never really seen Trollop used before. I always take care of the args parsing manually, without a library, but it gets a little difficult since my args parsing can get pretty involved. Multiple hosts/files multiple options, optional arguments, etc.
Argument parsing libraries simplify so much. I spend 30 seconds writing argument parsing code instead of like 10 minutes, and my code is much cleaner with Trollop than it would have been with manual argument parsing.

Trollop also makes a lovely little help option:
Quote from: ruby 4chanDownloader.rb --help
4chanDownload.rb is a command-line program to download images from 4chan threads
Usage:
        4chanDownload.rb [options] <thread_url> <image_download_directory>
where [options] are zero or more of:
  --interval, -i <f>:   Seconds to wait after downloading each image (default: 0.5)
       --version, -v:   Print version and exit
          --help, -h:   Show this message

You should really consider using Trollop or Optparse. I can't imagine anyone but masochists going back from them by choice.

Offline lucid

  • #Underground
  • Titan
  • **
  • Posts: 2683
  • Cookies: 243
  • psychonaut
    • View Profile
Re: 4chan image download in Ruby
« Reply #3 on: October 07, 2014, 09:00:00 PM »
I'll consider it. For some reason, everytime I check out optparse examples I can't figure out how to fit it into my code specifically. I get pretty heavy on command-line arguments.
"Hacking is at least as much about ideas as about computers and technology. We use our skills to open doors that should never have been shut. We open these doors not only for our own benefit but for the benefit of others, too." - Brian the Hacker

Quote
15:04  @Phage : I'm bored of Python

Offline proxx

  • Avatarception
  • Global Moderator
  • Titan
  • *
  • Posts: 2803
  • Cookies: 256
  • ФФФ
    • View Profile
Re: 4chan image download in Ruby
« Reply #4 on: October 07, 2014, 09:39:22 PM »
I'll consider it. For some reason, everytime I check out optparse examples I can't figure out how to fit it into my code specifically. I get pretty heavy on command-line arguments.
Point is that if you want to do that manually you will have to hold into account a lot and things like
Code: [Select]
-r0vs
Code: [Select]
-r 0vs
Code: [Select]
-r1 1 are plain annoying.
I can recommend it, just rip off an existing example :)
« Last Edit: October 07, 2014, 09:40:37 PM by proxx »
Wtf where you thinking with that signature? - Phage.
This was another little experiment *evillaughter - Proxx.
Evilception... - Phage

 



Want to be here? Contact Ande, Factionwars or Kulverstukas on the forum or at IRC.