This forum is in archive mode. You will not be able to post new content.

Show Posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.


Messages - bonewagon

Pages: [1]
1
Scripting Languages / [Python] Manga downloader
« on: March 19, 2014, 09:24:09 PM »
This can download a manga (one at a time) from mangareader.net for offline reading.

Code: (python) [Select]
#!/usr/bin/python
import urllib2
from re import compile, findall
from sys import argv
from os import mkdir

chapter_exp = compile('(?<=href=")/(?:\d+-)+\d+/[\w-]+/chapter-\d+\.html') # extract the URLs for each chapter
page_exp = compile('(?<=value=")/(?:\d+-)+\d+/[\w-]+/chapter-\d+\.html') # the URLs for each page in the chapter
img_exp = compile('(?<=src=")https?://i\d*\.mangareader\.net/.*\d+\.jpg') # the image for the page

chapter_number = page_number = 1

def get_matches(url, exp):
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; rv:20.0) Gecko/20100101 Firefox/20.0')]
matches = findall(exp, opener.open(url).read())
for match in matches:
while matches.count(match) != 1:
matches.remove(match)
return matches

def fetch_image(url):
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; rv:20.0) Gecko/20100101 Firefox/20.0')]
return opener.open(url).read()

def usage():
print "Usage: mangadownloader.py [options]"
print "Options: \n\n"

print "-starturl=http://www.mangareader.net/311/bloody-monday.html \t download the manga for which the table of contents is at the given mangareader URL (Bloody Monday in this case)\n"
print "-outputpath=/path/to/output/directory \t Write the manga into this directory\n"
print "-subdir=BloodyMonday \t In the output path listed above, create a subdirectory titled 'BloodyMonday' to put the manga into (optional)\n"
print "-mergechapters \t Merge the pages of all chapters into one chapter (optional)\n"
print "-help or -h \t Display this message"

exit()

if "-help" in argv or "-h" in argv:
usage()

arguments = {}

for arg in argv:

if arg.startswith("-starturl="):
arguments["starturl"] = arg[len("-starturl="):]

elif arg.startswith("-outputpath="):
arguments["outputpath"] = arg[len("-outputpath="):]

elif arg.startswith("-subdir="):
arguments["subdir"] = arg[len("-subdir="):]


for required_argument in ["starturl", "outputpath"]:
if not arguments.has_key(required_argument):
usage()

if "-mergechapters" in argv:
arguments["mergechapters"] = True
else:
arguments["mergechapters"] = False

if arguments.has_key("subdir"):
arguments["outputpath"] += "/" + arguments["subdir"]
try:
mkdir(arguments["outputpath"])
except: # directory already exists
pass

chapters = get_matches(arguments["starturl"], chapter_exp)
for chapter in range(len(chapters)):
chapters[chapter] = "http://www.mangareader.net" + chapters[chapter] # convert to absolute URLs

for chapter in chapters:

if not arguments["mergechapters"]:
path = arguments["outputpath"] + "/" + str(chapter_number)
try:
mkdir(path)
except:
pass
chapter_number += 1
else:
path = arguments["outputpath"]


pages = get_matches(chapter, page_exp)
for page in pages:
image = fetch_image(get_matches("http://www.mangareader.net" + page, img_exp)[0])
open(path + "/" + str(page_number) + ".jpg", "wb").write(image)
page_number += 1
if not arguments["mergechapters"]:
page_number = 1

raw_input("Done. Press [enter] to quit: ")

Usage:

-starturl=http://www.mangareader.net/326/code-geass-lelouch-of-the-rebellion.html (This downloads the manga which is at the specified URL - Code Geass in this case. Obviously it must be a URL at mangareader.net.)

-outputpath=C:/Users/Me/path/to/output/directory (Specify the directory the manga chapters and pages are to be written into.)

-subdir=CodeGeass (Creates a subdirectory with the specified name - again, "CodeGess" in this case - in the directory specified in the option above, and the manga will go into it. This is optional.)

-mergechapters (Rather than organizing the manga into chapters, all the pages will be written into same directory. This is obviously optional, but not recommended by me, and I don't see why anyone would want to do this for anything but the shortest of mangas.)

The chapters will be numbered from 1 to [insert the number of chapters the given manga has, here] and the pages in each chapter will be numbered from 1 to [insert the number of pages the given chapter has, here].

2
I've written a small program that harvests data from Google. I named it "Pygore" (Python Google Regular Expressions).




It is like a web crawler, but a little unique, so I'll try to explain what it does:

  • To start with, it makes a user-defined Google query. The user can enter ordinary search terms or use Google operators ("Google hacking") to improve the results of the search.
  • A certain number of URLs are extracted from the results of the query. That number is defined by the user. If the user's query produced less URLs than the number the user specified, Pygore will simply extract all of them.
  • These URLs are iterated through, and Pygore visits each of them (as a web client) and downloads the HTML source for each of the web pages.
  • Pygore then searches through the HTML for a user-defined regular expression. The matches are then extracted and dumped. Optionally, the user can also dump the URLs that the matches were found at, right beside the matches themselves. Pygore can dump the results to the terminal, a line-by-line text file, or an HTML file.
(I've written it using Tkinter for the GUI, and it makes use of the xgoogle library to implement the Google searching.
Pygore is split into several modules, (rather than being a single .py script) so I uploaded it as an attachment rather than posting the source code directly. The attachment contains the source code, although the Xgoogle modules were not written by me.)
I think something like this can be useful, although I myself am probably never going to use it. Let me know what you think.

Pages: [1]


Want to be here? Contact Ande, Factionwars or Kulverstukas on the forum or at IRC.