parsed.org

Tips by tag: url

Netcraft Site Report by xinu on Dec 31, 2005 04:06 PM

This is a particularly tasty utility that will tell you what a site is (and has been) running along with hosting and DNS lookup information. Just replace www.google.com with the URL of your choosing.

http://toolbar.netcraft.com/site_report?url=http://www.google.com/

diagnosticdnsonlinetoolurlutilities
Python url regular explession by trick on Sep 19, 2008 10:57 PM

The following Python code is a regex that should fine any valid link. It does not, however, include punctuation at the end of the url:

SCHEMES = ('http', 'https', 'ftp', 'mailto', 'news', 'gopher',
'nntp', 'telnet', 'wais', 'prospero', 'aim', 'webcal')
# Note: fragment id is uchar | reserved, see rfc 1738 page 19
# %% for % because of string formating
# puncuation = ? , ; . : !
# if punctuation is at the end, then don't include it
URL_FORMAT = (r'(?<!\w)((?:%s):' # protocol + :
    '/*(?!/)(?:' # get any starting /'s
    '[\w$\+\*@&=\-/]' # reserved | unreserved
    '|%%[a-fA-F0-9]{2}' # escape
    '|[\?\.:\(\),;!\'](?!(?:\s|$))' # punctuation
    '|(?:(?<=[^/:]{2})#)' # fragment id
    '){2,}' # at least two characters in the main url part
    ')') % ('|'.join(SCHEMES),)

Code taken from the remark markup library.

pythonregexurl
Ranged Curl by xinu on Jan 15, 2005 02:29 PM

If you have a URL that you need to crawl and you know the range of numbers in the image, you can do something like this:

$ curl -O http://www.example.com/img/samples[00-99].jpg

That should (at least attempt to) fetch the images samples00.jpg through samples99.jpg. Enjoy!

If you're using more than one range, you'll want to build your filename or a path with the --create-dirs option. For example:

$ curl http://www.example.com/imgs[00-99]/samples[00-27].jpg --create-dirs -o "#1/#2.jpg"

Alternatively, you can just be ghetto and name the files like dirname_filename.jpg:

$ curl http://www.example.com/images[00-99]/samples[00-27].jpg -o "#1_#2.jpg"
commandscurlextrasimageshellurl
RSS