wget批量下载文件详解:如何下载qcon,tcon 2011,oscon 2009/2010的所有slides

April 16, 2011 by · Comments Off on wget批量下载文件详解:如何下载qcon,tcon 2011,oscon 2009/2010的所有slides
Filed under: design 

1.下载 qcon beijing 2011的所有pdf文件

wget  `curl -s http://www.qconbeijing.com/schedule.html | sed ‘s/<\/a>/\n/g’ | sed ‘s/.*href=”\([^”]*\)”.*$/\1/’ | grep download | sed ‘s/download/http:\/\/www.qconbeijing.com\/download/g’ `

命令详解:
curl下载到schedule.html,内容输出到stdout,
第1个sed把链接的结束标签替换为换行, </a>替换为换行, 以确保每行只有一个链接。
第2个sed找到所有的href=””中间的内容,并输出;
grep download 找到所有的 download/xxxx.pdf的链接,
最后一个sed把download替换为文件的全路径, 比如 网页中的 download/panxiaoliang.pdf 链接会被替换为 http://www.qconbeijing.com/download/panxiaoliang.pdf

举例,schedule.html网页中有这样的一行,其中第2个href的地址是需要提取出来,并且补充baseurl的:
<td><p align=”center”><a href=”ShowNews.aspx?id=35″>构建高性能的微博系统——再谈新浪微博架构</a><a target=”_blank” href=”download/yangweihua.pdf”>(幻灯片下载)</a><a href=”ShowNews.aspx?id=37″></a><br />

2. 下载Qcon San Francisco 2008-2011的所有slides

wget `curl http://qconsf.com/sf2008/schedule/wednesday.jsp -s | grep pdf | sed ‘s”<a href=””‘ | sed ‘s#”##’ | sed ‘s#”>##’ | sed ‘s#/sf2008#http://qconsf.com/sf2008#’`

wget `curl http://qconsf.com/sf2008/schedule/thursday.jsp -s | grep pdf | sed ‘s”<a href=””‘ | sed ‘s#”##’ | sed ‘s#”>##’ | sed ‘s#/sf2008#http://qconsf.com/sf2008#’`

wget `curl http://qconsf.com/sf2008/schedule/friday.jsp -s | grep pdf | sed ‘s”<a href=””‘ | sed ‘s#”##’ | sed ‘s#”>##’ | sed ‘s#/sf2008#http://qconsf.com/sf2008#’`

重命名下载的文件为文件名中%2F后面的名字:
ls | awk -F%2F ‘{print “mv ” “\””$0″\””, “\””$5″\””}’ > ../a.sh
source ../a.sh
wget -c `curl http://qconsf.com/sf2009/schedule/wednesday.jsp -s | grep pdf | sed ‘s”<a href=””‘ | sed ‘s#”##’ | sed ‘s#”>##’ | sed ‘s#/sf2009#http://qconsf.com/sf2009#’`

wget -c `curl http://qconsf.com/sf2009/schedule/thursday.jsp -s | grep pdf | sed ‘s”<a href=””‘ | sed ‘s#”##’ | sed ‘s#”>##’ | sed ‘s#/sf2009#http://qconsf.com/sf2009#’`

wget -c `curl http://qconsf.com/sf2009/schedule/friday.jsp -s | grep pdf | sed ‘s”<a href=””‘ | sed ‘s#”##’ | sed ‘s#”>##’ | sed ‘s#/sf2009#http://qconsf.com/sf2009#’`

重命名下载的文件为文件名中%2F后面的名字:
ls | awk -F%2F ‘{print “mv ” “\””$0″\””, “\””$4″\””}’ > ../a.sh
source ../a.sh
wget -c `curl http://qconsf.com/sf2010/schedule/wednesday.jsp -s | grep pdf | sed ‘s”<a href=””‘ | sed ‘s#”##’ | sed ‘s#”>##’ | sed ‘s#/sf2010#http://qconsf.com/sf2010#’`

wget -c `curl http://qconsf.com/sf2010/schedule/thursday.jsp -s | grep pdf | sed ‘s”<a href=””‘ | sed ‘s#”##’ | sed ‘s#”>##’ | sed ‘s#/sf2010#http://qconsf.com/sf2010#’`

wget -c `curl http://qconsf.com/sf2010/schedule/friday.jsp -s | grep pdf | sed ‘s”<a href=””‘ | sed ‘s#”##’ | sed ‘s#”>##’ | sed ‘s#/sf2010#http://qconsf.com/sf2010#’`

重命名下载的文件为文件名中%2F后面的名字:
ls | awk -F%2F ‘{print “mv ” “\””$0″\””, “\””$4″\””}’ > ../a.sh
source ../a.sh

3.下载Qcon London 2010-2011的slides

wget `curl http://qconlondon.com/london-2010/schedule/wednesday.jsp -s | grep pdf | sed ‘s”<a href=””‘ | sed ‘s#”##’ | sed ‘s#”>##’ | sed ‘s#/london-2010#http://qconlondon.com/london-2010#’`

wget `curl http://qconlondon.com/london-2011/schedule/wednesday.jsp -s | grep pdf | sed ‘s”<a href=””‘ | sed ‘s#”##’ | sed ‘s#”>##’ | sed ‘s#/london-2011#http://qconlondon.com/london-2011#’`

2010,2011年的还有如下thrusday, friday两个jsp页面中文件的下载命令上面未列出,直接替换上面的wednesday为thrusday, friday即可。
http://qconlondon.com/london-2011/schedule/thursday.jsp
http://qconlondon.com/london-2011/schedule/friday.jsp

重命名下载的文件为文件名中%2F后面的名字:

ls | awk -F%2F ‘{print “mv ” $0, $4}’ > ../a.sh
source ../a.sh

2011/07更新:

4.下载淘宝嘉年华2011 (Tcon 2011)所有slides

wget http://developerclub.taobao.com/schedule/ -O tcon2011.txt
wget `grep ppts tcon2011.txt | sed 's/.*href="\([^"]*\)".*$/\1/' | sed 's#/ppts#http://developerclub.taobao.com/ppts#g'`

里面的wget下载到schedule页面的内容

grep ppts找到所有包含下载链接的行并输出到标准输出;

第1个sed找到所有href中的链接地址(相对链接地址),如 /ppts/魏子均More_Weapons_More_Power.pdf。

第2个sed将上面的相对路径替换成绝对路径, 如:

http://developerclub.taobao.com/ppts/魏子均More_Weapons_More_Power.pdf

外面的wget下载所有的链接。

5.下载oscon 2009/2010的所有slides

download oscon 2009 slides:

wget http://www.oscon.com/oscon2009/public/schedule/proceedings  -O oscon2009.txt
grep "Presentation File:" -A 6  oscon2009.txt | grep "a href" | sed 's/.*href="\([^"]*\)".*$/wget "\1"/' > download-oscon2009.sh
source download-oscon2009.sh

第1个grep找到包含Presentation File:的行,及其后的6行并输出到控制台;

第2个grep找到包含href链接的行;

sed命令产生一条条的wget命令, 如:

wget “http://assets.en.oreilly.com/1/event/27/Django in the Real World Presentation.pdf”

然后输出到download-oscon2009.sh,并执行这个文件。

download oscon 2010 slides:

wget http://www.oscon.com/oscon2010/public/schedule/proceedings -O oscon2010.txt 
grep "Presentation:" -A 6 oscon2010.txt | grep "a href" | sed 's/.*href="\([^"]*\)".*$/wget "\1"/' > download-oscon2010.sh
download-oscon2010.sh

 

 

 

 

 

 

 

具体参数与上面的类同。

 

Digg This
Reddit This
Stumble Now!
Vote on DZone
Share on Facebook
Bookmark this on Delicious
Share on LinkedIn
Bookmark this on Technorati
Post on Twitter
Google Buzz (aka. Google Reader)