本篇参考:http://tonl.iteye.com/blog/1918245
python版本:2.7 64bit window版本;
下载python:http://www.python.org/getit/
Python 2.7.5 Windows X86-64 Installer (Windows AMD64 / Intel 64 / X86-64 binary [1] — does not include source),举办安装:
首先编写下面的spider.py剧本:
# -*- coding: utf-8 -*- #import urllib2 from urllib import urlopen import os import sys class Spider: """ download web site from the given file """ def __init__(self,filename,downloadPath): """ init the filename ,if the filename is not raise a error """ if not os.path.isfile(filename): print 'the given file does not exist,the program will exit' sys.exit(0) else: self.fname=filename if not os.path.isdir(downloadPath): print 'the given download path does not exist ,the programe will exit' else: self.dpath=downloadPath def download(self): """ download the web site from the given file by line """ fp=open(self.fname,'r') while True: line=fp.readline() if not line: break if 'html' in line: tempname=filter(str.isalnum,line).replace('html','.html') else: tempname=filter(str.isalnum,line)+'.html' self.download_html(line,self.dpath+'\\'+tempname) fp.close() def download_html(self,website,filename): """ download the html by the given web site and save to name """ response=urlopen(website) data=response.read() fp=file(filename,'a+') fp.write(data) fp.close() def test(): """ test program """ filename=sys.argv[1] downloadPath=sys.argv[2] spider=Spider(filename,downloadPath) spider.download() if __name__ =='__main__': test()
上面的剧本,要输入两个参数,一个是要下载的网页的地点文件,名目一般如下(websites.txt):
查察本栏目
http://blog.csdn.net/fansy1990 http://www.baidu.com
别的一个参数是下载的网页的存放所在。
然后可以在呼吁行运行:
python D:\\spider.py D:\\websites.txt D:\\download_tmp
然后到D盘的download_tmp下面查找下载的文件,假如找到,则说明设置正确;
最后编写下面的java措施,需要导入jython-*.jar包(lz下载的是2.2的):
package test; import java.io.IOException; public class PyTest { /** * @param args * @throws IOException * @throws InterruptedException */ public static void main(String[] args) throws IOException, InterruptedException { String py_path="D:\\spider.py"; String websites="D:\\websites.txt"; String outDir="D:\\tmp"; // Process pr=Runtime.getRuntime().exec("python "+py_path+" "+websites+" "+outDir ); pr.waitFor(); System.out.println("done ..."); } }
运行上面的呼吁,需要配置eclipse中的Environment属性,添加一个PATH变量,值是python的安装目次;
运行后,会提示:
*sys-package-mgr*: can’t create package cache dir, *jython-2.2.jar\cachedir\packages’
这个可以不消管,不会影响措施运行。