Java爬虫有很多,WebMagic是其中一个,文档齐全,入门简单,个人用来爬取一些小数据很不错,以下以爬取彩票开奖结果为例,介绍一下基本用法。
WebMagic官网文档Introduction · WebMagic Documents,文档很细致,通过实例介绍了一个完整的爬取过程,并持久化爬取结果。
WebMagic封装的很好,一般来说我们只用定义自己的PageProcessor(用于提取数据),Pipeline(用于处理提取的数据,如持久化)
下面依葫芦画瓢,我们来爬取彩票的开奖结果,以下内容仅限个人学习使用
需求:爬取彩票的开奖结果,并写入数据库
我们基于springboot框架开始,springboot可以方便的执行定时爬取,结合mybatis把数据写入数据库
----------------------------我是分割线---------------------------
爬取的源:体彩官网(中国体彩网_国家体育总局体育彩票管理中心官方网站),500彩票(彩票开奖结果查询_彩票开奖号码公告_彩票开奖时间 - 500彩票网),新浪爱彩(【彩票开奖】彩票开奖结果_最新全国体彩,福彩,快彩开奖查询_新浪爱彩)
体彩的玩法有:大乐透,7星彩,排列3,排列5
打开体彩官网,首页可以看到最近一期各种玩法的开奖结果
我们点击前面的各个玩法,可以进去看详情,至于为何要点击进去,作为初学者,单个玩法单独处理可能会简单明了
打开大乐透详情页面(超级大乐透_中国体彩网),chrome浏览器按F12打开开发工具,刷新页面,看看请求的过程
逐一观察请求,发现这个请求
这个请求返回的是JSON,数据完全符合我们的需求,直接利用就好了
这时你可能会质疑为啥你要先看请求的过程,而不是分析页面的内容,其实在找到这个请求之前,我也分析过页面,页面的源码中并没有开奖的数据,所有我断定数据是通过后加载的方式填入页面的,想到这里当然要看请求咯
JSON数据最好不过了,反解析后直接使用,少了在HTML中提取数据的过程,核心代码如下:
定义TcOrgProcessor类,写如何提取我们需要的数据
public class TcOrgProcessor implements PageProcessor {
private final Logger logger = LoggerFactory.getLogger(TcOrgProcessor.class);
private static final DateTimeFormatter DATE_FORMAT = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss");
public static final String DLT_URL = "https://webapi.sporttery.cn/gateway/lottery/getDigitalDrawInfoV1.qry?param=85,0&isVerify=1";
public static final String QXC_URL = "https://webapi.sporttery.cn/gateway/lottery/getDigitalDrawInfoV1.qry?param=04,0&isVerify=1";
public static final String PL5_URL = "https://webapi.sporttery.cn/gateway/lottery/getDigitalDrawInfoV1.qry?param=35,0;350133,0&isVerify=1";
private final Site site = Site.me();
@Override
public void process(Page page) {
String url = page.getUrl().toString();
//预处理返回结果
String text = page.getRawText();
JSONObject rootInfo = JSON.parseObject(text);
if (rootInfo.getIntValue("errorCode") != 0) {
page.setSkip(true);
logger.error("请求结果错误,URL=>{},内容=>{}", url, text);
return;
}
//读取value字段
JSONObject valueObject = rootInfo.getJSONObject("value");
//大乐透,七星彩,排列的开奖结果模型相同,只是字段不同
try {
String[] keys = new String[] { "dlt", "qxc", "plw", "pls" };
List<DrawInfo> drawInfos = new ArrayList<>();
for (String key : keys) {
JSONObject drawInfoObject = valueObject.getJSONObject(key);
if (drawInfoObject == null || drawInfoObject.isEmpty())
continue;
//处理结果
String gameName = drawInfoObject.getString("lotteryGameName");
String drawNum = drawInfoObject.getString("lotteryDrawNum");
String strDrawResult = drawInfoObject.getString("lotteryDrawResult").replaceAll(" ", ",");
LocalDate drawDate = LocalDate.parse(drawInfoObject.getString("lotteryDrawTime"), DATE_FORMAT);
String poolBalance = drawInfoObject.getString("poolBalanceAfterdraw").replaceAll(",", "");
//构造开奖信息模型
List<Integer> drawResult = Arrays.stream(strDrawResult.split(",")).map(Integer::valueOf).collect(Collectors.toList());
int poolIntValue = new BigDecimal(poolBalance).intValue();
DrawInfo drawInfo = new DrawInfo(gameName, drawNum, drawDate, drawResult, poolIntValue, Source.TC_ORG);
drawInfos.add(drawInfo);
}
//存入结果集
page.putField("results", drawInfos);
} catch (Exception e) {
logger.error("解析异常:{}", e.getMessage());
page.setSkip(true);
}
}
@Override
public Site getSite() {
return site;
}
public static void main(String[] args) {
Spider.create(new TcOrgProcessor()).addUrl(DLT_URL).addUrl(QXC_URL).addUrl(PL5_URL).addPipeline(new ConsolePipeline()).run();
}
DrawInfo类是开奖信息模型,我们将提取的数据,标准化成这个模型,方便在后续的Pipeline中使用
/**
* 开奖信息模型
*/
public class DrawInfo {
//游戏名称[大乐透,7星彩,排列5]
private String game;
//期号[21035]
private String expect;
//开奖日期[2021-03-15]
private LocalDate drawDate;
//开奖结果[1,2,3,4,5]
private List<Integer> drawResult;
//奖池[188827520]
private int poolBalance;
//采集来源
private Source source;
public String getGame() {
return game;
}
public void setGame(String game) {
this.game = game;
}
public String getExpect() {
return expect;
}
public void setExpect(String expect) {
this.expect = expect;
}
public LocalDate getDrawDate() {
return drawDate;
}
public void setDrawDate(LocalDate drawDate) {
this.drawDate = drawDate;
}
public List<Integer> getDrawResult() {
return drawResult;
}
public void setDrawResult(List<Integer> drawResult) {
this.drawResult = drawResult;
}
public int getPoolBalance() {
return poolBalance;
}
public void setPoolBalance(int poolBalance) {
this.poolBalance = poolBalance;
}
public Source getSource() {
return source;
}
public void setSource(Source source) {
this.source = source;
}
public DrawInfo(String game, String expect, LocalDate drawDate, List<Integer> drawResult, int poolBalance, Source source) {
this.game = game;
this.expect = expect;
this.drawDate = drawDate;
this.drawResult = drawResult;
this.poolBalance = poolBalance;
this.source = source;
}
@Override
public String toString() {
return "DrawInfo{" + "game='" + game + '\'' + ", expect='" + expect + '\'' + ", drawDate=" + drawDate + ", drawResult=" + drawResult
+ ", poolBalance=" + poolBalance + ", source=" + source + '}';
}
@Override
public boolean equals(Object o) {
if (this == o)
return true;
if (o == null || getClass() != o.getClass())
return false;
DrawInfo drawInfo = (DrawInfo) o;
return Objects.equals(game, drawInfo.game) && Objects.equals(expect, drawInfo.expect) && Objects.equals(drawDate, drawInfo.drawDate)
&& Objects.equals(drawResult, drawInfo.drawResult);
}
@Override
public int hashCode() {
return Objects.hash(game, expect, drawDate, drawResult);
}
}
现在我们已经爬取到了需要的数据,自定义Pipeline可以自己处理爬取的结果
@Component
public class DrawResultPipeline implements Pipeline {
private final Logger logger = LoggerFactory.getLogger(DrawResultPipeline.class);
/**
* pipeline处理数据
* @param resultItems
* @param task
*/
@Override
public synchronized void process(ResultItems resultItems, Task task) {
Map<String, Object> map = resultItems.getAll();
logger.info("爬取数据结果:{}", map);
//noinspection unchecked
List<DrawInfo> results = (List<DrawInfo>) map.get("results");
//TODO: 持久化到数据库
}
}
为了能及时的获取最新的数据,我们设置一个定时任务,每间隔一段时间爬取一次
在springboot中可以很容易实现定时任务(百度搜索:springboot定时任务)
/**
* 定时任务爬取开奖结果
*/
@Component
public class SchedulerTask {
private final Logger logger = LoggerFactory.getLogger(SchedulerTask.class);
//注入自定义的Pipeline,传给WebMagic的Spider
@Resource private DrawResultPipeline drawResultPipeline;
/**
* 定时爬取开奖结果
*/
@Scheduled(cron = "0 0/2 8-23 * * ?")
public void fetch() throws Exception {
Spider.create(new TcOrgProcessor()).setExitWhenComplete(true).addPipeline(drawResultPipeline).start();
//TODO: 添加其他源的爬虫
}
}
至此爬取,持久化的流程就结束了。
其他源只是PageProcessor不同,持久化的过程是相同的,所以只用写对应的PageProcessor即可,完成后PageProcessor后添加到定时任务即可定时爬取
500网的PageProcessor
/** * 这里的数据是在页面中提取的,需要用到xpath或正则表达式抽取想要的数据 * 配合chrome浏览器的F12,查看页面源码,一步步抽取想要的数据
* http://kaijiang.500.com/
*/
public class WubaiProcessor implements PageProcessor {
private final Logger logger = LoggerFactory.getLogger(WubaiProcessor.class);
private static final DateTimeFormatter DATE_FORMAT = DateTimeFormatter.ofPattern("yyyy-MM-dd");
public final static String START_URL = "http://kaijiang.500.com";
private final Site site = Site.me();
@Override
public void process(Page page) {
//开奖表的根节点
Selectable rootNode = page.getHtml().xpath("//table[@class=kj_tablelist01]/tbody");
List<DrawInfo> drawInfos = new ArrayList<>();
//大乐透
try {
Selectable dltNode = rootNode.xpath("//tr[@id=dlt]");
String drawNum = dltNode.xpath("//td[2]/text()").replace("期", "").toString().trim();
String strDrawDate = dltNode.xpath("//td[3]/text()").toString().trim();
LocalDate drawDate = LocalDate.parse(strDrawDate, DATE_FORMAT);
String strDrawResult = dltNode.xpath("//td[4]/script").regex("formatResult\\('dlt','(.*)'\\)", 1).toString().trim();
strDrawResult = strDrawResult.replace("|", ",");
String poolBalance = dltNode.xpath("//td[5]/script").regex("formatCCMoney\\('dlt','(.*)'\\)", 1).toString().trim();
logger.info("大乐透:{}, {}, {}, {}", drawNum, drawDate, strDrawResult, poolBalance);
//构造开奖对象
List<Integer> drawResult = Arrays.stream(strDrawResult.split(",")).map(Integer::valueOf).collect(Collectors.toList());
int poolIntValue = new BigDecimal(poolBalance).intValue();
DrawInfo dltInfo = new DrawInfo("大乐透", drawNum, drawDate, drawResult, poolIntValue, Source.WUBAI_COM);
drawInfos.add(dltInfo);
} catch (Exception e) {
logger.error("大乐透解析页面异常:{}", e.getMessage());
}
//7星彩
try {
Selectable qxcNode = rootNode.xpath("//tr[@id=qxc]");
String drawNum = qxcNode.xpath("//td[2]/text()").replace("期", "").toString().trim();
String strDrawDate = qxcNode.xpath("//td[3]/text()").toString().trim();
LocalDate drawDate = LocalDate.parse(strDrawDate, DATE_FORMAT);
String strDrawResult = qxcNode.xpath("//td[4]/script").regex("formatResult\\('qxc','(.*)'\\)", 1).toString().trim();
String poolBalance = qxcNode.xpath("//td[5]/script").regex("formatCCMoney\\('qxc','(.*)'\\)", 1).toString().trim();
logger.info("7星彩:{}, {}, {}, {}", drawNum, drawDate, strDrawResult, poolBalance);
//构造开奖对象
List<Integer> drawResult = Arrays.stream(strDrawResult.split(",")).map(Integer::valueOf).collect(Collectors.toList());
int poolIntValue = new BigDecimal(poolBalance).intValue();
DrawInfo qxcInfo = new DrawInfo("7星彩", drawNum, drawDate, drawResult, poolIntValue, Source.WUBAI_COM);
drawInfos.add(qxcInfo);
} catch (Exception e) {
logger.error("7星彩解析页面异常:{}", e.getMessage());
}
//排列5
try {
Selectable plwNode = rootNode.xpath("//tr[@id=plw]");
String drawNum = plwNode.xpath("//td[2]/text()").replace("期", "").toString().trim();
String strDrawDate = plwNode.xpath("//td[3]/text()").toString().trim();
LocalDate drawDate = LocalDate.parse(strDrawDate, DATE_FORMAT);
String strDrawResult = plwNode.xpath("//td[4]/script").regex("formatResult\\('plw','(.*)'\\)", 1).toString().trim();
String poolBalance = plwNode.xpath("//td[5]/script").regex("formatCCMoney\\('plw','(.*)'\\)", 1).toString().trim();
logger.info("排列5:{}, {}, {}, {}", drawNum, drawDate, strDrawResult, poolBalance);
//构造开奖对象
List<Integer> drawResult = Arrays.stream(strDrawResult.split(",")).map(Integer::valueOf).collect(Collectors.toList());
int poolIntValue = new BigDecimal(poolBalance).intValue();
DrawInfo plwInfo = new DrawInfo("排列5", drawNum, drawDate, drawResult, poolIntValue, Source.WUBAI_COM);
drawInfos.add(plwInfo);
} catch (Exception e) {
logger.error("排列5解析页面异常:{}", e.getMessage());
}
//排列3
try {
Selectable plwNode = rootNode.xpath("//tr[@id=pls]");
String drawNum = plwNode.xpath("//td[2]/text()").replace("期", "").toString().trim();
String strDrawDate = plwNode.xpath("//td[3]/text()").toString().trim();
LocalDate drawDate = LocalDate.parse(strDrawDate, DATE_FORMAT);
String strDrawResult = plwNode.xpath("//td[4]/script").regex("formatResult\\('pls','(.*)'\\)", 1).toString().trim();
String poolBalance = plwNode.xpath("//td[5]/script").regex("formatCCMoney\\('pls','(.*)'\\)", 1).toString().trim();
logger.info("排列3:{}, {}, {}, {}", drawNum, drawDate, strDrawResult, poolBalance);
//构造开奖对象
List<Integer> drawResult = Arrays.stream(strDrawResult.split(",")).map(Integer::valueOf).collect(Collectors.toList());
int poolIntValue = new BigDecimal(poolBalance).intValue();
DrawInfo plsInfo = new DrawInfo("排列3", drawNum, drawDate, drawResult, poolIntValue, Source.WUBAI_COM);
drawInfos.add(plsInfo);
} catch (Exception e) {
logger.error("排列3解析页面异常:{}", e.getMessage());
}
page.putField("results", drawInfos);
}
@Override
public Site getSite() {
return site;
}
public static void main(String[] args) {
Spider.create(new WubaiProcessor()).addUrl(START_URL).run();
}
}
至此,我们使用WebMagic得到了想要的数据,持久化到数据库的示例
欢迎学习交流