楚雄师范学院学报 ›› 2020, Vol. 35 ›› Issue (6): 124-131.

• 计算机 • 上一篇    下一篇

一种基于Heritrix可配置主题的聚焦爬虫方法

王松1,*, 刘洪基1, 叶晓波2   

  1. 1.楚雄师范学院 经济与管理学院,云南 楚雄 675000;
    2.楚雄师范学院 国有资产与信息化管理处,云南 楚雄 675000
  • 收稿日期:2020-05-06 出版日期:2020-11-20 发布日期:2021-03-29
  • 通讯作者: * 王松(1976–),男,硕士,楚雄师范学院经济与管理学院高级实验师,研究方向为大数据挖掘管理与分析、机器学习、数据库。E-mail:ws@cxtc.edu.cn,Tel. 13508789000

A Focused Crawler Method of Configurable Theme Based Heritrix

WANG Song, LIU Hongji, YE Xiaobo   

  1. School of Economics & Management, Chuxiong Normal University, Chuxiong, Yunnan Province 675000;
    Dept. of Management of State-owned Assets and Informationalization, Chuxiong Normal University, Chuxiong, Yunnan Province 675000
  • Received:2020-05-06 Online:2020-11-20 Published:2021-03-29

摘要: 通用搜索引擎存在不能有针对性地满足用户查询需求和搜索关键词难以准确描述的问题。从数据挖掘和机器学习的角度出发,提出一种基于网络爬虫开源框架Heritrix的可配置主题的聚焦爬虫方法,从指定的站源,根据不同的爬取策略,启动多线程爬取,按照预先设置的关键字和栏目信息进行分类搜索,把最符合条件和要求的信息爬取下来供选择、判断、分析和利用。这种方法可在一定程度上解决搜索引擎查询信息的需求问题,提升用户体验,提高检索效率。

关键词: 聚焦爬虫, 可配置主题, Heritrix

Abstract: During the time of development of the Internet, massive information was generated in the cyber-world and has become an important asset. Meanwhile users’ requirement on information search has become higher and higher. How to search key information quickly and effectively is one of the most difficult problems to solve. Basically, the search engine satisfies needs in data searching. However, needs of users only focusing on special themes or fields cannot be satisfied. Through searching key words only is hard to describe their needs or their problems. Thus, this study focuses on data mining and machine learning and proposes a crawler method of configurable theme focused on crawler system that is based on open-source framework of web crawler Heritrix. To a certain extent this method can solve the above mentioned problems and improve users’ perception and searching efficiency.

Key words: focused crawler, configurable theme, Heritrix

中图分类号: