Research of a professional search engine system based on Lucene and Heritrix

Ying Hong* and Chao Lv

Abstract

In order to solve user’s problem of searching professional information quickly and correctly, a professional search engine is designed and realized. In the first place, the web pages are collected by the means of extended Heritrix web crawler. The data extracted from web pages based on jsoup is saved to the local. In the second place, Chinese words segmentation, inverted index, index retrieval and improved web page ranking algorithm technology are taken to handle the collected data. At last, a professional search engine is designed and realized. The experimental results show that this professional search engine enhances accuracy and efficiency of web page information retrieval in great degree.

Relevant Publications in Journal of Chemical and Pharmaceutical Research