crawler.link_extraction_regular_expression
Specifies the regular expression used to extract links from each document.
Key: crawler.link_extraction_regular_expression
Type: String
Can be set in: collection.cfg
Description
This option defines the regular expression that will be used to extract URLs from HTML links like the following:
<link rel="alternate"href="http://www.abc.net.au/mobile"/>
<a href="http://www.abc.net.au">ABC</a>
<img src="http://www.abc.net.au/logo.png"alt="ABC Logo"/>
Default Value
crawler.link_extraction_regular_expression=\s(href|src)(\s)*=(\s)*('|")?\s*(.*?)(>|\4|(\s\w+\=))>?([^<]+)?(</a)?
If no value is defined, then the above default is used.
Examples
crawler.link_extraction_group=5
crawler.link_extraction_regular_expression=\s(href|src)(\s)*=(\s)*(\'|\")?\\s*(.*?)(>|\"|\'|(\s\w+\=))
Extracted groups:
(href|src): handlelink,aorimgHTML tags.(\s): optional spaces(\s): optional spaces(\'|\"):- quotes to begin the URL(.*?): the URL (non-greedy pattern)(>|\"|\'|(\s\w+\=)): end the URL