Metadata scraper filter
Introduction
The metadata scraper filter is used to extract content out of HTML documents and inject it as metadata for the document.
Enabling
To enable the metadata scraper filter add MetadataScraper to the filter.jsoup.classes list where <default_jsoup_filters> is the default value.
filter.jsoup.classes=<default_jsoup_filters>,MetadataScraper
Configuration
The filter is configured via a separate file metadata_scraper.json which must reside in the collection configuration folder ($SEARCH_HOME/conf/$COLLECTION/metadata_scraper.json).
This file is in JSON format and contains a list of rules to apply to documents, depending if their URL matches a regular expression:
[{
"urlRegex": "http://example\\.org/",
"metadataName": "author",
"elementSelector": "div.author-name",
"applyIfNoMatch": false,
"extractionType": "text",
"description": "Get author from DIV"
}, {
"urlRegex": "http://example\\.org/products/",
"metadataName": "productSku",
"elementSelector": "div.product p.sku",
"applyIfNoMatch": false,
"extractionType": "attr",
"attributeName": "data-sku"
}]
Each rule is defined with the following attributes:
urlRegex
Regular expression to specify which documents the rule applies to. The URL of the document will be matched against this regular expression and the rule will be applied only if there's a match.
Note: Because this is a regular expression, special characters like . must be escaped with \. In addition backslashes must be themselves escaped by \ in JSON, resulting in a double backslash: \\.. Without this escaping, . would mean "any character" in the regular expression syntax.
metadataName
This is the name of the resulting metadata that will get injected in the document. For example if this is set to author, the following will be injected in the document:
<meta name="author" content="...">
If the rule yields multiple values, they will be injected separately:
<meta name="author" content="Shakespeare">
<meta name="author" content="Yeats">
elementSelector
This is a CSS selector to select the HTML element from which to extract the content of the metadata to inject. For example with the following HTML fragment:
<div class="info">
<div class="author-name">William Shakespeare</div>
</div>
And the following selector: div.author-name, the inner <div> would be selected for extraction.
applyIfNoMatch
This is a boolean which indicates if the rule should get applied when the selector matches (false, this is the default), or when it doesn't match (true).
This is useful to inject a metadata on documents that don't match a specific selector. For example:
{
"urlRegex": "http://example\\.org/products/",
"metadataName": "productCategory",
"elementSelector": "p.category",
"applyIfNoMatch": true,
"processMode": "constant",
"value": "Default category"
}
With this rule, if a document doesn't contain a <p> tag with the category class, a productCategory metadata will be injected with the content Default category.
When applyIfNoMatch is set to true, the rule will only run when the elementSelector does not match. In the example above, if the document did contain a <p> tag with the category class, then the productCategory metadata will not be set to anything by this rule. For such a use case, it is recommended that this rule is paired with another one which extracts the category.
Note: Setting "processMode": "constant" is also important here. Without it, the default processMode of regex will be applied and this won't match any content.
extractionType
This indicates how the content should be extracted from the selected element. Possible values are:
text: The textual content of the matching element will be extracted. If the element contains HTML, all the tags are strippedhtml: The raw HTML content of the matching element will be extracted.attr: The value of an attribute of the matching element will be extracted. In this mode,attributeNamemust be provided.
For example, with the following HTML fragment
<div class="product" data-sku="1234">
<h1>Product title</h1>
<p>Product description</p>
</div>
And the selector div.product:
textwould result in the contentProduct title Product descriptionto be extractedhtmlwould result in the content<h1>Product title</h1> <p>Product description</p>to be extractedattr, withattributeName: "data-sku"would result in1234to be extracted
attributeName
This specifies the name of the attribute to extract the value from, if extractionType is set to attr. See example from the previous section for details.
processMode and value
This indicates how the extracted content is processed. Possible values are:
regex: Apply a regular expression over the extracted content. The regular expression must contain a capture group (using()) and each match will be injected as a separate metadataconstant: Return a hard coded string
This setting works in conjunction with value which indicates either the regular expression to apply, or the hard coded value to use.
This setting is optional. If it's not set, the complete extracted content is retained as-is.
For example with the HTML fragment:
<div class="info">
<div class="author-name">William Shakespeare</div>
</div>
And the rule:
{
"urlRegex": "http://example\\.org/",
"metadataName": "author",
"elementSelector": "div.author-name",
"applyIfNoMatch": false,
"extractionType": "text",
"processMode": "regex",
"value": "(\\S+)"
}
This would result in two metadata being injected author=William and author=Shakespeare because the regular expression (\\S+) yields 2 matches.
If processMode were set to constant and value to Yeats, a single metadata author=Yeats would have been injected.
description
This attribute is used to add a comment to the rule. It is optional and is not used when applying the rule.