filter.jsoup.undesirable_text-source.[key_name]
Specify sources of undesirable text strings to detect and present within content auditor.
Key: filter.jsoup.undesirable_text-source.[key_name]
Type: String
Can be set in: collection.cfg
Description
This setting controls where 'undesirable text' is listed for detection in content auditor.
The format allows for setting several sources to be defined, each with a key name (allowing collections to override the defaults).
filter.jsoup.undesirable_text-source.(key_name)=(file_path)
The format of the file at the given path is expected to be a list of undesirable word sequences, with
newlines separating each sequence. Where multi-word sequences are used, each word should be separated
by a single space character. Text versions of HTML entities (e.g. \u2014 instead of —) should
be used where applicable.
Undesirable text files can be created from the administration interface file manager by selecting undesirable-text.*.cfg
from the create menu. To make use of this file, the file_path must be set to $SEARCH_HOME/conf/$COLLECTION_NAME/undesirable-text.<name>.cfg.
The key_name can be any string as long as it is unique per collection.
Default values
filter.jsoup.undesirable_text-source.default-misspellings=$SEARCH_HOME/conf/common-misspellings.txt.default
This default setting provides a list of commonly misspelled words in English based on Wikipedia's list of common misspellings for machines.
Examples
The following overrides the misspellings with a custom file, and also includes an additional set from 'undesirable-text.additional.cfg'.
filter.jsoup.undesirable_text-source.default-misspellings=$SEARCH_HOME/conf/$COLLECTION_NAME/undesirable-text.default-misspellings.cfg
filter.jsoup.undesirable_text-source.additional=$SEARCH_HOME/conf/$COLLECTION_NAME/undesirable-text.additional.cfg
more_undesirable_text.txt contains:
\u2014
etc.
e.g.
aluminum
purple monkey