java.lang.Object TextCleaner
public class TextCleaner
Cleans text into a "canonical" form, a simple form in which it's easy to make comparisons among very differently formatted text. The key modifications are to keep only certain "allowed" characters, change to lower case, remove HTML tags, and change successive whitespace into a single space.
Field Summary | |
---|---|
static java.lang.String |
DEFAULT_RETAIN_LIST
The default list of letter, number, and whitespace characters to allow. |
private java.lang.String |
myRetainList
The list of characters to retain on calls to eliminateDisallowedCharacters. |
Constructor Summary | |
---|---|
TextCleaner()
Create the default text cleaner, which retains letter, number, and whitespace characters. |
|
TextCleaner(java.lang.String retainList)
Create a text cleaner with the given retention list (of characters to retain on calls to eliminateDisallowedCharacters). |
Method Summary | |
---|---|
java.lang.String |
clean(java.lang.String text)
Clean the given text by performing all TextCleaner transformations on it (with whitespace collapse happening last and character elimination happening second-to-last). |
static java.lang.String |
collapseWhitespace(java.lang.String text)
Collapse consecutive whitespace characters in the given text to a single space. |
static java.lang.String |
convertToLowercase(java.lang.String text)
Convert the given text to lower case. |
java.lang.String |
eliminateDisallowedCharacters(java.lang.String text)
Eliminate all the characters except those in the list getRetainList from the string. |
static java.lang.String |
eliminateDisallowedCharacters(java.lang.String text,
java.lang.String retainList)
Eliminate all the characters except those in the supplied retention list from the string. |
static java.lang.String |
eliminateHTMLTags(java.lang.String text)
Eliminate HTML tags from the given text. |
java.lang.String |
getRetainList()
Get the list of characters to retain on calls to eliminateDisallowedCharacters |
void |
setRetainList(java.lang.String list)
Set the list of characters to retain on calls to eliminateDisallowedCharacters |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final java.lang.String DEFAULT_RETAIN_LIST
private java.lang.String myRetainList
Constructor Detail |
---|
public TextCleaner()
public TextCleaner(java.lang.String retainList)
retainList
- a list of characters to retainMethod Detail |
---|
public java.lang.String clean(java.lang.String text)
text
- the text to clean (non-null)
public static java.lang.String collapseWhitespace(java.lang.String text)
Hint: build a new string up character-by-character from the old text. Just add each non-whitespace character of text (see the isWhitespace method of Character). Also, keep track of whether the last character was whitespace or not. Skip all the whitespace, BUT when you see a non-whitespace character immediately after a whitespace character, add in an extra blank as the new word separator.
text
- the text to convert (non-null)
public static java.lang.String convertToLowercase(java.lang.String text)
text
- the text to convert (non-null)
public java.lang.String eliminateDisallowedCharacters(java.lang.String text)
text
- the text to convert (non-null)
public static java.lang.String eliminateDisallowedCharacters(java.lang.String text, java.lang.String retainList)
Hint: try building up a new string starting with the empty string, adding to it each character of text that is in the retain list (and NOT adding the characters outside the retain list).
text
- the text to convert (non-null)retainList
- the list of characters to retain (non-null)
public static java.lang.String eliminateHTMLTags(java.lang.String text)
Hint: build up a new string character-by-character from the old text. Keep track of whether you're currently between a left and right angle bracket. If you are, just don't add any characters at all.
text
- the text to convert (non-null)
public java.lang.String getRetainList()
public void setRetainList(java.lang.String list)
list
- the list of characters to retain (non-null)