Class TextCleaner

java.lang.Object
  extended by TextCleaner

public class TextCleaner
extends java.lang.Object

Cleans text into a "canonical" form, a simple form in which it's easy to make comparisons among very differently formatted text. The key modifications are to keep only certain "allowed" characters, change to lower case, remove HTML tags, and change successive whitespace into a single space.


Field Summary
static java.lang.String DEFAULT_RETAIN_LIST
          The default list of letter, number, and whitespace characters to allow.
private  java.lang.String myRetainList
          The list of characters to retain on calls to eliminateDisallowedCharacters.
 
Constructor Summary
TextCleaner()
          Create the default text cleaner, which retains letter, number, and whitespace characters.
TextCleaner(java.lang.String retainList)
          Create a text cleaner with the given retention list (of characters to retain on calls to eliminateDisallowedCharacters).
 
Method Summary
 java.lang.String clean(java.lang.String text)
          Clean the given text by performing all TextCleaner transformations on it (with whitespace collapse happening last and character elimination happening second-to-last).
static java.lang.String collapseWhitespace(java.lang.String text)
          Collapse consecutive whitespace characters in the given text to a single space.
static java.lang.String convertToLowercase(java.lang.String text)
          Convert the given text to lower case.
 java.lang.String eliminateDisallowedCharacters(java.lang.String text)
          Eliminate all the characters except those in the list getRetainList from the string.
static java.lang.String eliminateDisallowedCharacters(java.lang.String text, java.lang.String retainList)
          Eliminate all the characters except those in the supplied retention list from the string.
static java.lang.String eliminateHTMLTags(java.lang.String text)
          Eliminate HTML tags from the given text.
 java.lang.String getRetainList()
          Get the list of characters to retain on calls to eliminateDisallowedCharacters
 void setRetainList(java.lang.String list)
          Set the list of characters to retain on calls to eliminateDisallowedCharacters
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_RETAIN_LIST

public static final java.lang.String DEFAULT_RETAIN_LIST
The default list of letter, number, and whitespace characters to allow. Used to initialize the cleaner in the default constructor.

See Also:
Constant Field Values

myRetainList

private java.lang.String myRetainList
The list of characters to retain on calls to eliminateDisallowedCharacters.

Constructor Detail

TextCleaner

public TextCleaner()
Create the default text cleaner, which retains letter, number, and whitespace characters.


TextCleaner

public TextCleaner(java.lang.String retainList)
Create a text cleaner with the given retention list (of characters to retain on calls to eliminateDisallowedCharacters).

Parameters:
retainList - a list of characters to retain
Method Detail

clean

public java.lang.String clean(java.lang.String text)
Clean the given text by performing all TextCleaner transformations on it (with whitespace collapse happening last and character elimination happening second-to-last).

Parameters:
text - the text to clean (non-null)
Returns:
the text, cleaned

collapseWhitespace

public static java.lang.String collapseWhitespace(java.lang.String text)
Collapse consecutive whitespace characters in the given text to a single space. Also removes "dangling" whitespace on either end of the text. (See the String method named trim.)

Hint: build a new string up character-by-character from the old text. Just add each non-whitespace character of text (see the isWhitespace method of Character). Also, keep track of whether the last character was whitespace or not. Skip all the whitespace, BUT when you see a non-whitespace character immediately after a whitespace character, add in an extra blank as the new word separator.

Parameters:
text - the text to convert (non-null)
Returns:
the text, with whitespace collapsed

convertToLowercase

public static java.lang.String convertToLowercase(java.lang.String text)
Convert the given text to lower case.

Parameters:
text - the text to convert (non-null)
Returns:
the text, in lower-case

eliminateDisallowedCharacters

public java.lang.String eliminateDisallowedCharacters(java.lang.String text)
Eliminate all the characters except those in the list getRetainList from the string.

Parameters:
text - the text to convert (non-null)
Returns:
the text, with disallowed characters eliminated

eliminateDisallowedCharacters

public static java.lang.String eliminateDisallowedCharacters(java.lang.String text,
                                                             java.lang.String retainList)
Eliminate all the characters except those in the supplied retention list from the string.

Hint: try building up a new string starting with the empty string, adding to it each character of text that is in the retain list (and NOT adding the characters outside the retain list).

Parameters:
text - the text to convert (non-null)
retainList - the list of characters to retain (non-null)
Returns:
the text, with disallowed characters eliminated

eliminateHTMLTags

public static java.lang.String eliminateHTMLTags(java.lang.String text)
Eliminate HTML tags from the given text. That is, remove everything between each pair of left and right angle brackets, including the brackets.

Hint: build up a new string character-by-character from the old text. Keep track of whether you're currently between a left and right angle bracket. If you are, just don't add any characters at all.

Parameters:
text - the text to convert (non-null)
Returns:
the text, without HTML tags

getRetainList

public java.lang.String getRetainList()
Get the list of characters to retain on calls to eliminateDisallowedCharacters

Returns:
the list of characters to retain

setRetainList

public void setRetainList(java.lang.String list)
Set the list of characters to retain on calls to eliminateDisallowedCharacters

Parameters:
list - the list of characters to retain (non-null)