Values for the 'pattern' arg of the wordshape op.
WordShape.BEGINS_WITH_OPEN_QUOTE
:
The input begins with an open quote.
The following strings are considered open quotes:
" QUOTATION MARK
' APOSTROPHE
` GRAVE ACCENT
`` Pair of GRAVE ACCENTs
\uFF02 FULLWIDTH QUOTATION MARK
\uFF07 FULLWIDTH APOSTROPHE
\u00AB LEFT - POINTING DOUBLE ANGLE QUOTATION MARK
\u2018 LEFT SINGLE QUOTATION MARK
\u201A SINGLE LOW - 9 QUOTATION MARK
\u201B SINGLE HIGH - REVERSED - 9 QUOTATION MARK
\u201C LEFT DOUBLE QUOTATION MARK
\u201E DOUBLE LOW - 9 QUOTATION MARK
\u201F DOUBLE HIGH - REVERSED - 9 QUOTATION MARK
\u2039 SINGLE LEFT - POINTING ANGLE QUOTATION MARK
\u300C LEFT CORNER BRACKET
\u300E LEFT WHITE CORNER BRACKET
\u301D REVERSED DOUBLE PRIME QUOTATION MARK
\u2E42 DOUBLE LOW - REVERSED - 9 QUOTATION MARK
\uFF62 HALFWIDTH LEFT CORNER BRACKET
\uFE41 PRESENTATION FORM FOR VERTICAL LEFT CORNER BRACKET
\uFE43 PRESENTATION FORM FOR VERTICAL LEFT WHITE CORNER BRACKET
Note: U+B4 (acute accent) not included.
WordShape.BEGINS_WITH_PUNCT_OR_SYMBOL
:
The input starts with a punctuation or symbol character.
WordShape.ENDS_WITH_CLOSE_QUOTE
:
The input ends witha closing quote character.
The following strings are considered close quotes:
" QUOTATION MARK
' APOSTROPHE
` GRAVE ACCENT
'' Pair of APOSTROPHEs
\uFF02 FULLWIDTH QUOTATION MARK
\uFF07 FULLWIDTH APOSTROPHE
\u00BB RIGHT - POINTING DOUBLE ANGLE QUOTATION MARK
\u2019 RIGHT SINGLE QUOTATION MARK
\u201D RIGHT DOUBLE QUOTATION MARK
\u203A SINGLE RIGHT - POINTING ANGLE QUOTATION MARK
\u300D RIGHT CORNER BRACKET
\u300F RIGHT WHITE CORNER BRACKET
\u301E DOUBLE PRIME QUOTATION MARK
\u301F LOW DOUBLE PRIME QUOTATION MARK
\uFE42 PRESENTATION FORM FOR VERTICAL RIGHT CORNER BRACKET
\uFE44 PRESENTATION FORM FOR VERTICAL RIGHT WHITE CORNER BRACKET
\uFF63 HALFWIDTH RIGHT CORNER BRACKET
Note: U+B4 (ACUTE ACCENT) is not included.
WordShape.ENDS_WITH_ELLIPSIS
:
The input ends with an ellipsis (i.e. with three or more
periods or a unicode ellipsis character).
WordShape.ENDS_WITH_EMOTICON
:
The input ends with an emoticon.
WordShape.ENDS_WITH_PUNCT_OR_SYMBOL
:
The input ends with a punctuation or symbol character.
WordShape.HAS_CURRENCY_SYMBOL
:
The input contains a currency symbol.
WordShape.HAS_EMOJI
:
The input contains an emoji character.
See http://www.unicode.org/Public/emoji/1.0//emoji-data.txt
Emojis are in unicode ranges 2600-26FF
, 1F300-1F6FF
, and
1F900-1F9FF
.
WordShape.HAS_MATH_SYMBOL
:
The input contains a mathematical symbol.
WordShape.HAS_MIXED_CASE
:
The input contains both uppercase and lowercase letterforms.
WordShape.HAS_NON_LETTER
:
The input contains a non-letter character.
WordShape.HAS_NO_DIGITS
:
The input contains no digit characters.
WordShape.HAS_NO_PUNCT_OR_SYMBOL
:
The input contains no unicode punctuation or symbol characters.
WordShape.HAS_ONLY_DIGITS
:
The input consists entirely of unicode digit characters.
WordShape.HAS_PUNCTUATION_DASH
:
The input contains at least one unicode dash character.
Note that this uses the Pd (Dash) unicode property. This property will
not match to soft-hyphens and katakana middle dot characters.
WordShape.HAS_SOME_DIGITS
:
The input contains a mix of digit characters and non-digit
characters.
WordShape.HAS_SOME_PUNCT_OR_SYMBOL
:
The input contains a mix of punctuation or symbol characters,
and non-punctuation non-symbol characters.
WordShape.HAS_TITLE_CASE
:
The input has title case (i.e. the first character is upper or title
case, and the remaining characters are lowercase).
WordShape.IS_ACRONYM_WITH_PERIODS
:
The input is a period-separated acronym.
This matches for strings of the form "I.B.M." but not "IBM".
WordShape.IS_EMOTICON
:
The input is a single emoticon.
WordShape.IS_LOWERCASE
:
The input contains only lowercase letterforms.
WordShape.IS_MIXED_CASE_LETTERS
:
The input contains only uppercase and lowercase letterforms.
WordShape.IS_NUMERIC_VALUE
:
The input is parseable as a numeric value. This will match a
fairly broad set of floating point and integer representations (but
not Nan or Inf).
WordShape.IS_PUNCT_OR_SYMBOL
:
The input contains only punctuation and symbol characters.
WordShape.IS_UPPERCASE
:
The input contains only uppercase letterforms.
Class Variables
BEGINS_WITH_OPEN_QUOTE
<WordShape.BEGINS_WITH_OPEN_QUOTE: '\`\`.*|["\'\`'"‘‚‛“«„‟‹「『〝⹂「﹁﹃][^"\'\`'"‘‚‛“«„‟‹「『〝⹂「﹁﹃]*'>
BEGINS_WITH_PUNCT_OR_SYMBOL
<WordShape.BEGINS_WITH_PUNCT_OR_SYMBOL: '[\\p{P}\\p{S}].*'>
ENDS_WITH_CLOSE_QUOTE
<WordShape.ENDS_WITH_CLOSE_QUOTE: '.*\'\'|[^"\'\`'"»’”›」』〞〟﹂﹄」]*["\'\`'"»’”›」』〞〟﹂﹄」]'>
ENDS_WITH_ELLIPSIS
<WordShape.ENDS_WITH_ELLIPSIS: '.*(\\.{3}|[…⋯])'>
ENDS_WITH_EMOTICON
<WordShape.ENDS_WITH_EMOTICON: ".*(:\\-\\)|:\\)|:o\\)|:\\]|:3|:>|=\\]|=\\)|:\\}|:\\^\\)|:\\-D|:\\-\\)\\)|:\\-\\)\\)\\)|:\\-\\)\\)\\)\\)|:\\-\\)\\)\\)\\)\\)|>:\\[|:\\-\\(|:\\(|:\\-c|:c|:\\-<|:<|:\\-\\[|:\\[|:\\{|;\\(|:\\-\\|\\||:@|>:\\(|:'\\-\\(|:'\\(|:'\\-\\)|:'\\)|D:<|>:O|:\\-O|:\\-o|:\\*|:\\-\\*|:\\^\\*|;\\-\\)|;\\)|\\*\\-\\)|\\*\\)|;\\-\\]|;\\]|;\\^\\)|:\\-,|>:P|:\\-P|:p|=p|:\\-p|=p|:P|=P|;p|;\\-p|;P|;\\-P|>:\\\\|>:/|:\\-/|:\\-\\.|:/|:\\\\|=/|=\\\\|:\\||:\\-\\||:\\$|:\\-\\#|:\\#|O:\\-\\)|0:\\-\\)|0:\\)|0;\\^\\)|>:\\)|>;\\)|>:\\-\\)|\\}:\\-\\)|\\}:\\)|3:\\-\\)|>_>\\^|\\^<_<|\\|;\\-\\)|\\|\\-O|:\\-J|:\\-\\&|:\\&|\\#\\-\\)|%\\-\\)|%\\)|<:\\-\\||\\~:\\-\\\\|\\*<\\|:\\-\\)|=:o\\]|,:\\-\\)|7:\\^\\]|</3|<3|8\\-\\)|\\^_\\^|:D|:\\-D|=D|\\^_\\^;;|O=\\)|\\}=\\)|B\\)|B\\-\\)|=\\||\\-_\\-|o_o;|u_u|:\\-\\\\|:s|:S|:\\-s|:\\-S|;\\*|;\\-\\*|:\\(|=\\(|>\\.<|>:\\-\\(|>:\\(|>=\\(|;_;|T_T|='\\(|>_<|D:|:o|:\\-o|=o|o\\.o|:O|:\\-O|=O|O\\.O|x_x|X\\-\\(|X\\(|X\\-o|X\\-O|:X\\)|\\(=\\^\\.\\^=\\)|\\(=\\^\\.\\.\\^=\\)|=\\^_\\^=|\\-<@%|:\\(\\|\\)|:\\(:\\)|\\(\\]:\\{|<\\\\3|\\~@\\~|8'\\(|XD|DX\\:っ\\)|\\:っC|ಠ\\_ಠ)$">
ENDS_WITH_PUNCT_OR_SYMBOL
<WordShape.ENDS_WITH_PUNCT_OR_SYMBOL: '.*[\\p{P}\\p{S}]'>
HAS_CURRENCY_SYMBOL
<WordShape.HAS_CURRENCY_SYMBOL: '.*\\p{Sc}.*'>
HAS_EMOJI
<WordShape.HAS_EMOJI: '.*(.*[‼⁉ℹ↔-↙↩↪⌚⌛⌨⏏⏩-⏳⏸-⏺Ⓜ▪▫▶◀◻-◾☀-⛿✂✅✈-✍✏✒✔✖✝✡✨✳✴❄❇❌❎❓-❕❗❣❤➕-➗⤴⤵⬅-⬇⬛⬜⭐⭕〰〽㊗㊙🀄🃏🅰🅱🅾🅿🆎🆑-🆚🇦-🇿🈁🈂🈚🈯🈲-🈺🉐🉑🌀-\U0001f6ff🤀-🧿🩰-🩴🩸-🩺🪀-🪆🪐-🪨🪰-🪶🫀-🫂🫐-🫖].*)$'>
HAS_MATH_SYMBOL
<WordShape.HAS_MATH_SYMBOL: '.*\\p{Sm}.*'>
HAS_MIXED_CASE
<WordShape.HAS_MIXED_CASE: '.*\\p{Lu}.*\\p{Ll}.*|.*\\p{Ll}.*\\p{Lu}.*'>
HAS_NON_LETTER
<WordShape.HAS_NON_LETTER: '.*\\P{L}.*'>
HAS_NO_DIGITS
<WordShape.HAS_NO_DIGITS: '\\P{Nd}*'>
HAS_NO_PUNCT_OR_SYMBOL
<WordShape.HAS_NO_PUNCT_OR_SYMBOL: '[^\\p{P}\\p{S}]*'>
HAS_ONLY_DIGITS
<WordShape.HAS_ONLY_DIGITS: '\\p{Nd}+'>
HAS_PUNCTUATION_DASH
<WordShape.HAS_PUNCTUATION_DASH: '.*\\p{Pd}+.*'>
HAS_SOME_DIGITS
<WordShape.HAS_SOME_DIGITS: '.*\\P{Nd}\\p{Nd}.*|.*\\p{Nd}\\P{Nd}.*'>
HAS_SOME_PUNCT_OR_SYMBOL
<WordShape.HAS_SOME_PUNCT_OR_SYMBOL: '.*[^\\p{P}\\p{S}][\\p{P}\\p{S}].*|.*[\\p{P}\\p{S}][^\\p{P}\\p{S}].*'>
HAS_TITLE_CASE
<WordShape.HAS_TITLE_CASE: '\\P{L}*[\\p{Lu}\\p{Lt}]\\p{Ll}+.*'>
IS_ACRONYM_WITH_PERIODS
<WordShape.IS_ACRONYM_WITH_PERIODS: '(\\p{Lu}\\.)+'>
IS_EMOTICON
<WordShape.IS_EMOTICON: ":\\-\\)|:\\)|:o\\)|:\\]|:3|:>|=\\]|=\\)|:\\}|:\\^\\)|:\\-D|:\\-\\)\\)|:\\-\\)\\)\\)|:\\-\\)\\)\\)\\)|:\\-\\)\\)\\)\\)\\)|>:\\[|:\\-\\(|:\\(|:\\-c|:c|:\\-<|:<|:\\-\\[|:\\[|:\\{|;\\(|:\\-\\|\\||:@|>:\\(|:'\\-\\(|:'\\(|:'\\-\\)|:'\\)|D:<|>:O|:\\-O|:\\-o|:\\*|:\\-\\*|:\\^\\*|;\\-\\)|;\\)|\\*\\-\\)|\\*\\)|;\\-\\]|;\\]|;\\^\\)|:\\-,|>:P|:\\-P|:p|=p|:\\-p|=p|:P|=P|;p|;\\-p|;P|;\\-P|>:\\\\|>:/|:\\-/|:\\-\\.|:/|:\\\\|=/|=\\\\|:\\||:\\-\\||:\\$|:\\-\\#|:\\#|O:\\-\\)|0:\\-\\)|0:\\)|0;\\^\\)|>:\\)|>;\\)|>:\\-\\)|\\}:\\-\\)|\\}:\\)|3:\\-\\)|>_>\\^|\\^<_<|\\|;\\-\\)|\\|\\-O|:\\-J|:\\-\\&|:\\&|\\#\\-\\)|%\\-\\)|%\\)|<:\\-\\||\\~:\\-\\\\|\\*<\\|:\\-\\)|=:o\\]|,:\\-\\)|7:\\^\\]|</3|<3|8\\-\\)|\\^_\\^|:D|:\\-D|=D|\\^_\\^;;|O=\\)|\\}=\\)|B\\)|B\\-\\)|=\\||\\-_\\-|o_o;|u_u|:\\-\\\\|:s|:S|:\\-s|:\\-S|;\\*|;\\-\\*|:\\(|=\\(|>\\.<|>:\\-\\(|>:\\(|>=\\(|;_;|T_T|='\\(|>_<|D:|:o|:\\-o|=o|o\\.o|:O|:\\-O|=O|O\\.O|x_x|X\\-\\(|X\\(|X\\-o|X\\-O|:X\\)|\\(=\\^\\.\\^=\\)|\\(=\\^\\.\\.\\^=\\)|=\\^_\\^=|\\-<@%|:\\(\\|\\)|:\\(:\\)|\\(\\]:\\{|<\\\\3|\\~@\\~|8'\\(|XD|DX\\:っ\\)|\\:っC|ಠ\\_ಠ">
IS_LOWERCASE
<WordShape.IS_LOWERCASE: '\\p{Ll}+'>
IS_MIXED_CASE_LETTERS
<WordShape.IS_MIXED_CASE_LETTERS: '\\p{L}*\\p{Lu}\\p{L}*\\p{Ll}\\p{L}*|\\p{L}*\\p{Ll}\\p{L}*\\p{Lu}\\p{L}*'>
IS_NUMERIC_VALUE
<WordShape.IS_NUMERIC_VALUE: '([+-]?((\\p{Nd}+\\.?\\p{Nd}*)|(\\.\\p{Nd}+)))([eE]-?\\p{Nd}+)?'>
IS_PUNCT_OR_SYMBOL
<WordShape.IS_PUNCT_OR_SYMBOL: '[\\p{P}|\\p{S}]+'>
IS_UPPERCASE
<WordShape.IS_UPPERCASE: '\\p{Lu}+'>