View source on GitHub |
A Splitter
that uses a state machine to determine sentence breaks.
StateBasedSentenceBreaker
splits text into sentences by using a state
machine to determine when a sequence of characters indicates a potential
sentence break.
The state machine consists of an initial state
, then transitions to a
collecting terminal punctuation state
once an acronym, an emoticon, or
terminal punctuation (ellipsis, question mark, exclamation point, etc.), is
encountered.
It transitions to the collecting close punctuation state
when a close
punctuation (close bracket, end quote, etc.) is found.
If non-punctuation is encountered in the collecting terminal punctuation or collecting close punctuation states, then the state machine exits, returning false, indicating it has moved past the end of a potential sentence fragment.
Methods
break_sentences
break_sentences(
doc
)
Splits doc
into sentence fragments and returns the fragments' text.
Args | |
---|---|
doc
|
A string Tensor of shape [batch] with a batch of documents.
|
Returns | |
---|---|
results
|
A string RaggedTensor of shape [batch, (num_sentences)]
with each input broken up into its constituent sentence fragments.
|
break_sentences_with_offsets
break_sentences_with_offsets(
doc
)
Splits doc
into sentence fragments, returns text, start & end offsets.
Example:
1 1 2 3
012345678901234 01234567890123456789012345678901234567
doc: 'Hello...foo bar', 'Welcome to the U.S. don't be surprised'
fragment_text: [
['Hello...', 'foo bar'],
['Welcome to the U.S.' , 'don't be surprised']
]
start: [[0, 8],[0, 20]]
end: [[8, 15],[19, 38]]
Args | |
---|---|
doc
|
A string Tensor of shape [batch] or [batch, 1] .
|
Returns | |
---|---|
A tuple of (fragment_text, start, end) where:
|
|
fragment_text
|
A string RaggedTensor of shape [batch, (num_sentences)]
with each input broken up into its constituent sentence fragments.
|
start
|
A int64 RaggedTensor of shape [batch, (num_sentences)]
where each entry is the inclusive beginning byte offset of a sentence.
|
end
|
A int64 RaggedTensor of shape [batch, (num_sentences)]
where each entry is the exclusive ending byte offset of a sentence.
|