View source on GitHub |
Splitter that uses a Hub module.
Inherits From: SplitterWithOffsets
, Splitter
text.HubModuleSplitter(
hub_module_handle
)
The TensorFlow graph from the module performs the real work. The Python code from this class handles the details of interfacing with that module, as well as the support for ragged tensors and high-rank tensors.
The Hub module should be supported by hub.load()
<https://www.tensorflow.org/hub/api_docs/python/hub/load>
_ If a v1 module, it
should have a graph variant with an empty set of tags; we consider that graph
variant to be the module and ignore everything else. The module should have a
signature named default
that takes a text
input (a rank-1 tensor of
strings to split into pieces) and returns a dictionary of tensors, let's say
output_dict
, such that:
output_dict['num_pieces']
is a rank-1 tensor of integers, where num_pieces[i] is the number of pieces that text[i] was split into.output_dict['pieces']
is a rank-1 tensor of strings containing all pieces for text0, followed by all pieces for text1 and so on.output_dict['starts']
is a rank-1 tensor of integers with the byte offsets where the pieces start (relative to the beginning of the corresponding input string).output_dict['end']
is a rank-1 tensor of integers with the byte offsets right after the end of the tokens (relative to the beginning of the corresponding input string).
The output dictionary may contain other tensors (e.g., for debugging) but this class is not using them.
Example:
import tensorflow_hub as hub
HUB_MODULE = "https://tfhub.dev/google/zh_segmentation/1"
segmenter = HubModuleSplitter(hub.resolve(HUB_MODULE))
segmenter.split(["新华社北京"])
You can also use this tokenizer to return the split strings and their offsets:
import tensorflow_hub as hub
HUB_MODULE = "https://tfhub.dev/google/zh_segmentation/1"
segmenter = HubModuleSplitter(hub.resolve(HUB_MODULE))
pieces, starts, ends = segmenter.split_with_offsets(["新华社北京"])
print("pieces: %s starts: %s ends: %s" % (pieces, starts, ends))
pieces:
Currently, this class also supports an older API, which uses slightly different key names for the output dictionary. For new Hub modules, please use the API described above.
Methods
split
split(
input_strs
)
Splits a tensor of UTF-8 strings into pieces.
Args | |
---|---|
input_strs
|
An N-dimensional Tensor or RaggedTensor of UTF-8 strings.
|
Returns | |
---|---|
A RaggedTensor of segmented text. The returned shape is the shape of the
input tensor with an added ragged dimension for the pieces of each string.
|
split_with_offsets
split_with_offsets(
input_strs
)
Splits a tensor of UTF-8 strings into pieces with [start,end) offsets.
Args | |
---|---|
input_strs
|
An N-dimensional Tensor or RaggedTensor of UTF-8 strings.
|
Returns | |
---|---|
A tuple (pieces, start_offsets, end_offsets) where:
|