hashformers.segmenter package

Submodules

hashformers.segmenter.segmenter module

class hashformers.segmenter.segmenter.BaseSegmenter

Bases: object

predict(*args, **kwargs)
class hashformers.segmenter.segmenter.EkphrasisWordSegmenter(**kwargs)

Bases: ekphrasis.classes.segmenter.Segmenter, hashformers.segmenter.segmenter.BaseSegmenter

find_segment(text, prev='<S>')

Return (log P(words), words), where words is the best estimated segmentation :param text: the text to be segmented :param prev: :return:

segment(inputs) List[str]
segment_word(word) str
class hashformers.segmenter.segmenter.HashtagContainer(hashtags: List[List[str]], hashtag_set: List[str], replacement_dict: dict)

Bases: object

hashtag_set: List[str]
hashtags: List[List[str]]
replacement_dict: dict
class hashformers.segmenter.segmenter.RegexWordSegmenter(regex_rules=None)

Bases: hashformers.segmenter.segmenter.BaseSegmenter

segment(inputs: List[str])
segment_word(rule, word)
segmentation_generator(word_list)
class hashformers.segmenter.segmenter.TweetSegmenter(matcher=None, word_segmenter=None)

Bases: hashformers.segmenter.segmenter.BaseSegmenter

build_hashtag_container(tweets: str, preprocessing_kwargs: dict = {}, segmenter_kwargs: dict = {})
compile_dict(hashtags, segmentations, hashtag_token=None, lower=False, separator=' ', hashtag_character='#')
extract_hashtags(tweets)
replace_hashtags(tweet, regex_pattern, replacement_dict)
segment(tweets: List[str], regex_flag: Any = 0, preprocessing_kwargs: dict = {}, segmenter_kwargs: dict = {})
segmented_tweet_generator(tweets, hashtags, hashtag_set, replacement_dict, flag=0)
class hashformers.segmenter.segmenter.TweetSegmenterOutput(output: List[str], word_segmenter_output: Any)

Bases: object

output: List[str]
word_segmenter_output: Any
class hashformers.segmenter.segmenter.TwitterTextMatcher

Bases: object

class hashformers.segmenter.segmenter.WordSegmenter(segmenter_model_name_or_path='gpt2', segmenter_model_type='gpt2', segmenter_device='cuda', segmenter_gpu_batch_size=1, reranker_gpu_batch_size=2000, reranker_model_name_or_path='bert-base-uncased', reranker_model_type='bert')

Bases: hashformers.segmenter.segmenter.BaseSegmenter

A general-purpose word segmentation API.

segment(word_list: List[str], topk: int = 20, steps: int = 13, alpha: float = 0.222, beta: float = 0.111, use_reranker: bool = False, return_ranks: bool = False) Any

Segment a list of strings.

Parameters
  • word_list (List[str]) – A list of strings.

  • topk (int, optional) – top-k parameter for the Beamsearch algorithm. A lower top-k value will speed up the algorithm. However, this will decrease the amount of candidate segmentations in a rank, defaults to 20

  • steps (int, optional) – steps parameter for the Beamsearch algorithm. A lower amount of steps will speed up the algorithm. However, the algorithm will never detect a number of words larger than amount of steps, defaults to 13

  • alpha (float, optional) – alpha parameter for the top-2 ensemble. It controls the weight given to the segmenter candidates. Reasonable values range from 0 to 1, defaults to 0.222

  • beta (float, optional) – beta parameter for the top-2 ensemble. It controls the weight given to the reranker candidates. Reasonable values range from 0 to 1, defaults to 0.111

  • use_reranker (bool, optional) – Whether or not to run the reranker, defaults to False

  • return_ranks (bool, optional) – Return not just the segmented hashtags but also the a dictionary of the ranks, defaults to False

Returns

A list of segmented words if return_ranks == False. A dictionary of the ranks and the segmented words if return_ranks == True.

Return type

Any

class hashformers.segmenter.segmenter.WordSegmenterOutput(output: List[str], segmenter_rank: Union[pandas.core.frame.DataFrame, NoneType] = None, reranker_rank: Union[pandas.core.frame.DataFrame, NoneType] = None, ensemble_rank: Union[pandas.core.frame.DataFrame, NoneType] = None)

Bases: object

ensemble_rank: Optional[pandas.core.frame.DataFrame] = None
output: List[str]
reranker_rank: Optional[pandas.core.frame.DataFrame] = None
segmenter_rank: Optional[pandas.core.frame.DataFrame] = None
hashformers.segmenter.segmenter.coerce_segmenter_objects(method)
hashformers.segmenter.segmenter.deleteEncodingLayers(model, layer_list=[0])
hashformers.segmenter.segmenter.prune_segmenter_layers(ws, layer_list=[0])

Module contents