hashformers.segmenter package
Submodules
hashformers.segmenter.segmenter module
- class hashformers.segmenter.segmenter.EkphrasisWordSegmenter(**kwargs)
Bases:
ekphrasis.classes.segmenter.Segmenter
,hashformers.segmenter.segmenter.BaseSegmenter
- find_segment(text, prev='<S>')
Return (log P(words), words), where words is the best estimated segmentation :param text: the text to be segmented :param prev: :return:
- segment(inputs) List[str]
- segment_word(word) str
- class hashformers.segmenter.segmenter.HashtagContainer(hashtags: List[List[str]], hashtag_set: List[str], replacement_dict: dict)
Bases:
object
- hashtag_set: List[str]
- hashtags: List[List[str]]
- replacement_dict: dict
- class hashformers.segmenter.segmenter.RegexWordSegmenter(regex_rules=None)
Bases:
hashformers.segmenter.segmenter.BaseSegmenter
- segment(inputs: List[str])
- segment_word(rule, word)
- segmentation_generator(word_list)
- class hashformers.segmenter.segmenter.TweetSegmenter(matcher=None, word_segmenter=None)
Bases:
hashformers.segmenter.segmenter.BaseSegmenter
- build_hashtag_container(tweets: str, preprocessing_kwargs: dict = {}, segmenter_kwargs: dict = {})
- compile_dict(hashtags, segmentations, hashtag_token=None, lower=False, separator=' ', hashtag_character='#')
- extract_hashtags(tweets)
- replace_hashtags(tweet, regex_pattern, replacement_dict)
- segment(tweets: List[str], regex_flag: Any = 0, preprocessing_kwargs: dict = {}, segmenter_kwargs: dict = {})
- segmented_tweet_generator(tweets, hashtags, hashtag_set, replacement_dict, flag=0)
- class hashformers.segmenter.segmenter.TweetSegmenterOutput(output: List[str], word_segmenter_output: Any)
Bases:
object
- output: List[str]
- word_segmenter_output: Any
- class hashformers.segmenter.segmenter.TwitterTextMatcher
Bases:
object
- class hashformers.segmenter.segmenter.WordSegmenter(segmenter_model_name_or_path='gpt2', segmenter_model_type='gpt2', segmenter_device='cuda', segmenter_gpu_batch_size=1, reranker_gpu_batch_size=2000, reranker_model_name_or_path='bert-base-uncased', reranker_model_type='bert')
Bases:
hashformers.segmenter.segmenter.BaseSegmenter
A general-purpose word segmentation API.
- segment(word_list: List[str], topk: int = 20, steps: int = 13, alpha: float = 0.222, beta: float = 0.111, use_reranker: bool = False, return_ranks: bool = False) Any
Segment a list of strings.
- Parameters
word_list (List[str]) – A list of strings.
topk (int, optional) – top-k parameter for the Beamsearch algorithm. A lower top-k value will speed up the algorithm. However, this will decrease the amount of candidate segmentations in a rank, defaults to 20
steps (int, optional) – steps parameter for the Beamsearch algorithm. A lower amount of steps will speed up the algorithm. However, the algorithm will never detect a number of words larger than amount of steps, defaults to 13
alpha (float, optional) – alpha parameter for the top-2 ensemble. It controls the weight given to the segmenter candidates. Reasonable values range from 0 to 1, defaults to 0.222
beta (float, optional) – beta parameter for the top-2 ensemble. It controls the weight given to the reranker candidates. Reasonable values range from 0 to 1, defaults to 0.111
use_reranker (bool, optional) – Whether or not to run the reranker, defaults to False
return_ranks (bool, optional) – Return not just the segmented hashtags but also the a dictionary of the ranks, defaults to False
- Returns
A list of segmented words if return_ranks == False. A dictionary of the ranks and the segmented words if return_ranks == True.
- Return type
Any
- class hashformers.segmenter.segmenter.WordSegmenterOutput(output: List[str], segmenter_rank: Union[pandas.core.frame.DataFrame, NoneType] = None, reranker_rank: Union[pandas.core.frame.DataFrame, NoneType] = None, ensemble_rank: Union[pandas.core.frame.DataFrame, NoneType] = None)
Bases:
object
- ensemble_rank: Optional[pandas.core.frame.DataFrame] = None
- output: List[str]
- reranker_rank: Optional[pandas.core.frame.DataFrame] = None
- segmenter_rank: Optional[pandas.core.frame.DataFrame] = None
- hashformers.segmenter.segmenter.coerce_segmenter_objects(method)
- hashformers.segmenter.segmenter.deleteEncodingLayers(model, layer_list=[0])
- hashformers.segmenter.segmenter.prune_segmenter_layers(ws, layer_list=[0])