Rednotebook 中文词云

2021-07-01

字数统计: 1.1k字 | 阅读时长≈ 5分

Rednotebook 中文词云

rednotebook是个python写的轻量笔记软件，里面有一个Words词云功能比较突出，但不支持中文，每个未被空格分开的句子都会被当成一个单词放入词云中，起不到词云的作用。我找到程序文件位置发现是用python写的，进一步定位到与词云相关的是里面的**data.py中的get_words()函数和gui/clouds.py中的_get_words_for_cloud()**函数，在这两个文件中加入中文分词支持，应该就可以完成想要的词云效果。

选用包：THULAC

https://github.com/thunlp/THULAC-Python

1	sudo pip install thulac

~~前往http://thulac.thunlp.org/message_v1_1填写个人信息之后下载thulac的模型~~下载完之后发现pip安装的thulac已经自带默认模型，位置在/usr/local/lib/python3.8/dist-packages/thulac/models

装好之后先在python终端试一下：

>>> import thulac
>>> thu1 = thulac.thulac()
Model loaded succeed
>>> text = "清华大学计算机系thulacmodel模型model"
>>> tokenized = thu1.cut(text, text=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  ......
  File "/usr/local/lib/python3.8/dist-packages/thulac/character/CBTaggingDecoder.py", line 170, in segmentTag
    start = time.clock()
AttributeError: module 'time' has no attribute 'clock'

原因是python3.8不再支持time.clock()，按照报错找到出错的代码，将time.clock()改为time.perf_counter()。修改之后好了

1
2
3

>>> tokenized = thu1.cut(text, text=True)
>>> print(tokenized)
清华大学_ni 计算机系_n thulacmodel_x 模型_n model_x

用的时候应该text=False（默认情况），返回的是一个二维数组。

进一步测试一下：

>>> text = "小明学习他的离散数学，真是开心啊已经这么晚了"
>>> tokenized = thu1.cut(text)
>>> print(tokenized)
[['小明', 'np'], ['学习', 'v'], ['他', 'r'], ['的', 'u'], ['离散', 'v'], ['数学', 'n'], ['，', 'w'], ['真是', 'd'], ['开心', 'a'], ['啊', 'u'], ['已经', 'd'], ['这', 'r'], ['么', 'q'], ['晚', 'a'], ['了', 'u']]
>>> text = "Let's try english words"
>>> print(thu1.cut(text))
[['Let', 'x'], ["'", 'w'], ['s', 'j'], [' ', 'j'], ['try', 'n'], [' ', 'v'], ['english', 'x'], [' ', 'v'], ['words', 'x']]

其中各个词性的含义：

n/名词 np/人名 ns/地名 ni/机构名 nz/其它专名
m/数词 q/量词 mq/数量词 t/时间词 f/方位词 s/处所词
v/动词 a/形容词 d/副词 h/前接成分 k/后接成分 
i/习语 j/简称 r/代词 c/连词 p/介词 u/助词 y/语气助词
e/叹词 o/拟声词 g/语素 w/标点 x/其它

可见它没法支持英文，意料之中

到时候我们应该只保留词性为n*， v， a，i这几种词性，其他的不显示在词云中。

rednotebook代码魔改

data.py

def is_contain_chinese(check_str):
    for ch in check_str:
        if u'\u4e00' <= ch <= u'\u9fff':
            return True
    return False

// class Day:
def get_words(self, with_special_chars=False):
    all_text = self.text
    all_text = re.sub(r"[%s]+"%punc, " ", all_text)
    words = all_text.split()
    if with_special_chars:
        return words
    # Strip all ASCII punctuation except for $, %, @ and '.
    words = [w.strip('.|-!"&/()=?*+~#_:;,<>^°`{}[]\\') for w in words]
    result = []
    for i in words:
        if i:
            if is_contain_chinese(i):
                cutted = thu.cut(i)
                for j in cutted:
                    if j[1] in ['n', 'np', 'ns', 'ni', 'nz', 'v', 'a', 'i']:
                        result.append(j[0])
                    else:
            result.append(i)
    return result

gui/clouds.py

def _get_words_for_cloud(self, word_count_dict, ignores, includes):
    words_and_frequencies = [
        (word, freq)
        for (word, freq) in word_count_dict.items()
        if (len(word) > 4 or any(pattern.match(word) or (len(word) > 1 and not word.encode('utf-8').isalpha()) for pattern in includes))
        and not
        # filter words in ignore_list
        any(pattern.match(word) for pattern in ignores)
    ]
    return self.select_most_frequent_words(words_and_frequencies, CLOUD_WORDS)

其中中文是长度大于等于2的即可纳入词云，英文单词是长度大于4。

最终结果

感觉还可以吧？我觉得唯一的缺点就是笔记中的词语很可能不大会重复，而重复多次的词语估计将在这里持续存在很久很久。。。

而且这个thulac模型的加载时间有些长，打开rednotebook要花几秒钟才加载得出词云

本文作者： Junetheriver
本文链接： http://nicklennonliu.github.io/2021/07/01/Rednotebook 中文词云/
版权声明： 本博客所有文章除特别声明外，均采用 MIT 许可协议。转载请注明出处！