笔记（7）|通过题库词频统计发现高频考点

梦貘 2024-07-20 12:17

笔记

前言

考研复习，我比较喜欢用电子资料。电子资料的好处就是结构灵活，可以随机根据自己的需要进行调整。

比如今天我就突发奇想，能不能通过统计题库词频的方式，来筛查出高频考点？

本篇记录我的折腾过程。老样子，代码部分由GPT-4o编写

准备工作

先说一下我做这个工作前的条件：

一份题库

之前找学长学姐买的，是word版，我把他们合并到一个文件后，又用Pandoc转换成了Markdown格式，因为自己写了个刷题小程序，用markdown比较方便导入
Logseq电子版笔记

前两遍读教材的时候，自己用Logseq做的笔记，页面基本上是以名词作为标题的

折腾过程

由于我比较习惯用python，所以直接问了AI：

我：用纯Python的方式分词有没有什么比较好的工具和方案，比如把“朝鲜旧石器时代古人类分布特点及形成原因”分成“朝鲜”“旧石器时代”“古人类”“分布”“特点”“及”“形成”“原因“

AI：在纯Python环境下进行中文分词，有几个常用的工具和方案。以下是一些比较流行的方法：

1. 使用 jieba 分词库

jieba 是一个非常流行的中文分词库，支持多种分词模式，包括精确模式、全模式和搜索引擎模式。

安装 jieba：
pip install jieba
使用 jieba 进行分词：
import jieba

text = "朝鲜旧石器时代古人类分布特点及形成原因"
words = jieba.lcut(text)
print(words)
输出：
['朝鲜', '旧石器时代', '古人类', '分布', '特点', '及', '形成', '原因']
2. 使用 pkuseg 分词库

pkuseg 是北京大学开发的一款中文分词工具，具有高准确率和高效率。

安装 pkuseg：
pip install pkuseg
使用 pkuseg 进行分词：
import pkuseg

seg = pkuseg.pkuseg()  # 默认配置
text = "朝鲜旧石器时代古人类分布特点及形成原因"
words = seg.cut(text)
print(words)
输出：
['朝鲜', '旧石器时代', '古人类', '分布', '特点', '及', '形成', '原因']
3. 使用 THULAC 分词库

THULAC 是清华大学开发的中文词法分析工具，支持词性标注和分词。

安装 THULAC：
pip install thulac
使用 THULAC 进行分词：
import thulac

thu = thulac.thulac(seg_only=True)  # 只进行分词，不进行词性标注
text = "朝鲜旧石器时代古人类分布特点及形成原因"
words = thu.cut(text, text=True).split()
print(words)
输出：
['朝鲜', '旧石器时代', '古人类', '分布', '特点', '及', '形成', '原因']
4. 自定义词典

有时候你可能需要自定义分词词典，以确保某些特定的词汇能够正确分词。以 jieba 为例：
import jieba

# 添加自定义词汇
jieba.add_word('旧石器时代')
jieba.add_word('古人类')

text = "朝鲜旧石器时代古人类分布特点及形成原因"
words = jieba.lcut(text)
print(words)
输出：
['朝鲜', '旧石器时代', '古人类', '分布', '特点', '及', '形成', '原因']
总结

以上几种方法都是在纯Python环境下进行中文分词的有效方案。jieba 是最常用的，具有较好的性能和易用性。pkuseg 和 THULAC 也都是非常优秀的选择，具体使用哪种工具可以根据你的需求和偏好来决定。

我对比了几个方案的分词效果，最终决定用jieba的分词方案。

然后让它写了个代码，大概实现了以下功能：

1、会自动读取当前目录下的md文件

2、会对文件中的内容进行分析和分词

3、将分词结果自动进行词频统计

4、将词频统计结果按从高到低的结果储存在一个csv文件中

5、去除一些常用词（比如”的“以及一些标点符号、空格之类的），只保留专有名词

和GPT的沟通过程此处省略。但是在实现5的过程中，GPT最开始的思路是自己枚举常用词，我让它不要枚举，上网上找数据集，然后它就帮我找了一个中文常用停词表.

我选择了“中文停用词表.txt”，并根据代码把它重命名成了“hit_stopwords.txt”，放在代码的相同目录下。

代码如下：

import os
import jieba
import jieba.posseg as pseg
import csv
from collections import Counter

def read_md_files(directory):
    """读取指定目录下所有的Markdown文件内容"""
    texts = []
    for filename in os.listdir(directory):
        if filename.endswith('.md'):
            with open(os.path.join(directory, filename), 'r', encoding='utf-8') as file:
                texts.append(file.read())
    return texts

def load_stopwords(file_path):
    """加载停用词表"""
    with open(file_path, 'r', encoding='utf-8') as file:
        stopwords = set(line.strip() for line in file)
    return stopwords

def segment_texts(texts, stopwords):
    """对文本列表进行分词，并过滤停用词和标点符号"""
    words = []
    for text in texts:
        words.extend([word for word, flag in pseg.lcut(text) 
                      if word not in stopwords and flag in ['nr', 'ns', 'nt', 'nz', 'nrt']])
    return words

def count_word_frequencies(words):
    """统计词频"""
    return Counter(words)

def save_to_csv(word_freq, output_file):
    """将词频统计结果保存到CSV文件"""
    with open(output_file, 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(['Word', 'Frequency'])
        for word, freq in word_freq.most_common():
            writer.writerow([word, freq])

def main():
    directory = '.'  # 当前目录
    output_file = 'word_frequencies.csv'
    stopwords_file = 'hit_stopwords.txt'  # 停用词表文件路径
    
    # 加载停用词表
    stopwords = load_stopwords(stopwords_file)
    
    # 读取Markdown文件
    texts = read_md_files(directory)
    
    # 对文本进行分词，并过滤停用词和标点符号
    words = segment_texts(texts, stopwords)
    
    # 统计词频
    word_freq = count_word_frequencies(words)
    
    # 保存词频统计结果到CSV文件
    save_to_csv(word_freq, output_file)
    
    print(f'词频统计结果已保存到 {output_file}')

if __name__ == '__main__':
    main()

这样统计出来的词频基本是可用的了，但还是有些奇奇怪怪的东西，比如把“木户孝允”拆分成了“木”和“户孝允”，以及出现了单个的字，比如“王”之类的，对我来讲属于无效信息。

于是我想起来，我Logseq的笔记标题不就可以当知识点列表吗，于是写了个python小程序，从我logseq的pages文件夹中获取了所有的页面文件标题，并且把“.md”的文件拓展名去掉了，一行一个储存在“file_list.txt”文件中

import os

def list_files_in_directory(directory, output_file):
    # 打开输出文件
    with open(output_file, 'w', encoding='utf-8') as file:
        # 遍历指定目录
        for root, dirs, files in os.walk(directory):
            for name in files:
                # 写入文件名到输出文件，每个文件名占一行
                file.write(name + '\n')

if __name__ == "__main__":
    # 指定目录
    directory = r'{你的知识库路径}'
    # 输出文件路径
    output_file = r'file_list.txt'
    
    # 调用函数
    list_files_in_directory(directory, output_file)
    print(f"文件名已保存到 {output_file}")

得到这个知识点文件后，继续让AI完善代码

我：在这个基础上，同目录下有个叫file_list.txt的文件，是以换行符分割的名次列表，先检查这个文件，如果出现在这个文件中的名词，就不要分词了，作为一个整体的名词使用（可以忽略标点符号，只看中文汉字部分），请完善这个程序

AI：为了确保 file_list.txt 中的名词不被拆分，我们可以在加载自定义名词时指定词频和词性，以确保这些词在分词时被正确识别。我们还可以编写一个函数来清理文本中的标点符号，以便更好地处理这些名词。

以下是修改后的代码：

import os
import jieba
import jieba.posseg as pseg
import csv
import re
from collections import Counter

def read_md_files(directory):
    """读取指定目录下所有的Markdown文件内容"""
    texts = []
    for filename in os.listdir(directory):
        if filename.endswith('.md'):
            with open(os.path.join(directory, filename), 'r', encoding='utf-8') as file:
                texts.append(file.read())
    return texts

def load_stopwords(file_path):
    """加载停用词表"""
    with open(file_path, 'r', encoding='utf-8') as file:
        stopwords = set(line.strip() for line in file)
    return stopwords

def load_custom_words(file_path):
    """加载自定义名词列表"""
    with open(file_path, 'r', encoding='utf-8') as file:
        custom_words = [line.strip() for line in file if line.strip()]
    return custom_words

def add_custom_words_to_jieba(custom_words):
    """将自定义名词添加到jieba词典"""
    for word in custom_words:
        jieba.add_word(word, freq=1000000, tag='nz')  # 高词频和专有名词标签

def clean_text(text):
    """清理文本中的标点符号"""
    return re.sub(r'[^\w\s]', '', text)

def segment_texts(texts, stopwords):
    """对文本列表进行分词，并过滤停用词和标点符号"""
    words = []
    for text in texts:
        cleaned_text = clean_text(text)
        words.extend([word for word, flag in pseg.lcut(cleaned_text) 
                      if word not in stopwords and flag in ['nr', 'ns', 'nt', 'nz', 'nrt']])
    return words

def count_word_frequencies(words):
    """统计词频"""
    return Counter(words)

def save_to_csv(word_freq, output_file):
    """将词频统计结果保存到CSV文件"""
    with open(output_file, 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(['Word', 'Frequency'])
        for word, freq in word_freq.most_common():
            writer.writerow([word, freq])

def main():
    directory = '.'  # 当前目录
    output_file = 'word_frequencies.csv'
    stopwords_file = 'hit_stopwords.txt'  # 停用词表文件路径
    custom_words_file = 'file_list.txt'  # 自定义名词列表文件路径
    
    # 加载停用词表
    stopwords = load_stopwords(stopwords_file)
    
    # 加载自定义名词列表
    custom_words = load_custom_words(custom_words_file)
    
    # 将自定义名词添加到jieba词典
    add_custom_words_to_jieba(custom_words)
    
    # 读取Markdown文件
    texts = read_md_files(directory)
    
    # 对文本进行分词，并过滤停用词和标点符号
    words = segment_texts(texts, stopwords)
    
    # 统计词频
    word_freq = count_word_frequencies(words)
    
    # 保存词频统计结果到CSV文件
    save_to_csv(word_freq, output_file)
    
    print(f'词频统计结果已保存到 {output_file}')

if __name__ == '__main__':
    main()

解释

高词频和专有名词标签：在 add_custom_words_to_jieba 函数中，将自定义名词添加到 jieba 词典时，设置了高词频（freq=1000000）和专有名词标签（tag='nz'），以确保这些名词不会被拆分。
清理文本中的标点符号：在 segment_texts 函数中，增加了 clean_text 函数，用于清理文本中的标点符号，以便更好地处理名词。

请确保在运行代码前，在项目目录中放置 hit_stopwords.txt 和 file_list.txt 文件。file_list.txt 文件中的每个名词应占一行。

然后就是让AI处理掉单个字的部分：

我：好，这个代码是可用的，在这个基础上，拆出来的单个字就不要列入词频统计了，更新代码

AI：好的，我们可以在统计词频时过滤掉单个字的词。以下是更新后的代码：

import os
import jieba
import jieba.posseg as pseg
import csv
import re
from collections import Counter

def read_md_files(directory):
    """读取指定目录下所有的Markdown文件内容"""
    texts = []
    for filename in os.listdir(directory):
        if filename.endswith('.md'):
            with open(os.path.join(directory, filename), 'r', encoding='utf-8') as file:
                texts.append(file.read())
    return texts

def load_stopwords(file_path):
    """加载停用词表"""
    with open(file_path, 'r', encoding='utf-8') as file:
        stopwords = set(line.strip() for line in file)
    return stopwords

def load_custom_words(file_path):
    """加载自定义名词列表"""
    with open(file_path, 'r', encoding='utf-8') as file:
        custom_words = [line.strip() for line in file if line.strip()]
    return custom_words

def add_custom_words_to_jieba(custom_words):
    """将自定义名词添加到jieba词典"""
    for word in custom_words:
        jieba.add_word(word, freq=1000000, tag='nz')  # 高词频和专有名词标签

def clean_text(text):
    """清理文本中的标点符号"""
    return re.sub(r'[^\w\s]', '', text)

def segment_texts(texts, stopwords):
    """对文本列表进行分词，并过滤停用词和标点符号"""
    words = []
    for text in texts:
        cleaned_text = clean_text(text)
        words.extend([word for word, flag in pseg.lcut(cleaned_text) 
                      if word not in stopwords and flag in ['nr', 'ns', 'nt', 'nz', 'nrt'] and len(word) > 1])
    return words

def count_word_frequencies(words):
    """统计词频"""
    return Counter(words)

def save_to_csv(word_freq, output_file):
    """将词频统计结果保存到CSV文件"""
    with open(output_file, 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(['Word', 'Frequency'])
        for word, freq in word_freq.most_common():
            writer.writerow([word, freq])

def main():
    directory = '.'  # 当前目录
    output_file = 'word_frequencies.csv'
    stopwords_file = 'hit_stopwords.txt'  # 停用词表文件路径
    custom_words_file = 'file_list.txt'  # 自定义名词列表文件路径
    
    # 加载停用词表
    stopwords = load_stopwords(stopwords_file)
    
    # 加载自定义名词列表
    custom_words = load_custom_words(custom_words_file)
    
    # 将自定义名词添加到jieba词典
    add_custom_words_to_jieba(custom_words)
    
    # 读取Markdown文件
    texts = read_md_files(directory)
    
    # 对文本进行分词，并过滤停用词和标点符号
    words = segment_texts(texts, stopwords)
    
    # 统计词频
    word_freq = count_word_frequencies(words)
    
    # 保存词频统计结果到CSV文件
    save_to_csv(word_freq, output_file)
    
    print(f'词频统计结果已保存到 {output_file}')

if __name__ == '__main__':
    main()

解释

过滤单个字：在 segment_texts 函数中，添加了条件 len(word) > 1，以过滤掉单个字的词。

请确保在运行代码前，在项目目录中放置 hit_stopwords.txt 和 file_list.txt 文件。file_list.txt 文件中的每个名词应占一行。

这时候词频统计的质量就更高了，但是我还想进一步完善，就是只保留file_list.txt中有的知识点部分。

我：帮我写一个不依赖第三方库的python代码： 1、读取同目录下的”word_frequencies.csv“文件，它的结构是这样的：
Word,Frequency
日本,2157
朝鲜,896
2、读取同目录下的file_list.txt文件 3、过滤掉这两个文件中的标点符号，只对纯中文字符的部分进行对比 4、数字进行大小写转换，比如“八一七”和“817”要能够进行双向匹配 5、将word_frequencies.csv的Word字段与file_list.txt比对，Frequency不动，只留下file_list.txt中有的部分，将保留的结果保存到一个新的csv文件中请直接输出代码

AI：好的，这里是一个不依赖第三方库的Python代码，它可以完成你描述的任务：

import csv
import re

# 读取 word_frequencies.csv 文件
def read_word_frequencies(file_path):
    word_frequencies = {}
    with open(file_path, 'r', encoding='utf-8') as file:
        reader = csv.DictReader(file)
        for row in reader:
            word_frequencies[row['Word']] = int(row['Frequency'])
    return word_frequencies

# 读取 file_list.txt 文件
def read_file_list(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        file_list = file.read().splitlines()
    return file_list

# 过滤掉标点符号，只保留纯中文字符
def filter_punctuation(text):
    return re.sub(r'[^\u4e00-\u9fa5]', '', text)

# 数字大小写转换 (简单示例，只处理常见的数字)
def convert_numbers(text):
    num_map = {
        '零': '0', '一': '1', '二': '2', '三': '3', '四': '4', 
        '五': '5', '六': '6', '七': '7', '八': '8', '九': '9'
    }
    for ch, num in num_map.items():
        text = text.replace(ch, num)
    return text

# 处理和比对数据
def process_and_compare(word_frequencies, file_list):
    filtered_word_frequencies = {}
    for word, frequency in word_frequencies.items():
        filtered_word = filter_punctuation(word)
        converted_word = convert_numbers(filtered_word)
        for line in file_list:
            filtered_line = filter_punctuation(line)
            converted_line = convert_numbers(filtered_line)
            if filtered_word in filtered_line or converted_word in filtered_line or filtered_word in converted_line or converted_word in converted_line:
                filtered_word_frequencies[word] = frequency
                break
    return filtered_word_frequencies

# 保存结果到新文件
def save_to_csv(data, file_path):
    with open(file_path, 'w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(['Word', 'Frequency'])
        for word, frequency in data.items():
            writer.writerow([word, frequency])

# 主函数
def main():
    word_frequencies = read_word_frequencies('word_frequencies.csv')
    file_list = read_file_list('file_list.txt')
    filtered_word_frequencies = process_and_compare(word_frequencies, file_list)
    save_to_csv(filtered_word_frequencies, 'filtered_word_frequencies.csv')

if __name__ == '__main__':
    main()

这个代码将会：

读取 word_frequencies.csv 文件并提取单词及其频率。
读取 file_list.txt 文件。
过滤掉所有标点符号，只保留纯中文字符。
将中文数字转换为阿拉伯数字进行双向匹配。
比对 word_frequencies.csv 中的单词和 file_list.txt 中的内容，只保留匹配到的单词及其频率。
将结果保存到一个新的 filtered_word_frequencies.csv 文件中。

这个代码也是可用的，按照这个流程走下来，就可以得到一个完全匹配知识点的CSV词频统计文件。这个文件可以直接用excel打开，甚至另存为xlsx格式，进行数据分析。

以上。

梦的杂物间

笔记（7）|通过题库词频统计发现高频考点

前言

准备工作

折腾过程

1. 使用 `jieba` 分词库

2. 使用 `pkuseg` 分词库

3. 使用 `THULAC` 分词库

4. 自定义词典

总结

解释

解释

笔记（7）|通过题库词频统计发现高频考点

前言

准备工作

折腾过程

1. 使用 jieba 分词库

2. 使用 pkuseg 分词库

3. 使用 THULAC 分词库

4. 自定义词典

总结

解释

解释

1. 使用 `jieba` 分词库

2. 使用 `pkuseg` 分词库

3. 使用 `THULAC` 分词库