关键词提取---PyTextRank和Spacy的工作原理

计算岩土力学

2021年7月26日 13:58

1 引言

由于要写研究报告，因此这个暑期的大部分文章将会与自然语言处理相关。PyTextRank <PyTextRank---文本关键字(keywords)的自动取出>作为Spacy管道的扩展，用来处理基于图的自然语言处理，构筑知识图谱实践以及提取关键词短语和摘要。它的基本操作过程是首先使用Spacy提取文本的名词短语，然后对这些短语使用TextRank算法进行排序。这个笔记检查了PyTextRank和Spacy得出的结果，以决定在提取关键词这个环节上是否还需要独立使用Spacy，从而优化代码。测试使用的库文件和模型如下：

pytextrank V3.1.1

Spacy V3.0.6 (最新版本V3.1.1)

en_core_web_md (V3.0.0 7/23/2021)

en_core_web_lg (V3.0.0 7/23/2021)

关键词提取---PyTextRank和Spacy的工作原理的图1

2 文本准备

在这个测试中，使用了下面的文本，中文翻译仅作为参考，不参与运算。

text="Analyses by numerical methods are performed using the Fast Langrangian Analysis of Continua (FLAC), FLAC3D, Universal Distinct Element Code (UDEC), and 3DEC computer codes. From 1994 to 1997, FLAC was the most commonly used software for slope-stability analysis. In order to achieve a better representation of the real conditions, it was necessary to include explicitly in the model numerous major structures with several intersections. As the number of these explicit structures and their intersections increased, it was more and more difficult to construct the model. Due to this and the need to include explicitly all major structures, in 1998 the numerical analyses began to be done using UDEC, which allows an easier “handling” of the structures. In certain special cases, three-dimensional numerical models are used. Due to the larger engineering resources required by these three-dimensional models, their use is less frequent than the two-dimensional models. In 1998, 3DEC was used to develop a three-dimensional model of the southern sector of the Chuquicamata Mine. This was used, together with two-dimensional models and in situ observations, to predict the evolution of the subsidence that will affect the sector from 1999 to 2008." 使用FLAC、FLAC3D、Universal Distinct Element Code（UDEC）和3DEC等计算机软件进行了数值分析。从1994年到1997年，FLAC是最常用的边坡稳定性分析软件。为了更好地表示实际情况，有必要在模型中显式地包括许多有几个交叉点的主要结构。随着这些显式结构及其交叉点数量的增加，构建模型的难度也越来越大。由于这种情况和明确包括所有主要结构的需要，从1998年开始使用UDEC进行数值分析，它可以更容易地 "处理 "这些结构。在某些特殊情况下，会使用三维数值模型。由于这些三维模型需要较大的工程资源，它们的使用不如二维模型频繁。1998年，3DEC被用来开发Chuquicamata矿南区的三维模型。该模型与二维模型和现场观测一起，被用来预测1999年至2008年影响该区的沉降演变。

3 PyTextRank计算结果

在这个测试中(geotech-PyTextRank.py)，使用了en_core_web_lg模型(741 MB), 共取出25个关键词短语，排名前10位的短语如下：

numerous major structures

several intersections
numerical methods
situ observations
the Chuquicamata Mine
three-dimensional numerical models
two-dimensional models
all major structures
the southern sector
slope-stability analysis

同时，也比较了en_core_web_sm和en_core_web_lg的计算结果，发现没有太大差别。

4 Spacy计算结果

使用Spacy加载同样的模型，得出的名词短语与PyTextrank的结果相同，由此可见，PyTextrank对Spacy得出的结果确实没有进行进一步加工。Spacy使用doc.noun_chunks进行关键词提取。其工作原理是：遍历文档中的基础名词短语。如果文档已被语法解析，则产生基础名词短语Span对象。基准名词短语，或称 "NP chunk"，是一个不允许其他NP嵌套在其中的名词短语---因此没有NP级协调，没有介词短语，也没有从句。

doc = nlp(text)

Doc类是一个访问语言注释的容器。此外，Toekn类进行预料分类: token.pos_ == "VERB", 得出这段文本没有重复的动词列表: ['achieve', 'affect', 'allow', 'begin', 'construct', 'develop', 'do', 'include', 'increase', 'perform', 'predict', 'require', 'use']。

Spacy的实体判别(doc.ents)把软件都归结到ORG，这个可以在代码中定制自己定义的实体名称,以后详述。

the Fast Langrangian Analysis ORG

Continua PERSON

FLAC3D CARDINAL

UDEC ORG

3DEC NORP

1994 to 1997 DATE

1998 DATE

UDEC ORG

three CARDINAL

two CARDINAL

1998 DATE