基于马尔可夫链(markovify)文本生成代码的改进

1 引言

自动生成文本是自然语言处理中一个非常有趣的研究领域,目前主要有两种途径实现这个功能:第一种方法是深度学习,典型的例子是利用Transformers的"text-generation"管道,这种方法的理论基础是因果语言模拟(causal language modeling), 默认的模型是GPT-2,使用Top-K采样《开放式文本生成(Open-Ended Text Generation》; 在此基础上发展的aitextgen功能更强大一些,不过aitextgen好像不能在本机上训练自己的数据,不清楚什么原因, 只能使用Colab。第二种方法是马尔可夫链《马尔可夫链(Markov chain)随机产生新的文档》。这个笔记简要记录了对geotech-markovify-text-generation.py的改进,这个改进提高了生成句子的质量。


2 改进方法

尽管深度学习Transformers使用了大的模型GPT-2, 但测试结果显示对于我们特定的专业领域,这些模型并不能给出令人满意的结果,主要原因是这些模型中没有包含专业的知识库,因而生成的句子杂乱无章没有逻辑,这也是我们努力改造马尔可夫链的主要原因。另一方面,大而杂乱的数据集不能产生出合理的逻辑性非常强的句子,一个主题突出的数据集更容易产生出有实际意义的句子。因此改进的第一步是合并了geotech-flashtext-passages.py中的算法,通过主题关键词产生出一个聚合的小型数据集,把产生的这个数据集作为马尔可夫链的输入文件。


第二个改进是增加了一个文本清理子程序,清除文件中存在的杂乱结构,包括空行,无意义的字符以及小于一定长度的句子。


第三个改进是在代码中同时增加了两个类POSifiedText_Spacy和POSifiedText_NLTK,用来改进目前的markovify.Text方法。在POSifiedText_Spacy中,使用了最新的en_core_web_lg模型。这种改进的优点是极大地改善了生成句子的质量,缺点是对于大的数据集,运行时间变慢,特别是POSifiedText_Spacy方法,在一个40M的数据集测试中,训练时间花了接近50分钟。


因此,目前的代码中包括了三种句子生成方法。假如设定每种方法都产生5个句子,那么每次运行能同时产生出15个句子。


3 试验例子

作为一个试验例子,首先根据主题"rock slope failure"聚合一个小型的数据集,然后运行代码geotech-markovify-text-generation.py。一个小的改进是给定一个词,可以列出这个词所有邻接的下一个单词,例如"stability", 后接的名词有:

analysis

issues

assessment

conditions

evaluation

curves

problem

prospective

approaches

field

charts

calculations

computations.

models

这个功能不仅可以用来辅助教学,也可以帮助论文写作。接下来生成一些"stability analysis"相关的句子。每种方法均选择生成5个句子,因此生成了15个句子。

基于马尔可夫链(markovify)文本生成代码的改进的图1

按照词云,这个数据集最top的关键词为: step-path, rock bridge, path failure, rock slope, rock mass, failure mode, intact rock

[1] stability analysis through consideration of the role of stress-induced damage on slope performance.


[2] stability analysis is a conceptual illustration of possible rock slope investigations and finds application in a 3D to a stress-dependent failure mechanism is of great interest in rock slopes --- As large open pit mine.


[3] stability analysis is statically indeterminate and the overall block stability was assessed for 12 metre bench heights using planar and wedge failures.


[4] stability analysis of rock slopes, it is becoming increasingly necessary to consider the interaction between intact rock bridge content or percentage remains one of the open pit, notably in terms of expected breakback angles using a stiff modular applied static loading to fulfill visual excavation to the unfavourable orientations of discontinuities.


[5] stability analysis of planar, wedge and stepped path failures were presented in terms of these limitations with respect to block forming potential and kinematics.


[6] stability analysis , performed using the hybrid FDEM code , ELFEN with fracture mechanics criteria , is moving under the assumption of fully continuous lateral releases , or whether the planes are located so that they actually intersect behind the slope along the line of intersection .


[7] stability analysis used the overlay linear - element process based on the determination of relationships between tension cracks on the stability of rock mass was relatively poor , the dip , dip direction , nature and type of joint coalescence is considered conservative compared to the intersection of the current geographic condition the stability of rock slope instability provided enough block size in the orthogneiss rock unit Two new stereographic projection methods in the model simulations .


[8] stability analysis --- The importance of 3D step - path discontinuities and intact rock fractures and step - path failure are presented in Chapter 7 where step - path failure are important for the East Wall are going to be performed .


[9] stability analysis package using the limit equilibrium methods exist incorporating step - path failure .


[10] stability analysis for the idealised slope geometry .


[11] stability analysis is statically indeterminate and the collapse manifold was planar or wedge failure.


[12] stability analysis of rock slopes---Field data collection of the slope because, for example, a wedge resting on two intersecting discontinuities is of great interest in rock slopes---Wedge analyses for sandstones and quartzites have also been carried out using the SWEDGE software, allowed the identification of fracture propagation using both the hybrid code ELFEN in modelling and highly affects the stability of the rock mass dilation in facilitating the slope toe leading to failure of rock slopes.


[13] stability analysis package using the wedge stability with wedge failure and should be aware of these methods, and rock mass through fracture initiation, propagation and coalescence.


[14] stability analysis and slope monitoring data emphasising the control of fracture initiation and propagation.


[15] stability analysis tool Universal Distinct Element Code Visage is a conceptual large open pit slopes.


3 结束语

本文记录了代码geotech-markovify-text-generation.py的主要改进过程,生成句子的质量虽然比以前使用的方法提高了不少,但其算法仍有待进一步改进,例如在生成句子后自动识别生成句子的语法关系,对错误的语法关系进行改正。


默认 最新
当前暂无评论,小编等你评论哦!
点赞 评论 收藏
关注