GeotechSet数据集主题模拟(Topic Modeling)

1 引言

主题模拟(Topic Modeling)是一种从大量非结构化文本中提取隐藏主题的技术。面临的挑战是如何提取出清晰、分离和有意义的高质量主题,这在很大程度上取决于文本预处理的质量和寻找最佳主题数量的策略。Latent Dirichlet Allocation(LDA)是一种流行的主题模拟算法。LDA主题模拟方法的本质是寻找文档内的关键词分布,通过关键词的聚合确定主题内容,在《LDA Topic Modeling(主题建模): 以Rocscience 2021用户会议为例》中我们使用了LdaModel和k-mean算法两种算法进行了主题模拟。本文在此基础上讨论了主题模拟的最新进展。

GeotechSet数据集主题模拟(Topic Modeling)的图1

rock slopes toppling failure stability analysis

Stability analysis of steep rock slopes

岩石边坡稳定性分析方法简述

Stability Analyses of Jointed Rock Slopes with Counter-tilted Failure

Soil-Rock Slope Stability Analysis by Considering the Nonuniformity of Rocks

intake slope


2 小数据集准备

这个小型的数据集由三部分组成:第一部分选择了GeotechSet数据集的Rock Mechanics子集,把所有的文档名称汇集成一个文件,大约5000个文档标题;第二部分选择了本公众号的大约370篇文章的标题,这部分主要是中文;第三部分选择了Chuquicamata矿和Palabora矿的一些文档,总的文件尺寸大约730k。

GeotechSet数据集主题模拟(Topic Modeling)的图2


3 小数据集模拟

本模拟使用distiluse-base-multilingual-cased嵌入模型

(1) 总共生成了大约100个主题;

(2) 与'discontinuity'最相关的50个词汇:

'discontinuities', 'discontinuity', 'fracturing', 'discontinuous',  'fractured', 'displacements', 'fractures', 'fracture', 'continuous', 'persistent', 'displacement', 'instability', 'dilation', 'deformation', 'discrete', 'defects', 'fragmentation', 'limitations', 'uncertainty', 'subsidence', 'propagation', 'progressive', 'collapse', 'continuum', 'intensity', 'persistence', 'friction', 'disturbance', 'overburden', 'excavation',       'quantitative', 'stochastic', 'faults', 'flexural', 'finite', 'density', 'seismic', 'failures', 'strain', 'dilution', 'residual', 'dependent', 'intact', 'equilibrium', 'reduction', 'spacing', 'width', 'cracking', 'uniaxial', 'ratio'

GeotechSet数据集主题模拟(Topic Modeling)的图3

(3) 数据集中最相关的主题文章:

[1] Numerical Simulation of Fractured Rock Mass Behavior- Explicit Modeling of Joints

Numerical modelling of slope uncertainty due to rock mass jointing

[2] Application of the numerical manifold method to model progressive failure in rock slopes

[3] Numerical modelling of brittle fracture and step-path failure- From laboratory to rock slope scale

[4] An investigation into the development of toppling at the edge of fractured rock plateaux using a numerical modelling approach

[5] Modelling progressive failure in fractured rock masses

[6] Numerical modelling of the flexural deformation of foliated rock slopes

[7] An investigation of the development of secondary toppling phenomena at the edges of a fractured rock plateau using a numerical modelling approach

[8] Simulation of Toppling Failure of Rock Slope by Numerical Manifold Method

Numerical modelling of brittle rock failure

GeotechSet数据集主题模拟(Topic Modeling)的图4

(4) 与'slope'语义相关的词汇:

slopes, slip, sliding, ramp, landslide, wedge, shear, slide, caving, landslides, overburden, strain, fracturing, srk, barrier, shallow, collapse, toppling, spacing, valley


4 结束语

这个笔记通过一个小的试验数据集简要总结了与主题模拟相关的一些工作,包括找出数据集内最top的主题,查询与主题或关键词相关的句子等。对于GeotechSet数据集(目前尺寸153M),运行时间仍然是一个挑战,按照主题细分为小的数据集从时间和质量控制来说更有效。BTW,Transformers今天更新到V4.9.2.

默认 最新
当前暂无评论,小编等你评论哦!
点赞 评论 收藏
关注