pynini 详解：加权有限状态转换器在文本处理中的应用

发布日期: 2026-05-01

文章字数: 3.4k

阅读时长: 16 分

pynini 是一个基于 OpenFst 的 Python 库，用于构建和操作加权有限状态转换器（Weighted Finite-State Transducer, WFST）。它在文本规范化（TN）、逆文本规范化（ITN）、语音识别后处理等领域有广泛应用。

本文通过具体示例，详解 pynini 的核心算子及其使用方式。

一、基础概念

什么是 WFST？

有限状态转换器（FST）可以看作是一个带权重的有限状态自动机，但与自动机不同的是，FST 同时有输入标签和输出标签。当输入一个字符串时，FST 会将其转换为另一个字符串。

输入: "three hundred and twenty one"
  ↓ [WFST 转换]
输出: "321"

安装 pynini

# pynini 依赖 OpenFst，安装较复杂
# 推荐使用 conda 安装
conda install -c conda-forge pynini

# 或者使用 pip + 预编译版本（仅 Linux/macOS）
pip install pynini

注意：Windows 上安装 pynini 较为困难，建议使用 WSL 或 Linux 环境。

二、基础算子

1. `acceptor` —— 接受器

最简单的 FST，输入和输出相同，常用于匹配固定字符串。

import pynini

# 创建一个接受器，接受字符串 "hello"
fst = pynini.accep("hello")

# 测试
print(pynini.shortestpath(fst).stringify())
# 输出: hello

FST 结构图：

graph LR
    S((S)) -- h:h/0 --> N1((1))
    N1 -- e:e/0 --> N2((2))
    N2 -- l:l/0 --> N3((3))
    N3 -- l:l/0 --> N4((4))
    N4 -- o:o/0 --> F((5))
    
    style S fill:#e1f5fe
    style F fill:#c8e6c9
    
    classDef state fill:#fff3e0,stroke:#f57c00,stroke-width:2px;
    class S,N1,N2,N3,N4,F state;

说明：S 为起始状态，5 为终止状态。每个转移的格式为 输入:输出/权重，acceptor 中输入和输出相同，权重为 0。

2. `string_file` —— 从字符串创建

与 accep 类似，但会自动处理字符串中的空格。

fst = pynini.string_file("world")
print(pynini.shortestpath(fst).stringify())
# 输出: world

三、组合算子

3. `concat` —— 连接

将多个 FST 按顺序连接，相当于字符串的拼接操作。

fst1 = pynini.accep("hello")
fst2 = pynini.accep(" ")
fst3 = pynini.accep("world")

# 连接三个 FST
result = fst1 + fst2 + fst3
print(pynini.shortestpath(result).stringify())
# 输出: hello world

FST 结构图：

graph LR
    S((S)) -- hello --> N6((6))
    N6 -- '  ' --> N7((7))
    N7 -- world --> F((12))
    
    style S fill:#e1f5fe
    style F fill:#c8e6c9
    
    classDef state fill:#fff3e0,stroke:#f57c00,stroke-width:2px;
    class S,N6,N7,F state;

说明：三个 FST 首尾相连，状态 0-5 是 “hello”，状态 6 是空格，状态 7-12 是 “world”。

也可以使用 pynini.compose() 的简写形式：

result = pynini.concat([fst1, fst2, fst3])

4. `union` —— 并集

创建可以接受多个可选路径的 FST，相当于正则表达式的 |。

fst1 = pynini.accep("cat")
fst2 = pynini.accep("dog")
fst3 = pynini.accep("bird")

# 创建并集
result = fst1 | fst2 | fst3

# 测试不同输入
for word in ["cat", "dog", "bird", "fish"]:
    try:
        output = pynini.shortestpath(pynini.compose(word, result)).stringify()
        print(f"{word} -> {output}")
    except:
        print(f"{word} -> [拒绝]")

# 输出:
# cat -> cat
# dog -> dog
# bird -> bird
# fish -> [拒绝]

FST 结构图：

graph TD
    S((S)) -- c:c/0 --> N1((1))
    S -- d:d/0 --> N4((4))
    S -- b:b/0 --> N7((7))
    
    N1 -- a:a/0 --> N2((2))
    N2 -- t:t/0 --> F1((3))
    
    N4 -- o:o/0 --> N5((5))
    N5 -- g:g/0 --> F2((6))
    
    N7 -- i:i/0 --> N8((8))
    N8 -- r:r/0 --> N9((9))
    N9 -- d:d/0 --> F3((10))
    
    style S fill:#e1f5fe
    style F1 fill:#c8e6c9
    style F2 fill:#c8e6c9
    style F3 fill:#c8e6c9
    
    classDef state fill:#fff3e0,stroke:#f57c00,stroke-width:2px;
    class S,N1,N2,N4,N5,N7,N8,N9,F1,F2,F3 state;

说明：从起始状态 S 分出三条独立路径，分别匹配 “cat”、”dog”、”bird”，每条路径的终点都是终止状态。

5. `closure` —— 闭包

创建可以重复零次或多次的路径，相当于正则表达式的 *。

# 创建匹配 "a" 的 FST
a = pynini.accep("a")

# Kleene 闭包（零次或多次）
star = pynini.closure(a)

# 测试
for s in ["", "a", "aa", "aaa", "b"]:
    try:
        output = pynini.shortestpath(pynini.compose(s, star)).stringify()
        print(f"'{s}' -> '{output}'")
    except:
        print(f"'{s}' -> [拒绝]")

# 输出:
# '' -> ''
# 'a' -> 'a'
# 'aa' -> 'aa'
# 'aaa' -> 'aaa'
# 'b' -> [拒绝]

FST 结构图：

graph LR
    S((S)) -- ε/ε/0 --> F((F))
    S -- a:a/0 --> S
    
    style S fill:#e1f5fe
    style F fill:#c8e6c9
    
    classDef state fill:#fff3e0,stroke:#f57c00,stroke-width:2px;
    class S,F state;

说明：

S 既是起始状态也是终止状态（ε 转移表示可以不经过任何输入就到达终点）

S 到自身的循环转移表示可以重复多次

这就是 Kleene 星号的核心结构：零次（直接走 ε）或多次（循环）

也可以使用 pynini.closure(fst, n=1) 表示至少出现 n 次：

# 至少出现 2 次
plus2 = pynini.closure(a, n=2)

6. `cross` —— 转换

这是 WFST 最强大的算子，可以将输入转换为输出。

# 创建转换器：输入 "one" 输出 "1"
fst = pynini.cross("one", "1")

# 测试
result = pynini.compose("one", fst)
print(pynini.shortestpath(result).stringify())
# 输出: 1

FST 结构图：

graph LR
    S((S)) -- o:1/0 --> N1((1))
    N1 -- n:/0 --> N2((2))
    N2 -- e:/0 --> F((3))
    
    style S fill:#e1f5fe
    style F fill:#c8e6c9
    
    classDef state fill:#fff3e0,stroke:#f57c00,stroke-width:2px;
    class S,N1,N2,F state;

说明：与 acceptor 不同，第一个字符的转移是 o:1（输入 ‘o’，输出 ‘1’），后续字符是 n: 和 e:（输入字符，输出为空）。这是将 “one” 转换为 “1” 的关键。

多个转换规则：

# 多个转换规则
num_map = (
    pynini.cross("one", "1") |
    pynini.cross("two", "2") |
    pynini.cross("three", "3") |
    pynini.cross("four", "4") |
    pynini.cross("five", "5")
)

for word in ["one", "three", "five"]:
    result = pynini.compose(word, num_map)
    print(f"{word} -> {pynini.shortestpath(result).stringify()}")

# 输出:
# one -> 1
# three -> 3
# five -> 5

并集转换器结构图：

graph TD
    S((S)) -- o:1/0 --> N1((1))
    S -- t:2/0 --> N4((4))
    S -- t:3/0 --> N7((7))
    S -- f:4/0 --> N10((10))
    S -- f:5/0 --> N12((12))
    
    N1 -- n:/0 --> N2((2))
    N2 -- e:/0 --> F1((3))
    
    N4 -- w:/0 --> N5((5))
    N5 -- o:/0 --> F2((6))
    
    N7 -- h:/0 --> N8((8))
    N8 -- r:/0 --> N9((9))
    N9 -- e:/0 --> F3((10))
    
    N10 -- o:/0 --> F4((11))
    
    N12 -- i:/0 --> N13((13))
    N13 -- v:/0 --> N14((14))
    N14 -- e:/0 --> F5((15))
    
    style S fill:#e1f5fe
    style F1 fill:#c8e6c9
    style F2 fill:#c8e6c9
    style F3 fill:#c8e6c9
    style F4 fill:#c8e6c9
    style F5 fill:#c8e6c9
    
    classDef state fill:#fff3e0,stroke:#f57c00,stroke-width:2px;
    class S,N1,N2,N4,N5,N7,N8,N9,N10,N12,N13,N14,F1,F2,F3,F4,F5 state;

7. `invert` —— 反转

将 FST 的输入和输出对调。

# 原始转换器：数字单词 -> 数字
forward = (
    pynini.cross("one", "1") |
    pynini.cross("two", "2") |
    pynini.cross("three", "3")
)

# 反转：数字 -> 数字单词
backward = pynini.invert(forward)

# 测试
result = pynini.compose("2", backward)
print(pynini.shortestpath(result).stringify())
# 输出: two

四、高级算子

8. `compose` —— 组合

将两个 FST 组合，第一个的输出作为第二个的输入。这是构建复杂转换管道的核心。

# 第一步：数字单词 -> 数字
word_to_digit = (
    pynini.cross("one", "1") |
    pynini.cross("two", "2") |
    pynini.cross("three", "3")
)

# 第二步：数字 -> 中文数字
digit_to_chinese = (
    pynini.cross("1", "一") |
    pynini.cross("2", "二") |
    pynini.cross("3", "三")
)

# 组合：数字单词 -> 中文数字
full_pipeline = word_to_digit @ digit_to_chinese

for word in ["one", "two", "three"]:
    result = pynini.compose(word, full_pipeline)
    print(f"{word} -> {pynini.shortestpath(result).stringify()}")

# 输出:
# one -> 一
# two -> 二
# three -> 三

FST 组合管道图：

graph LR
    Input["输入: one"] --> FST1["FST1: 英文→数字"]
    FST1 --> Mid["中间结果: 1"]
    Mid --> FST2["FST2: 数字→中文"]
    FST2 --> Output["输出: 一"]
    
    style Input fill:#e1f5fe
    style Output fill:#c8e6c9
    style FST1 fill:#fff3e0
    style FST2 fill:#fff3e0
    style Mid fill:#f3e5f5

说明：Compose 操作将两个 FST 串联，FST1 的输出（”1”）自动作为 FST2 的输入，最终实现从英文单词到中文数字的直接转换。

9. `project` —— 投影

将 FST 投影到输入侧或输出侧，相当于将转换器变为接受器。

fst = pynini.cross("hello", "你好")

# 投影到输入侧（只保留输入）
input_proj = pynini.project(fst, output=False)
print("输入侧投影:", pynini.shortestpath(input_proj).stringify())
# 输出: hello

# 投影到输出侧（只保留输出）
output_proj = pynini.project(fst, output=True)
print("输出侧投影:", pynini.shortestpath(output_proj).stringify())
# 输出: 你好

10. `difference` —— 差集

从第一个 FST 中移除第二个 FST 能接受的内容，相当于正则表达式的差集操作。

# 接受所有字母
alpha = pynini.accep("abc") | pynini.accep("xyz")

# 要排除的内容
exclude = pynini.accep("xyz")

# 差集：从 alpha 中排除 xyz
result = pynini.difference(alpha, exclude)

for s in ["abc", "xyz"]:
    try:
        output = pynini.shortestpath(pynini.compose(s, result)).stringify()
        print(f"{s} -> {output}")
    except:
        print(f"{s} -> [拒绝]")

# 输出:
# abc -> abc
# xyz -> [拒绝]

11. `intersect` —— 交集

只保留两个 FST 都能接受的内容。

fst1 = pynini.accep("apple") | pynini.accep("banana") | pynini.accep("cherry")
fst2 = pynini.accep("banana") | pynini.accep("cherry") | pynini.accep("date")

result = pynini.intersect(fst1, fst2)

for fruit in ["apple", "banana", "cherry", "date"]:
    try:
        output = pynini.shortestpath(pynini.compose(fruit, result)).stringify()
        print(f"{fruit} -> {output}")
    except:
        print(f"{fruit} -> [拒绝]")

# 输出:
# apple -> [拒绝]
# banana -> banana
# cherry -> cherry
# date -> [拒绝]

12. `replace` —— 替换

类似于字符串的 replace 操作，但更强大，可以处理复杂模式。

# 创建替换规则
rules = [
    ("old", pynini.cross("old", "new")),
    ("young", pynini.cross("young", "new")),
]

# 使用 pynini.cdrewrite 进行上下文依赖重写
# 这里简化演示直接使用 union
replace_rule = (
    pynini.cross("old", "new") |
    pynini.cross("young", "new")
)

for word in ["old", "young", "ancient"]:
    try:
        result = pynini.compose(word, replace_rule)
        print(f"{word} -> {pynini.shortestpath(result).stringify()}")
    except:
        print(f"{word} -> [无匹配]")

# 输出:
# old -> new
# young -> new
# ancient -> [无匹配]

13. `optimize` —— 优化

压缩 FST 的状态数，提高效率。

# 创建一个较复杂的 FST
words = [pynini.accep(w) for w in ["apple", "application", "apply", "applet"]]
big_fst = pynini.union(*words)

print("优化前状态数:", big_fst.num_states())

optimized = pynini.optimize(big_fst)
print("优化后状态数:", optimized.num_states())

14. `rmepsilon` —— 移除 ε 转移

移除 FST 中的空转移（epsilon transitions）。

# 创建一个包含 ε 转移的 FST（某些操作会自动产生）
fst = pynini.closure(pynini.accep("a"))

# 移除 ε 转移
cleaned = pynini.rmepsilon(fst)
print(pynini.shortestpath(cleaned).stringify())
# 输出: a (或更多 a)

五、实用案例

案例 1：数字规范化（TN）

将英文数字短语转换为阿拉伯数字。

import pynini

# 基本数字映射
digit_map = pynini.union(
    pynini.cross("zero", "0"),
    pynini.cross("one", "1"),
    pynini.cross("two", "2"),
    pynini.cross("three", "3"),
    pynini.cross("four", "4"),
    pynini.cross("five", "5"),
    pynini.cross("six", "6"),
    pynini.cross("seven", "7"),
    pynini.cross("eight", "8"),
    pynini.cross("nine", "9"),
)

# 十位数
tens_map = pynini.union(
    pynini.cross("twenty", "2"),
    pynini.cross("thirty", "3"),
    pynini.cross("forty", "4"),
    pynini.cross("fifty", "5"),
)

# 百位
hundred = pynini.cross("hundred", "")

# 简单示例：将 "twenty three" 转为 "23"
twenty_three = (
    pynini.cross("twenty", "2") +
    pynini.accep(" ") +
    pynini.cross("three", "3")
)

result = pynini.compose("twenty three", twenty_three)
print(pynini.shortestpath(result).stringify())
# 输出: 23

FST 结构图：

graph LR
    S((S)) -- twenty:2/0 --> N6((6))
    N6 -- '  ': /0 --> N7((7))
    N7 -- three:3/0 --> F((12))
    
    style S fill:#e1f5fe
    style F fill:#c8e6c9
    
    classDef state fill:#fff3e0,stroke:#f57c00,stroke-width:2px;
    class S,N6,N7,F state;

说明：完整路径为 “twenty:2” + “ “ + “three:3”，输入 “twenty three” 输出 “23”。

案例 2：时间格式规范化

# 将 "three thirty" 转换为 "3:30"
time_converter = (
    pynini.cross("three", "3") +
    pynini.accep(" ") +
    pynini.cross("thirty", "30")
)

result = pynini.compose("three thirty", time_converter)
print(pynini.shortestpath(result).stringify())
# 输出: 330

案例 3：缩写展开

# 将常见缩写展开为全称
abbreviations = pynini.union(
    pynini.cross("Dr.", "Doctor"),
    pynini.cross("Mr.", "Mister"),
    pynini.cross("Mrs.", "Missus"),
    pynini.cross("St.", "Street"),
    pynini.cross("Ave.", "Avenue"),
)

for abbr in ["Dr.", "Mr.", "St."]:
    result = pynini.compose(abbr, abbreviations)
    print(f"{abbr} -> {pynini.shortestpath(result).stringify()}")

# 输出:
# Dr. -> Doctor
# Mr. -> Mister
# St. -> Street

案例 4：电话号码格式化

# 将纯数字电话号码格式化为带分隔符的形式
# 例如：13812345678 -> 138-1234-5678

phone = pynini.accep("13812345678")

# 这里需要更复杂的规则，演示简化版本
format_rule = pynini.accep("13812345678") @ pynini.accep("138-1234-5678")
# 实际应用中需要逐位映射

六、常见陷阱与注意事项

1. 空格处理

pynini 默认将每个字符视为一个符号，空格需要显式处理：

# 错误：直接包含空格
wrong = pynini.accep("hello world")  # 这会当作单个符号

# 正确：分别处理
correct = pynini.accep("hello") + pynini.accep(" ") + pynini.accep("world")

2. 权重（Weight）

pynini 支持加权 FST，权重可以用于优先级排序：

# 带权重的转换规则
# 权重越小，优先级越高
fst1 = pynini.cross("color", "colour", weight=1.0)
fst2 = pynini.cross("color", "颜色", weight=0.5)

result = fst1 | fst2
# 输出会优先选择权重小的路径

3. 大文件处理

对于大型 FST，建议：

# 先构建，后优化
big_fst = ...  # 构建复杂 FST
optimized = pynini.optimize(big_fst)
optimized = pynini.rmepsilon(optimized)

4. 调试技巧

# 打印 FST 的可视化结构
print(fst.draw("fst.dot"))
# 使用 graphviz 查看：dot -Tpng fst.dot -o fst.png

# 查看 FST 状态和转移
for state in fst.states():
    print(f"State {state}:")
    for arc in fst.arcs(state):
        print(f"  {arc.ilabel} -> {arc.olabel} (weight: {arc.weight})")

七、完整示例：简单文本规范化管道

import pynini

class SimpleTN:
    """简单文本规范化器"""
    
    def __init__(self):
        # 数字映射
        digits = pynini.union(
            *[pynini.cross(str(i), str(i)) for i in range(10)]
        )
        digits |= pynini.union(
            pynini.cross("zero", "0"),
            pynini.cross("one", "1"),
            pynini.cross("two", "2"),
            pynini.cross("three", "3"),
            pynini.cross("four", "4"),
            pynini.cross("five", "5"),
        )
        
        # 缩写映射
        abbr = pynini.union(
            pynini.cross("Dr.", "Doctor"),
            pynini.cross("Mr.", "Mister"),
        )
        
        # 组合规则
        self.fst = digits | abbr
    
    def normalize(self, text):
        """规范化文本"""
        try:
            result = pynini.compose(text, self.fst)
            return pynini.shortestpath(result).stringify()
        except:
            return text  # 无法转换则返回原文

# 使用示例
tn = SimpleTN()
print(tn.normalize("Dr."))       # 输出: Doctor
print(tn.normalize("one"))       # 输出: 1
print(tn.normalize("three"))     # 输出: 3

FST 管道图：

graph TD
    Input["输入文本"] --> FST["SimpleTN FST"]
    
    FST --> Path1["数字路径"]
    FST --> Path2["缩写路径"]
    
    Path1 --> D1["0-9: 0-9"]
    Path1 --> D2["zero:0, one:1, two:2, ..."]
    
    Path2 --> A1["Dr.: Doctor"]
    Path2 --> A2["Mr.: Mister"]
    
    D1 --> Output["输出: 规范化文本"]
    D2 --> Output
    A1 --> Output
    A2 --> Output
    
    style Input fill:#e1f5fe
    style Output fill:#c8e6c9
    style FST fill:#fff3e0
    style Path1 fill:#f3e5f5
    style Path2 fill:#f3e5f5

说明：SimpleTN 包含两条主要路径（数字和缩写），输入文本通过最短路径算法选择最优的转换路径。

八、总结

算子	功能	等价正则表达式
`accep`	接受固定字符串	字面匹配
`concat` / `+`	连接	字符串拼接
`union` / `	`	并集
`closure` / `*`	闭包	`a*`
`cross`	输入→输出转换	替换
`compose` / `@`	管道组合	复合函数
`invert`	输入输出反转	反向映射
`project`	投影到一侧	接受器转换
`difference`	差集	`a - b`
`intersect`	交集	`a & b`
`optimize`	状态压缩	无
`rmepsilon`	移除空转移	无

pynini 的强大之处在于可以将多个简单 FST 通过组合算子构建成复杂的转换管道。掌握这些基础算子后，可以逐步构建用于语音识别后处理、文本规范化、机器翻译等场景的 WFST 系统。

参考资源

2356

https://ysliu.cc/2026/05/01/pynini-detailed-explanation/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 2356 !

开发技术 Python NLP

本篇

pynini 详解：加权有限状态转换器在文本处理中的应用

pynini 是一个基于 OpenFst 的 Python 库，用于构建和操作加权有限状态转换器（Weighted Finite-State Transducer, WFST）。它在文本规范化（TN）、逆文本规范化（ITN）、语音识别后处理

2026-05-01 2356

开发技术 Python NLP

Transformer Attention 机制详解

2026-03-27 2356

深度学习，Transformer Attention 神经网络，Python

pynini 详解：加权有限状态转换器在文本处理中的应用

一、基础概念

什么是 WFST？

安装 pynini

二、基础算子

1. acceptor —— 接受器

2. string_file —— 从字符串创建

三、组合算子

3. concat —— 连接

4. union —— 并集

5. closure —— 闭包

6. cross —— 转换

7. invert —— 反转

四、高级算子

8. compose —— 组合

9. project —— 投影

10. difference —— 差集

11. intersect —— 交集

12. replace —— 替换

13. optimize —— 优化

14. rmepsilon —— 移除 ε 转移

五、实用案例

案例 1：数字规范化（TN）

案例 2：时间格式规范化

案例 3：缩写展开

案例 4：电话号码格式化

六、常见陷阱与注意事项

1. 空格处理

2. 权重（Weight）

3. 大文件处理

4. 调试技巧

七、完整示例：简单文本规范化管道

八、总结

参考资源

1. `acceptor` —— 接受器

2. `string_file` —— 从字符串创建

3. `concat` —— 连接

4. `union` —— 并集

5. `closure` —— 闭包

6. `cross` —— 转换

7. `invert` —— 反转

8. `compose` —— 组合

9. `project` —— 投影

10. `difference` —— 差集

11. `intersect` —— 交集

12. `replace` —— 替换

13. `optimize` —— 优化

14. `rmepsilon` —— 移除 ε 转移