Lucene 4 Finite State Automata in 10 minutes (Intro & Tutorial)
FSA is a computational model whereby input symbols are read by the computer — the automaton — to drive a state machine. The state machine is simply a graph with nodes and labeled, directed edges. Each node in the graph is a state, and each edge is a transition labeled with a potential input symbol. By matching the current node’s edges to the current input symbol, the automaton follows edges to the next state. The next input symbol is read, and based on the transitions of the new state, we transition to yet another state, so-on and so-forth.
In Practice — An Automaton as a data structure
When compared to a HashSet or TreeSet the memory representation (can be) much, much smaller, with very fast lookups. Moreover, Automata give you fuzzier features like regular expression matching.
http://blog.mikemccandless.com/2010/12/using-finite-state-transducers-in.html
Finite State Transducers 详解
FST 可以表示成FST<Key, Value>的形式,我们可以用O(length(key))的复杂度,找到key所对应的值。除此之外,FST 还支持用Value来查找key以及查找Value最优的key等功能。
FST 是一种类似于Trie或自动机的数据结构
http://www.cnblogs.com/LBSer/p/4119841.html
FSA is a computational model whereby input symbols are read by the computer — the automaton — to drive a state machine. The state machine is simply a graph with nodes and labeled, directed edges. Each node in the graph is a state, and each edge is a transition labeled with a potential input symbol. By matching the current node’s edges to the current input symbol, the automaton follows edges to the next state. The next input symbol is read, and based on the transitions of the new state, we transition to yet another state, so-on and so-forth.
In Practice — An Automaton as a data structure
When compared to a HashSet or TreeSet the memory representation (can be) much, much smaller, with very fast lookups. Moreover, Automata give you fuzzier features like regular expression matching.
http://blog.mikemccandless.com/2010/12/using-finite-state-transducers-in.html
Finite State Transducers 详解
FST 可以表示成FST<Key, Value>的形式,我们可以用O(length(key))的复杂度,找到key所对应的值。除此之外,FST 还支持用Value来查找key以及查找Value最优的key等功能。
FST 是一种类似于Trie或自动机的数据结构
http://www.cnblogs.com/LBSer/p/4119841.html
怎么实现一个字典呢?我们马上想到排序数组,即term字典是一个已经按字母顺序排序好的数组,数组每一项存放着term和对应的倒排文档id列表。每次载入索引的时候只要将term数组载入内存,通过二分查找即可。这种方法查询时间复杂度为Log(N),N指的是term数目,占用的空间大小是O(N*str(term))。排序数组的缺点是消耗内存,即需要完整存储每一个term,当term数目多达上千万时,占用的内存将不可接受。
2 常用字典数据结构
很多数据结构均能完成字典功能,总结如下。
数据结构 | 优缺点 |
排序列表Array/List | 使用二分法查找,不平衡 |
HashMap/TreeMap | 性能高,内存消耗大,几乎是原始数据的三倍 |
Skip List | 跳跃表,可快速查找词语,在lucene、redis、Hbase等均有实现。相对于TreeMap等结构,特别适合高并发场景(Skip List介绍) |
Trie | 适合英文词典,如果系统中存在大量字符串且这些字符串基本没有公共前缀,则相应的trie树将非常消耗内存(数据结构之trie树) |
Double Array Trie | 适合做中文词典,内存占用小,很多分词工具均采用此种算法(深入双数组Trie) |
Ternary Search Tree | 三叉树,每一个node有3个节点,兼具省空间和查询快的优点(Ternary Search Tree) |
Finite State Transducers (FST) | 一种有限状态转移机,Lucene 4有开源实现,并大量使用 |