算法之美――求解 字符串间最短距离(动态规划) - 小熊不去实验室 - 博客频道 - CSDN.NET


把两个字符串变成相同的基本操作定义如下:
1. 修改一个字符(如把 a 变成 b)
2. 增加一个字符 (如 abed 变成 abedd)
3. 删除一个字符(如 jackbllog 变成 jackblog)
针对于 jackbllog到jackblog 只需要删除一个或增加一个 l 就可以把两个字符串变为相同。把这种操作需要的次数定义为两个字符串的距离 L, 则相似度定义为1/(L+1) 即距离加一的倒数。那么jackbllog和jackblog的相似度为 1/(1+1)=1/2=0.5 也就是所两个字符串的相似度是 0.5。

1. Definition of  Minimum Edit Distance 

Edit Distance用于衡量两个strings之间的相似性。
两个strings之间的Minimum edit distance是指把其中一个string通过编辑(包括插入,删除,替换操作)转换为另一个string的最小操作数。
如上图所示,d(deletion)代表删除操作,s(substitution)代表替换操作,i(insertion)代表插入操作。
(为了简单起见,后面的Edit Distance 简写为ED)
如果每种操作的cost(成本)为1,那么ED = 5.
如果s操作的cost为2(即所谓的Levenshtein Distance),ED = 8.

2. Computing Minimum Edit Distance

那么如何找到两个strings的minimun edit distance呢?要知道把一个string转换为另一个string可以有很多种方法(或者说"路径")。我们所知道起始状态(第一个string)、终止状态(另一个string)、基本操作(插入、删除、替换),要求的是最短路径。
对于如下两个strings:
X的长度为n
Y的长度为m
我们定义D(i,j)为 X 的前i个字符 X[1...i] 与 Y 的前j个字符 Y[1...j] 之间的距离,其中0<i<n, 0<j<m,因此X与Y的距离可以用D(n,m)来表示。
假如我们想要计算最终的D(n,m),那么可以从头开始,先计算D(i, j) (i和j从1开始)的值,然后基于前面的结果计算更大的D(i, j),直到最终求得D(n,m)。
算法过程如下图所示:
上图中使用的是"Levenshtein Distance"即替换的成本为2.
请读者深入理解一下上图中的循环体部分: D(i,j)可能的取值为:
1. D(i-1, j) +1 ;
2. D(i, j-1) +1 ;
3. D(i-1, j-1) + 2 (当X新增加的字符和Y新增加的字符不同时,需要替换)或者 + 0(即两个字符串新增加的字符相同)
下图即对字符串 INTENTION 和 EXECUTION 一步步求ED形成的表。左上角画红圈的8就是两个字符串间的最小ED。


http://bylijinnan.iteye.com/blog/1627362
Init record[][] to all -1.
  1.      * 解答:动态规划+备忘录 
  2.      * 2012-11-04:主要思路还是递归。字符串记为A和B(当前比较的位置记为K,当前距离记为L),从第一个字符开始按位比较,分两种情况: 
  3.      * 1、A和B在第K位的字符相等(L不变)。那好,各自向后移动,继续比较第K+1位 
  4.      * 2、A和B在第K位的字符不相等(L=L+1)。采取递归,作三种操作,看哪种操作最后得到的距离最短: 
  5.      * 一是A和B同时向后移动(相当于A和B同时删除这个字符),继续比较第K+1位 
  6.      * 二是A移动B不移动,相当于A删除了这个字符,用剩余的字符与B作比较 
  7.      * 三是A不移动B移动,相当于B删除了这个字符,用剩余的字符与A作比较 
  8.      * 递归的好处就是可以递归得到这三种操作到最后得到的距离,哪个是最短 
  9.      * 举个例子,A="abc",B="zbc"。我们可以一眼看出,采用第一种操作算得的距离最短(L=1) 
  10.      * 但程序中要递归执行这另外两种操作并比较: 
  11.      * A1="bc",B2="zbc" -->按位比较得到的L=1+3 
  12.      * A2="abc",B2="bc" -->按位比较得到的L=1+3 
  13.      * 因此程序会选择第一种操作,再接着进行第K+1位的比较 
  14.      */  
  15.       
  16.     private static int[][] record;  //记录子问题的解,0表示子问题未求解  
  17.       
  18.     public static void main(String[] args) {  
  19.         String strA = "abcd";  
  20.         String[] strBB = {  
  21.                 "",  
  22.                 "z",  
  23.                 "a",  
  24.                 "ac",  
  25.                 "adc"  
  26.         };  
  27.         for (String strB : strBB) {  
  28.             int distance = distanceBetween(strA, strB);  
  29.             System.out.println(distance);  
  30.         }  
  31.     }  
  32.   
  33.     public static int distanceBetween(String strA, String strB) {  
  34.         int distance = -1;  
  35.         if (strA != null && strB != null) {  
  36.             int lenA = strA.length();  
  37.             int lenB = strB.length();  
  38.             if (lenA == 0 && lenB == 0) {  
  39.                 distance = 0;  
  40.             }  
  41.             if (lenA != 0 && lenB == 0) {  
  42.                 distance = lenA;  
  43.             }  
  44.             if (lenA == 0 && lenB != 0) {  
  45.                 distance = lenB;  
  46.             }  
  47.             if (lenA != 0 && lenB != 0) {  
  48.                 record = new int[lenA + 1][lenB + 1];  
  49.                 char[] charArrayA = strA.toCharArray();  
  50.                 char[] charArrayB = strB.toCharArray();  
  51.                 distance = distanceHelp(charArrayA, charArrayB, 00, lenA - 1, lenB - 1);  
  52.             }  
  53.         }  
  54.         return distance;  
  55.     }  
  56.       
  57.     //endA和endB是不变的,因此记录子问题的解可用record[beginA][beginB]来表示  
  58.     public static int distanceHelp(char[] charArrayA, char[] charArrayB,  
  59.                                      int beginA, int beginB, int endA, int endB) {  
  60.         if (beginA > endA) {             //递归出口:A从头到尾每个字符遍历完了,B有两种情况:  
  61.             if (beginB > endB) {         //1.B也同时遍历完了,说明这A=B  
  62.                 return 0;     
  63.             } else {  
  64.                 return endB - beginB + 1;   //2.B还没遍历完,那B剩下的长度就是两个字符串不同的地方,即距离  
  65.             }  
  66.         }  
  67.         if (beginB > endB) {  
  68.             if (beginA > endA) {  
  69.                 return 0;  
  70.             } else {  
  71.                 return endA - beginA + 1;  
  72.             }  
  73.         }  
  74.           
  75.         int distance = -1;  
  76.         if (charArrayA[beginA] == charArrayB[beginB]) {  
  77.             distance = record[beginA + 1][beginB + 1];  
  78.             if (distance == 0) {  
  79.                 distance = distanceHelp(charArrayA, charArrayB, beginA + 1, beginB + 1, endA, endB);  
  80.             }  
  81.         } else {  
  82.             int d1 = record[beginA + 1][beginB];  
  83.             if (d1 == 0) {  
  84.                 d1 = distanceHelp(charArrayA, charArrayB, beginA + 1, beginB, endA, endB);   
  85.             }  
  86.             int d2 = record[beginA][beginB + 1];  
  87.             if (d2 == 0) {  
  88.                 d2 = distanceHelp(charArrayA, charArrayB, beginA, beginB + 1, endA, endB);   
  89.             }  
  90.             int d3 = record[beginA + 1][beginB + 1];  
  91.             if (d3 == 0) {  
  92.                 d3 = distanceHelp(charArrayA, charArrayB, beginA + 1, beginB + 1, endA, endB);   
  93.             }  
  94.             distance = min(d1, d2, d3) + 1;  
  95.         }  
  96.         record[beginA][beginB] = distance;  
  97.         return distance;  
  98.     }  
Read full article from 算法之美――求解 字符串间最短距离(动态规划) - 小熊不去实验室 - 博客频道 - CSDN.NET

Labels

LeetCode (1432) GeeksforGeeks (1122) LeetCode - Review (1067) Review (882) Algorithm (668) to-do (609) Classic Algorithm (270) Google Interview (237) Classic Interview (222) Dynamic Programming (220) DP (186) Bit Algorithms (145) POJ (141) Math (137) Tree (132) LeetCode - Phone (129) EPI (122) Cracking Coding Interview (119) DFS (115) Difficult Algorithm (115) Lintcode (115) Different Solutions (110) Smart Algorithm (104) Binary Search (96) BFS (91) HackerRank (90) Binary Tree (86) Hard (79) Two Pointers (78) Stack (76) Company-Facebook (75) BST (72) Graph Algorithm (72) Time Complexity (69) Greedy Algorithm (68) Interval (63) Company - Google (62) Geometry Algorithm (61) Interview Corner (61) LeetCode - Extended (61) Union-Find (60) Trie (58) Advanced Data Structure (56) List (56) Priority Queue (53) Codility (52) ComProGuide (50) LeetCode Hard (50) Matrix (50) Bisection (48) Segment Tree (48) Sliding Window (48) USACO (46) Space Optimization (45) Company-Airbnb (41) Greedy (41) Mathematical Algorithm (41) Tree - Post-Order (41) ACM-ICPC (40) Algorithm Interview (40) Data Structure Design (40) Graph (40) Backtracking (39) Data Structure (39) Jobdu (39) Random (39) Codeforces (38) Knapsack (38) LeetCode - DP (38) Recursive Algorithm (38) String Algorithm (38) TopCoder (38) Sort (37) Introduction to Algorithms (36) Pre-Sort (36) Beauty of Programming (35) Must Known (34) Binary Search Tree (33) Follow Up (33) prismoskills (33) Palindrome (32) Permutation (31) Array (30) Google Code Jam (30) HDU (30) Array O(N) (29) Logic Thinking (29) Monotonic Stack (29) Puzzles (29) Code - Detail (27) Company-Zenefits (27) Microsoft 100 - July (27) Queue (27) Binary Indexed Trees (26) TreeMap (26) to-do-must (26) 1point3acres (25) GeeksQuiz (25) Merge Sort (25) Reverse Thinking (25) hihocoder (25) Company - LinkedIn (24) Hash (24) High Frequency (24) Summary (24) Divide and Conquer (23) Proof (23) Game Theory (22) Topological Sort (22) Lintcode - Review (21) Tree - Modification (21) Algorithm Game (20) CareerCup (20) Company - Twitter (20) DFS + Review (20) DP - Relation (20) Brain Teaser (19) DP - Tree (19) Left and Right Array (19) O(N) (19) Sweep Line (19) UVA (19) DP - Bit Masking (18) LeetCode - Thinking (18) KMP (17) LeetCode - TODO (17) Probabilities (17) Simulation (17) String Search (17) Codercareer (16) Company-Uber (16) Iterator (16) Number (16) O(1) Space (16) Shortest Path (16) itint5 (16) DFS+Cache (15) Dijkstra (15) Euclidean GCD (15) Heap (15) LeetCode - Hard (15) Majority (15) Number Theory (15) Rolling Hash (15) Tree Traversal (15) Brute Force (14) Bucket Sort (14) DP - Knapsack (14) DP - Probability (14) Difficult (14) Fast Power Algorithm (14) Pattern (14) Prefix Sum (14) TreeSet (14) Algorithm Videos (13) Amazon Interview (13) Basic Algorithm (13) Codechef (13) Combination (13) Computational Geometry (13) DP - Digit (13) LCA (13) LeetCode - DFS (13) Linked List (13) Long Increasing Sequence(LIS) (13) Math-Divisible (13) Reservoir Sampling (13) mitbbs (13) Algorithm - How To (12) Company - Microsoft (12) DP - Interval (12) DP - Multiple Relation (12) DP - Relation Optimization (12) LeetCode - Classic (12) Level Order Traversal (12) Prime (12) Pruning (12) Reconstruct Tree (12) Thinking (12) X Sum (12) AOJ (11) Bit Mask (11) Company-Snapchat (11) DP - Space Optimization (11) Dequeue (11) Graph DFS (11) MinMax (11) Miscs (11) Princeton (11) Quick Sort (11) Stack - Tree (11) 尺取法 (11) 挑战程序设计竞赛 (11) Coin Change (10) DFS+Backtracking (10) Facebook Hacker Cup (10) Fast Slow Pointers (10) HackerRank Easy (10) Interval Tree (10) Limited Range (10) Matrix - Traverse (10) Monotone Queue (10) SPOJ (10) Starting Point (10) States (10) Stock (10) Theory (10) Tutorialhorizon (10) Kadane - Extended (9) Mathblog (9) Max-Min Flow (9) Maze (9) Median (9) O(32N) (9) Quick Select (9) Stack Overflow (9) System Design (9) Tree - Conversion (9) Use XOR (9) Book Notes (8) Company-Amazon (8) DFS+BFS (8) DP - States (8) Expression (8) Longest Common Subsequence(LCS) (8) One Pass (8) Quadtrees (8) Traversal Once (8) Trie - Suffix (8) 穷竭搜索 (8) Algorithm Problem List (7) All Sub (7) Catalan Number (7) Cycle (7) DP - Cases (7) Facebook Interview (7) Fibonacci Numbers (7) Flood fill (7) Game Nim (7) Graph BFS (7) HackerRank Difficult (7) Hackerearth (7) Inversion (7) Kadane’s Algorithm (7) Manacher (7) Morris Traversal (7) Multiple Data Structures (7) Normalized Key (7) O(XN) (7) Radix Sort (7) Recursion (7) Sampling (7) Suffix Array (7) Tech-Queries (7) Tree - Serialization (7) Tree DP (7) Trie - Bit (7) 蓝桥杯 (7) Algorithm - Brain Teaser (6) BFS - Priority Queue (6) BFS - Unusual (6) Classic Data Structure Impl (6) DP - 2D (6) DP - Monotone Queue (6) DP - Unusual (6) DP-Space Optimization (6) Dutch Flag (6) How To (6) Interviewstreet (6) Knapsack - MultiplePack (6) Local MinMax (6) MST (6) Minimum Spanning Tree (6) Number - Reach (6) Parentheses (6) Pre-Sum (6) Probability (6) Programming Pearls (6) Rabin-Karp (6) Reverse (6) Scan from right (6) Schedule (6) Stream (6) Subset Sum (6) TSP (6) Xpost (6) n00tc0d3r (6) reddit (6) AI (5) Abbreviation (5) Anagram (5) Art Of Programming-July (5) Assumption (5) Bellman Ford (5) Big Data (5) Code - Solid (5) Code Kata (5) Codility-lessons (5) Coding (5) Company - WMware (5) Convex Hull (5) Crazyforcode (5) DFS - Multiple (5) DFS+DP (5) DP - Multi-Dimension (5) DP-Multiple Relation (5) Eulerian Cycle (5) Graph - Unusual (5) Graph Cycle (5) Hash Strategy (5) Immutability (5) Java (5) LogN (5) Manhattan Distance (5) Matrix Chain Multiplication (5) N Queens (5) Pre-Sort: Index (5) Quick Partition (5) Quora (5) Randomized Algorithms (5) Resources (5) Robot (5) SPFA(Shortest Path Faster Algorithm) (5) Shuffle (5) Sieve of Eratosthenes (5) Strongly Connected Components (5) Subarray Sum (5) Sudoku (5) Suffix Tree (5) Swap (5) Threaded (5) Tree - Creation (5) Warshall Floyd (5) Word Search (5) jiuzhang (5)

Popular Posts