Run Length Encoding | GeeksforGeeks


Given an input string, write a function that returns the Run Length Encoded string for the input string.

a) Pick the first character from source string.
b) Append the picked character to the destination string.
c) Count the number of subsequent occurrences of the picked character and append the count to destination string.
d) Pick the next character and repeat steps b) c) and d) if end of string is NOT reached.

Java Code from http://rosettacode.org/wiki/Run-length_encoding#Java
    public static String encode(String source) {
        StringBuffer dest = new StringBuffer();
        for (int i = 0; i < source.length(); i++) {
            int runLength = 1;
            while (i+1 < source.length() && source.charAt(i) == source.charAt(i+1)) {
                runLength++;
                i++;
            }
            dest.append(runLength);
            dest.append(source.charAt(i));
        }
        return dest.toString();
    }
    public static String decode(String source) {
        StringBuffer dest = new StringBuffer();
        Pattern pattern = Pattern.compile("[0-9]+|[a-zA-Z]");
        Matcher matcher = pattern.matcher(source);
        while (matcher.find()) {
            int number = Integer.parseInt(matcher.group());
            matcher.find();
            while (number-- != 0) {
                dest.append(matcher.group());
            }
        }
        return dest.toString();
    }
char *encode(char *src)
{    
  int rLen;
  char count[MAX_RLEN];
  int len = strlen(src);
  /* If all characters in the source string are different,
    then size of destination string would be twice of input string.
    For example if the src is "abcd", then dest would be "a1b1c1d1"
    For other inputs, size would be less than twice.  */
  char *dest = (char *)malloc(sizeof(char)*(len*2 + 1));
  int i, j = 0, k;
  /* traverse the input string one by one */
  for(i = 0; i < len; i++)
  {
    /* Copy the first occurrence of the new character */
    dest[j++] = src[i];
    /* Count the number of occurrences of the new character */
    rLen = 1;    
    while(i + 1 < len && src[i] == src[i+1])
    {
      rLen++;
      i++;
    }  
    /* Store rLen in a character array count[] */
    sprintf(count, "%d", rLen);
    /* Copy the count[] to destination */
    for(k = 0; *(count+k); k++, j++)
    {
      dest[j] = count[k];
    }
  
  /*terminate the destination string */
  dest[j] = '\0';
  return dest;
}    
http://www.stoimen.com/blog/2012/01/09/computer-algorithms-data-compression-with-run-length-encoding/
aaaaaaaaaabbbaxxxxyyyzyx
The length of this string is 17, which is approximately 70% of the initial length. Obviously this is not the optimal way to compress the given string. For instance we don’t need to use the digit “1” when the character is repeated only once. In some cases this approach can increase the length of the initial string which is exactly the opposite of what we need. In this case we’ll get the string bellow.
a10b3a1x4y3z1y1x1
Now the length of the resulting string is 13, which is 54% of the initial length! A variation of the example above is not to keep a counter of the repetitions of the character, but their position instead. Thus the initial string will be compressed as follows.
a10b3ax4y3zyx
Which of these two approaches you’ll use depends on the goal. In the second case we can achieve a good optimization of binary search.
$message = 'aaaaaaaaaabbbaxxxxyyyzyx';
 
function run_length_encode($msg)
{
 $i = $j = 0;
 $prev = '';
 $output = '';
 
 while ($msg[$i]) {
  if ($msg[$i] != $prev) {
 
   if ($i) 
    $output .= $j;
 
   $output .= $msg[$i];
 
   $prev = $msg[$i];
 
   $j = 0;
  }
  $j++;
  $i++;
 }
 
 $output .= $j;
 
 return $output;
}
 
// a10b3a1x4y3z1y1x1
echo run_length_encode($message);
And slightly optimized.
$message = 'aaaaaaaaaabbbaxxxxyyyzyx';
 
function run_length_encode($msg)
{
 $i = $j = 0;
 $prev = '';
 $output = '';
 
 while ($msg[$i]) {
  if ($msg[$i] != $prev) {
 
   if ($i && $j > 1) 
    $output .= $j;
 
   $output .= $msg[$i];
 
   $prev = $msg[$i];
 
   $j = 0;
  }
  $j++;
  $i++;
 }
 
 if ($j > 1)
  $output .= $j;
 
 return $output;
}
 
// a10b3ax4y3zyx
echo run_length_encode($message);
优步面经
public String encode(String s) {
      int start = 0;
      StringBuilder res = new StringBuilder();
      for (int i=1; i<=s.length(); i++) {
        if (i == s.length() || s.charAt(i) != s.charAt(start)) {
            if (i - start > 2) {
              res.append(String.valueOf(i-start));
              res.append(s.charAt(start));
            } else {
              res.append(s.substring(start, i));
            }
            start = i;
        }
      }
       
      return res.toString();
    }
首先暴力扫描之。然后问有什么问题,答曰decode的时候111aaa会encode成313a然后decode成313个a。提出每encode一个就插入一个特殊符号标记,他说可以。然后追加问是否可以不用特殊符号,答曰创建新的计数单位,比如用A代表1B代表2,然后aaa就变成Ca(you get the idea),这样就不需要插入额外符号了。


Finally a small change – now we store the position of the character.
$message = 'aaaaaaaaaabbbaxxxxyyyzyx';
 
function run_length_encode($msg)
{
 $i = 0;
 $prev = '';
 $output = '';
 
 while ($msg[$i]) {
  if ($msg[$i] != $prev) {
 
   $output .= $msg[$i] . $i;
 
   $prev = $msg[$i];
 
  }
 
  $i++;
 }
 
 return $output;
}
 
// a0b10a13x14y18z21y22x23
echo run_length_encode($message);
https://en.wikipedia.org/wiki/Run-length_encoding
WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW
Run-length encoding can be expressed in multiple ways to accommodate data properties as well as additional compression algorithms. For instance, one popular method encodes run lengths for runs of two or more characters only, using an "escape" symbol to identify runs, or using the character itself as the escape, so that any time a character appears twice it denotes a run. On the previous example, this would give the following:
WW12BWW12BB3WW24BWW14
This would be interpreted as a run of twelve Ws, a B, a run of twelve Ws, a run of three Bs, etc. In data where runs are less frequent, this can significantly improve the compression rate.
One other matter is the application of additional compression algorithms. Even with the runs extracted, the frequencies of different characters may be large, allowing for further compression; however, if the run lengths are written in the file in the locations where the runs occurred, the presence of these numbers interrupts the normal flow and makes it harder to compress. To overcome this, some run-length encoders separate the data and escape symbols from the run lengths, so that the two can be handled independently. For the example data, this would result in two outputs, the string "WWBWWBBWWBWW" and the numbers (12,12,3,24,14).

游程编码(RLE,run-length encoding)
整数变动长度表示法
整数固定长度表示法
4位元表示法
顾名思义,利用4个位元来储存整数,以符号C表示整数值。其可表现的最大整数值为15,因此,若资料重复出现次数超过15,便以“分段”方式处理。
假设某资料出现N次,则可以将其分成(N/15)+1段落来处理,其中N/15的值为无条件舍去。例如连续出现33个A:
原始资料:
AAAAAAAAAAAAAAA AAAAAAAAAAAAAAA AAA
压缩结果:
     15A          15A         3A
内部储存码:
1111 01000001 1111 01000001 0011 01000001
15   A    15    A     3     A

8位元表示法

同4位元表示法的概念,其能表示最大整数为255。假设某资料出现N次,则可以将其分成(N/255)+1段落来处理,其中N/255的值为无条件舍去。
http://algorithmsforever.blogspot.com/2011/11/run-length-encoding-is-very-simple-form.html
input = "aabbbcddeef" RLE = "a2b3cd2e2f"
input = "aaabbcddeeeef" RLE = "a3b2cd2e4f"
void rle_encode(char[] input, int N, char[] output){
int index = 0, count = 1;
char last = input[0];

for(int i=1; input[i]; i++){ 


if(last == input[i]){
count++;
}
else {

output[index++] = last;
if(count > 1){

output[index++] = count + '0';
count = 1;
}
last = input[i];
}
}

output[index++] = last;
if(count > 1){

output[index++] = count + '0';
}
}

http://lemire.me/blog/archives/2009/11/24/run-length-encoding-part-i/
How you use the RLE idea depends on your goals such as (1) maximize the compression rate (2) maximize the processing speed.
  • Instead of using a counter for each run of characters, you only add a counter after a value has been repeated twice. For example, the string AAABBBBBZWWK becomes AA1-BB3-Z-WW-K. Thus, if many characters are not repeated, you will rarely use an unnecessary counter.
  • Instead of a counter, you may store the location of the run in the array. For example, the string AAABBBBBZWWK becomes 1A-4B-9Z-10W-11K. The benefit of this approach is to allow random access in logarithmic time using binary search. However, it is also incompatible with some techniques to avoid unnecessary counters. So, it is a compression-speed trade-off. For even more speed, you can store both the location of the run and its length (thus avoiding a subtraction).
  • To help vectorization, you can group the characters into blocks of k characters. For example, using blocks of two characters, the string AAABBBBBZWWK becomes 1AA-1AB-2BB-1ZW-1WK. Again, this is a compression-speed trade-off because there will be fewer runs to compress after grouping the characters into long blocks.
http://www.fileformat.info/mirror/egff/ch09_03.htm
Read full article from Run Length Encoding | GeeksforGeeks

Labels

LeetCode (1432) GeeksforGeeks (1122) LeetCode - Review (1067) Review (882) Algorithm (668) to-do (609) Classic Algorithm (270) Google Interview (237) Classic Interview (222) Dynamic Programming (220) DP (186) Bit Algorithms (145) POJ (141) Math (137) Tree (132) LeetCode - Phone (129) EPI (122) Cracking Coding Interview (119) DFS (115) Difficult Algorithm (115) Lintcode (115) Different Solutions (110) Smart Algorithm (104) Binary Search (96) BFS (91) HackerRank (90) Binary Tree (86) Hard (79) Two Pointers (78) Stack (76) Company-Facebook (75) BST (72) Graph Algorithm (72) Time Complexity (69) Greedy Algorithm (68) Interval (63) Company - Google (62) Geometry Algorithm (61) Interview Corner (61) LeetCode - Extended (61) Union-Find (60) Trie (58) Advanced Data Structure (56) List (56) Priority Queue (53) Codility (52) ComProGuide (50) LeetCode Hard (50) Matrix (50) Bisection (48) Segment Tree (48) Sliding Window (48) USACO (46) Space Optimization (45) Company-Airbnb (41) Greedy (41) Mathematical Algorithm (41) Tree - Post-Order (41) ACM-ICPC (40) Algorithm Interview (40) Data Structure Design (40) Graph (40) Backtracking (39) Data Structure (39) Jobdu (39) Random (39) Codeforces (38) Knapsack (38) LeetCode - DP (38) Recursive Algorithm (38) String Algorithm (38) TopCoder (38) Sort (37) Introduction to Algorithms (36) Pre-Sort (36) Beauty of Programming (35) Must Known (34) Binary Search Tree (33) Follow Up (33) prismoskills (33) Palindrome (32) Permutation (31) Array (30) Google Code Jam (30) HDU (30) Array O(N) (29) Logic Thinking (29) Monotonic Stack (29) Puzzles (29) Code - Detail (27) Company-Zenefits (27) Microsoft 100 - July (27) Queue (27) Binary Indexed Trees (26) TreeMap (26) to-do-must (26) 1point3acres (25) GeeksQuiz (25) Merge Sort (25) Reverse Thinking (25) hihocoder (25) Company - LinkedIn (24) Hash (24) High Frequency (24) Summary (24) Divide and Conquer (23) Proof (23) Game Theory (22) Topological Sort (22) Lintcode - Review (21) Tree - Modification (21) Algorithm Game (20) CareerCup (20) Company - Twitter (20) DFS + Review (20) DP - Relation (20) Brain Teaser (19) DP - Tree (19) Left and Right Array (19) O(N) (19) Sweep Line (19) UVA (19) DP - Bit Masking (18) LeetCode - Thinking (18) KMP (17) LeetCode - TODO (17) Probabilities (17) Simulation (17) String Search (17) Codercareer (16) Company-Uber (16) Iterator (16) Number (16) O(1) Space (16) Shortest Path (16) itint5 (16) DFS+Cache (15) Dijkstra (15) Euclidean GCD (15) Heap (15) LeetCode - Hard (15) Majority (15) Number Theory (15) Rolling Hash (15) Tree Traversal (15) Brute Force (14) Bucket Sort (14) DP - Knapsack (14) DP - Probability (14) Difficult (14) Fast Power Algorithm (14) Pattern (14) Prefix Sum (14) TreeSet (14) Algorithm Videos (13) Amazon Interview (13) Basic Algorithm (13) Codechef (13) Combination (13) Computational Geometry (13) DP - Digit (13) LCA (13) LeetCode - DFS (13) Linked List (13) Long Increasing Sequence(LIS) (13) Math-Divisible (13) Reservoir Sampling (13) mitbbs (13) Algorithm - How To (12) Company - Microsoft (12) DP - Interval (12) DP - Multiple Relation (12) DP - Relation Optimization (12) LeetCode - Classic (12) Level Order Traversal (12) Prime (12) Pruning (12) Reconstruct Tree (12) Thinking (12) X Sum (12) AOJ (11) Bit Mask (11) Company-Snapchat (11) DP - Space Optimization (11) Dequeue (11) Graph DFS (11) MinMax (11) Miscs (11) Princeton (11) Quick Sort (11) Stack - Tree (11) 尺取法 (11) 挑战程序设计竞赛 (11) Coin Change (10) DFS+Backtracking (10) Facebook Hacker Cup (10) Fast Slow Pointers (10) HackerRank Easy (10) Interval Tree (10) Limited Range (10) Matrix - Traverse (10) Monotone Queue (10) SPOJ (10) Starting Point (10) States (10) Stock (10) Theory (10) Tutorialhorizon (10) Kadane - Extended (9) Mathblog (9) Max-Min Flow (9) Maze (9) Median (9) O(32N) (9) Quick Select (9) Stack Overflow (9) System Design (9) Tree - Conversion (9) Use XOR (9) Book Notes (8) Company-Amazon (8) DFS+BFS (8) DP - States (8) Expression (8) Longest Common Subsequence(LCS) (8) One Pass (8) Quadtrees (8) Traversal Once (8) Trie - Suffix (8) 穷竭搜索 (8) Algorithm Problem List (7) All Sub (7) Catalan Number (7) Cycle (7) DP - Cases (7) Facebook Interview (7) Fibonacci Numbers (7) Flood fill (7) Game Nim (7) Graph BFS (7) HackerRank Difficult (7) Hackerearth (7) Inversion (7) Kadane’s Algorithm (7) Manacher (7) Morris Traversal (7) Multiple Data Structures (7) Normalized Key (7) O(XN) (7) Radix Sort (7) Recursion (7) Sampling (7) Suffix Array (7) Tech-Queries (7) Tree - Serialization (7) Tree DP (7) Trie - Bit (7) 蓝桥杯 (7) Algorithm - Brain Teaser (6) BFS - Priority Queue (6) BFS - Unusual (6) Classic Data Structure Impl (6) DP - 2D (6) DP - Monotone Queue (6) DP - Unusual (6) DP-Space Optimization (6) Dutch Flag (6) How To (6) Interviewstreet (6) Knapsack - MultiplePack (6) Local MinMax (6) MST (6) Minimum Spanning Tree (6) Number - Reach (6) Parentheses (6) Pre-Sum (6) Probability (6) Programming Pearls (6) Rabin-Karp (6) Reverse (6) Scan from right (6) Schedule (6) Stream (6) Subset Sum (6) TSP (6) Xpost (6) n00tc0d3r (6) reddit (6) AI (5) Abbreviation (5) Anagram (5) Art Of Programming-July (5) Assumption (5) Bellman Ford (5) Big Data (5) Code - Solid (5) Code Kata (5) Codility-lessons (5) Coding (5) Company - WMware (5) Convex Hull (5) Crazyforcode (5) DFS - Multiple (5) DFS+DP (5) DP - Multi-Dimension (5) DP-Multiple Relation (5) Eulerian Cycle (5) Graph - Unusual (5) Graph Cycle (5) Hash Strategy (5) Immutability (5) Java (5) LogN (5) Manhattan Distance (5) Matrix Chain Multiplication (5) N Queens (5) Pre-Sort: Index (5) Quick Partition (5) Quora (5) Randomized Algorithms (5) Resources (5) Robot (5) SPFA(Shortest Path Faster Algorithm) (5) Shuffle (5) Sieve of Eratosthenes (5) Strongly Connected Components (5) Subarray Sum (5) Sudoku (5) Suffix Tree (5) Swap (5) Threaded (5) Tree - Creation (5) Warshall Floyd (5) Word Search (5) jiuzhang (5)

Popular Posts