Massive Algorithms: Run Length Encoding

Run Length Encoding | GeeksforGeeks

Given an input string, write a function that returns the Run Length Encoded string for the input string.

a) Pick the first character from source string.
b) Append the picked character to the destination string.
c) Count the number of subsequent occurrences of the picked character and append the count to destination string.
d) Pick the next character and repeat steps b) c) and d) if end of string is NOT reached.

Java Code from http://rosettacode.org/wiki/Run-length_encoding#Java

    public static String encode(String source) {
        StringBuffer dest = new StringBuffer();
        for (int i = 0; i < source.length(); i++) {
            int runLength = 1;
            while (i+1 < source.length() && source.charAt(i) == source.charAt(i+1)) {
                runLength++;
                i++;
            }
            dest.append(runLength);
            dest.append(source.charAt(i));
        }
        return dest.toString();
    }
    public static String decode(String source) {
        StringBuffer dest = new StringBuffer();
        Pattern pattern = Pattern.compile("[0-9]+|[a-zA-Z]");
        Matcher matcher = pattern.matcher(source);
        while (matcher.find()) {
            int number = Integer.parseInt(matcher.group());
            matcher.find();
            while (number-- != 0) {
                dest.append(matcher.group());
            }
        }
        return dest.toString();
    }

char *encode(char *src)

{

  int rLen;

  char count[MAX_RLEN];

  int len = strlen(src);

  /* If all characters in the source string are different, 

    then size of destination string would be twice of input string.

    For example if the src is "abcd", then dest would be "a1b1c1d1"

    For other inputs, size would be less than twice.  */

  char *dest = (char *)malloc(sizeof(char)*(len*2 + 1));

  int i, j = 0, k;

  /* traverse the input string one by one */

  for(i = 0; i < len; i++)

{

    /* Copy the first occurrence of the new character */

    dest[j++] = src[i];

    /* Count the number of occurrences of the new character */

    rLen = 1;     

    while(i + 1 < len && src[i] == src[i+1])

{

      rLen++;

      i++;

}

    /* Store rLen in a character array count[] */

    sprintf(count, "%d", rLen);

    /* Copy the count[] to destination */

    for(k = 0; *(count+k); k++, j++)

{

      dest[j] = count[k]; 

}

}

  /*terminate the destination string */

  dest[j] = '\0';

  return dest;

}

http://www.stoimen.com/blog/2012/01/09/computer-algorithms-data-compression-with-run-length-encoding/
aaaaaaaaaabbbaxxxxyyyzyx
The length of this string is 17, which is approximately 70% of the initial length. Obviously this is not the optimal way to compress the given string. For instance we don’t need to use the digit “1” when the character is repeated only once. In some cases this approach can increase the length of the initial string which is exactly the opposite of what we need. In this case we’ll get the string bellow.
a10b3a1x4y3z1y1x1
Now the length of the resulting string is 13, which is 54% of the initial length! A variation of the example above is not to keep a counter of the repetitions of the character, but their position instead. Thus the initial string will be compressed as follows.
a10b3ax4y3zyx
Which of these two approaches you’ll use depends on the goal. In the second case we can achieve a good optimization of binary search.

$message = 'aaaaaaaaaabbbaxxxxyyyzyx';
 
function run_length_encode($msg)
{
 $i = $j = 0;
 $prev = '';
 $output = '';
 
 while ($msg[$i]) {
  if ($msg[$i] != $prev) {
 
   if ($i) 
    $output .= $j;
 
   $output .= $msg[$i];
 
   $prev = $msg[$i];
 
   $j = 0;
  }
  $j++;
  $i++;
 }
 
 $output .= $j;
 
 return $output;
}
 
// a10b3a1x4y3z1y1x1
echo run_length_encode($message);

And slightly optimized.

$message = 'aaaaaaaaaabbbaxxxxyyyzyx';
 
function run_length_encode($msg)
{
 $i = $j = 0;
 $prev = '';
 $output = '';
 
 while ($msg[$i]) {
  if ($msg[$i] != $prev) {
 
   if ($i && $j > 1) 
    $output .= $j;
 
   $output .= $msg[$i];
 
   $prev = $msg[$i];
 
   $j = 0;
  }
  $j++;
  $i++;
 }
 
 if ($j > 1)
  $output .= $j;
 
 return $output;
}
 
// a10b3ax4y3zyx
echo run_length_encode($message);

优步面经

public String encode(String s) {

      int start = 0;

      StringBuilder res = new StringBuilder();

      for (int i=1; i<=s.length(); i++) {

        if (i == s.length() || s.charAt(i) != s.charAt(start)) {

            if (i - start > 2) {

              res.append(String.valueOf(i-start));

              res.append(s.charAt(start));

            } else {

              res.append(s.substring(start, i));

}

            start = i;

}

}

      return res.toString();

}

首先暴力扫描之。然后问有什么问题，答曰decode的时候111aaa会encode成313a然后decode成313个a。提出每encode一个就插入一个特殊符号标记，他说可以。然后追加问是否可以不用特殊符号，答曰创建新的计数单位，比如用A代表1B代表2，然后aaa就变成Ca（you get the idea），这样就不需要插入额外符号了。

Finally a small change – now we store the position of the character.

$message = 'aaaaaaaaaabbbaxxxxyyyzyx';
 
function run_length_encode($msg)
{
 $i = 0;
 $prev = '';
 $output = '';
 
 while ($msg[$i]) {
  if ($msg[$i] != $prev) {
 
   $output .= $msg[$i] . $i;
 
   $prev = $msg[$i];
 
  }
 
  $i++;
 }
 
 return $output;
}
 
// a0b10a13x14y18z21y22x23
echo run_length_encode($message);

https://en.wikipedia.org/wiki/Run-length_encoding
WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW

Run-length encoding can be expressed in multiple ways to accommodate data properties as well as additional compression algorithms. For instance, one popular method encodes run lengths for runs of two or more characters only, using an "escape" symbol to identify runs, or using the character itself as the escape, so that any time a character appears twice it denotes a run. On the previous example, this would give the following:

WW12BWW12BB3WW24BWW14

This would be interpreted as a run of twelve Ws, a B, a run of twelve Ws, a run of three Bs, etc. In data where runs are less frequent, this can significantly improve the compression rate.

One other matter is the application of additional compression algorithms. Even with the runs extracted, the frequencies of different characters may be large, allowing for further compression; however, if the run lengths are written in the file in the locations where the runs occurred, the presence of these numbers interrupts the normal flow and makes it harder to compress. To overcome this, some run-length encoders separate the data and escape symbols from the run lengths, so that the two can be handled independently. For the example data, this would result in two outputs, the string "WWBWWBBWWBWW" and the numbers (12,12,3,24,14).

游程编码（RLE，run-length encoding）
整数变动长度表示法
整数固定长度表示法
4位元表示法

顾名思义，利用4个位元来储存整数，以符号C表示整数值。其可表现的最大整数值为15，因此，若资料重复出现次数超过15，便以“分段”方式处理。
假设某资料出现N次，则可以将其分成(N/15)+1段落来处理，其中N/15的值为无条件舍去。例如连续出现33个A：

原始资料：
AAAAAAAAAAAAAAA AAAAAAAAAAAAAAA AAA
压缩结果：

     15A          15A         3A

内部储存码：
1111 01000001 1111 01000001 0011 01000001

15   A    15    A     3     A

8位元表示法

同4位元表示法的概念，其能表示最大整数为255。假设某资料出现N次，则可以将其分成(N/255)+1段落来处理，其中N/255的值为无条件舍去。

http://algorithmsforever.blogspot.com/2011/11/run-length-encoding-is-very-simple-form.html

input = "aabbbcddeef" RLE = "a2b3cd2e2f"
input = "aaabbcddeeeef" RLE = "a3b2cd2e4f"

void rle_encode(char[] input, int N, char[] output){

int index = 0, count = 1;
char last = input[0];

for(int i=1; input[i]; i++){

if(last == input[i]){

count++;

}
else {

output[index++] = last;
if(count > 1){

output[index++] = count + '0';
count = 1;

}
last = input[i];

}

}

output[index++] = last;
if(count > 1){

output[index++] = count + '0';

}

http://lemire.me/blog/archives/2009/11/24/run-length-encoding-part-i/
How you use the RLE idea depends on your goals such as (1) maximize the compression rate (2) maximize the processing speed.

Instead of using a counter for each run of characters, you only add a counter after a value has been repeated twice. For example, the string AAABBBBBZWWK becomes AA1-BB3-Z-WW-K. Thus, if many characters are not repeated, you will rarely use an unnecessary counter.

Instead of a counter, you may store the location of the run in the array. For example, the string AAABBBBBZWWK becomes 1A-4B-9Z-10W-11K. The benefit of this approach is to allow random access in logarithmic time using binary search. However, it is also incompatible with some techniques to avoid unnecessary counters. So, it is a compression-speed trade-off. For even more speed, you can store both the location of the run and its length (thus avoiding a subtraction).
To help vectorization, you can group the characters into blocks of k characters. For example, using blocks of two characters, the string AAABBBBBZWWK becomes 1AA-1AB-2BB-1ZW-1WK. Again, this is a compression-speed trade-off because there will be fewer runs to compress after grouping the characters into long blocks.

http://www.fileformat.info/mirror/egff/ch09_03.htm
Read full article from Run Length Encoding | GeeksforGeeks

Run Length Encoding | GeeksforGeeks

8位元表示法

Labels

Popular Posts