Implement bzip2's run-length encoding

Dennis just a moment. 3 answers, 0 views

Background

After applying the BWT (as seen in Burrows, Wheeler and Back) and the MTF (as seen in Move to the printable ASCII front), the bzip2 compressor applies a rather unique form of run-length encoding.

Definition

For the purpose of this challenge, we define the transformation BRLE as follows:

Given an input string s that consists solely of ASCII characters with code points between 0x20 and 0x7A, do the following:

1. Replace each run of equal characters by a single occurrence of the character and store number of repetitions after the first.

2. Encode the number of repetitions after the first occurrence of the character, using bijective base-2 numeration and the symbols { and }.

A non-negative integer n is encoded as the string bk…b0 such that n = 2ki(bk)+…+20i(b0), where i({) = 1 and i(}) = 2.

Note that this representation is always unique. For example, the number 0 is encoded as an empty string.

3. Insert the string of curly brackets that encodes the number of repetitions after the single occurrence of the corresponding character.

Step-by-step example

Input:  "abbcccddddeeeeeffffffggggggghhhhhhhh"
Step 1: "abcdefgh" with repetitions 0, 1, 2, 3, 4, 5, 6, 7
Step 2: "" "{" "}" "{{" "{}" "}{" "}}" "{{{"
Step 3: "ab{c}d{{e{}f}{g}}h{{{"

Implement an involutive program or function that reads a single string from STDIN or as command-line or function argument and prints or returns either the BRLE or its inverse of the input string.

If the input contains no curly brackets, apply the BRLE. If the input contains curly brackets, apply its inverse.

Examples

INPUT:  CODEGOLF
OUTPUT: CODEGOLF

INPUT:  PROGRAMMING
OUTPUT: PROGRAM{ING

INPUT:  PUZ{LES
OUTPUT: PUZZLES

INPUT:  444488888888GGGGGGGGGGGGGGGGWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
OUTPUT: 4{{8{{{G{{{{W{{{{{

INPUT:  y}}}{{
OUTPUT: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy

• You cannot use any built-ins that compute the BRLE or its inverse of a string.

• You can use built-ins that:

• Compute the RLE or RLD of a string, as long as the number of repetitions is not stored in bijective base-2.

• Perform base conversion of any kind.

• Your code may print a trailing newline if you choose STDOUT for output.

• Your code has to work for any input of 1000 or less ASCII characters in the range from 0x20 to 0x7A, plus curly brackets (0x7B and 0x7D).

• If the input contains curly brackets, you may assume that it is a valid result of applying the BRLE to a string.

• Standard code golf rules apply. The shortest submission in bytes wins.

jimmy23013 07/13/2015.

CJam, 50 48 bytes

l_{}:T&1${_T#)_(@a?)+}%{(\2b)*}%@e{(2b1>Tf=}%? Thanks for Dennis for saving 2 bytes. Try it online. Explanation l_ e# Read and duplicate input. {}:T e# T = "{}" & e# If the input has '{ or '}: 1$          e# Input.
{           e# For each character:
_T#)    e# If it is '{ or '}:
_(  e# Return 0 for '{ or 1 for '}.
@a  e# Otherwise, convert the character itself to an array (string).
?
)+      e# If it is a number, increment and append to the previous array.
e# If it is a string with at least 1 character, do nothing.
}%
{(\         e# For each character and bijective base 2 number:
2b)*    e# Repeat the character 1 + that many times.
}%
e# Else:
@           e# Input.
e          e# Run-length encoding.
{(          e# For each character and length:
2b1>    e# Convert the length to base 2 and remove the first bit.
Tf=     e# Map 0 to '{ and 1 to '}.
}%
?               e# End if.

isaacg 07/15/2015.

Pyth, 48 50 bytes

JHs?m*hdi+1xLJtd2tczf-@zTJUz@Jzsm+ed@LJtjhd2rz8

2 bytes thanks to @Maltysen.

Explanation:

JHs?m*hdi+1xLJtd2tczf-@zTJUz@Jzsm+ed@LJtjhd2rz8
Implicit: z = input()
H is empty dict.
JH                                                 J = repr(H) = "{}"
s                                                Print the concatenation of
?                        @Jz                    If z and J share any chars:
f     Uz                       Filter range(len(z))
-@zTJ                         On the absence of z[T] in J.
cz                               Chop z at these indices.
just before each non '{}'.
t                                 Remove empty initial piece.
m*hd                                           Map to d[0] *
i       2                                  the base 2 number
xLJtd                                   index in J mapped over d[:-1]
+1                                        with a 1 prepended.
rz8    Otherwise, run len. encode z
m                  map over (len, char)
jhd2       Convert len to binary.
@LJ            Map to element of J.
+ed               Prepend char.
s                   Concatenate. 

feersum 07/15/2015.

OCaml, 252

t is the function that does the transformation.

#load"str.cma"open Str
let rec(!)=group_beginning and
g=function|1->""|x->(g(x/2)^[|"{";"}"|].(x mod 2))and($)i s=if g i=s then i else(i+1)$s and
t s=global_substitute(regexp"$.$\1*$[{}]*$")(fun s->String.make(1\$matched_group 2 s)s.[!1]^g(!2- !1))s`

At first I was thinking I had to check for the presence of curly braces, but it turned out to be unnecessary. The decoding clearly has no effect on strings that are already decoded, and the encoding part proved equally harmless to ones that were already encoded.