finding optimal token definitions for compression

I have a collection of strings which have a lot of common substrings,
and I’m trying to find a good way to define tokens to compress them.

For instance, if my strings are:

s1 = "String"
s2 = "Bool"
s3 = "String -> Bool"
s4 = "String -> String"
s5 = "(String -> String) -> String -> [Bool]"

then I might want to use the tokens:

$1 = "String"
$2 = "Bool"
$3 = "$1 -> $1"

so that the strings may be defined as:

s1 = "$1"
s2 = "$2"
s3 = "$1 -> $2"
s4 = "$3"
s5 = "($3) -> $1 -> [$2]"

(In fact, now it’s clear that the definition $4 = " -> " might be good one to add.)

I’m looking for a good (perhaps the best?) way to choose the
token definitions. I’m interested in minimizing the total length
of the token definitions + the resulting string definitions.

Any ideas?


It’s kinda related to this SO question: Huffman encoding with variable length symbols


The generic term for what you’re trying to do is grammar-based compression. In particular, you seem to be trying to solve the smallest grammar problem.

See e.g. Charikar, Lehman, Lehman, Liu, Panigrahy, Prabhakaran, Sahai, Shelat (2005). The Smallest Grammar Problem. IEEE Transactions on Information Theory 51 (7): 2554–2576

Lempel-Ziv style compression does the kind of thing you’re looking for.

For your application, it sounds like you want a fixed “string dictionary” (LZ’s internal representation of your token definitions), rather than the dynamic one used by standard streaming compression programs like gzip or compress. The general Lempel-Ziv scheme is readily adaptable to this kind of thing.

If nothing else, be aware that Lempel-Ziv-style techniques can naturally help compress the token definition data itself.

Trả lời

Email của bạn sẽ không được hiển thị công khai. Các trường bắt buộc được đánh dấu *