Optimal runtime for simple compression scheme










1















Here is a simple puzzle with some applications in bio-informatics. It's an abstracted version of something that came up at a friend's job.



Consider a very simple compression scheme, whose output consists of a stream of two operations:




  • put(a): outputs a character a


  • dup(): duplicates all output written so far

For notational convenience, write the character x itself for put('x') and * for dup().



For example, "a**b*c" expands to "aaaabaaaabc".



To compress a given string s, find the shortest list of these two operations to generate it.



For example for "aaaaaaaaaab" shortens to a**a*b. (a***aab is also possible, but one character longer.)



My question is: what's the best achievable runtime for optimal compression? (And what's the algorithm that achieves that runtime.)



I believe linear runtime is possible, but I haven't found anything definitely better than quadratic, yet. (Not too worried about using extra space.)










share|improve this question



















  • 1





    How? Please add an answer, if you have an idea.

    – Matthias
    Nov 14 '18 at 17:02















1















Here is a simple puzzle with some applications in bio-informatics. It's an abstracted version of something that came up at a friend's job.



Consider a very simple compression scheme, whose output consists of a stream of two operations:




  • put(a): outputs a character a


  • dup(): duplicates all output written so far

For notational convenience, write the character x itself for put('x') and * for dup().



For example, "a**b*c" expands to "aaaabaaaabc".



To compress a given string s, find the shortest list of these two operations to generate it.



For example for "aaaaaaaaaab" shortens to a**a*b. (a***aab is also possible, but one character longer.)



My question is: what's the best achievable runtime for optimal compression? (And what's the algorithm that achieves that runtime.)



I believe linear runtime is possible, but I haven't found anything definitely better than quadratic, yet. (Not too worried about using extra space.)










share|improve this question



















  • 1





    How? Please add an answer, if you have an idea.

    – Matthias
    Nov 14 '18 at 17:02













1












1








1


1






Here is a simple puzzle with some applications in bio-informatics. It's an abstracted version of something that came up at a friend's job.



Consider a very simple compression scheme, whose output consists of a stream of two operations:




  • put(a): outputs a character a


  • dup(): duplicates all output written so far

For notational convenience, write the character x itself for put('x') and * for dup().



For example, "a**b*c" expands to "aaaabaaaabc".



To compress a given string s, find the shortest list of these two operations to generate it.



For example for "aaaaaaaaaab" shortens to a**a*b. (a***aab is also possible, but one character longer.)



My question is: what's the best achievable runtime for optimal compression? (And what's the algorithm that achieves that runtime.)



I believe linear runtime is possible, but I haven't found anything definitely better than quadratic, yet. (Not too worried about using extra space.)










share|improve this question
















Here is a simple puzzle with some applications in bio-informatics. It's an abstracted version of something that came up at a friend's job.



Consider a very simple compression scheme, whose output consists of a stream of two operations:




  • put(a): outputs a character a


  • dup(): duplicates all output written so far

For notational convenience, write the character x itself for put('x') and * for dup().



For example, "a**b*c" expands to "aaaabaaaabc".



To compress a given string s, find the shortest list of these two operations to generate it.



For example for "aaaaaaaaaab" shortens to a**a*b. (a***aab is also possible, but one character longer.)



My question is: what's the best achievable runtime for optimal compression? (And what's the algorithm that achieves that runtime.)



I believe linear runtime is possible, but I haven't found anything definitely better than quadratic, yet. (Not too worried about using extra space.)







algorithm big-o lossless-compression






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 14 '18 at 17:02







Matthias

















asked Nov 14 '18 at 15:00









MatthiasMatthias

589




589







  • 1





    How? Please add an answer, if you have an idea.

    – Matthias
    Nov 14 '18 at 17:02












  • 1





    How? Please add an answer, if you have an idea.

    – Matthias
    Nov 14 '18 at 17:02







1




1





How? Please add an answer, if you have an idea.

– Matthias
Nov 14 '18 at 17:02





How? Please add an answer, if you have an idea.

– Matthias
Nov 14 '18 at 17:02












1 Answer
1






active

oldest

votes


















1














Yes, a linear run time is possible for this compression scheme.



Create a list dp. The ith element of this list will the best compression possible for the first i elements of the string.



dp[1] = 1
dp[i] = dp[i-1] + 1
if i is even and first i/2 elements of string are equal to second i/2 elements:
dp[i] = min(dp[i], dp[i/2] + 1)


To check if first i/2 elements are equal to second i/2 elements, you can find the longest common prefix between the string and the suffix starting at index i/2. If this prefix is greater than or equal to i/2 in length, then the first i/2 elements are indeed equal to the second i/2 elements.



Speeding up this operation is possible using a modified LCP array.



First, build a suffix array for the string in O(n).



Then, build a longest common prefix array for the suffix array in O(n).



Now, find the index of the complete string in the suffix array. Let's say it's i. Now, iterate from i to end of LCP array replacing each value with the minimum seen so far. Similarly, iterate down from i-1 to 0 in the LCP array replacing each value with the minimum seen so far.



Once this is done, each value in the LCP array represents the longest common prefix of that suffix with the complete string, which is what the algorithm required.
Note that these values are ordered according to the sorted suffixes rather than the position of the suffixes in the string. Mapping it is fairly simple using the suffix array though.






share|improve this answer























  • Thanks. I'll go through this in detail tomorrow. From my literature research, I was already expecting that LCP and suffix array might be useful.

    – Matthias
    Nov 15 '18 at 14:49











  • I also made a sketch for a simple algorithm with two rolling hashes: Basically, roll one from [0:i] the other from [i:2*i]. A Rabin fingerprint might work.

    – Matthias
    Nov 15 '18 at 14:51











  • @Matthias Yeah, fingerprinting sounds like a much simpler idea that might be good enough in practical scenarios.

    – merlyn
    Nov 15 '18 at 14:55











  • Oh, I think in practice most of the time the naive approach would work fast (much faster than it's worst case quadratic behaviour). Btw, I think you can drop the dynamic programming in your solution and use greedy, if you walk backwards through your string. For fingerprinting to work in theory, I have to get a good handle on error probabilities. (So far I've only managed to achieve O(n log n) runtime via fingerprinting.) PS Please have a look at cstheory.stackexchange.com/q/41251/26936 too.

    – Matthias
    Nov 16 '18 at 3:12












  • BTW, the dynamic programming bit at the beginning is overkill. Greedy from the back works just fine. I have found a new an simpler linear scheme, too. Will post soon.

    – Matthias
    Dec 18 '18 at 16:03










Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53303119%2foptimal-runtime-for-simple-compression-scheme%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














Yes, a linear run time is possible for this compression scheme.



Create a list dp. The ith element of this list will the best compression possible for the first i elements of the string.



dp[1] = 1
dp[i] = dp[i-1] + 1
if i is even and first i/2 elements of string are equal to second i/2 elements:
dp[i] = min(dp[i], dp[i/2] + 1)


To check if first i/2 elements are equal to second i/2 elements, you can find the longest common prefix between the string and the suffix starting at index i/2. If this prefix is greater than or equal to i/2 in length, then the first i/2 elements are indeed equal to the second i/2 elements.



Speeding up this operation is possible using a modified LCP array.



First, build a suffix array for the string in O(n).



Then, build a longest common prefix array for the suffix array in O(n).



Now, find the index of the complete string in the suffix array. Let's say it's i. Now, iterate from i to end of LCP array replacing each value with the minimum seen so far. Similarly, iterate down from i-1 to 0 in the LCP array replacing each value with the minimum seen so far.



Once this is done, each value in the LCP array represents the longest common prefix of that suffix with the complete string, which is what the algorithm required.
Note that these values are ordered according to the sorted suffixes rather than the position of the suffixes in the string. Mapping it is fairly simple using the suffix array though.






share|improve this answer























  • Thanks. I'll go through this in detail tomorrow. From my literature research, I was already expecting that LCP and suffix array might be useful.

    – Matthias
    Nov 15 '18 at 14:49











  • I also made a sketch for a simple algorithm with two rolling hashes: Basically, roll one from [0:i] the other from [i:2*i]. A Rabin fingerprint might work.

    – Matthias
    Nov 15 '18 at 14:51











  • @Matthias Yeah, fingerprinting sounds like a much simpler idea that might be good enough in practical scenarios.

    – merlyn
    Nov 15 '18 at 14:55











  • Oh, I think in practice most of the time the naive approach would work fast (much faster than it's worst case quadratic behaviour). Btw, I think you can drop the dynamic programming in your solution and use greedy, if you walk backwards through your string. For fingerprinting to work in theory, I have to get a good handle on error probabilities. (So far I've only managed to achieve O(n log n) runtime via fingerprinting.) PS Please have a look at cstheory.stackexchange.com/q/41251/26936 too.

    – Matthias
    Nov 16 '18 at 3:12












  • BTW, the dynamic programming bit at the beginning is overkill. Greedy from the back works just fine. I have found a new an simpler linear scheme, too. Will post soon.

    – Matthias
    Dec 18 '18 at 16:03















1














Yes, a linear run time is possible for this compression scheme.



Create a list dp. The ith element of this list will the best compression possible for the first i elements of the string.



dp[1] = 1
dp[i] = dp[i-1] + 1
if i is even and first i/2 elements of string are equal to second i/2 elements:
dp[i] = min(dp[i], dp[i/2] + 1)


To check if first i/2 elements are equal to second i/2 elements, you can find the longest common prefix between the string and the suffix starting at index i/2. If this prefix is greater than or equal to i/2 in length, then the first i/2 elements are indeed equal to the second i/2 elements.



Speeding up this operation is possible using a modified LCP array.



First, build a suffix array for the string in O(n).



Then, build a longest common prefix array for the suffix array in O(n).



Now, find the index of the complete string in the suffix array. Let's say it's i. Now, iterate from i to end of LCP array replacing each value with the minimum seen so far. Similarly, iterate down from i-1 to 0 in the LCP array replacing each value with the minimum seen so far.



Once this is done, each value in the LCP array represents the longest common prefix of that suffix with the complete string, which is what the algorithm required.
Note that these values are ordered according to the sorted suffixes rather than the position of the suffixes in the string. Mapping it is fairly simple using the suffix array though.






share|improve this answer























  • Thanks. I'll go through this in detail tomorrow. From my literature research, I was already expecting that LCP and suffix array might be useful.

    – Matthias
    Nov 15 '18 at 14:49











  • I also made a sketch for a simple algorithm with two rolling hashes: Basically, roll one from [0:i] the other from [i:2*i]. A Rabin fingerprint might work.

    – Matthias
    Nov 15 '18 at 14:51











  • @Matthias Yeah, fingerprinting sounds like a much simpler idea that might be good enough in practical scenarios.

    – merlyn
    Nov 15 '18 at 14:55











  • Oh, I think in practice most of the time the naive approach would work fast (much faster than it's worst case quadratic behaviour). Btw, I think you can drop the dynamic programming in your solution and use greedy, if you walk backwards through your string. For fingerprinting to work in theory, I have to get a good handle on error probabilities. (So far I've only managed to achieve O(n log n) runtime via fingerprinting.) PS Please have a look at cstheory.stackexchange.com/q/41251/26936 too.

    – Matthias
    Nov 16 '18 at 3:12












  • BTW, the dynamic programming bit at the beginning is overkill. Greedy from the back works just fine. I have found a new an simpler linear scheme, too. Will post soon.

    – Matthias
    Dec 18 '18 at 16:03













1












1








1







Yes, a linear run time is possible for this compression scheme.



Create a list dp. The ith element of this list will the best compression possible for the first i elements of the string.



dp[1] = 1
dp[i] = dp[i-1] + 1
if i is even and first i/2 elements of string are equal to second i/2 elements:
dp[i] = min(dp[i], dp[i/2] + 1)


To check if first i/2 elements are equal to second i/2 elements, you can find the longest common prefix between the string and the suffix starting at index i/2. If this prefix is greater than or equal to i/2 in length, then the first i/2 elements are indeed equal to the second i/2 elements.



Speeding up this operation is possible using a modified LCP array.



First, build a suffix array for the string in O(n).



Then, build a longest common prefix array for the suffix array in O(n).



Now, find the index of the complete string in the suffix array. Let's say it's i. Now, iterate from i to end of LCP array replacing each value with the minimum seen so far. Similarly, iterate down from i-1 to 0 in the LCP array replacing each value with the minimum seen so far.



Once this is done, each value in the LCP array represents the longest common prefix of that suffix with the complete string, which is what the algorithm required.
Note that these values are ordered according to the sorted suffixes rather than the position of the suffixes in the string. Mapping it is fairly simple using the suffix array though.






share|improve this answer













Yes, a linear run time is possible for this compression scheme.



Create a list dp. The ith element of this list will the best compression possible for the first i elements of the string.



dp[1] = 1
dp[i] = dp[i-1] + 1
if i is even and first i/2 elements of string are equal to second i/2 elements:
dp[i] = min(dp[i], dp[i/2] + 1)


To check if first i/2 elements are equal to second i/2 elements, you can find the longest common prefix between the string and the suffix starting at index i/2. If this prefix is greater than or equal to i/2 in length, then the first i/2 elements are indeed equal to the second i/2 elements.



Speeding up this operation is possible using a modified LCP array.



First, build a suffix array for the string in O(n).



Then, build a longest common prefix array for the suffix array in O(n).



Now, find the index of the complete string in the suffix array. Let's say it's i. Now, iterate from i to end of LCP array replacing each value with the minimum seen so far. Similarly, iterate down from i-1 to 0 in the LCP array replacing each value with the minimum seen so far.



Once this is done, each value in the LCP array represents the longest common prefix of that suffix with the complete string, which is what the algorithm required.
Note that these values are ordered according to the sorted suffixes rather than the position of the suffixes in the string. Mapping it is fairly simple using the suffix array though.







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 15 '18 at 6:20









merlynmerlyn

1,73011222




1,73011222












  • Thanks. I'll go through this in detail tomorrow. From my literature research, I was already expecting that LCP and suffix array might be useful.

    – Matthias
    Nov 15 '18 at 14:49











  • I also made a sketch for a simple algorithm with two rolling hashes: Basically, roll one from [0:i] the other from [i:2*i]. A Rabin fingerprint might work.

    – Matthias
    Nov 15 '18 at 14:51











  • @Matthias Yeah, fingerprinting sounds like a much simpler idea that might be good enough in practical scenarios.

    – merlyn
    Nov 15 '18 at 14:55











  • Oh, I think in practice most of the time the naive approach would work fast (much faster than it's worst case quadratic behaviour). Btw, I think you can drop the dynamic programming in your solution and use greedy, if you walk backwards through your string. For fingerprinting to work in theory, I have to get a good handle on error probabilities. (So far I've only managed to achieve O(n log n) runtime via fingerprinting.) PS Please have a look at cstheory.stackexchange.com/q/41251/26936 too.

    – Matthias
    Nov 16 '18 at 3:12












  • BTW, the dynamic programming bit at the beginning is overkill. Greedy from the back works just fine. I have found a new an simpler linear scheme, too. Will post soon.

    – Matthias
    Dec 18 '18 at 16:03

















  • Thanks. I'll go through this in detail tomorrow. From my literature research, I was already expecting that LCP and suffix array might be useful.

    – Matthias
    Nov 15 '18 at 14:49











  • I also made a sketch for a simple algorithm with two rolling hashes: Basically, roll one from [0:i] the other from [i:2*i]. A Rabin fingerprint might work.

    – Matthias
    Nov 15 '18 at 14:51











  • @Matthias Yeah, fingerprinting sounds like a much simpler idea that might be good enough in practical scenarios.

    – merlyn
    Nov 15 '18 at 14:55











  • Oh, I think in practice most of the time the naive approach would work fast (much faster than it's worst case quadratic behaviour). Btw, I think you can drop the dynamic programming in your solution and use greedy, if you walk backwards through your string. For fingerprinting to work in theory, I have to get a good handle on error probabilities. (So far I've only managed to achieve O(n log n) runtime via fingerprinting.) PS Please have a look at cstheory.stackexchange.com/q/41251/26936 too.

    – Matthias
    Nov 16 '18 at 3:12












  • BTW, the dynamic programming bit at the beginning is overkill. Greedy from the back works just fine. I have found a new an simpler linear scheme, too. Will post soon.

    – Matthias
    Dec 18 '18 at 16:03
















Thanks. I'll go through this in detail tomorrow. From my literature research, I was already expecting that LCP and suffix array might be useful.

– Matthias
Nov 15 '18 at 14:49





Thanks. I'll go through this in detail tomorrow. From my literature research, I was already expecting that LCP and suffix array might be useful.

– Matthias
Nov 15 '18 at 14:49













I also made a sketch for a simple algorithm with two rolling hashes: Basically, roll one from [0:i] the other from [i:2*i]. A Rabin fingerprint might work.

– Matthias
Nov 15 '18 at 14:51





I also made a sketch for a simple algorithm with two rolling hashes: Basically, roll one from [0:i] the other from [i:2*i]. A Rabin fingerprint might work.

– Matthias
Nov 15 '18 at 14:51













@Matthias Yeah, fingerprinting sounds like a much simpler idea that might be good enough in practical scenarios.

– merlyn
Nov 15 '18 at 14:55





@Matthias Yeah, fingerprinting sounds like a much simpler idea that might be good enough in practical scenarios.

– merlyn
Nov 15 '18 at 14:55













Oh, I think in practice most of the time the naive approach would work fast (much faster than it's worst case quadratic behaviour). Btw, I think you can drop the dynamic programming in your solution and use greedy, if you walk backwards through your string. For fingerprinting to work in theory, I have to get a good handle on error probabilities. (So far I've only managed to achieve O(n log n) runtime via fingerprinting.) PS Please have a look at cstheory.stackexchange.com/q/41251/26936 too.

– Matthias
Nov 16 '18 at 3:12






Oh, I think in practice most of the time the naive approach would work fast (much faster than it's worst case quadratic behaviour). Btw, I think you can drop the dynamic programming in your solution and use greedy, if you walk backwards through your string. For fingerprinting to work in theory, I have to get a good handle on error probabilities. (So far I've only managed to achieve O(n log n) runtime via fingerprinting.) PS Please have a look at cstheory.stackexchange.com/q/41251/26936 too.

– Matthias
Nov 16 '18 at 3:12














BTW, the dynamic programming bit at the beginning is overkill. Greedy from the back works just fine. I have found a new an simpler linear scheme, too. Will post soon.

– Matthias
Dec 18 '18 at 16:03





BTW, the dynamic programming bit at the beginning is overkill. Greedy from the back works just fine. I have found a new an simpler linear scheme, too. Will post soon.

– Matthias
Dec 18 '18 at 16:03



















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53303119%2foptimal-runtime-for-simple-compression-scheme%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

Barbados

How to read a connectionString WITH PROVIDER in .NET Core?

Node.js Script on GitHub Pages or Amazon S3