How to group sequences with the same regularity using spark









up vote
2
down vote

favorite
1












To simplify the input parameters and code, I've generated an input list (in real live it is a lot of csv files with digits inside). All input numbers should be grouped by the single regularity rule. Numbers that can not be described with any regularity rule should be removed from the output result.



input example:



[1010,1020,1050,1030,1022,1880,1940,1900,1920,2010,3100]


output result example:



[[1010,1020,1030,1050],[1880,1900,1920,1940]]


[1010,1020,1030,1050] has increment step 10 and sequence has an exclusion (1040). [1880,1900,1920,1940] has increment step 20, sequence without exclusions.



Starting solution with spark on java



public static void main(String args) throws Exception 
Logger.getLogger("org").setLevel(Level.OFF);
SparkConf conf = new SparkConf().setAppName("reduce").setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(conf);

List<Integer> inputIntegers = Arrays.asList(1010,1020,1050,1030,1022,1880,1940,1900,1920,2010,3100);
JavaRDD<Integer> integerRdd = sc.parallelize(inputIntegers);
System.out.println("Check count of objects: " + integerRdd.count());

System.out.println(integerRdd.collect());
JavaRDD<Integer> sorted = integerRdd.sortBy(x -> x, true, 1);
System.out.println("Check sorted collection:");
System.out.println(sorted.collect());
/* TODO
- enrich data with increment step
- group by increment step (increment step should be calculated in run time)
- filter out single values
*/



I think I need to use increment step but I do not know how to manage data from different RDDs (in case of using spark), to have a chance save incremental step in current spark RDD



Does anyone have ideas how to resolve my spark issue or how it can be solved with different tools?



P.S.



Spark (or any equivalent technology) is mandatory because solution must be big data oriented and as result resolved in distributed system.










share|improve this question























  • so I guess the group should have at least 3 elements, otherwise any pair of elements would satisfy regularity
    – mangusta
    Nov 10 at 10:56










  • @mangusta nicely commented - updated. thanks
    – Sergii
    Nov 10 at 10:59










  • Привіт! Without specifying additional conditions and requirements the algorithmic problem in the background seems to be intractable. In your example input data is already sorted. Also, are you looking for the same increment rule only, or other regularity rules are acceptable, e.g sequence 1,2,4,5,7,8,10,11,...?
    – Yola
    Nov 10 at 11:33











  • almost sorted (collisions could happen 1010,1020,1050,1030...), and exclusion rules make issue harder. @yola, Do you know how to group numbers by increment value?
    – Sergii
    Nov 10 at 11:38










  • I'm afraid that i don't know anything about spark. First i would sort the sequence with something like bubble-sort, as you tell that is already sorted. Then i would decide on the biggest possible step, that could be deduced from the range of the numbers in the sequence, maybe other considerations as well. And then just routinely check for different step. The first step to consider would be a[1] - a[0].
    – Yola
    Nov 10 at 11:44














up vote
2
down vote

favorite
1












To simplify the input parameters and code, I've generated an input list (in real live it is a lot of csv files with digits inside). All input numbers should be grouped by the single regularity rule. Numbers that can not be described with any regularity rule should be removed from the output result.



input example:



[1010,1020,1050,1030,1022,1880,1940,1900,1920,2010,3100]


output result example:



[[1010,1020,1030,1050],[1880,1900,1920,1940]]


[1010,1020,1030,1050] has increment step 10 and sequence has an exclusion (1040). [1880,1900,1920,1940] has increment step 20, sequence without exclusions.



Starting solution with spark on java



public static void main(String args) throws Exception 
Logger.getLogger("org").setLevel(Level.OFF);
SparkConf conf = new SparkConf().setAppName("reduce").setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(conf);

List<Integer> inputIntegers = Arrays.asList(1010,1020,1050,1030,1022,1880,1940,1900,1920,2010,3100);
JavaRDD<Integer> integerRdd = sc.parallelize(inputIntegers);
System.out.println("Check count of objects: " + integerRdd.count());

System.out.println(integerRdd.collect());
JavaRDD<Integer> sorted = integerRdd.sortBy(x -> x, true, 1);
System.out.println("Check sorted collection:");
System.out.println(sorted.collect());
/* TODO
- enrich data with increment step
- group by increment step (increment step should be calculated in run time)
- filter out single values
*/



I think I need to use increment step but I do not know how to manage data from different RDDs (in case of using spark), to have a chance save incremental step in current spark RDD



Does anyone have ideas how to resolve my spark issue or how it can be solved with different tools?



P.S.



Spark (or any equivalent technology) is mandatory because solution must be big data oriented and as result resolved in distributed system.










share|improve this question























  • so I guess the group should have at least 3 elements, otherwise any pair of elements would satisfy regularity
    – mangusta
    Nov 10 at 10:56










  • @mangusta nicely commented - updated. thanks
    – Sergii
    Nov 10 at 10:59










  • Привіт! Without specifying additional conditions and requirements the algorithmic problem in the background seems to be intractable. In your example input data is already sorted. Also, are you looking for the same increment rule only, or other regularity rules are acceptable, e.g sequence 1,2,4,5,7,8,10,11,...?
    – Yola
    Nov 10 at 11:33











  • almost sorted (collisions could happen 1010,1020,1050,1030...), and exclusion rules make issue harder. @yola, Do you know how to group numbers by increment value?
    – Sergii
    Nov 10 at 11:38










  • I'm afraid that i don't know anything about spark. First i would sort the sequence with something like bubble-sort, as you tell that is already sorted. Then i would decide on the biggest possible step, that could be deduced from the range of the numbers in the sequence, maybe other considerations as well. And then just routinely check for different step. The first step to consider would be a[1] - a[0].
    – Yola
    Nov 10 at 11:44












up vote
2
down vote

favorite
1









up vote
2
down vote

favorite
1






1





To simplify the input parameters and code, I've generated an input list (in real live it is a lot of csv files with digits inside). All input numbers should be grouped by the single regularity rule. Numbers that can not be described with any regularity rule should be removed from the output result.



input example:



[1010,1020,1050,1030,1022,1880,1940,1900,1920,2010,3100]


output result example:



[[1010,1020,1030,1050],[1880,1900,1920,1940]]


[1010,1020,1030,1050] has increment step 10 and sequence has an exclusion (1040). [1880,1900,1920,1940] has increment step 20, sequence without exclusions.



Starting solution with spark on java



public static void main(String args) throws Exception 
Logger.getLogger("org").setLevel(Level.OFF);
SparkConf conf = new SparkConf().setAppName("reduce").setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(conf);

List<Integer> inputIntegers = Arrays.asList(1010,1020,1050,1030,1022,1880,1940,1900,1920,2010,3100);
JavaRDD<Integer> integerRdd = sc.parallelize(inputIntegers);
System.out.println("Check count of objects: " + integerRdd.count());

System.out.println(integerRdd.collect());
JavaRDD<Integer> sorted = integerRdd.sortBy(x -> x, true, 1);
System.out.println("Check sorted collection:");
System.out.println(sorted.collect());
/* TODO
- enrich data with increment step
- group by increment step (increment step should be calculated in run time)
- filter out single values
*/



I think I need to use increment step but I do not know how to manage data from different RDDs (in case of using spark), to have a chance save incremental step in current spark RDD



Does anyone have ideas how to resolve my spark issue or how it can be solved with different tools?



P.S.



Spark (or any equivalent technology) is mandatory because solution must be big data oriented and as result resolved in distributed system.










share|improve this question















To simplify the input parameters and code, I've generated an input list (in real live it is a lot of csv files with digits inside). All input numbers should be grouped by the single regularity rule. Numbers that can not be described with any regularity rule should be removed from the output result.



input example:



[1010,1020,1050,1030,1022,1880,1940,1900,1920,2010,3100]


output result example:



[[1010,1020,1030,1050],[1880,1900,1920,1940]]


[1010,1020,1030,1050] has increment step 10 and sequence has an exclusion (1040). [1880,1900,1920,1940] has increment step 20, sequence without exclusions.



Starting solution with spark on java



public static void main(String args) throws Exception 
Logger.getLogger("org").setLevel(Level.OFF);
SparkConf conf = new SparkConf().setAppName("reduce").setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(conf);

List<Integer> inputIntegers = Arrays.asList(1010,1020,1050,1030,1022,1880,1940,1900,1920,2010,3100);
JavaRDD<Integer> integerRdd = sc.parallelize(inputIntegers);
System.out.println("Check count of objects: " + integerRdd.count());

System.out.println(integerRdd.collect());
JavaRDD<Integer> sorted = integerRdd.sortBy(x -> x, true, 1);
System.out.println("Check sorted collection:");
System.out.println(sorted.collect());
/* TODO
- enrich data with increment step
- group by increment step (increment step should be calculated in run time)
- filter out single values
*/



I think I need to use increment step but I do not know how to manage data from different RDDs (in case of using spark), to have a chance save incremental step in current spark RDD



Does anyone have ideas how to resolve my spark issue or how it can be solved with different tools?



P.S.



Spark (or any equivalent technology) is mandatory because solution must be big data oriented and as result resolved in distributed system.







java algorithm apache-spark sequence






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 10 at 10:58

























asked Nov 10 at 10:50









Sergii

2,30751846




2,30751846











  • so I guess the group should have at least 3 elements, otherwise any pair of elements would satisfy regularity
    – mangusta
    Nov 10 at 10:56










  • @mangusta nicely commented - updated. thanks
    – Sergii
    Nov 10 at 10:59










  • Привіт! Without specifying additional conditions and requirements the algorithmic problem in the background seems to be intractable. In your example input data is already sorted. Also, are you looking for the same increment rule only, or other regularity rules are acceptable, e.g sequence 1,2,4,5,7,8,10,11,...?
    – Yola
    Nov 10 at 11:33











  • almost sorted (collisions could happen 1010,1020,1050,1030...), and exclusion rules make issue harder. @yola, Do you know how to group numbers by increment value?
    – Sergii
    Nov 10 at 11:38










  • I'm afraid that i don't know anything about spark. First i would sort the sequence with something like bubble-sort, as you tell that is already sorted. Then i would decide on the biggest possible step, that could be deduced from the range of the numbers in the sequence, maybe other considerations as well. And then just routinely check for different step. The first step to consider would be a[1] - a[0].
    – Yola
    Nov 10 at 11:44
















  • so I guess the group should have at least 3 elements, otherwise any pair of elements would satisfy regularity
    – mangusta
    Nov 10 at 10:56










  • @mangusta nicely commented - updated. thanks
    – Sergii
    Nov 10 at 10:59










  • Привіт! Without specifying additional conditions and requirements the algorithmic problem in the background seems to be intractable. In your example input data is already sorted. Also, are you looking for the same increment rule only, or other regularity rules are acceptable, e.g sequence 1,2,4,5,7,8,10,11,...?
    – Yola
    Nov 10 at 11:33











  • almost sorted (collisions could happen 1010,1020,1050,1030...), and exclusion rules make issue harder. @yola, Do you know how to group numbers by increment value?
    – Sergii
    Nov 10 at 11:38










  • I'm afraid that i don't know anything about spark. First i would sort the sequence with something like bubble-sort, as you tell that is already sorted. Then i would decide on the biggest possible step, that could be deduced from the range of the numbers in the sequence, maybe other considerations as well. And then just routinely check for different step. The first step to consider would be a[1] - a[0].
    – Yola
    Nov 10 at 11:44















so I guess the group should have at least 3 elements, otherwise any pair of elements would satisfy regularity
– mangusta
Nov 10 at 10:56




so I guess the group should have at least 3 elements, otherwise any pair of elements would satisfy regularity
– mangusta
Nov 10 at 10:56












@mangusta nicely commented - updated. thanks
– Sergii
Nov 10 at 10:59




@mangusta nicely commented - updated. thanks
– Sergii
Nov 10 at 10:59












Привіт! Without specifying additional conditions and requirements the algorithmic problem in the background seems to be intractable. In your example input data is already sorted. Also, are you looking for the same increment rule only, or other regularity rules are acceptable, e.g sequence 1,2,4,5,7,8,10,11,...?
– Yola
Nov 10 at 11:33





Привіт! Without specifying additional conditions and requirements the algorithmic problem in the background seems to be intractable. In your example input data is already sorted. Also, are you looking for the same increment rule only, or other regularity rules are acceptable, e.g sequence 1,2,4,5,7,8,10,11,...?
– Yola
Nov 10 at 11:33













almost sorted (collisions could happen 1010,1020,1050,1030...), and exclusion rules make issue harder. @yola, Do you know how to group numbers by increment value?
– Sergii
Nov 10 at 11:38




almost sorted (collisions could happen 1010,1020,1050,1030...), and exclusion rules make issue harder. @yola, Do you know how to group numbers by increment value?
– Sergii
Nov 10 at 11:38












I'm afraid that i don't know anything about spark. First i would sort the sequence with something like bubble-sort, as you tell that is already sorted. Then i would decide on the biggest possible step, that could be deduced from the range of the numbers in the sequence, maybe other considerations as well. And then just routinely check for different step. The first step to consider would be a[1] - a[0].
– Yola
Nov 10 at 11:44




I'm afraid that i don't know anything about spark. First i would sort the sequence with something like bubble-sort, as you tell that is already sorted. Then i would decide on the biggest possible step, that could be deduced from the range of the numbers in the sequence, maybe other considerations as well. And then just routinely check for different step. The first step to consider would be a[1] - a[0].
– Yola
Nov 10 at 11:44

















active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53238196%2fhow-to-group-sequences-with-the-same-regularity-using-spark%23new-answer', 'question_page');

);

Post as a guest



































active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes















 

draft saved


draft discarded















































 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53238196%2fhow-to-group-sequences-with-the-same-regularity-using-spark%23new-answer', 'question_page');

);

Post as a guest














































































這個網誌中的熱門文章

What does pagestruct do in Eviews?

Dutch intervention in Lombok and Karangasem

Channel Islands