How to group sequences with the same regularity using spark

up vote
2
down vote

favorite

To simplify the input parameters and code, I've generated an input list (in real live it is a lot of csv files with digits inside). All input numbers should be grouped by the single regularity rule. Numbers that can not be described with any regularity rule should be removed from the output result.

input example:

[1010,1020,1050,1030,1022,1880,1940,1900,1920,2010,3100]

output result example:

[[1010,1020,1030,1050],[1880,1900,1920,1940]]

[1010,1020,1030,1050] has increment step 10 and sequence has an exclusion (1040). [1880,1900,1920,1940] has increment step 20, sequence without exclusions.

Starting solution with spark on java

public static void main(String args) throws Exception 
 Logger.getLogger("org").setLevel(Level.OFF);
 SparkConf conf = new SparkConf().setAppName("reduce").setMaster("local[*]");
 JavaSparkContext sc = new JavaSparkContext(conf);

 List<Integer> inputIntegers = Arrays.asList(1010,1020,1050,1030,1022,1880,1940,1900,1920,2010,3100);
 JavaRDD<Integer> integerRdd = sc.parallelize(inputIntegers);
 System.out.println("Check count of objects: " + integerRdd.count());

 System.out.println(integerRdd.collect());
 JavaRDD<Integer> sorted = integerRdd.sortBy(x -> x, true, 1);
 System.out.println("Check sorted collection:");
 System.out.println(sorted.collect());
 /* TODO
 - enrich data with increment step
 - group by increment step (increment step should be calculated in run time)
 - filter out single values 
 */

I think I need to use increment step but I do not know how to manage data from different RDDs (in case of using spark), to have a chance save incremental step in current spark RDD

Does anyone have ideas how to resolve my spark issue or how it can be solved with different tools?

P.S.

Spark (or any equivalent technology) is mandatory because solution must be big data oriented and as result resolved in distributed system.

edited Nov 10 at 10:58

asked Nov 10 at 10:50

Sergii

2,30751846

so I guess the group should have at least 3 elements, otherwise any pair of elements would satisfy regularity
– mangusta
Nov 10 at 10:56

@mangusta nicely commented - updated. thanks
– Sergii
Nov 10 at 10:59

Привіт! Without specifying additional conditions and requirements the algorithmic problem in the background seems to be intractable. In your example input data is already sorted. Also, are you looking for the same increment rule only, or other regularity rules are acceptable, e.g sequence 1,2,4,5,7,8,10,11,...?
– Yola
Nov 10 at 11:33

almost sorted (collisions could happen 1010,1020,1050,1030...), and exclusion rules make issue harder. @yola, Do you know how to group numbers by increment value?
– Sergii
Nov 10 at 11:38

I'm afraid that i don't know anything about spark. First i would sort the sequence with something like bubble-sort, as you tell that is already sorted. Then i would decide on the biggest possible step, that could be deduced from the range of the numbers in the sequence, maybe other considerations as well. And then just routinely check for different step. The first step to consider would be a[1] - a[0].
– Yola
Nov 10 at 11:44

|
show 4 more comments

up vote
2
down vote

favorite

input example:

[1010,1020,1050,1030,1022,1880,1940,1900,1920,2010,3100]

output result example:

[[1010,1020,1030,1050],[1880,1900,1920,1940]]

[1010,1020,1030,1050] has increment step 10 and sequence has an exclusion (1040). [1880,1900,1920,1940] has increment step 20, sequence without exclusions.

Starting solution with spark on java

public static void main(String args) throws Exception 
 Logger.getLogger("org").setLevel(Level.OFF);
 SparkConf conf = new SparkConf().setAppName("reduce").setMaster("local[*]");
 JavaSparkContext sc = new JavaSparkContext(conf);

 List<Integer> inputIntegers = Arrays.asList(1010,1020,1050,1030,1022,1880,1940,1900,1920,2010,3100);
 JavaRDD<Integer> integerRdd = sc.parallelize(inputIntegers);
 System.out.println("Check count of objects: " + integerRdd.count());

 System.out.println(integerRdd.collect());
 JavaRDD<Integer> sorted = integerRdd.sortBy(x -> x, true, 1);
 System.out.println("Check sorted collection:");
 System.out.println(sorted.collect());
 /* TODO
 - enrich data with increment step
 - group by increment step (increment step should be calculated in run time)
 - filter out single values 
 */

I think I need to use increment step but I do not know how to manage data from different RDDs (in case of using spark), to have a chance save incremental step in current spark RDD

Does anyone have ideas how to resolve my spark issue or how it can be solved with different tools?

P.S.

Spark (or any equivalent technology) is mandatory because solution must be big data oriented and as result resolved in distributed system.

edited Nov 10 at 10:58

asked Nov 10 at 10:50

Sergii

2,30751846

so I guess the group should have at least 3 elements, otherwise any pair of elements would satisfy regularity
– mangusta
Nov 10 at 10:56

@mangusta nicely commented - updated. thanks
– Sergii
Nov 10 at 10:59

Привіт! Without specifying additional conditions and requirements the algorithmic problem in the background seems to be intractable. In your example input data is already sorted. Also, are you looking for the same increment rule only, or other regularity rules are acceptable, e.g sequence 1,2,4,5,7,8,10,11,...?
– Yola
Nov 10 at 11:33

almost sorted (collisions could happen 1010,1020,1050,1030...), and exclusion rules make issue harder. @yola, Do you know how to group numbers by increment value?
– Sergii
Nov 10 at 11:38

I'm afraid that i don't know anything about spark. First i would sort the sequence with something like bubble-sort, as you tell that is already sorted. Then i would decide on the biggest possible step, that could be deduced from the range of the numbers in the sequence, maybe other considerations as well. And then just routinely check for different step. The first step to consider would be a[1] - a[0].
– Yola
Nov 10 at 11:44

|
show 4 more comments

up vote
2
down vote

favorite

input example:

[1010,1020,1050,1030,1022,1880,1940,1900,1920,2010,3100]

output result example:

[[1010,1020,1030,1050],[1880,1900,1920,1940]]

[1010,1020,1030,1050] has increment step 10 and sequence has an exclusion (1040). [1880,1900,1920,1940] has increment step 20, sequence without exclusions.

Starting solution with spark on java

public static void main(String args) throws Exception 
 Logger.getLogger("org").setLevel(Level.OFF);
 SparkConf conf = new SparkConf().setAppName("reduce").setMaster("local[*]");
 JavaSparkContext sc = new JavaSparkContext(conf);

 List<Integer> inputIntegers = Arrays.asList(1010,1020,1050,1030,1022,1880,1940,1900,1920,2010,3100);
 JavaRDD<Integer> integerRdd = sc.parallelize(inputIntegers);
 System.out.println("Check count of objects: " + integerRdd.count());

 System.out.println(integerRdd.collect());
 JavaRDD<Integer> sorted = integerRdd.sortBy(x -> x, true, 1);
 System.out.println("Check sorted collection:");
 System.out.println(sorted.collect());
 /* TODO
 - enrich data with increment step
 - group by increment step (increment step should be calculated in run time)
 - filter out single values 
 */

I think I need to use increment step but I do not know how to manage data from different RDDs (in case of using spark), to have a chance save incremental step in current spark RDD

Does anyone have ideas how to resolve my spark issue or how it can be solved with different tools?

P.S.

Spark (or any equivalent technology) is mandatory because solution must be big data oriented and as result resolved in distributed system.

edited Nov 10 at 10:58

asked Nov 10 at 10:50

Sergii

2,30751846

input example:

[1010,1020,1050,1030,1022,1880,1940,1900,1920,2010,3100]

output result example:

[[1010,1020,1030,1050],[1880,1900,1920,1940]]

[1010,1020,1030,1050] has increment step 10 and sequence has an exclusion (1040). [1880,1900,1920,1940] has increment step 20, sequence without exclusions.

Starting solution with spark on java

public static void main(String args) throws Exception 
 Logger.getLogger("org").setLevel(Level.OFF);
 SparkConf conf = new SparkConf().setAppName("reduce").setMaster("local[*]");
 JavaSparkContext sc = new JavaSparkContext(conf);

 List<Integer> inputIntegers = Arrays.asList(1010,1020,1050,1030,1022,1880,1940,1900,1920,2010,3100);
 JavaRDD<Integer> integerRdd = sc.parallelize(inputIntegers);
 System.out.println("Check count of objects: " + integerRdd.count());

 System.out.println(integerRdd.collect());
 JavaRDD<Integer> sorted = integerRdd.sortBy(x -> x, true, 1);
 System.out.println("Check sorted collection:");
 System.out.println(sorted.collect());
 /* TODO
 - enrich data with increment step
 - group by increment step (increment step should be calculated in run time)
 - filter out single values 
 */

I think I need to use increment step but I do not know how to manage data from different RDDs (in case of using spark), to have a chance save incremental step in current spark RDD

Does anyone have ideas how to resolve my spark issue or how it can be solved with different tools?

P.S.

Spark (or any equivalent technology) is mandatory because solution must be big data oriented and as result resolved in distributed system.

java algorithm apache-spark sequence

edited Nov 10 at 10:58

asked Nov 10 at 10:50

Sergii

2,30751846

edited Nov 10 at 10:58

asked Nov 10 at 10:50

Sergii

2,30751846

edited Nov 10 at 10:58

asked Nov 10 at 10:50

Sergii

2,30751846

asked Nov 10 at 10:50

Sergii

2,30751846

asked Nov 10 at 10:50

Sergii

2,30751846

so I guess the group should have at least 3 elements, otherwise any pair of elements would satisfy regularity
– mangusta
Nov 10 at 10:56

@mangusta nicely commented - updated. thanks
– Sergii
Nov 10 at 10:59

Привіт! Without specifying additional conditions and requirements the algorithmic problem in the background seems to be intractable. In your example input data is already sorted. Also, are you looking for the same increment rule only, or other regularity rules are acceptable, e.g sequence 1,2,4,5,7,8,10,11,...?
– Yola
Nov 10 at 11:33

almost sorted (collisions could happen 1010,1020,1050,1030...), and exclusion rules make issue harder. @yola, Do you know how to group numbers by increment value?
– Sergii
Nov 10 at 11:38

I'm afraid that i don't know anything about spark. First i would sort the sequence with something like bubble-sort, as you tell that is already sorted. Then i would decide on the biggest possible step, that could be deduced from the range of the numbers in the sequence, maybe other considerations as well. And then just routinely check for different step. The first step to consider would be a[1] - a[0].
– Yola
Nov 10 at 11:44

|
show 4 more comments

so I guess the group should have at least 3 elements, otherwise any pair of elements would satisfy regularity
– mangusta
Nov 10 at 10:56

@mangusta nicely commented - updated. thanks
– Sergii
Nov 10 at 10:59

Привіт! Without specifying additional conditions and requirements the algorithmic problem in the background seems to be intractable. In your example input data is already sorted. Also, are you looking for the same increment rule only, or other regularity rules are acceptable, e.g sequence 1,2,4,5,7,8,10,11,...?
– Yola
Nov 10 at 11:33

almost sorted (collisions could happen 1010,1020,1050,1030...), and exclusion rules make issue harder. @yola, Do you know how to group numbers by increment value?
– Sergii
Nov 10 at 11:38

I'm afraid that i don't know anything about spark. First i would sort the sequence with something like bubble-sort, as you tell that is already sorted. Then i would decide on the biggest possible step, that could be deduced from the range of the numbers in the sequence, maybe other considerations as well. And then just routinely check for different step. The first step to consider would be a[1] - a[0].
– Yola
Nov 10 at 11:44

so I guess the group should have at least 3 elements, otherwise any pair of elements would satisfy regularity
– mangusta
Nov 10 at 10:56

@mangusta nicely commented - updated. thanks
– Sergii
Nov 10 at 10:59

Привіт! Without specifying additional conditions and requirements the algorithmic problem in the background seems to be intractable. In your example input data is already sorted. Also, are you looking for the same increment rule only, or other regularity rules are acceptable, e.g sequence 1,2,4,5,7,8,10,11,...?
– Yola
Nov 10 at 11:33

almost sorted (collisions could happen 1010,1020,1050,1030...), and exclusion rules make issue harder. @yola, Do you know how to group numbers by increment value?
– Sergii
Nov 10 at 11:38

I'm afraid that i don't know anything about spark. First i would sort the sequence with something like bubble-sort, as you tell that is already sorted. Then i would decide on the biggest possible step, that could be deduced from the range of the numbers in the sequence, maybe other considerations as well. And then just routinely check for different step. The first step to consider would be a[1] - a[0].
– Yola
Nov 10 at 11:44

|
show 4 more comments

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53238196%2fhow-to-group-sequences-with-the-same-regularity-using-spark%23new-answer', 'question_page');

);

Post as a guest

Name

active

oldest

votes

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Odtnhj