How to group sequences with the same regularity using spark
up vote
2
down vote
favorite
To simplify the input parameters and code, I've generated an input list (in real live it is a lot of csv files with digits inside). All input numbers should be grouped by the single regularity rule. Numbers that can not be described with any regularity rule should be removed from the output result.
input example:
[1010,1020,1050,1030,1022,1880,1940,1900,1920,2010,3100]
output result example:
[[1010,1020,1030,1050],[1880,1900,1920,1940]]
[1010,1020,1030,1050] has increment step 10 and sequence has an exclusion (1040). [1880,1900,1920,1940] has increment step 20, sequence without exclusions.
Starting solution with spark on java
public static void main(String args) throws Exception
Logger.getLogger("org").setLevel(Level.OFF);
SparkConf conf = new SparkConf().setAppName("reduce").setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(conf);
List<Integer> inputIntegers = Arrays.asList(1010,1020,1050,1030,1022,1880,1940,1900,1920,2010,3100);
JavaRDD<Integer> integerRdd = sc.parallelize(inputIntegers);
System.out.println("Check count of objects: " + integerRdd.count());
System.out.println(integerRdd.collect());
JavaRDD<Integer> sorted = integerRdd.sortBy(x -> x, true, 1);
System.out.println("Check sorted collection:");
System.out.println(sorted.collect());
/* TODO
- enrich data with increment step
- group by increment step (increment step should be calculated in run time)
- filter out single values
*/
I think I need to use increment step but I do not know how to manage data from different RDDs (in case of using spark), to have a chance save incremental step in current spark RDD
Does anyone have ideas how to resolve my spark issue or how it can be solved with different tools?
P.S.
Spark (or any equivalent technology) is mandatory because solution must be big data oriented and as result resolved in distributed system.
java algorithm apache-spark sequence
|
show 4 more comments
up vote
2
down vote
favorite
To simplify the input parameters and code, I've generated an input list (in real live it is a lot of csv files with digits inside). All input numbers should be grouped by the single regularity rule. Numbers that can not be described with any regularity rule should be removed from the output result.
input example:
[1010,1020,1050,1030,1022,1880,1940,1900,1920,2010,3100]
output result example:
[[1010,1020,1030,1050],[1880,1900,1920,1940]]
[1010,1020,1030,1050] has increment step 10 and sequence has an exclusion (1040). [1880,1900,1920,1940] has increment step 20, sequence without exclusions.
Starting solution with spark on java
public static void main(String args) throws Exception
Logger.getLogger("org").setLevel(Level.OFF);
SparkConf conf = new SparkConf().setAppName("reduce").setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(conf);
List<Integer> inputIntegers = Arrays.asList(1010,1020,1050,1030,1022,1880,1940,1900,1920,2010,3100);
JavaRDD<Integer> integerRdd = sc.parallelize(inputIntegers);
System.out.println("Check count of objects: " + integerRdd.count());
System.out.println(integerRdd.collect());
JavaRDD<Integer> sorted = integerRdd.sortBy(x -> x, true, 1);
System.out.println("Check sorted collection:");
System.out.println(sorted.collect());
/* TODO
- enrich data with increment step
- group by increment step (increment step should be calculated in run time)
- filter out single values
*/
I think I need to use increment step but I do not know how to manage data from different RDDs (in case of using spark), to have a chance save incremental step in current spark RDD
Does anyone have ideas how to resolve my spark issue or how it can be solved with different tools?
P.S.
Spark (or any equivalent technology) is mandatory because solution must be big data oriented and as result resolved in distributed system.
java algorithm apache-spark sequence
so I guess the group should have at least 3 elements, otherwise any pair of elements would satisfy regularity
– mangusta
Nov 10 at 10:56
@mangusta nicely commented - updated. thanks
– Sergii
Nov 10 at 10:59
Привіт! Without specifying additional conditions and requirements the algorithmic problem in the background seems to be intractable. In your example input data is already sorted. Also, are you looking for the same increment rule only, or other regularity rules are acceptable, e.g sequence1,2,4,5,7,8,10,11,...?
– Yola
Nov 10 at 11:33
almost sorted (collisions could happen1010,1020,1050,1030...), and exclusion rules make issue harder. @yola, Do you know how to group numbers by increment value?
– Sergii
Nov 10 at 11:38
I'm afraid that i don't know anything about spark. First i would sort the sequence with something like bubble-sort, as you tell that is already sorted. Then i would decide on the biggest possible step, that could be deduced from the range of the numbers in the sequence, maybe other considerations as well. And then just routinely check for different step. The first step to consider would bea[1] - a[0].
– Yola
Nov 10 at 11:44
|
show 4 more comments
up vote
2
down vote
favorite
up vote
2
down vote
favorite
To simplify the input parameters and code, I've generated an input list (in real live it is a lot of csv files with digits inside). All input numbers should be grouped by the single regularity rule. Numbers that can not be described with any regularity rule should be removed from the output result.
input example:
[1010,1020,1050,1030,1022,1880,1940,1900,1920,2010,3100]
output result example:
[[1010,1020,1030,1050],[1880,1900,1920,1940]]
[1010,1020,1030,1050] has increment step 10 and sequence has an exclusion (1040). [1880,1900,1920,1940] has increment step 20, sequence without exclusions.
Starting solution with spark on java
public static void main(String args) throws Exception
Logger.getLogger("org").setLevel(Level.OFF);
SparkConf conf = new SparkConf().setAppName("reduce").setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(conf);
List<Integer> inputIntegers = Arrays.asList(1010,1020,1050,1030,1022,1880,1940,1900,1920,2010,3100);
JavaRDD<Integer> integerRdd = sc.parallelize(inputIntegers);
System.out.println("Check count of objects: " + integerRdd.count());
System.out.println(integerRdd.collect());
JavaRDD<Integer> sorted = integerRdd.sortBy(x -> x, true, 1);
System.out.println("Check sorted collection:");
System.out.println(sorted.collect());
/* TODO
- enrich data with increment step
- group by increment step (increment step should be calculated in run time)
- filter out single values
*/
I think I need to use increment step but I do not know how to manage data from different RDDs (in case of using spark), to have a chance save incremental step in current spark RDD
Does anyone have ideas how to resolve my spark issue or how it can be solved with different tools?
P.S.
Spark (or any equivalent technology) is mandatory because solution must be big data oriented and as result resolved in distributed system.
java algorithm apache-spark sequence
To simplify the input parameters and code, I've generated an input list (in real live it is a lot of csv files with digits inside). All input numbers should be grouped by the single regularity rule. Numbers that can not be described with any regularity rule should be removed from the output result.
input example:
[1010,1020,1050,1030,1022,1880,1940,1900,1920,2010,3100]
output result example:
[[1010,1020,1030,1050],[1880,1900,1920,1940]]
[1010,1020,1030,1050] has increment step 10 and sequence has an exclusion (1040). [1880,1900,1920,1940] has increment step 20, sequence without exclusions.
Starting solution with spark on java
public static void main(String args) throws Exception
Logger.getLogger("org").setLevel(Level.OFF);
SparkConf conf = new SparkConf().setAppName("reduce").setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(conf);
List<Integer> inputIntegers = Arrays.asList(1010,1020,1050,1030,1022,1880,1940,1900,1920,2010,3100);
JavaRDD<Integer> integerRdd = sc.parallelize(inputIntegers);
System.out.println("Check count of objects: " + integerRdd.count());
System.out.println(integerRdd.collect());
JavaRDD<Integer> sorted = integerRdd.sortBy(x -> x, true, 1);
System.out.println("Check sorted collection:");
System.out.println(sorted.collect());
/* TODO
- enrich data with increment step
- group by increment step (increment step should be calculated in run time)
- filter out single values
*/
I think I need to use increment step but I do not know how to manage data from different RDDs (in case of using spark), to have a chance save incremental step in current spark RDD
Does anyone have ideas how to resolve my spark issue or how it can be solved with different tools?
P.S.
Spark (or any equivalent technology) is mandatory because solution must be big data oriented and as result resolved in distributed system.
java algorithm apache-spark sequence
java algorithm apache-spark sequence
edited Nov 10 at 10:58
asked Nov 10 at 10:50
Sergii
2,30751846
2,30751846
so I guess the group should have at least 3 elements, otherwise any pair of elements would satisfy regularity
– mangusta
Nov 10 at 10:56
@mangusta nicely commented - updated. thanks
– Sergii
Nov 10 at 10:59
Привіт! Without specifying additional conditions and requirements the algorithmic problem in the background seems to be intractable. In your example input data is already sorted. Also, are you looking for the same increment rule only, or other regularity rules are acceptable, e.g sequence1,2,4,5,7,8,10,11,...?
– Yola
Nov 10 at 11:33
almost sorted (collisions could happen1010,1020,1050,1030...), and exclusion rules make issue harder. @yola, Do you know how to group numbers by increment value?
– Sergii
Nov 10 at 11:38
I'm afraid that i don't know anything about spark. First i would sort the sequence with something like bubble-sort, as you tell that is already sorted. Then i would decide on the biggest possible step, that could be deduced from the range of the numbers in the sequence, maybe other considerations as well. And then just routinely check for different step. The first step to consider would bea[1] - a[0].
– Yola
Nov 10 at 11:44
|
show 4 more comments
so I guess the group should have at least 3 elements, otherwise any pair of elements would satisfy regularity
– mangusta
Nov 10 at 10:56
@mangusta nicely commented - updated. thanks
– Sergii
Nov 10 at 10:59
Привіт! Without specifying additional conditions and requirements the algorithmic problem in the background seems to be intractable. In your example input data is already sorted. Also, are you looking for the same increment rule only, or other regularity rules are acceptable, e.g sequence1,2,4,5,7,8,10,11,...?
– Yola
Nov 10 at 11:33
almost sorted (collisions could happen1010,1020,1050,1030...), and exclusion rules make issue harder. @yola, Do you know how to group numbers by increment value?
– Sergii
Nov 10 at 11:38
I'm afraid that i don't know anything about spark. First i would sort the sequence with something like bubble-sort, as you tell that is already sorted. Then i would decide on the biggest possible step, that could be deduced from the range of the numbers in the sequence, maybe other considerations as well. And then just routinely check for different step. The first step to consider would bea[1] - a[0].
– Yola
Nov 10 at 11:44
so I guess the group should have at least 3 elements, otherwise any pair of elements would satisfy regularity
– mangusta
Nov 10 at 10:56
so I guess the group should have at least 3 elements, otherwise any pair of elements would satisfy regularity
– mangusta
Nov 10 at 10:56
@mangusta nicely commented - updated. thanks
– Sergii
Nov 10 at 10:59
@mangusta nicely commented - updated. thanks
– Sergii
Nov 10 at 10:59
Привіт! Without specifying additional conditions and requirements the algorithmic problem in the background seems to be intractable. In your example input data is already sorted. Also, are you looking for the same increment rule only, or other regularity rules are acceptable, e.g sequence
1,2,4,5,7,8,10,11,...?– Yola
Nov 10 at 11:33
Привіт! Without specifying additional conditions and requirements the algorithmic problem in the background seems to be intractable. In your example input data is already sorted. Also, are you looking for the same increment rule only, or other regularity rules are acceptable, e.g sequence
1,2,4,5,7,8,10,11,...?– Yola
Nov 10 at 11:33
almost sorted (collisions could happen
1010,1020,1050,1030...), and exclusion rules make issue harder. @yola, Do you know how to group numbers by increment value?– Sergii
Nov 10 at 11:38
almost sorted (collisions could happen
1010,1020,1050,1030...), and exclusion rules make issue harder. @yola, Do you know how to group numbers by increment value?– Sergii
Nov 10 at 11:38
I'm afraid that i don't know anything about spark. First i would sort the sequence with something like bubble-sort, as you tell that is already sorted. Then i would decide on the biggest possible step, that could be deduced from the range of the numbers in the sequence, maybe other considerations as well. And then just routinely check for different step. The first step to consider would be
a[1] - a[0].– Yola
Nov 10 at 11:44
I'm afraid that i don't know anything about spark. First i would sort the sequence with something like bubble-sort, as you tell that is already sorted. Then i would decide on the biggest possible step, that could be deduced from the range of the numbers in the sequence, maybe other considerations as well. And then just routinely check for different step. The first step to consider would be
a[1] - a[0].– Yola
Nov 10 at 11:44
|
show 4 more comments
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53238196%2fhow-to-group-sequences-with-the-same-regularity-using-spark%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
so I guess the group should have at least 3 elements, otherwise any pair of elements would satisfy regularity
– mangusta
Nov 10 at 10:56
@mangusta nicely commented - updated. thanks
– Sergii
Nov 10 at 10:59
Привіт! Without specifying additional conditions and requirements the algorithmic problem in the background seems to be intractable. In your example input data is already sorted. Also, are you looking for the same increment rule only, or other regularity rules are acceptable, e.g sequence
1,2,4,5,7,8,10,11,...?– Yola
Nov 10 at 11:33
almost sorted (collisions could happen
1010,1020,1050,1030...), and exclusion rules make issue harder. @yola, Do you know how to group numbers by increment value?– Sergii
Nov 10 at 11:38
I'm afraid that i don't know anything about spark. First i would sort the sequence with something like bubble-sort, as you tell that is already sorted. Then i would decide on the biggest possible step, that could be deduced from the range of the numbers in the sequence, maybe other considerations as well. And then just routinely check for different step. The first step to consider would be
a[1] - a[0].– Yola
Nov 10 at 11:44