Best way to get (millions of rows of) data into Janusgraph via Tinkerpop, with a specific model










2















Just started out with Tinkerpop and Janusgraph, and I'm trying to figure this out based on the documentation.



  • I have three datasets, each containing about 20 milions rows (csv files)

  • There is a specific model in which the variables and rows need to be connected, e.g. what are vertices, what are labels, what are edges, etc.

  • After having everything in a graph, I'd like to of course use some basic Gremlin to see how well the model works.

But first I need a way to get the data into Janusgraph.



Possibly there exist scripts for this.
But otherwise, is it perhaps something to be written in python, to open a csv file, get each row of a variable X, and add this as a vertex/edge/etc. ...?
Or am I completely misinterpreting Janusgraph/Tinkerpop?



Thanks for any help in advance.



EDIT:



Say I have a few files, each of which contain a few million rows, representing people, and several variables, representing different metrics. A first example could look like thid:



 metric_1 metric_2 metric_3 ..

person_1 a e i
person_2 b f j
person_3 c g k
person_4 d h l
..


Should I translate this to files with nodes that are in the first place made up of just the values, [a,..., l].
(and later perhaps more elaborate sets of properties)



And are [a,..., l] then indexed?



The 'Modern' graph here seems to have an index (number 1,...,12 for all the nodes and edges, independent of their overlapping label/category), e.g. should each measurement be indexed separately and then linked to a given person_x to which they belong?



Apologies for these probably straightforward questions, but I'm fairly new to this.










share|improve this question



















  • 1





    Does each dataset map to a different graph ? Have you already configured a storage backend ?

    – Benoit Guigal
    Nov 14 '18 at 14:24












  • In this case there are several datasets (csv files) that should become one graph. (In another case I will use only one dataset.) For the storage backend: I've downloaded ScyllaDB and I performed step 1 & 2 scylladb.com/download/debian9 -> since I only want to use this on my desktop, not in a cluster (yet) I have not done step 3. Should I?

    – nikolai
    Nov 14 '18 at 16:13












  • Ok great. For testing purposes though, I would recomend using the script bin/janusgraph.sh which will start Cassandra, ElasticSearch and a gremlin-server. You will then be free in the future to tune which storage backend you want to use

    – Benoit Guigal
    Nov 14 '18 at 16:56











  • Thanks, I'll download Cassandra and do as stated here docs.janusgraph.org/latest/cassandra.html but do I need to use/download ElasticSearch as well? Also, does this not interfere with ScyllaDB?

    – nikolai
    Nov 14 '18 at 17:25











  • If you use the script janusgraph.sh you do not have to download anything, Cassandra and Elasticsearch are packaged with JanusGraph. Indeed you have to stop ScyllaDB to avoid conflicting port binding

    – Benoit Guigal
    Nov 14 '18 at 18:01















2















Just started out with Tinkerpop and Janusgraph, and I'm trying to figure this out based on the documentation.



  • I have three datasets, each containing about 20 milions rows (csv files)

  • There is a specific model in which the variables and rows need to be connected, e.g. what are vertices, what are labels, what are edges, etc.

  • After having everything in a graph, I'd like to of course use some basic Gremlin to see how well the model works.

But first I need a way to get the data into Janusgraph.



Possibly there exist scripts for this.
But otherwise, is it perhaps something to be written in python, to open a csv file, get each row of a variable X, and add this as a vertex/edge/etc. ...?
Or am I completely misinterpreting Janusgraph/Tinkerpop?



Thanks for any help in advance.



EDIT:



Say I have a few files, each of which contain a few million rows, representing people, and several variables, representing different metrics. A first example could look like thid:



 metric_1 metric_2 metric_3 ..

person_1 a e i
person_2 b f j
person_3 c g k
person_4 d h l
..


Should I translate this to files with nodes that are in the first place made up of just the values, [a,..., l].
(and later perhaps more elaborate sets of properties)



And are [a,..., l] then indexed?



The 'Modern' graph here seems to have an index (number 1,...,12 for all the nodes and edges, independent of their overlapping label/category), e.g. should each measurement be indexed separately and then linked to a given person_x to which they belong?



Apologies for these probably straightforward questions, but I'm fairly new to this.










share|improve this question



















  • 1





    Does each dataset map to a different graph ? Have you already configured a storage backend ?

    – Benoit Guigal
    Nov 14 '18 at 14:24












  • In this case there are several datasets (csv files) that should become one graph. (In another case I will use only one dataset.) For the storage backend: I've downloaded ScyllaDB and I performed step 1 & 2 scylladb.com/download/debian9 -> since I only want to use this on my desktop, not in a cluster (yet) I have not done step 3. Should I?

    – nikolai
    Nov 14 '18 at 16:13












  • Ok great. For testing purposes though, I would recomend using the script bin/janusgraph.sh which will start Cassandra, ElasticSearch and a gremlin-server. You will then be free in the future to tune which storage backend you want to use

    – Benoit Guigal
    Nov 14 '18 at 16:56











  • Thanks, I'll download Cassandra and do as stated here docs.janusgraph.org/latest/cassandra.html but do I need to use/download ElasticSearch as well? Also, does this not interfere with ScyllaDB?

    – nikolai
    Nov 14 '18 at 17:25











  • If you use the script janusgraph.sh you do not have to download anything, Cassandra and Elasticsearch are packaged with JanusGraph. Indeed you have to stop ScyllaDB to avoid conflicting port binding

    – Benoit Guigal
    Nov 14 '18 at 18:01













2












2








2








Just started out with Tinkerpop and Janusgraph, and I'm trying to figure this out based on the documentation.



  • I have three datasets, each containing about 20 milions rows (csv files)

  • There is a specific model in which the variables and rows need to be connected, e.g. what are vertices, what are labels, what are edges, etc.

  • After having everything in a graph, I'd like to of course use some basic Gremlin to see how well the model works.

But first I need a way to get the data into Janusgraph.



Possibly there exist scripts for this.
But otherwise, is it perhaps something to be written in python, to open a csv file, get each row of a variable X, and add this as a vertex/edge/etc. ...?
Or am I completely misinterpreting Janusgraph/Tinkerpop?



Thanks for any help in advance.



EDIT:



Say I have a few files, each of which contain a few million rows, representing people, and several variables, representing different metrics. A first example could look like thid:



 metric_1 metric_2 metric_3 ..

person_1 a e i
person_2 b f j
person_3 c g k
person_4 d h l
..


Should I translate this to files with nodes that are in the first place made up of just the values, [a,..., l].
(and later perhaps more elaborate sets of properties)



And are [a,..., l] then indexed?



The 'Modern' graph here seems to have an index (number 1,...,12 for all the nodes and edges, independent of their overlapping label/category), e.g. should each measurement be indexed separately and then linked to a given person_x to which they belong?



Apologies for these probably straightforward questions, but I'm fairly new to this.










share|improve this question
















Just started out with Tinkerpop and Janusgraph, and I'm trying to figure this out based on the documentation.



  • I have three datasets, each containing about 20 milions rows (csv files)

  • There is a specific model in which the variables and rows need to be connected, e.g. what are vertices, what are labels, what are edges, etc.

  • After having everything in a graph, I'd like to of course use some basic Gremlin to see how well the model works.

But first I need a way to get the data into Janusgraph.



Possibly there exist scripts for this.
But otherwise, is it perhaps something to be written in python, to open a csv file, get each row of a variable X, and add this as a vertex/edge/etc. ...?
Or am I completely misinterpreting Janusgraph/Tinkerpop?



Thanks for any help in advance.



EDIT:



Say I have a few files, each of which contain a few million rows, representing people, and several variables, representing different metrics. A first example could look like thid:



 metric_1 metric_2 metric_3 ..

person_1 a e i
person_2 b f j
person_3 c g k
person_4 d h l
..


Should I translate this to files with nodes that are in the first place made up of just the values, [a,..., l].
(and later perhaps more elaborate sets of properties)



And are [a,..., l] then indexed?



The 'Modern' graph here seems to have an index (number 1,...,12 for all the nodes and edges, independent of their overlapping label/category), e.g. should each measurement be indexed separately and then linked to a given person_x to which they belong?



Apologies for these probably straightforward questions, but I'm fairly new to this.







python gremlin tinkerpop tinkerpop3 janusgraph






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 14 '18 at 18:35







nikolai

















asked Nov 13 '18 at 20:02









nikolainikolai

133




133







  • 1





    Does each dataset map to a different graph ? Have you already configured a storage backend ?

    – Benoit Guigal
    Nov 14 '18 at 14:24












  • In this case there are several datasets (csv files) that should become one graph. (In another case I will use only one dataset.) For the storage backend: I've downloaded ScyllaDB and I performed step 1 & 2 scylladb.com/download/debian9 -> since I only want to use this on my desktop, not in a cluster (yet) I have not done step 3. Should I?

    – nikolai
    Nov 14 '18 at 16:13












  • Ok great. For testing purposes though, I would recomend using the script bin/janusgraph.sh which will start Cassandra, ElasticSearch and a gremlin-server. You will then be free in the future to tune which storage backend you want to use

    – Benoit Guigal
    Nov 14 '18 at 16:56











  • Thanks, I'll download Cassandra and do as stated here docs.janusgraph.org/latest/cassandra.html but do I need to use/download ElasticSearch as well? Also, does this not interfere with ScyllaDB?

    – nikolai
    Nov 14 '18 at 17:25











  • If you use the script janusgraph.sh you do not have to download anything, Cassandra and Elasticsearch are packaged with JanusGraph. Indeed you have to stop ScyllaDB to avoid conflicting port binding

    – Benoit Guigal
    Nov 14 '18 at 18:01












  • 1





    Does each dataset map to a different graph ? Have you already configured a storage backend ?

    – Benoit Guigal
    Nov 14 '18 at 14:24












  • In this case there are several datasets (csv files) that should become one graph. (In another case I will use only one dataset.) For the storage backend: I've downloaded ScyllaDB and I performed step 1 & 2 scylladb.com/download/debian9 -> since I only want to use this on my desktop, not in a cluster (yet) I have not done step 3. Should I?

    – nikolai
    Nov 14 '18 at 16:13












  • Ok great. For testing purposes though, I would recomend using the script bin/janusgraph.sh which will start Cassandra, ElasticSearch and a gremlin-server. You will then be free in the future to tune which storage backend you want to use

    – Benoit Guigal
    Nov 14 '18 at 16:56











  • Thanks, I'll download Cassandra and do as stated here docs.janusgraph.org/latest/cassandra.html but do I need to use/download ElasticSearch as well? Also, does this not interfere with ScyllaDB?

    – nikolai
    Nov 14 '18 at 17:25











  • If you use the script janusgraph.sh you do not have to download anything, Cassandra and Elasticsearch are packaged with JanusGraph. Indeed you have to stop ScyllaDB to avoid conflicting port binding

    – Benoit Guigal
    Nov 14 '18 at 18:01







1




1





Does each dataset map to a different graph ? Have you already configured a storage backend ?

– Benoit Guigal
Nov 14 '18 at 14:24






Does each dataset map to a different graph ? Have you already configured a storage backend ?

– Benoit Guigal
Nov 14 '18 at 14:24














In this case there are several datasets (csv files) that should become one graph. (In another case I will use only one dataset.) For the storage backend: I've downloaded ScyllaDB and I performed step 1 & 2 scylladb.com/download/debian9 -> since I only want to use this on my desktop, not in a cluster (yet) I have not done step 3. Should I?

– nikolai
Nov 14 '18 at 16:13






In this case there are several datasets (csv files) that should become one graph. (In another case I will use only one dataset.) For the storage backend: I've downloaded ScyllaDB and I performed step 1 & 2 scylladb.com/download/debian9 -> since I only want to use this on my desktop, not in a cluster (yet) I have not done step 3. Should I?

– nikolai
Nov 14 '18 at 16:13














Ok great. For testing purposes though, I would recomend using the script bin/janusgraph.sh which will start Cassandra, ElasticSearch and a gremlin-server. You will then be free in the future to tune which storage backend you want to use

– Benoit Guigal
Nov 14 '18 at 16:56





Ok great. For testing purposes though, I would recomend using the script bin/janusgraph.sh which will start Cassandra, ElasticSearch and a gremlin-server. You will then be free in the future to tune which storage backend you want to use

– Benoit Guigal
Nov 14 '18 at 16:56













Thanks, I'll download Cassandra and do as stated here docs.janusgraph.org/latest/cassandra.html but do I need to use/download ElasticSearch as well? Also, does this not interfere with ScyllaDB?

– nikolai
Nov 14 '18 at 17:25





Thanks, I'll download Cassandra and do as stated here docs.janusgraph.org/latest/cassandra.html but do I need to use/download ElasticSearch as well? Also, does this not interfere with ScyllaDB?

– nikolai
Nov 14 '18 at 17:25













If you use the script janusgraph.sh you do not have to download anything, Cassandra and Elasticsearch are packaged with JanusGraph. Indeed you have to stop ScyllaDB to avoid conflicting port binding

– Benoit Guigal
Nov 14 '18 at 18:01





If you use the script janusgraph.sh you do not have to download anything, Cassandra and Elasticsearch are packaged with JanusGraph. Indeed you have to stop ScyllaDB to avoid conflicting port binding

– Benoit Guigal
Nov 14 '18 at 18:01












2 Answers
2






active

oldest

votes


















3














JanusGraph uses pluggable storage backends and indexs. For testing purposes, a script called bin/janusgraph.sh is packaged with the distribution. It allows to quickly get up and running by starting Cassandra and Elasticsearch (it also starts a gremlin-server but we won't use it)



cd /path/to/janus
bin/janusgraph.sh start


Then I would recommend loading your data using a Groovy script. Groovy scripts can be executed with the Gremlin console



bin/gremlin.sh -e scripts/load_data.script 


An efficient way to load the data is to split it into two files:



  • nodes.csv: one line per node with all attributes

  • links.csv: one line per link with source_id and target_id and all the links attributes

This might require some data preparation steps.



Here is an example script



The trick to speed up the process is to keep a mapping between your id and the id created by JanusGraph during the creation of the nodes.



Even if it is not mandatory, I strongly recommend you to create an explicit schema for your graph before loading any data. Here is an example script






share|improve this answer


















  • 1





    Thanks, this is very helpful! I've updated my question to specify one bit, as I'm also not sure what is exactly meant by 'Id'.

    – nikolai
    Nov 14 '18 at 18:32











  • You're welcome, I am not sure to understand your graph structure. Are the persons in your files the vertices of the graph ? What are the links between the persons ? I recommend reading this presentation (Slides 5 to 10) slideshare.net/ptgoetz/… about JanusGraph and graph structure.

    – Benoit Guigal
    Nov 15 '18 at 8:40


















3














Well, the truth is bulk loading of real user data into JanusGraph is a real pain. I've been using JanuGraph since it's very first version about 2 years ago and its still a pain to bulk load data. A lot of it is not necessarily down to JanusGraph because different users have very different data, different formats, different graph models (ie some mostly need one vertex with one edge ( ex. child-mother ) others deal with one vertex with many edges ( ex user followers ) ) and last but definitely not least, the very nature of the tool deals with large data sets, not to mention the underlying storage and index databases mostly come preconfigured to replicate massively (i.e you might be thinking 20m rows but you actually end up inserting 60m or 80m entries)



All said, I've had moderate success in bulk loading a some tens of millions in decent timeframes (again it will be painful but here are the general steps).



  • Provide IDs when creating graph elements. If importing from eg MySQL think of perhaps combining the tablename with the id value to create unique IDs eg users1, tweets2

  • Don't specify schema up front. This is because JanusGraph will need to ensure the data conforms on each inserting

  • Don't specify index up front. Just related to above but really deserves its own entry. Bulk insert first index later

  • Please, please, please, be aware of the underlying database features for bulk inserts and activate them i.e read up on Cassandra, ScyllaDB, Big Table, docs especially on replication and indexing

  • After all the above, configure JanusGraph for bulk loading, ensure your data integrity is correct (i.e no duplicate ids) and consider some form of parallelizing insert request e.g some kind of map reduce system

I think I've covered the major points, again, there's no silver bullet here and the process normally involves quite some trial and error for example the bulk insert rates, too low is bad e.g 10 per second while too high is equally bad eg 10k per second and it almost always depends on your data so its a case by case basis, can't recommend where you should start.



All said and done, give it a real go, bulk load is the hardest part in my opinion and the struggles are well worth the new dimension it gives your application.



All the best!






share|improve this answer






















    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53288639%2fbest-way-to-get-millions-of-rows-of-data-into-janusgraph-via-tinkerpop-with-a%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    3














    JanusGraph uses pluggable storage backends and indexs. For testing purposes, a script called bin/janusgraph.sh is packaged with the distribution. It allows to quickly get up and running by starting Cassandra and Elasticsearch (it also starts a gremlin-server but we won't use it)



    cd /path/to/janus
    bin/janusgraph.sh start


    Then I would recommend loading your data using a Groovy script. Groovy scripts can be executed with the Gremlin console



    bin/gremlin.sh -e scripts/load_data.script 


    An efficient way to load the data is to split it into two files:



    • nodes.csv: one line per node with all attributes

    • links.csv: one line per link with source_id and target_id and all the links attributes

    This might require some data preparation steps.



    Here is an example script



    The trick to speed up the process is to keep a mapping between your id and the id created by JanusGraph during the creation of the nodes.



    Even if it is not mandatory, I strongly recommend you to create an explicit schema for your graph before loading any data. Here is an example script






    share|improve this answer


















    • 1





      Thanks, this is very helpful! I've updated my question to specify one bit, as I'm also not sure what is exactly meant by 'Id'.

      – nikolai
      Nov 14 '18 at 18:32











    • You're welcome, I am not sure to understand your graph structure. Are the persons in your files the vertices of the graph ? What are the links between the persons ? I recommend reading this presentation (Slides 5 to 10) slideshare.net/ptgoetz/… about JanusGraph and graph structure.

      – Benoit Guigal
      Nov 15 '18 at 8:40















    3














    JanusGraph uses pluggable storage backends and indexs. For testing purposes, a script called bin/janusgraph.sh is packaged with the distribution. It allows to quickly get up and running by starting Cassandra and Elasticsearch (it also starts a gremlin-server but we won't use it)



    cd /path/to/janus
    bin/janusgraph.sh start


    Then I would recommend loading your data using a Groovy script. Groovy scripts can be executed with the Gremlin console



    bin/gremlin.sh -e scripts/load_data.script 


    An efficient way to load the data is to split it into two files:



    • nodes.csv: one line per node with all attributes

    • links.csv: one line per link with source_id and target_id and all the links attributes

    This might require some data preparation steps.



    Here is an example script



    The trick to speed up the process is to keep a mapping between your id and the id created by JanusGraph during the creation of the nodes.



    Even if it is not mandatory, I strongly recommend you to create an explicit schema for your graph before loading any data. Here is an example script






    share|improve this answer


















    • 1





      Thanks, this is very helpful! I've updated my question to specify one bit, as I'm also not sure what is exactly meant by 'Id'.

      – nikolai
      Nov 14 '18 at 18:32











    • You're welcome, I am not sure to understand your graph structure. Are the persons in your files the vertices of the graph ? What are the links between the persons ? I recommend reading this presentation (Slides 5 to 10) slideshare.net/ptgoetz/… about JanusGraph and graph structure.

      – Benoit Guigal
      Nov 15 '18 at 8:40













    3












    3








    3







    JanusGraph uses pluggable storage backends and indexs. For testing purposes, a script called bin/janusgraph.sh is packaged with the distribution. It allows to quickly get up and running by starting Cassandra and Elasticsearch (it also starts a gremlin-server but we won't use it)



    cd /path/to/janus
    bin/janusgraph.sh start


    Then I would recommend loading your data using a Groovy script. Groovy scripts can be executed with the Gremlin console



    bin/gremlin.sh -e scripts/load_data.script 


    An efficient way to load the data is to split it into two files:



    • nodes.csv: one line per node with all attributes

    • links.csv: one line per link with source_id and target_id and all the links attributes

    This might require some data preparation steps.



    Here is an example script



    The trick to speed up the process is to keep a mapping between your id and the id created by JanusGraph during the creation of the nodes.



    Even if it is not mandatory, I strongly recommend you to create an explicit schema for your graph before loading any data. Here is an example script






    share|improve this answer













    JanusGraph uses pluggable storage backends and indexs. For testing purposes, a script called bin/janusgraph.sh is packaged with the distribution. It allows to quickly get up and running by starting Cassandra and Elasticsearch (it also starts a gremlin-server but we won't use it)



    cd /path/to/janus
    bin/janusgraph.sh start


    Then I would recommend loading your data using a Groovy script. Groovy scripts can be executed with the Gremlin console



    bin/gremlin.sh -e scripts/load_data.script 


    An efficient way to load the data is to split it into two files:



    • nodes.csv: one line per node with all attributes

    • links.csv: one line per link with source_id and target_id and all the links attributes

    This might require some data preparation steps.



    Here is an example script



    The trick to speed up the process is to keep a mapping between your id and the id created by JanusGraph during the creation of the nodes.



    Even if it is not mandatory, I strongly recommend you to create an explicit schema for your graph before loading any data. Here is an example script







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Nov 14 '18 at 17:27









    Benoit GuigalBenoit Guigal

    3481417




    3481417







    • 1





      Thanks, this is very helpful! I've updated my question to specify one bit, as I'm also not sure what is exactly meant by 'Id'.

      – nikolai
      Nov 14 '18 at 18:32











    • You're welcome, I am not sure to understand your graph structure. Are the persons in your files the vertices of the graph ? What are the links between the persons ? I recommend reading this presentation (Slides 5 to 10) slideshare.net/ptgoetz/… about JanusGraph and graph structure.

      – Benoit Guigal
      Nov 15 '18 at 8:40












    • 1





      Thanks, this is very helpful! I've updated my question to specify one bit, as I'm also not sure what is exactly meant by 'Id'.

      – nikolai
      Nov 14 '18 at 18:32











    • You're welcome, I am not sure to understand your graph structure. Are the persons in your files the vertices of the graph ? What are the links between the persons ? I recommend reading this presentation (Slides 5 to 10) slideshare.net/ptgoetz/… about JanusGraph and graph structure.

      – Benoit Guigal
      Nov 15 '18 at 8:40







    1




    1





    Thanks, this is very helpful! I've updated my question to specify one bit, as I'm also not sure what is exactly meant by 'Id'.

    – nikolai
    Nov 14 '18 at 18:32





    Thanks, this is very helpful! I've updated my question to specify one bit, as I'm also not sure what is exactly meant by 'Id'.

    – nikolai
    Nov 14 '18 at 18:32













    You're welcome, I am not sure to understand your graph structure. Are the persons in your files the vertices of the graph ? What are the links between the persons ? I recommend reading this presentation (Slides 5 to 10) slideshare.net/ptgoetz/… about JanusGraph and graph structure.

    – Benoit Guigal
    Nov 15 '18 at 8:40





    You're welcome, I am not sure to understand your graph structure. Are the persons in your files the vertices of the graph ? What are the links between the persons ? I recommend reading this presentation (Slides 5 to 10) slideshare.net/ptgoetz/… about JanusGraph and graph structure.

    – Benoit Guigal
    Nov 15 '18 at 8:40













    3














    Well, the truth is bulk loading of real user data into JanusGraph is a real pain. I've been using JanuGraph since it's very first version about 2 years ago and its still a pain to bulk load data. A lot of it is not necessarily down to JanusGraph because different users have very different data, different formats, different graph models (ie some mostly need one vertex with one edge ( ex. child-mother ) others deal with one vertex with many edges ( ex user followers ) ) and last but definitely not least, the very nature of the tool deals with large data sets, not to mention the underlying storage and index databases mostly come preconfigured to replicate massively (i.e you might be thinking 20m rows but you actually end up inserting 60m or 80m entries)



    All said, I've had moderate success in bulk loading a some tens of millions in decent timeframes (again it will be painful but here are the general steps).



    • Provide IDs when creating graph elements. If importing from eg MySQL think of perhaps combining the tablename with the id value to create unique IDs eg users1, tweets2

    • Don't specify schema up front. This is because JanusGraph will need to ensure the data conforms on each inserting

    • Don't specify index up front. Just related to above but really deserves its own entry. Bulk insert first index later

    • Please, please, please, be aware of the underlying database features for bulk inserts and activate them i.e read up on Cassandra, ScyllaDB, Big Table, docs especially on replication and indexing

    • After all the above, configure JanusGraph for bulk loading, ensure your data integrity is correct (i.e no duplicate ids) and consider some form of parallelizing insert request e.g some kind of map reduce system

    I think I've covered the major points, again, there's no silver bullet here and the process normally involves quite some trial and error for example the bulk insert rates, too low is bad e.g 10 per second while too high is equally bad eg 10k per second and it almost always depends on your data so its a case by case basis, can't recommend where you should start.



    All said and done, give it a real go, bulk load is the hardest part in my opinion and the struggles are well worth the new dimension it gives your application.



    All the best!






    share|improve this answer



























      3














      Well, the truth is bulk loading of real user data into JanusGraph is a real pain. I've been using JanuGraph since it's very first version about 2 years ago and its still a pain to bulk load data. A lot of it is not necessarily down to JanusGraph because different users have very different data, different formats, different graph models (ie some mostly need one vertex with one edge ( ex. child-mother ) others deal with one vertex with many edges ( ex user followers ) ) and last but definitely not least, the very nature of the tool deals with large data sets, not to mention the underlying storage and index databases mostly come preconfigured to replicate massively (i.e you might be thinking 20m rows but you actually end up inserting 60m or 80m entries)



      All said, I've had moderate success in bulk loading a some tens of millions in decent timeframes (again it will be painful but here are the general steps).



      • Provide IDs when creating graph elements. If importing from eg MySQL think of perhaps combining the tablename with the id value to create unique IDs eg users1, tweets2

      • Don't specify schema up front. This is because JanusGraph will need to ensure the data conforms on each inserting

      • Don't specify index up front. Just related to above but really deserves its own entry. Bulk insert first index later

      • Please, please, please, be aware of the underlying database features for bulk inserts and activate them i.e read up on Cassandra, ScyllaDB, Big Table, docs especially on replication and indexing

      • After all the above, configure JanusGraph for bulk loading, ensure your data integrity is correct (i.e no duplicate ids) and consider some form of parallelizing insert request e.g some kind of map reduce system

      I think I've covered the major points, again, there's no silver bullet here and the process normally involves quite some trial and error for example the bulk insert rates, too low is bad e.g 10 per second while too high is equally bad eg 10k per second and it almost always depends on your data so its a case by case basis, can't recommend where you should start.



      All said and done, give it a real go, bulk load is the hardest part in my opinion and the struggles are well worth the new dimension it gives your application.



      All the best!






      share|improve this answer

























        3












        3








        3







        Well, the truth is bulk loading of real user data into JanusGraph is a real pain. I've been using JanuGraph since it's very first version about 2 years ago and its still a pain to bulk load data. A lot of it is not necessarily down to JanusGraph because different users have very different data, different formats, different graph models (ie some mostly need one vertex with one edge ( ex. child-mother ) others deal with one vertex with many edges ( ex user followers ) ) and last but definitely not least, the very nature of the tool deals with large data sets, not to mention the underlying storage and index databases mostly come preconfigured to replicate massively (i.e you might be thinking 20m rows but you actually end up inserting 60m or 80m entries)



        All said, I've had moderate success in bulk loading a some tens of millions in decent timeframes (again it will be painful but here are the general steps).



        • Provide IDs when creating graph elements. If importing from eg MySQL think of perhaps combining the tablename with the id value to create unique IDs eg users1, tweets2

        • Don't specify schema up front. This is because JanusGraph will need to ensure the data conforms on each inserting

        • Don't specify index up front. Just related to above but really deserves its own entry. Bulk insert first index later

        • Please, please, please, be aware of the underlying database features for bulk inserts and activate them i.e read up on Cassandra, ScyllaDB, Big Table, docs especially on replication and indexing

        • After all the above, configure JanusGraph for bulk loading, ensure your data integrity is correct (i.e no duplicate ids) and consider some form of parallelizing insert request e.g some kind of map reduce system

        I think I've covered the major points, again, there's no silver bullet here and the process normally involves quite some trial and error for example the bulk insert rates, too low is bad e.g 10 per second while too high is equally bad eg 10k per second and it almost always depends on your data so its a case by case basis, can't recommend where you should start.



        All said and done, give it a real go, bulk load is the hardest part in my opinion and the struggles are well worth the new dimension it gives your application.



        All the best!






        share|improve this answer













        Well, the truth is bulk loading of real user data into JanusGraph is a real pain. I've been using JanuGraph since it's very first version about 2 years ago and its still a pain to bulk load data. A lot of it is not necessarily down to JanusGraph because different users have very different data, different formats, different graph models (ie some mostly need one vertex with one edge ( ex. child-mother ) others deal with one vertex with many edges ( ex user followers ) ) and last but definitely not least, the very nature of the tool deals with large data sets, not to mention the underlying storage and index databases mostly come preconfigured to replicate massively (i.e you might be thinking 20m rows but you actually end up inserting 60m or 80m entries)



        All said, I've had moderate success in bulk loading a some tens of millions in decent timeframes (again it will be painful but here are the general steps).



        • Provide IDs when creating graph elements. If importing from eg MySQL think of perhaps combining the tablename with the id value to create unique IDs eg users1, tweets2

        • Don't specify schema up front. This is because JanusGraph will need to ensure the data conforms on each inserting

        • Don't specify index up front. Just related to above but really deserves its own entry. Bulk insert first index later

        • Please, please, please, be aware of the underlying database features for bulk inserts and activate them i.e read up on Cassandra, ScyllaDB, Big Table, docs especially on replication and indexing

        • After all the above, configure JanusGraph for bulk loading, ensure your data integrity is correct (i.e no duplicate ids) and consider some form of parallelizing insert request e.g some kind of map reduce system

        I think I've covered the major points, again, there's no silver bullet here and the process normally involves quite some trial and error for example the bulk insert rates, too low is bad e.g 10 per second while too high is equally bad eg 10k per second and it almost always depends on your data so its a case by case basis, can't recommend where you should start.



        All said and done, give it a real go, bulk load is the hardest part in my opinion and the struggles are well worth the new dimension it gives your application.



        All the best!







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 16 '18 at 2:42









        Don OmondiDon Omondi

        736813




        736813



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53288639%2fbest-way-to-get-millions-of-rows-of-data-into-janusgraph-via-tinkerpop-with-a%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            這個網誌中的熱門文章

            How to read a connectionString WITH PROVIDER in .NET Core?

            Node.js Script on GitHub Pages or Amazon S3

            Museum of Modern and Contemporary Art of Trento and Rovereto