Best way to get (millions of rows of) data into Janusgraph via Tinkerpop, with a specific model
Just started out with Tinkerpop and Janusgraph, and I'm trying to figure this out based on the documentation.
- I have three datasets, each containing about 20 milions rows (csv files)
- There is a specific model in which the variables and rows need to be connected, e.g. what are vertices, what are labels, what are edges, etc.
- After having everything in a graph, I'd like to of course use some basic Gremlin to see how well the model works.
But first I need a way to get the data into Janusgraph.
Possibly there exist scripts for this.
But otherwise, is it perhaps something to be written in python, to open a csv file, get each row of a variable X, and add this as a vertex/edge/etc. ...?
Or am I completely misinterpreting Janusgraph/Tinkerpop?
Thanks for any help in advance.
EDIT:
Say I have a few files, each of which contain a few million rows, representing people, and several variables, representing different metrics. A first example could look like thid:
metric_1 metric_2 metric_3 ..
person_1 a e i
person_2 b f j
person_3 c g k
person_4 d h l
..
Should I translate this to files with nodes that are in the first place made up of just the values, [a,..., l].
(and later perhaps more elaborate sets of properties)
And are [a,..., l] then indexed?
The 'Modern' graph here seems to have an index (number 1,...,12 for all the nodes and edges, independent of their overlapping label/category), e.g. should each measurement be indexed separately and then linked to a given person_x to which they belong?
Apologies for these probably straightforward questions, but I'm fairly new to this.
python gremlin tinkerpop tinkerpop3 janusgraph
add a comment |
Just started out with Tinkerpop and Janusgraph, and I'm trying to figure this out based on the documentation.
- I have three datasets, each containing about 20 milions rows (csv files)
- There is a specific model in which the variables and rows need to be connected, e.g. what are vertices, what are labels, what are edges, etc.
- After having everything in a graph, I'd like to of course use some basic Gremlin to see how well the model works.
But first I need a way to get the data into Janusgraph.
Possibly there exist scripts for this.
But otherwise, is it perhaps something to be written in python, to open a csv file, get each row of a variable X, and add this as a vertex/edge/etc. ...?
Or am I completely misinterpreting Janusgraph/Tinkerpop?
Thanks for any help in advance.
EDIT:
Say I have a few files, each of which contain a few million rows, representing people, and several variables, representing different metrics. A first example could look like thid:
metric_1 metric_2 metric_3 ..
person_1 a e i
person_2 b f j
person_3 c g k
person_4 d h l
..
Should I translate this to files with nodes that are in the first place made up of just the values, [a,..., l].
(and later perhaps more elaborate sets of properties)
And are [a,..., l] then indexed?
The 'Modern' graph here seems to have an index (number 1,...,12 for all the nodes and edges, independent of their overlapping label/category), e.g. should each measurement be indexed separately and then linked to a given person_x to which they belong?
Apologies for these probably straightforward questions, but I'm fairly new to this.
python gremlin tinkerpop tinkerpop3 janusgraph
1
Does each dataset map to a different graph ? Have you already configured a storage backend ?
– Benoit Guigal
Nov 14 '18 at 14:24
In this case there are several datasets (csv files) that should become one graph. (In another case I will use only one dataset.) For the storage backend: I've downloaded ScyllaDB and I performed step 1 & 2 scylladb.com/download/debian9 -> since I only want to use this on my desktop, not in a cluster (yet) I have not done step 3. Should I?
– nikolai
Nov 14 '18 at 16:13
Ok great. For testing purposes though, I would recomend using the scriptbin/janusgraph.sh
which will start Cassandra, ElasticSearch and a gremlin-server. You will then be free in the future to tune which storage backend you want to use
– Benoit Guigal
Nov 14 '18 at 16:56
Thanks, I'll download Cassandra and do as stated here docs.janusgraph.org/latest/cassandra.html but do I need to use/download ElasticSearch as well? Also, does this not interfere with ScyllaDB?
– nikolai
Nov 14 '18 at 17:25
If you use the script janusgraph.sh you do not have to download anything, Cassandra and Elasticsearch are packaged with JanusGraph. Indeed you have to stop ScyllaDB to avoid conflicting port binding
– Benoit Guigal
Nov 14 '18 at 18:01
add a comment |
Just started out with Tinkerpop and Janusgraph, and I'm trying to figure this out based on the documentation.
- I have three datasets, each containing about 20 milions rows (csv files)
- There is a specific model in which the variables and rows need to be connected, e.g. what are vertices, what are labels, what are edges, etc.
- After having everything in a graph, I'd like to of course use some basic Gremlin to see how well the model works.
But first I need a way to get the data into Janusgraph.
Possibly there exist scripts for this.
But otherwise, is it perhaps something to be written in python, to open a csv file, get each row of a variable X, and add this as a vertex/edge/etc. ...?
Or am I completely misinterpreting Janusgraph/Tinkerpop?
Thanks for any help in advance.
EDIT:
Say I have a few files, each of which contain a few million rows, representing people, and several variables, representing different metrics. A first example could look like thid:
metric_1 metric_2 metric_3 ..
person_1 a e i
person_2 b f j
person_3 c g k
person_4 d h l
..
Should I translate this to files with nodes that are in the first place made up of just the values, [a,..., l].
(and later perhaps more elaborate sets of properties)
And are [a,..., l] then indexed?
The 'Modern' graph here seems to have an index (number 1,...,12 for all the nodes and edges, independent of their overlapping label/category), e.g. should each measurement be indexed separately and then linked to a given person_x to which they belong?
Apologies for these probably straightforward questions, but I'm fairly new to this.
python gremlin tinkerpop tinkerpop3 janusgraph
Just started out with Tinkerpop and Janusgraph, and I'm trying to figure this out based on the documentation.
- I have three datasets, each containing about 20 milions rows (csv files)
- There is a specific model in which the variables and rows need to be connected, e.g. what are vertices, what are labels, what are edges, etc.
- After having everything in a graph, I'd like to of course use some basic Gremlin to see how well the model works.
But first I need a way to get the data into Janusgraph.
Possibly there exist scripts for this.
But otherwise, is it perhaps something to be written in python, to open a csv file, get each row of a variable X, and add this as a vertex/edge/etc. ...?
Or am I completely misinterpreting Janusgraph/Tinkerpop?
Thanks for any help in advance.
EDIT:
Say I have a few files, each of which contain a few million rows, representing people, and several variables, representing different metrics. A first example could look like thid:
metric_1 metric_2 metric_3 ..
person_1 a e i
person_2 b f j
person_3 c g k
person_4 d h l
..
Should I translate this to files with nodes that are in the first place made up of just the values, [a,..., l].
(and later perhaps more elaborate sets of properties)
And are [a,..., l] then indexed?
The 'Modern' graph here seems to have an index (number 1,...,12 for all the nodes and edges, independent of their overlapping label/category), e.g. should each measurement be indexed separately and then linked to a given person_x to which they belong?
Apologies for these probably straightforward questions, but I'm fairly new to this.
python gremlin tinkerpop tinkerpop3 janusgraph
python gremlin tinkerpop tinkerpop3 janusgraph
edited Nov 14 '18 at 18:35
nikolai
asked Nov 13 '18 at 20:02
nikolainikolai
133
133
1
Does each dataset map to a different graph ? Have you already configured a storage backend ?
– Benoit Guigal
Nov 14 '18 at 14:24
In this case there are several datasets (csv files) that should become one graph. (In another case I will use only one dataset.) For the storage backend: I've downloaded ScyllaDB and I performed step 1 & 2 scylladb.com/download/debian9 -> since I only want to use this on my desktop, not in a cluster (yet) I have not done step 3. Should I?
– nikolai
Nov 14 '18 at 16:13
Ok great. For testing purposes though, I would recomend using the scriptbin/janusgraph.sh
which will start Cassandra, ElasticSearch and a gremlin-server. You will then be free in the future to tune which storage backend you want to use
– Benoit Guigal
Nov 14 '18 at 16:56
Thanks, I'll download Cassandra and do as stated here docs.janusgraph.org/latest/cassandra.html but do I need to use/download ElasticSearch as well? Also, does this not interfere with ScyllaDB?
– nikolai
Nov 14 '18 at 17:25
If you use the script janusgraph.sh you do not have to download anything, Cassandra and Elasticsearch are packaged with JanusGraph. Indeed you have to stop ScyllaDB to avoid conflicting port binding
– Benoit Guigal
Nov 14 '18 at 18:01
add a comment |
1
Does each dataset map to a different graph ? Have you already configured a storage backend ?
– Benoit Guigal
Nov 14 '18 at 14:24
In this case there are several datasets (csv files) that should become one graph. (In another case I will use only one dataset.) For the storage backend: I've downloaded ScyllaDB and I performed step 1 & 2 scylladb.com/download/debian9 -> since I only want to use this on my desktop, not in a cluster (yet) I have not done step 3. Should I?
– nikolai
Nov 14 '18 at 16:13
Ok great. For testing purposes though, I would recomend using the scriptbin/janusgraph.sh
which will start Cassandra, ElasticSearch and a gremlin-server. You will then be free in the future to tune which storage backend you want to use
– Benoit Guigal
Nov 14 '18 at 16:56
Thanks, I'll download Cassandra and do as stated here docs.janusgraph.org/latest/cassandra.html but do I need to use/download ElasticSearch as well? Also, does this not interfere with ScyllaDB?
– nikolai
Nov 14 '18 at 17:25
If you use the script janusgraph.sh you do not have to download anything, Cassandra and Elasticsearch are packaged with JanusGraph. Indeed you have to stop ScyllaDB to avoid conflicting port binding
– Benoit Guigal
Nov 14 '18 at 18:01
1
1
Does each dataset map to a different graph ? Have you already configured a storage backend ?
– Benoit Guigal
Nov 14 '18 at 14:24
Does each dataset map to a different graph ? Have you already configured a storage backend ?
– Benoit Guigal
Nov 14 '18 at 14:24
In this case there are several datasets (csv files) that should become one graph. (In another case I will use only one dataset.) For the storage backend: I've downloaded ScyllaDB and I performed step 1 & 2 scylladb.com/download/debian9 -> since I only want to use this on my desktop, not in a cluster (yet) I have not done step 3. Should I?
– nikolai
Nov 14 '18 at 16:13
In this case there are several datasets (csv files) that should become one graph. (In another case I will use only one dataset.) For the storage backend: I've downloaded ScyllaDB and I performed step 1 & 2 scylladb.com/download/debian9 -> since I only want to use this on my desktop, not in a cluster (yet) I have not done step 3. Should I?
– nikolai
Nov 14 '18 at 16:13
Ok great. For testing purposes though, I would recomend using the script
bin/janusgraph.sh
which will start Cassandra, ElasticSearch and a gremlin-server. You will then be free in the future to tune which storage backend you want to use– Benoit Guigal
Nov 14 '18 at 16:56
Ok great. For testing purposes though, I would recomend using the script
bin/janusgraph.sh
which will start Cassandra, ElasticSearch and a gremlin-server. You will then be free in the future to tune which storage backend you want to use– Benoit Guigal
Nov 14 '18 at 16:56
Thanks, I'll download Cassandra and do as stated here docs.janusgraph.org/latest/cassandra.html but do I need to use/download ElasticSearch as well? Also, does this not interfere with ScyllaDB?
– nikolai
Nov 14 '18 at 17:25
Thanks, I'll download Cassandra and do as stated here docs.janusgraph.org/latest/cassandra.html but do I need to use/download ElasticSearch as well? Also, does this not interfere with ScyllaDB?
– nikolai
Nov 14 '18 at 17:25
If you use the script janusgraph.sh you do not have to download anything, Cassandra and Elasticsearch are packaged with JanusGraph. Indeed you have to stop ScyllaDB to avoid conflicting port binding
– Benoit Guigal
Nov 14 '18 at 18:01
If you use the script janusgraph.sh you do not have to download anything, Cassandra and Elasticsearch are packaged with JanusGraph. Indeed you have to stop ScyllaDB to avoid conflicting port binding
– Benoit Guigal
Nov 14 '18 at 18:01
add a comment |
2 Answers
2
active
oldest
votes
JanusGraph uses pluggable storage backends and indexs. For testing purposes, a script called bin/janusgraph.sh
is packaged with the distribution. It allows to quickly get up and running by starting Cassandra and Elasticsearch (it also starts a gremlin-server but we won't use it)
cd /path/to/janus
bin/janusgraph.sh start
Then I would recommend loading your data using a Groovy script. Groovy scripts can be executed with the Gremlin console
bin/gremlin.sh -e scripts/load_data.script
An efficient way to load the data is to split it into two files:
- nodes.csv: one line per node with all attributes
- links.csv: one line per link with
source_id
andtarget_id
and all the links attributes
This might require some data preparation steps.
Here is an example script
The trick to speed up the process is to keep a mapping between your id and the id created by JanusGraph during the creation of the nodes.
Even if it is not mandatory, I strongly recommend you to create an explicit schema for your graph before loading any data. Here is an example script
1
Thanks, this is very helpful! I've updated my question to specify one bit, as I'm also not sure what is exactly meant by 'Id'.
– nikolai
Nov 14 '18 at 18:32
You're welcome, I am not sure to understand your graph structure. Are the persons in your files the vertices of the graph ? What are the links between the persons ? I recommend reading this presentation (Slides 5 to 10) slideshare.net/ptgoetz/… about JanusGraph and graph structure.
– Benoit Guigal
Nov 15 '18 at 8:40
add a comment |
Well, the truth is bulk loading of real user data into JanusGraph is a real pain. I've been using JanuGraph since it's very first version about 2 years ago and its still a pain to bulk load data. A lot of it is not necessarily down to JanusGraph because different users have very different data, different formats, different graph models (ie some mostly need one vertex with one edge ( ex. child-mother ) others deal with one vertex with many edges ( ex user followers ) ) and last but definitely not least, the very nature of the tool deals with large data sets, not to mention the underlying storage and index databases mostly come preconfigured to replicate massively (i.e you might be thinking 20m rows but you actually end up inserting 60m or 80m entries)
All said, I've had moderate success in bulk loading a some tens of millions in decent timeframes (again it will be painful but here are the general steps).
- Provide IDs when creating graph elements. If importing from eg MySQL think of perhaps combining the tablename with the id value to create unique IDs eg users1, tweets2
- Don't specify schema up front. This is because JanusGraph will need to ensure the data conforms on each inserting
- Don't specify index up front. Just related to above but really deserves its own entry. Bulk insert first index later
- Please, please, please, be aware of the underlying database features for bulk inserts and activate them i.e read up on Cassandra, ScyllaDB, Big Table, docs especially on replication and indexing
- After all the above, configure JanusGraph for bulk loading, ensure your data integrity is correct (i.e no duplicate ids) and consider some form of parallelizing insert request e.g some kind of map reduce system
I think I've covered the major points, again, there's no silver bullet here and the process normally involves quite some trial and error for example the bulk insert rates, too low is bad e.g 10 per second while too high is equally bad eg 10k per second and it almost always depends on your data so its a case by case basis, can't recommend where you should start.
All said and done, give it a real go, bulk load is the hardest part in my opinion and the struggles are well worth the new dimension it gives your application.
All the best!
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53288639%2fbest-way-to-get-millions-of-rows-of-data-into-janusgraph-via-tinkerpop-with-a%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
JanusGraph uses pluggable storage backends and indexs. For testing purposes, a script called bin/janusgraph.sh
is packaged with the distribution. It allows to quickly get up and running by starting Cassandra and Elasticsearch (it also starts a gremlin-server but we won't use it)
cd /path/to/janus
bin/janusgraph.sh start
Then I would recommend loading your data using a Groovy script. Groovy scripts can be executed with the Gremlin console
bin/gremlin.sh -e scripts/load_data.script
An efficient way to load the data is to split it into two files:
- nodes.csv: one line per node with all attributes
- links.csv: one line per link with
source_id
andtarget_id
and all the links attributes
This might require some data preparation steps.
Here is an example script
The trick to speed up the process is to keep a mapping between your id and the id created by JanusGraph during the creation of the nodes.
Even if it is not mandatory, I strongly recommend you to create an explicit schema for your graph before loading any data. Here is an example script
1
Thanks, this is very helpful! I've updated my question to specify one bit, as I'm also not sure what is exactly meant by 'Id'.
– nikolai
Nov 14 '18 at 18:32
You're welcome, I am not sure to understand your graph structure. Are the persons in your files the vertices of the graph ? What are the links between the persons ? I recommend reading this presentation (Slides 5 to 10) slideshare.net/ptgoetz/… about JanusGraph and graph structure.
– Benoit Guigal
Nov 15 '18 at 8:40
add a comment |
JanusGraph uses pluggable storage backends and indexs. For testing purposes, a script called bin/janusgraph.sh
is packaged with the distribution. It allows to quickly get up and running by starting Cassandra and Elasticsearch (it also starts a gremlin-server but we won't use it)
cd /path/to/janus
bin/janusgraph.sh start
Then I would recommend loading your data using a Groovy script. Groovy scripts can be executed with the Gremlin console
bin/gremlin.sh -e scripts/load_data.script
An efficient way to load the data is to split it into two files:
- nodes.csv: one line per node with all attributes
- links.csv: one line per link with
source_id
andtarget_id
and all the links attributes
This might require some data preparation steps.
Here is an example script
The trick to speed up the process is to keep a mapping between your id and the id created by JanusGraph during the creation of the nodes.
Even if it is not mandatory, I strongly recommend you to create an explicit schema for your graph before loading any data. Here is an example script
1
Thanks, this is very helpful! I've updated my question to specify one bit, as I'm also not sure what is exactly meant by 'Id'.
– nikolai
Nov 14 '18 at 18:32
You're welcome, I am not sure to understand your graph structure. Are the persons in your files the vertices of the graph ? What are the links between the persons ? I recommend reading this presentation (Slides 5 to 10) slideshare.net/ptgoetz/… about JanusGraph and graph structure.
– Benoit Guigal
Nov 15 '18 at 8:40
add a comment |
JanusGraph uses pluggable storage backends and indexs. For testing purposes, a script called bin/janusgraph.sh
is packaged with the distribution. It allows to quickly get up and running by starting Cassandra and Elasticsearch (it also starts a gremlin-server but we won't use it)
cd /path/to/janus
bin/janusgraph.sh start
Then I would recommend loading your data using a Groovy script. Groovy scripts can be executed with the Gremlin console
bin/gremlin.sh -e scripts/load_data.script
An efficient way to load the data is to split it into two files:
- nodes.csv: one line per node with all attributes
- links.csv: one line per link with
source_id
andtarget_id
and all the links attributes
This might require some data preparation steps.
Here is an example script
The trick to speed up the process is to keep a mapping between your id and the id created by JanusGraph during the creation of the nodes.
Even if it is not mandatory, I strongly recommend you to create an explicit schema for your graph before loading any data. Here is an example script
JanusGraph uses pluggable storage backends and indexs. For testing purposes, a script called bin/janusgraph.sh
is packaged with the distribution. It allows to quickly get up and running by starting Cassandra and Elasticsearch (it also starts a gremlin-server but we won't use it)
cd /path/to/janus
bin/janusgraph.sh start
Then I would recommend loading your data using a Groovy script. Groovy scripts can be executed with the Gremlin console
bin/gremlin.sh -e scripts/load_data.script
An efficient way to load the data is to split it into two files:
- nodes.csv: one line per node with all attributes
- links.csv: one line per link with
source_id
andtarget_id
and all the links attributes
This might require some data preparation steps.
Here is an example script
The trick to speed up the process is to keep a mapping between your id and the id created by JanusGraph during the creation of the nodes.
Even if it is not mandatory, I strongly recommend you to create an explicit schema for your graph before loading any data. Here is an example script
answered Nov 14 '18 at 17:27
Benoit GuigalBenoit Guigal
3481417
3481417
1
Thanks, this is very helpful! I've updated my question to specify one bit, as I'm also not sure what is exactly meant by 'Id'.
– nikolai
Nov 14 '18 at 18:32
You're welcome, I am not sure to understand your graph structure. Are the persons in your files the vertices of the graph ? What are the links between the persons ? I recommend reading this presentation (Slides 5 to 10) slideshare.net/ptgoetz/… about JanusGraph and graph structure.
– Benoit Guigal
Nov 15 '18 at 8:40
add a comment |
1
Thanks, this is very helpful! I've updated my question to specify one bit, as I'm also not sure what is exactly meant by 'Id'.
– nikolai
Nov 14 '18 at 18:32
You're welcome, I am not sure to understand your graph structure. Are the persons in your files the vertices of the graph ? What are the links between the persons ? I recommend reading this presentation (Slides 5 to 10) slideshare.net/ptgoetz/… about JanusGraph and graph structure.
– Benoit Guigal
Nov 15 '18 at 8:40
1
1
Thanks, this is very helpful! I've updated my question to specify one bit, as I'm also not sure what is exactly meant by 'Id'.
– nikolai
Nov 14 '18 at 18:32
Thanks, this is very helpful! I've updated my question to specify one bit, as I'm also not sure what is exactly meant by 'Id'.
– nikolai
Nov 14 '18 at 18:32
You're welcome, I am not sure to understand your graph structure. Are the persons in your files the vertices of the graph ? What are the links between the persons ? I recommend reading this presentation (Slides 5 to 10) slideshare.net/ptgoetz/… about JanusGraph and graph structure.
– Benoit Guigal
Nov 15 '18 at 8:40
You're welcome, I am not sure to understand your graph structure. Are the persons in your files the vertices of the graph ? What are the links between the persons ? I recommend reading this presentation (Slides 5 to 10) slideshare.net/ptgoetz/… about JanusGraph and graph structure.
– Benoit Guigal
Nov 15 '18 at 8:40
add a comment |
Well, the truth is bulk loading of real user data into JanusGraph is a real pain. I've been using JanuGraph since it's very first version about 2 years ago and its still a pain to bulk load data. A lot of it is not necessarily down to JanusGraph because different users have very different data, different formats, different graph models (ie some mostly need one vertex with one edge ( ex. child-mother ) others deal with one vertex with many edges ( ex user followers ) ) and last but definitely not least, the very nature of the tool deals with large data sets, not to mention the underlying storage and index databases mostly come preconfigured to replicate massively (i.e you might be thinking 20m rows but you actually end up inserting 60m or 80m entries)
All said, I've had moderate success in bulk loading a some tens of millions in decent timeframes (again it will be painful but here are the general steps).
- Provide IDs when creating graph elements. If importing from eg MySQL think of perhaps combining the tablename with the id value to create unique IDs eg users1, tweets2
- Don't specify schema up front. This is because JanusGraph will need to ensure the data conforms on each inserting
- Don't specify index up front. Just related to above but really deserves its own entry. Bulk insert first index later
- Please, please, please, be aware of the underlying database features for bulk inserts and activate them i.e read up on Cassandra, ScyllaDB, Big Table, docs especially on replication and indexing
- After all the above, configure JanusGraph for bulk loading, ensure your data integrity is correct (i.e no duplicate ids) and consider some form of parallelizing insert request e.g some kind of map reduce system
I think I've covered the major points, again, there's no silver bullet here and the process normally involves quite some trial and error for example the bulk insert rates, too low is bad e.g 10 per second while too high is equally bad eg 10k per second and it almost always depends on your data so its a case by case basis, can't recommend where you should start.
All said and done, give it a real go, bulk load is the hardest part in my opinion and the struggles are well worth the new dimension it gives your application.
All the best!
add a comment |
Well, the truth is bulk loading of real user data into JanusGraph is a real pain. I've been using JanuGraph since it's very first version about 2 years ago and its still a pain to bulk load data. A lot of it is not necessarily down to JanusGraph because different users have very different data, different formats, different graph models (ie some mostly need one vertex with one edge ( ex. child-mother ) others deal with one vertex with many edges ( ex user followers ) ) and last but definitely not least, the very nature of the tool deals with large data sets, not to mention the underlying storage and index databases mostly come preconfigured to replicate massively (i.e you might be thinking 20m rows but you actually end up inserting 60m or 80m entries)
All said, I've had moderate success in bulk loading a some tens of millions in decent timeframes (again it will be painful but here are the general steps).
- Provide IDs when creating graph elements. If importing from eg MySQL think of perhaps combining the tablename with the id value to create unique IDs eg users1, tweets2
- Don't specify schema up front. This is because JanusGraph will need to ensure the data conforms on each inserting
- Don't specify index up front. Just related to above but really deserves its own entry. Bulk insert first index later
- Please, please, please, be aware of the underlying database features for bulk inserts and activate them i.e read up on Cassandra, ScyllaDB, Big Table, docs especially on replication and indexing
- After all the above, configure JanusGraph for bulk loading, ensure your data integrity is correct (i.e no duplicate ids) and consider some form of parallelizing insert request e.g some kind of map reduce system
I think I've covered the major points, again, there's no silver bullet here and the process normally involves quite some trial and error for example the bulk insert rates, too low is bad e.g 10 per second while too high is equally bad eg 10k per second and it almost always depends on your data so its a case by case basis, can't recommend where you should start.
All said and done, give it a real go, bulk load is the hardest part in my opinion and the struggles are well worth the new dimension it gives your application.
All the best!
add a comment |
Well, the truth is bulk loading of real user data into JanusGraph is a real pain. I've been using JanuGraph since it's very first version about 2 years ago and its still a pain to bulk load data. A lot of it is not necessarily down to JanusGraph because different users have very different data, different formats, different graph models (ie some mostly need one vertex with one edge ( ex. child-mother ) others deal with one vertex with many edges ( ex user followers ) ) and last but definitely not least, the very nature of the tool deals with large data sets, not to mention the underlying storage and index databases mostly come preconfigured to replicate massively (i.e you might be thinking 20m rows but you actually end up inserting 60m or 80m entries)
All said, I've had moderate success in bulk loading a some tens of millions in decent timeframes (again it will be painful but here are the general steps).
- Provide IDs when creating graph elements. If importing from eg MySQL think of perhaps combining the tablename with the id value to create unique IDs eg users1, tweets2
- Don't specify schema up front. This is because JanusGraph will need to ensure the data conforms on each inserting
- Don't specify index up front. Just related to above but really deserves its own entry. Bulk insert first index later
- Please, please, please, be aware of the underlying database features for bulk inserts and activate them i.e read up on Cassandra, ScyllaDB, Big Table, docs especially on replication and indexing
- After all the above, configure JanusGraph for bulk loading, ensure your data integrity is correct (i.e no duplicate ids) and consider some form of parallelizing insert request e.g some kind of map reduce system
I think I've covered the major points, again, there's no silver bullet here and the process normally involves quite some trial and error for example the bulk insert rates, too low is bad e.g 10 per second while too high is equally bad eg 10k per second and it almost always depends on your data so its a case by case basis, can't recommend where you should start.
All said and done, give it a real go, bulk load is the hardest part in my opinion and the struggles are well worth the new dimension it gives your application.
All the best!
Well, the truth is bulk loading of real user data into JanusGraph is a real pain. I've been using JanuGraph since it's very first version about 2 years ago and its still a pain to bulk load data. A lot of it is not necessarily down to JanusGraph because different users have very different data, different formats, different graph models (ie some mostly need one vertex with one edge ( ex. child-mother ) others deal with one vertex with many edges ( ex user followers ) ) and last but definitely not least, the very nature of the tool deals with large data sets, not to mention the underlying storage and index databases mostly come preconfigured to replicate massively (i.e you might be thinking 20m rows but you actually end up inserting 60m or 80m entries)
All said, I've had moderate success in bulk loading a some tens of millions in decent timeframes (again it will be painful but here are the general steps).
- Provide IDs when creating graph elements. If importing from eg MySQL think of perhaps combining the tablename with the id value to create unique IDs eg users1, tweets2
- Don't specify schema up front. This is because JanusGraph will need to ensure the data conforms on each inserting
- Don't specify index up front. Just related to above but really deserves its own entry. Bulk insert first index later
- Please, please, please, be aware of the underlying database features for bulk inserts and activate them i.e read up on Cassandra, ScyllaDB, Big Table, docs especially on replication and indexing
- After all the above, configure JanusGraph for bulk loading, ensure your data integrity is correct (i.e no duplicate ids) and consider some form of parallelizing insert request e.g some kind of map reduce system
I think I've covered the major points, again, there's no silver bullet here and the process normally involves quite some trial and error for example the bulk insert rates, too low is bad e.g 10 per second while too high is equally bad eg 10k per second and it almost always depends on your data so its a case by case basis, can't recommend where you should start.
All said and done, give it a real go, bulk load is the hardest part in my opinion and the struggles are well worth the new dimension it gives your application.
All the best!
answered Nov 16 '18 at 2:42
Don OmondiDon Omondi
736813
736813
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53288639%2fbest-way-to-get-millions-of-rows-of-data-into-janusgraph-via-tinkerpop-with-a%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Does each dataset map to a different graph ? Have you already configured a storage backend ?
– Benoit Guigal
Nov 14 '18 at 14:24
In this case there are several datasets (csv files) that should become one graph. (In another case I will use only one dataset.) For the storage backend: I've downloaded ScyllaDB and I performed step 1 & 2 scylladb.com/download/debian9 -> since I only want to use this on my desktop, not in a cluster (yet) I have not done step 3. Should I?
– nikolai
Nov 14 '18 at 16:13
Ok great. For testing purposes though, I would recomend using the script
bin/janusgraph.sh
which will start Cassandra, ElasticSearch and a gremlin-server. You will then be free in the future to tune which storage backend you want to use– Benoit Guigal
Nov 14 '18 at 16:56
Thanks, I'll download Cassandra and do as stated here docs.janusgraph.org/latest/cassandra.html but do I need to use/download ElasticSearch as well? Also, does this not interfere with ScyllaDB?
– nikolai
Nov 14 '18 at 17:25
If you use the script janusgraph.sh you do not have to download anything, Cassandra and Elasticsearch are packaged with JanusGraph. Indeed you have to stop ScyllaDB to avoid conflicting port binding
– Benoit Guigal
Nov 14 '18 at 18:01