How to implement code to manipulate files that runs in parellel?
I'm trying to load 10 dependent directories, which contains a bunch of JSON files, the structure is shown below:
for fpathe1,dirs1,fs1 in os.walk('../input/charliehebdo/rumours/'):
for f in fs1:
with open(os.path.join(fpathe1,f)) as dir_loc:
data.append(json.loads(dir_loc.read()))
charliehebdo = pd.DataFrame(data)
charliehebdo['label'] = 'TRUE'
charliehebdo['event'] = 'charliehebdo'
for fpathe2,dirs2,fs2 in os.walk('../input/charliehebdo/non-rumours/'):
for f in fs2:
with open(os.path.join(fpathe2,f)) as dir_loc:
data.append(json.loads(dir_loc.read()))
nonRumourcharliehebdo = pd.DataFrame(data)
nonRumourcharliehebdo['label'] = 'FALSE'
nonRumourcharliehebdo['event'] = 'charliehebdo'
for fpathe3,dirs3,fs3 in os.walk('../input/ferguson/rumours/'):
for f in fs3:
with open(os.path.join(fpathe3,f)) as dir_loc:
data.append(json.loads(dir_loc.read()))
ferguson = pd.DataFrame(data)
ferguson['label'] = 'TRUE'
ferguson['event'] = 'ferguson'
for fpathe4,dirs4,fs4 in os.walk('../input/ferguson/non-rumours/'):
for f in fs3:
with open(os.path.join(fpathe3,f)) as dir_loc:
data.append(json.loads(dir_loc.read()))
nonRumourferguson = pd.DataFrame(data)
nonRumourferguson['label'] = 'FALSE'
nonRumourferguson['event'] = 'ferguson'
However, the sample code is extremely time-consuming(I ran on my laptop with Intel Core i7-4720HQ and it cost me 24hr+) so I'm wondering if there's any better solution?
well, it seems that my structure figure confuse or mislead you so here is the dataset.raw dataset
I intended to illustrate the dataset by figure but it turns out to be worse.
python performance
add a comment |
I'm trying to load 10 dependent directories, which contains a bunch of JSON files, the structure is shown below:
for fpathe1,dirs1,fs1 in os.walk('../input/charliehebdo/rumours/'):
for f in fs1:
with open(os.path.join(fpathe1,f)) as dir_loc:
data.append(json.loads(dir_loc.read()))
charliehebdo = pd.DataFrame(data)
charliehebdo['label'] = 'TRUE'
charliehebdo['event'] = 'charliehebdo'
for fpathe2,dirs2,fs2 in os.walk('../input/charliehebdo/non-rumours/'):
for f in fs2:
with open(os.path.join(fpathe2,f)) as dir_loc:
data.append(json.loads(dir_loc.read()))
nonRumourcharliehebdo = pd.DataFrame(data)
nonRumourcharliehebdo['label'] = 'FALSE'
nonRumourcharliehebdo['event'] = 'charliehebdo'
for fpathe3,dirs3,fs3 in os.walk('../input/ferguson/rumours/'):
for f in fs3:
with open(os.path.join(fpathe3,f)) as dir_loc:
data.append(json.loads(dir_loc.read()))
ferguson = pd.DataFrame(data)
ferguson['label'] = 'TRUE'
ferguson['event'] = 'ferguson'
for fpathe4,dirs4,fs4 in os.walk('../input/ferguson/non-rumours/'):
for f in fs3:
with open(os.path.join(fpathe3,f)) as dir_loc:
data.append(json.loads(dir_loc.read()))
nonRumourferguson = pd.DataFrame(data)
nonRumourferguson['label'] = 'FALSE'
nonRumourferguson['event'] = 'ferguson'
However, the sample code is extremely time-consuming(I ran on my laptop with Intel Core i7-4720HQ and it cost me 24hr+) so I'm wondering if there's any better solution?
well, it seems that my structure figure confuse or mislead you so here is the dataset.raw dataset
I intended to illustrate the dataset by figure but it turns out to be worse.
python performance
What does the "structure" diagram convey exactly? What are the words intended to represent, and what are the lines for?
– Mad Physicist
Nov 12 at 6:39
I suggest you first profile your script and see where it's spending most of the time—because that will tell you if using concurrent processing will worthwhile or not. See How can you profile a script?
– martineau
Nov 12 at 6:41
Your code looks like it's probably I/O bound, so parallel processing may not speed things up—and could in fact slow things down because of the overhead involved in using it.
– martineau
Nov 12 at 6:50
@MadPhysicist: From the code it looks to me like it's the filesystem directory structure involved.
– martineau
Nov 12 at 6:52
add a comment |
I'm trying to load 10 dependent directories, which contains a bunch of JSON files, the structure is shown below:
for fpathe1,dirs1,fs1 in os.walk('../input/charliehebdo/rumours/'):
for f in fs1:
with open(os.path.join(fpathe1,f)) as dir_loc:
data.append(json.loads(dir_loc.read()))
charliehebdo = pd.DataFrame(data)
charliehebdo['label'] = 'TRUE'
charliehebdo['event'] = 'charliehebdo'
for fpathe2,dirs2,fs2 in os.walk('../input/charliehebdo/non-rumours/'):
for f in fs2:
with open(os.path.join(fpathe2,f)) as dir_loc:
data.append(json.loads(dir_loc.read()))
nonRumourcharliehebdo = pd.DataFrame(data)
nonRumourcharliehebdo['label'] = 'FALSE'
nonRumourcharliehebdo['event'] = 'charliehebdo'
for fpathe3,dirs3,fs3 in os.walk('../input/ferguson/rumours/'):
for f in fs3:
with open(os.path.join(fpathe3,f)) as dir_loc:
data.append(json.loads(dir_loc.read()))
ferguson = pd.DataFrame(data)
ferguson['label'] = 'TRUE'
ferguson['event'] = 'ferguson'
for fpathe4,dirs4,fs4 in os.walk('../input/ferguson/non-rumours/'):
for f in fs3:
with open(os.path.join(fpathe3,f)) as dir_loc:
data.append(json.loads(dir_loc.read()))
nonRumourferguson = pd.DataFrame(data)
nonRumourferguson['label'] = 'FALSE'
nonRumourferguson['event'] = 'ferguson'
However, the sample code is extremely time-consuming(I ran on my laptop with Intel Core i7-4720HQ and it cost me 24hr+) so I'm wondering if there's any better solution?
well, it seems that my structure figure confuse or mislead you so here is the dataset.raw dataset
I intended to illustrate the dataset by figure but it turns out to be worse.
python performance
I'm trying to load 10 dependent directories, which contains a bunch of JSON files, the structure is shown below:
for fpathe1,dirs1,fs1 in os.walk('../input/charliehebdo/rumours/'):
for f in fs1:
with open(os.path.join(fpathe1,f)) as dir_loc:
data.append(json.loads(dir_loc.read()))
charliehebdo = pd.DataFrame(data)
charliehebdo['label'] = 'TRUE'
charliehebdo['event'] = 'charliehebdo'
for fpathe2,dirs2,fs2 in os.walk('../input/charliehebdo/non-rumours/'):
for f in fs2:
with open(os.path.join(fpathe2,f)) as dir_loc:
data.append(json.loads(dir_loc.read()))
nonRumourcharliehebdo = pd.DataFrame(data)
nonRumourcharliehebdo['label'] = 'FALSE'
nonRumourcharliehebdo['event'] = 'charliehebdo'
for fpathe3,dirs3,fs3 in os.walk('../input/ferguson/rumours/'):
for f in fs3:
with open(os.path.join(fpathe3,f)) as dir_loc:
data.append(json.loads(dir_loc.read()))
ferguson = pd.DataFrame(data)
ferguson['label'] = 'TRUE'
ferguson['event'] = 'ferguson'
for fpathe4,dirs4,fs4 in os.walk('../input/ferguson/non-rumours/'):
for f in fs3:
with open(os.path.join(fpathe3,f)) as dir_loc:
data.append(json.loads(dir_loc.read()))
nonRumourferguson = pd.DataFrame(data)
nonRumourferguson['label'] = 'FALSE'
nonRumourferguson['event'] = 'ferguson'
However, the sample code is extremely time-consuming(I ran on my laptop with Intel Core i7-4720HQ and it cost me 24hr+) so I'm wondering if there's any better solution?
well, it seems that my structure figure confuse or mislead you so here is the dataset.raw dataset
I intended to illustrate the dataset by figure but it turns out to be worse.
python performance
python performance
edited Nov 13 at 2:22
asked Nov 12 at 6:02
Tilmant
13
13
What does the "structure" diagram convey exactly? What are the words intended to represent, and what are the lines for?
– Mad Physicist
Nov 12 at 6:39
I suggest you first profile your script and see where it's spending most of the time—because that will tell you if using concurrent processing will worthwhile or not. See How can you profile a script?
– martineau
Nov 12 at 6:41
Your code looks like it's probably I/O bound, so parallel processing may not speed things up—and could in fact slow things down because of the overhead involved in using it.
– martineau
Nov 12 at 6:50
@MadPhysicist: From the code it looks to me like it's the filesystem directory structure involved.
– martineau
Nov 12 at 6:52
add a comment |
What does the "structure" diagram convey exactly? What are the words intended to represent, and what are the lines for?
– Mad Physicist
Nov 12 at 6:39
I suggest you first profile your script and see where it's spending most of the time—because that will tell you if using concurrent processing will worthwhile or not. See How can you profile a script?
– martineau
Nov 12 at 6:41
Your code looks like it's probably I/O bound, so parallel processing may not speed things up—and could in fact slow things down because of the overhead involved in using it.
– martineau
Nov 12 at 6:50
@MadPhysicist: From the code it looks to me like it's the filesystem directory structure involved.
– martineau
Nov 12 at 6:52
What does the "structure" diagram convey exactly? What are the words intended to represent, and what are the lines for?
– Mad Physicist
Nov 12 at 6:39
What does the "structure" diagram convey exactly? What are the words intended to represent, and what are the lines for?
– Mad Physicist
Nov 12 at 6:39
I suggest you first profile your script and see where it's spending most of the time—because that will tell you if using concurrent processing will worthwhile or not. See How can you profile a script?
– martineau
Nov 12 at 6:41
I suggest you first profile your script and see where it's spending most of the time—because that will tell you if using concurrent processing will worthwhile or not. See How can you profile a script?
– martineau
Nov 12 at 6:41
Your code looks like it's probably I/O bound, so parallel processing may not speed things up—and could in fact slow things down because of the overhead involved in using it.
– martineau
Nov 12 at 6:50
Your code looks like it's probably I/O bound, so parallel processing may not speed things up—and could in fact slow things down because of the overhead involved in using it.
– martineau
Nov 12 at 6:50
@MadPhysicist: From the code it looks to me like it's the filesystem directory structure involved.
– martineau
Nov 12 at 6:52
@MadPhysicist: From the code it looks to me like it's the filesystem directory structure involved.
– martineau
Nov 12 at 6:52
add a comment |
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53256612%2fhow-to-implement-code-to-manipulate-files-that-runs-in-parellel%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53256612%2fhow-to-implement-code-to-manipulate-files-that-runs-in-parellel%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
What does the "structure" diagram convey exactly? What are the words intended to represent, and what are the lines for?
– Mad Physicist
Nov 12 at 6:39
I suggest you first profile your script and see where it's spending most of the time—because that will tell you if using concurrent processing will worthwhile or not. See How can you profile a script?
– martineau
Nov 12 at 6:41
Your code looks like it's probably I/O bound, so parallel processing may not speed things up—and could in fact slow things down because of the overhead involved in using it.
– martineau
Nov 12 at 6:50
@MadPhysicist: From the code it looks to me like it's the filesystem directory structure involved.
– martineau
Nov 12 at 6:52