How to implement code to manipulate files that runs in parellel?

I'm trying to load 10 dependent directories, which contains a bunch of JSON files, the structure is shown below:

5 events which divided into 2 categories

for fpathe1,dirs1,fs1 in os.walk('../input/charliehebdo/rumours/'):
 for f in fs1:
 with open(os.path.join(fpathe1,f)) as dir_loc:
 data.append(json.loads(dir_loc.read()))
 charliehebdo = pd.DataFrame(data)
 charliehebdo['label'] = 'TRUE'
 charliehebdo['event'] = 'charliehebdo'
for fpathe2,dirs2,fs2 in os.walk('../input/charliehebdo/non-rumours/'):
 for f in fs2:
 with open(os.path.join(fpathe2,f)) as dir_loc:
 data.append(json.loads(dir_loc.read()))
 nonRumourcharliehebdo = pd.DataFrame(data)
 nonRumourcharliehebdo['label'] = 'FALSE'
 nonRumourcharliehebdo['event'] = 'charliehebdo'
for fpathe3,dirs3,fs3 in os.walk('../input/ferguson/rumours/'):
 for f in fs3:
 with open(os.path.join(fpathe3,f)) as dir_loc:
 data.append(json.loads(dir_loc.read()))
 ferguson = pd.DataFrame(data)
 ferguson['label'] = 'TRUE'
 ferguson['event'] = 'ferguson'
for fpathe4,dirs4,fs4 in os.walk('../input/ferguson/non-rumours/'):
 for f in fs3:
 with open(os.path.join(fpathe3,f)) as dir_loc:
 data.append(json.loads(dir_loc.read()))
 nonRumourferguson = pd.DataFrame(data)
 nonRumourferguson['label'] = 'FALSE'
 nonRumourferguson['event'] = 'ferguson'

However, the sample code is extremely time-consuming(I ran on my laptop with Intel Core i7-4720HQ and it cost me 24hr+) so I'm wondering if there's any better solution?

well, it seems that my structure figure confuse or mislead you so here is the dataset.raw dataset

I intended to illustrate the dataset by figure but it turns out to be worse.

edited Nov 13 at 2:22

asked Nov 12 at 6:02

Tilmant

What does the "structure" diagram convey exactly? What are the words intended to represent, and what are the lines for?
– Mad Physicist
Nov 12 at 6:39

I suggest you first profile your script and see where it's spending most of the time—because that will tell you if using concurrent processing will worthwhile or not. See How can you profile a script?
– martineau
Nov 12 at 6:41

Your code looks like it's probably I/O bound, so parallel processing may not speed things up—and could in fact slow things down because of the overhead involved in using it.
– martineau
Nov 12 at 6:50

@MadPhysicist: From the code it looks to me like it's the filesystem directory structure involved.
– martineau
Nov 12 at 6:52

add a comment |

I'm trying to load 10 dependent directories, which contains a bunch of JSON files, the structure is shown below:

5 events which divided into 2 categories

for fpathe1,dirs1,fs1 in os.walk('../input/charliehebdo/rumours/'):
 for f in fs1:
 with open(os.path.join(fpathe1,f)) as dir_loc:
 data.append(json.loads(dir_loc.read()))
 charliehebdo = pd.DataFrame(data)
 charliehebdo['label'] = 'TRUE'
 charliehebdo['event'] = 'charliehebdo'
for fpathe2,dirs2,fs2 in os.walk('../input/charliehebdo/non-rumours/'):
 for f in fs2:
 with open(os.path.join(fpathe2,f)) as dir_loc:
 data.append(json.loads(dir_loc.read()))
 nonRumourcharliehebdo = pd.DataFrame(data)
 nonRumourcharliehebdo['label'] = 'FALSE'
 nonRumourcharliehebdo['event'] = 'charliehebdo'
for fpathe3,dirs3,fs3 in os.walk('../input/ferguson/rumours/'):
 for f in fs3:
 with open(os.path.join(fpathe3,f)) as dir_loc:
 data.append(json.loads(dir_loc.read()))
 ferguson = pd.DataFrame(data)
 ferguson['label'] = 'TRUE'
 ferguson['event'] = 'ferguson'
for fpathe4,dirs4,fs4 in os.walk('../input/ferguson/non-rumours/'):
 for f in fs3:
 with open(os.path.join(fpathe3,f)) as dir_loc:
 data.append(json.loads(dir_loc.read()))
 nonRumourferguson = pd.DataFrame(data)
 nonRumourferguson['label'] = 'FALSE'
 nonRumourferguson['event'] = 'ferguson'

However, the sample code is extremely time-consuming(I ran on my laptop with Intel Core i7-4720HQ and it cost me 24hr+) so I'm wondering if there's any better solution?

well, it seems that my structure figure confuse or mislead you so here is the dataset.raw dataset

I intended to illustrate the dataset by figure but it turns out to be worse.

edited Nov 13 at 2:22

asked Nov 12 at 6:02

Tilmant

What does the "structure" diagram convey exactly? What are the words intended to represent, and what are the lines for?
– Mad Physicist
Nov 12 at 6:39

I suggest you first profile your script and see where it's spending most of the time—because that will tell you if using concurrent processing will worthwhile or not. See How can you profile a script?
– martineau
Nov 12 at 6:41

Your code looks like it's probably I/O bound, so parallel processing may not speed things up—and could in fact slow things down because of the overhead involved in using it.
– martineau
Nov 12 at 6:50

@MadPhysicist: From the code it looks to me like it's the filesystem directory structure involved.
– martineau
Nov 12 at 6:52

add a comment |

I'm trying to load 10 dependent directories, which contains a bunch of JSON files, the structure is shown below:

5 events which divided into 2 categories

for fpathe1,dirs1,fs1 in os.walk('../input/charliehebdo/rumours/'):
 for f in fs1:
 with open(os.path.join(fpathe1,f)) as dir_loc:
 data.append(json.loads(dir_loc.read()))
 charliehebdo = pd.DataFrame(data)
 charliehebdo['label'] = 'TRUE'
 charliehebdo['event'] = 'charliehebdo'
for fpathe2,dirs2,fs2 in os.walk('../input/charliehebdo/non-rumours/'):
 for f in fs2:
 with open(os.path.join(fpathe2,f)) as dir_loc:
 data.append(json.loads(dir_loc.read()))
 nonRumourcharliehebdo = pd.DataFrame(data)
 nonRumourcharliehebdo['label'] = 'FALSE'
 nonRumourcharliehebdo['event'] = 'charliehebdo'
for fpathe3,dirs3,fs3 in os.walk('../input/ferguson/rumours/'):
 for f in fs3:
 with open(os.path.join(fpathe3,f)) as dir_loc:
 data.append(json.loads(dir_loc.read()))
 ferguson = pd.DataFrame(data)
 ferguson['label'] = 'TRUE'
 ferguson['event'] = 'ferguson'
for fpathe4,dirs4,fs4 in os.walk('../input/ferguson/non-rumours/'):
 for f in fs3:
 with open(os.path.join(fpathe3,f)) as dir_loc:
 data.append(json.loads(dir_loc.read()))
 nonRumourferguson = pd.DataFrame(data)
 nonRumourferguson['label'] = 'FALSE'
 nonRumourferguson['event'] = 'ferguson'

However, the sample code is extremely time-consuming(I ran on my laptop with Intel Core i7-4720HQ and it cost me 24hr+) so I'm wondering if there's any better solution?

well, it seems that my structure figure confuse or mislead you so here is the dataset.raw dataset

I intended to illustrate the dataset by figure but it turns out to be worse.

edited Nov 13 at 2:22

asked Nov 12 at 6:02

Tilmant

I'm trying to load 10 dependent directories, which contains a bunch of JSON files, the structure is shown below:

5 events which divided into 2 categories

for fpathe1,dirs1,fs1 in os.walk('../input/charliehebdo/rumours/'):
 for f in fs1:
 with open(os.path.join(fpathe1,f)) as dir_loc:
 data.append(json.loads(dir_loc.read()))
 charliehebdo = pd.DataFrame(data)
 charliehebdo['label'] = 'TRUE'
 charliehebdo['event'] = 'charliehebdo'
for fpathe2,dirs2,fs2 in os.walk('../input/charliehebdo/non-rumours/'):
 for f in fs2:
 with open(os.path.join(fpathe2,f)) as dir_loc:
 data.append(json.loads(dir_loc.read()))
 nonRumourcharliehebdo = pd.DataFrame(data)
 nonRumourcharliehebdo['label'] = 'FALSE'
 nonRumourcharliehebdo['event'] = 'charliehebdo'
for fpathe3,dirs3,fs3 in os.walk('../input/ferguson/rumours/'):
 for f in fs3:
 with open(os.path.join(fpathe3,f)) as dir_loc:
 data.append(json.loads(dir_loc.read()))
 ferguson = pd.DataFrame(data)
 ferguson['label'] = 'TRUE'
 ferguson['event'] = 'ferguson'
for fpathe4,dirs4,fs4 in os.walk('../input/ferguson/non-rumours/'):
 for f in fs3:
 with open(os.path.join(fpathe3,f)) as dir_loc:
 data.append(json.loads(dir_loc.read()))
 nonRumourferguson = pd.DataFrame(data)
 nonRumourferguson['label'] = 'FALSE'
 nonRumourferguson['event'] = 'ferguson'

However, the sample code is extremely time-consuming(I ran on my laptop with Intel Core i7-4720HQ and it cost me 24hr+) so I'm wondering if there's any better solution?

well, it seems that my structure figure confuse or mislead you so here is the dataset.raw dataset

I intended to illustrate the dataset by figure but it turns out to be worse.

python performance

edited Nov 13 at 2:22

asked Nov 12 at 6:02

Tilmant

edited Nov 13 at 2:22

asked Nov 12 at 6:02

Tilmant

edited Nov 13 at 2:22

asked Nov 12 at 6:02

Tilmant

asked Nov 12 at 6:02

Tilmant

asked Nov 12 at 6:02

Tilmant

What does the "structure" diagram convey exactly? What are the words intended to represent, and what are the lines for?
– Mad Physicist
Nov 12 at 6:39

I suggest you first profile your script and see where it's spending most of the time—because that will tell you if using concurrent processing will worthwhile or not. See How can you profile a script?
– martineau
Nov 12 at 6:41

Your code looks like it's probably I/O bound, so parallel processing may not speed things up—and could in fact slow things down because of the overhead involved in using it.
– martineau
Nov 12 at 6:50

@MadPhysicist: From the code it looks to me like it's the filesystem directory structure involved.
– martineau
Nov 12 at 6:52

add a comment |

What does the "structure" diagram convey exactly? What are the words intended to represent, and what are the lines for?
– Mad Physicist
Nov 12 at 6:39

I suggest you first profile your script and see where it's spending most of the time—because that will tell you if using concurrent processing will worthwhile or not. See How can you profile a script?
– martineau
Nov 12 at 6:41

Your code looks like it's probably I/O bound, so parallel processing may not speed things up—and could in fact slow things down because of the overhead involved in using it.
– martineau
Nov 12 at 6:50

@MadPhysicist: From the code it looks to me like it's the filesystem directory structure involved.
– martineau
Nov 12 at 6:52

What does the "structure" diagram convey exactly? What are the words intended to represent, and what are the lines for?
– Mad Physicist
Nov 12 at 6:39

I suggest you first profile your script and see where it's spending most of the time—because that will tell you if using concurrent processing will worthwhile or not. See How can you profile a script?
– martineau
Nov 12 at 6:41

Your code looks like it's probably I/O bound, so parallel processing may not speed things up—and could in fact slow things down because of the overhead involved in using it.
– martineau
Nov 12 at 6:50

@MadPhysicist: From the code it looks to me like it's the filesystem directory structure involved.
– martineau
Nov 12 at 6:52

add a comment |

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53256612%2fhow-to-implement-code-to-manipulate-files-that-runs-in-parellel%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Odtnhj