iterating through a file with millions of lines

I am trying to iterate through a file with millions of lines containg some data and I am retrieving it. Unfortunately this is going very slow and I was wondering how I could make it more efficient.
At the moment I am loading two files and I am iterating by each line.

The code:

# Retrieve session data
UBList = array([line.split('t') for line in source])
SID = set(UBList[:,1])
n_unique_sessions = len(Counter(UBList[:,1]))
source.close()
sessions = 
session_info = 
purchases = array([line.split('t') for line in open('Data/order_overview.txt', 'r').readlines()])

for sid in SID:
 print sid
 s = [line for line in UBList if line[1]==sid]
 uline = [line for line in s if line[3]=='17']
 tline = [line[2] for line in s]
 t_format = "%Y-%m-%d %H:%M:%S.%f"
 s_start = datetime.strptime(tline[0], t_format)
 s_end = datetime.strptime(tline[-1], t_format)
 s_length = (s_end - s_start).total_seconds()
 d_time = [line[2] for line in s if line[3]=='4']
 if len(uline) > 0: 
 uid = uline[0][12]
 else: 
 uid = 'NotFound'
 num_queries = len([line for line in s if line[3]=='27'])
 num_purchases = nonzero(purchases[:,0]==sid)[0].shape[0]
 sessions.update(sid: (uid, num_queries, num_purchases, s))
 f = open('Results/' + sid + '_' + uid + '_' + str(num_queries) + '_' + str(num_purchases) + '_' + str(s_length) + '.txt', 'w')
 f.writelines(['t'.join(line) for line in s])
 f.close()

Would something like this speed things up?

somevar = dict([sid, for sid in SID])
for line in UBList:
sid = line[1]
dSID[sid].append('t'.join(line))

Also is it possible to get the next line after a certain criteria is met? For example I find a line get its value with the next and do a calculation. And add the results if the line with the criteria was found multiple times.

edited Nov 13 '18 at 2:05

Cœur

17.5k9104145

asked Apr 5 '14 at 17:30

Emrulez

239414

I think this question suits better in codereview.stackexchange.com
– Ruben Bermudez
Apr 5 '14 at 17:33

Yeah, I'd try cache UBList as what you say, and use csvreader (docs.python.org/2/library/csv.html) to read the file, indicating that your separator is a tab (t) The example is a bit complex for me to understand what's happening there and provide a more definite answer.
– BorrajaX
Apr 5 '14 at 17:39

I think that array is the function in the numpy module. Could you describe the structure of the lines in source ? I think there are several errors in your use of numpy objects
– eyquem
Apr 5 '14 at 20:07

@eyquem The line in the source are tab seperated strings on each line. so something like abc [tab] def [tab] 123 [tab] 456 and so on for each line. BorrajaX: so instead of storing UBList read it line for line? I am a bit confised should I store data in a csv file? My current data is in a txt file.
– Emrulez
Apr 5 '14 at 21:36

Is array the method numpy.array or not ? If not, I wouldn't understand the writing UBList[:,1] with a comma in it
– eyquem
Apr 6 '14 at 1:38

|
show 1 more comment

The code:

# Retrieve session data
UBList = array([line.split('t') for line in source])
SID = set(UBList[:,1])
n_unique_sessions = len(Counter(UBList[:,1]))
source.close()
sessions = 
session_info = 
purchases = array([line.split('t') for line in open('Data/order_overview.txt', 'r').readlines()])

for sid in SID:
 print sid
 s = [line for line in UBList if line[1]==sid]
 uline = [line for line in s if line[3]=='17']
 tline = [line[2] for line in s]
 t_format = "%Y-%m-%d %H:%M:%S.%f"
 s_start = datetime.strptime(tline[0], t_format)
 s_end = datetime.strptime(tline[-1], t_format)
 s_length = (s_end - s_start).total_seconds()
 d_time = [line[2] for line in s if line[3]=='4']
 if len(uline) > 0: 
 uid = uline[0][12]
 else: 
 uid = 'NotFound'
 num_queries = len([line for line in s if line[3]=='27'])
 num_purchases = nonzero(purchases[:,0]==sid)[0].shape[0]
 sessions.update(sid: (uid, num_queries, num_purchases, s))
 f = open('Results/' + sid + '_' + uid + '_' + str(num_queries) + '_' + str(num_purchases) + '_' + str(s_length) + '.txt', 'w')
 f.writelines(['t'.join(line) for line in s])
 f.close()

Would something like this speed things up?

somevar = dict([sid, for sid in SID])
for line in UBList:
sid = line[1]
dSID[sid].append('t'.join(line))

edited Nov 13 '18 at 2:05

Cœur

17.5k9104145

asked Apr 5 '14 at 17:30

Emrulez

239414

I think this question suits better in codereview.stackexchange.com
– Ruben Bermudez
Apr 5 '14 at 17:33

Yeah, I'd try cache UBList as what you say, and use csvreader (docs.python.org/2/library/csv.html) to read the file, indicating that your separator is a tab (t) The example is a bit complex for me to understand what's happening there and provide a more definite answer.
– BorrajaX
Apr 5 '14 at 17:39

I think that array is the function in the numpy module. Could you describe the structure of the lines in source ? I think there are several errors in your use of numpy objects
– eyquem
Apr 5 '14 at 20:07

@eyquem The line in the source are tab seperated strings on each line. so something like abc [tab] def [tab] 123 [tab] 456 and so on for each line. BorrajaX: so instead of storing UBList read it line for line? I am a bit confised should I store data in a csv file? My current data is in a txt file.
– Emrulez
Apr 5 '14 at 21:36

Is array the method numpy.array or not ? If not, I wouldn't understand the writing UBList[:,1] with a comma in it
– eyquem
Apr 6 '14 at 1:38

|
show 1 more comment

The code:

# Retrieve session data
UBList = array([line.split('t') for line in source])
SID = set(UBList[:,1])
n_unique_sessions = len(Counter(UBList[:,1]))
source.close()
sessions = 
session_info = 
purchases = array([line.split('t') for line in open('Data/order_overview.txt', 'r').readlines()])

for sid in SID:
 print sid
 s = [line for line in UBList if line[1]==sid]
 uline = [line for line in s if line[3]=='17']
 tline = [line[2] for line in s]
 t_format = "%Y-%m-%d %H:%M:%S.%f"
 s_start = datetime.strptime(tline[0], t_format)
 s_end = datetime.strptime(tline[-1], t_format)
 s_length = (s_end - s_start).total_seconds()
 d_time = [line[2] for line in s if line[3]=='4']
 if len(uline) > 0: 
 uid = uline[0][12]
 else: 
 uid = 'NotFound'
 num_queries = len([line for line in s if line[3]=='27'])
 num_purchases = nonzero(purchases[:,0]==sid)[0].shape[0]
 sessions.update(sid: (uid, num_queries, num_purchases, s))
 f = open('Results/' + sid + '_' + uid + '_' + str(num_queries) + '_' + str(num_purchases) + '_' + str(s_length) + '.txt', 'w')
 f.writelines(['t'.join(line) for line in s])
 f.close()

Would something like this speed things up?

somevar = dict([sid, for sid in SID])
for line in UBList:
sid = line[1]
dSID[sid].append('t'.join(line))

edited Nov 13 '18 at 2:05

Cœur

17.5k9104145

asked Apr 5 '14 at 17:30

Emrulez

239414

The code:

# Retrieve session data
UBList = array([line.split('t') for line in source])
SID = set(UBList[:,1])
n_unique_sessions = len(Counter(UBList[:,1]))
source.close()
sessions = 
session_info = 
purchases = array([line.split('t') for line in open('Data/order_overview.txt', 'r').readlines()])

for sid in SID:
 print sid
 s = [line for line in UBList if line[1]==sid]
 uline = [line for line in s if line[3]=='17']
 tline = [line[2] for line in s]
 t_format = "%Y-%m-%d %H:%M:%S.%f"
 s_start = datetime.strptime(tline[0], t_format)
 s_end = datetime.strptime(tline[-1], t_format)
 s_length = (s_end - s_start).total_seconds()
 d_time = [line[2] for line in s if line[3]=='4']
 if len(uline) > 0: 
 uid = uline[0][12]
 else: 
 uid = 'NotFound'
 num_queries = len([line for line in s if line[3]=='27'])
 num_purchases = nonzero(purchases[:,0]==sid)[0].shape[0]
 sessions.update(sid: (uid, num_queries, num_purchases, s))
 f = open('Results/' + sid + '_' + uid + '_' + str(num_queries) + '_' + str(num_purchases) + '_' + str(s_length) + '.txt', 'w')
 f.writelines(['t'.join(line) for line in s])
 f.close()

Would something like this speed things up?

somevar = dict([sid, for sid in SID])
for line in UBList:
sid = line[1]
dSID[sid].append('t'.join(line))

python loops large-files

edited Nov 13 '18 at 2:05

Cœur

17.5k9104145

asked Apr 5 '14 at 17:30

Emrulez

239414

edited Nov 13 '18 at 2:05

Cœur

17.5k9104145

asked Apr 5 '14 at 17:30

Emrulez

239414

edited Nov 13 '18 at 2:05

Cœur

17.5k9104145

edited Nov 13 '18 at 2:05

Cœur

17.5k9104145

edited Nov 13 '18 at 2:05

Cœur

17.5k9104145

asked Apr 5 '14 at 17:30

Emrulez

239414

asked Apr 5 '14 at 17:30

Emrulez

239414

asked Apr 5 '14 at 17:30

Emrulez

239414

I think this question suits better in codereview.stackexchange.com
– Ruben Bermudez
Apr 5 '14 at 17:33

Yeah, I'd try cache UBList as what you say, and use csvreader (docs.python.org/2/library/csv.html) to read the file, indicating that your separator is a tab (t) The example is a bit complex for me to understand what's happening there and provide a more definite answer.
– BorrajaX
Apr 5 '14 at 17:39

I think that array is the function in the numpy module. Could you describe the structure of the lines in source ? I think there are several errors in your use of numpy objects
– eyquem
Apr 5 '14 at 20:07

@eyquem The line in the source are tab seperated strings on each line. so something like abc [tab] def [tab] 123 [tab] 456 and so on for each line. BorrajaX: so instead of storing UBList read it line for line? I am a bit confised should I store data in a csv file? My current data is in a txt file.
– Emrulez
Apr 5 '14 at 21:36

Is array the method numpy.array or not ? If not, I wouldn't understand the writing UBList[:,1] with a comma in it
– eyquem
Apr 6 '14 at 1:38

|
show 1 more comment

I think this question suits better in codereview.stackexchange.com
– Ruben Bermudez
Apr 5 '14 at 17:33

Yeah, I'd try cache UBList as what you say, and use csvreader (docs.python.org/2/library/csv.html) to read the file, indicating that your separator is a tab (t) The example is a bit complex for me to understand what's happening there and provide a more definite answer.
– BorrajaX
Apr 5 '14 at 17:39

I think that array is the function in the numpy module. Could you describe the structure of the lines in source ? I think there are several errors in your use of numpy objects
– eyquem
Apr 5 '14 at 20:07

@eyquem The line in the source are tab seperated strings on each line. so something like abc [tab] def [tab] 123 [tab] 456 and so on for each line. BorrajaX: so instead of storing UBList read it line for line? I am a bit confised should I store data in a csv file? My current data is in a txt file.
– Emrulez
Apr 5 '14 at 21:36

Is array the method numpy.array or not ? If not, I wouldn't understand the writing UBList[:,1] with a comma in it
– eyquem
Apr 6 '14 at 1:38

I think this question suits better in codereview.stackexchange.com
– Ruben Bermudez
Apr 5 '14 at 17:33

Yeah, I'd try cache UBList as what you say, and use csvreader (docs.python.org/2/library/csv.html) to read the file, indicating that your separator is a tab (t) The example is a bit complex for me to understand what's happening there and provide a more definite answer.
– BorrajaX
Apr 5 '14 at 17:39

I think that array is the function in the numpy module. Could you describe the structure of the lines in source ? I think there are several errors in your use of numpy objects
– eyquem
Apr 5 '14 at 20:07

@eyquem The line in the source are tab seperated strings on each line. so something like abc [tab] def [tab] 123 [tab] 456 and so on for each line. BorrajaX: so instead of storing UBList read it line for line? I am a bit confised should I store data in a csv file? My current data is in a txt file.
– Emrulez
Apr 5 '14 at 21:36

Is array the method numpy.array or not ? If not, I wouldn't understand the writing UBList[:,1] with a comma in it
– eyquem
Apr 6 '14 at 1:38

|
show 1 more comment

1 Answer
1

active

oldest

votes

If source is your file object, don't read it all to memory beforehand. You can process it and read at the same time, and Python is clever enough to process data while waiting for next chunk to be read. To achieve that, use list generators, not list comprehensions.

You create a lot of arrays you don't actually need, like s. Replace them with generators and get a speedup. Creating standard Python array is a heavy operation, and if it's bigger than a 4kB, generators should be better.

answered Apr 5 '14 at 18:44

Barafu Albino

1,019922

So instead of storing I should read it and process it? With something like while open(source) do .....
– Emrulez
Apr 5 '14 at 21:39

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f22884253%2fiterating-through-a-file-with-millions-of-lines%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

answered Apr 5 '14 at 18:44

Barafu Albino

1,019922

So instead of storing I should read it and process it? With something like while open(source) do .....
– Emrulez
Apr 5 '14 at 21:39

add a comment |

answered Apr 5 '14 at 18:44

Barafu Albino

1,019922

So instead of storing I should read it and process it? With something like while open(source) do .....
– Emrulez
Apr 5 '14 at 21:39

add a comment |

answered Apr 5 '14 at 18:44

Barafu Albino

1,019922

answered Apr 5 '14 at 18:44

Barafu Albino

1,019922

answered Apr 5 '14 at 18:44

Barafu Albino

1,019922

answered Apr 5 '14 at 18:44

Barafu Albino

1,019922

answered Apr 5 '14 at 18:44

Barafu Albino

1,019922

So instead of storing I should read it and process it? With something like while open(source) do .....
– Emrulez
Apr 5 '14 at 21:39

add a comment |

So instead of storing I should read it and process it? With something like while open(source) do .....
– Emrulez
Apr 5 '14 at 21:39

So instead of storing I should read it and process it? With something like while open(source) do .....
– Emrulez
Apr 5 '14 at 21:39

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Odtnhj