iterating through a file with millions of lines










0














I am trying to iterate through a file with millions of lines containg some data and I am retrieving it. Unfortunately this is going very slow and I was wondering how I could make it more efficient.
At the moment I am loading two files and I am iterating by each line.



The code:



# Retrieve session data
UBList = array([line.split('t') for line in source])
SID = set(UBList[:,1])
n_unique_sessions = len(Counter(UBList[:,1]))
source.close()
sessions =
session_info =
purchases = array([line.split('t') for line in open('Data/order_overview.txt', 'r').readlines()])

for sid in SID:
print sid
s = [line for line in UBList if line[1]==sid]
uline = [line for line in s if line[3]=='17']
tline = [line[2] for line in s]
t_format = "%Y-%m-%d %H:%M:%S.%f"
s_start = datetime.strptime(tline[0], t_format)
s_end = datetime.strptime(tline[-1], t_format)
s_length = (s_end - s_start).total_seconds()
d_time = [line[2] for line in s if line[3]=='4']
if len(uline) > 0:
uid = uline[0][12]
else:
uid = 'NotFound'
num_queries = len([line for line in s if line[3]=='27'])
num_purchases = nonzero(purchases[:,0]==sid)[0].shape[0]
sessions.update(sid: (uid, num_queries, num_purchases, s))
f = open('Results/' + sid + '_' + uid + '_' + str(num_queries) + '_' + str(num_purchases) + '_' + str(s_length) + '.txt', 'w')
f.writelines(['t'.join(line) for line in s])
f.close()


Would something like this speed things up?



somevar = dict([sid, for sid in SID])
for line in UBList:
sid = line[1]
dSID[sid].append('t'.join(line))


Also is it possible to get the next line after a certain criteria is met? For example I find a line get its value with the next and do a calculation. And add the results if the line with the criteria was found multiple times.










share|improve this question























  • I think this question suits better in codereview.stackexchange.com
    – Ruben Bermudez
    Apr 5 '14 at 17:33










  • Yeah, I'd try cache UBList as what you say, and use csvreader (docs.python.org/2/library/csv.html) to read the file, indicating that your separator is a tab (t) The example is a bit complex for me to understand what's happening there and provide a more definite answer.
    – BorrajaX
    Apr 5 '14 at 17:39










  • I think that array is the function in the numpy module. Could you describe the structure of the lines in source ? I think there are several errors in your use of numpy objects
    – eyquem
    Apr 5 '14 at 20:07










  • @eyquem The line in the source are tab seperated strings on each line. so something like abc [tab] def [tab] 123 [tab] 456 and so on for each line. BorrajaX: so instead of storing UBList read it line for line? I am a bit confised should I store data in a csv file? My current data is in a txt file.
    – Emrulez
    Apr 5 '14 at 21:36











  • Is array the method numpy.array or not ? If not, I wouldn't understand the writing UBList[:,1] with a comma in it
    – eyquem
    Apr 6 '14 at 1:38
















0














I am trying to iterate through a file with millions of lines containg some data and I am retrieving it. Unfortunately this is going very slow and I was wondering how I could make it more efficient.
At the moment I am loading two files and I am iterating by each line.



The code:



# Retrieve session data
UBList = array([line.split('t') for line in source])
SID = set(UBList[:,1])
n_unique_sessions = len(Counter(UBList[:,1]))
source.close()
sessions =
session_info =
purchases = array([line.split('t') for line in open('Data/order_overview.txt', 'r').readlines()])

for sid in SID:
print sid
s = [line for line in UBList if line[1]==sid]
uline = [line for line in s if line[3]=='17']
tline = [line[2] for line in s]
t_format = "%Y-%m-%d %H:%M:%S.%f"
s_start = datetime.strptime(tline[0], t_format)
s_end = datetime.strptime(tline[-1], t_format)
s_length = (s_end - s_start).total_seconds()
d_time = [line[2] for line in s if line[3]=='4']
if len(uline) > 0:
uid = uline[0][12]
else:
uid = 'NotFound'
num_queries = len([line for line in s if line[3]=='27'])
num_purchases = nonzero(purchases[:,0]==sid)[0].shape[0]
sessions.update(sid: (uid, num_queries, num_purchases, s))
f = open('Results/' + sid + '_' + uid + '_' + str(num_queries) + '_' + str(num_purchases) + '_' + str(s_length) + '.txt', 'w')
f.writelines(['t'.join(line) for line in s])
f.close()


Would something like this speed things up?



somevar = dict([sid, for sid in SID])
for line in UBList:
sid = line[1]
dSID[sid].append('t'.join(line))


Also is it possible to get the next line after a certain criteria is met? For example I find a line get its value with the next and do a calculation. And add the results if the line with the criteria was found multiple times.










share|improve this question























  • I think this question suits better in codereview.stackexchange.com
    – Ruben Bermudez
    Apr 5 '14 at 17:33










  • Yeah, I'd try cache UBList as what you say, and use csvreader (docs.python.org/2/library/csv.html) to read the file, indicating that your separator is a tab (t) The example is a bit complex for me to understand what's happening there and provide a more definite answer.
    – BorrajaX
    Apr 5 '14 at 17:39










  • I think that array is the function in the numpy module. Could you describe the structure of the lines in source ? I think there are several errors in your use of numpy objects
    – eyquem
    Apr 5 '14 at 20:07










  • @eyquem The line in the source are tab seperated strings on each line. so something like abc [tab] def [tab] 123 [tab] 456 and so on for each line. BorrajaX: so instead of storing UBList read it line for line? I am a bit confised should I store data in a csv file? My current data is in a txt file.
    – Emrulez
    Apr 5 '14 at 21:36











  • Is array the method numpy.array or not ? If not, I wouldn't understand the writing UBList[:,1] with a comma in it
    – eyquem
    Apr 6 '14 at 1:38














0












0








0







I am trying to iterate through a file with millions of lines containg some data and I am retrieving it. Unfortunately this is going very slow and I was wondering how I could make it more efficient.
At the moment I am loading two files and I am iterating by each line.



The code:



# Retrieve session data
UBList = array([line.split('t') for line in source])
SID = set(UBList[:,1])
n_unique_sessions = len(Counter(UBList[:,1]))
source.close()
sessions =
session_info =
purchases = array([line.split('t') for line in open('Data/order_overview.txt', 'r').readlines()])

for sid in SID:
print sid
s = [line for line in UBList if line[1]==sid]
uline = [line for line in s if line[3]=='17']
tline = [line[2] for line in s]
t_format = "%Y-%m-%d %H:%M:%S.%f"
s_start = datetime.strptime(tline[0], t_format)
s_end = datetime.strptime(tline[-1], t_format)
s_length = (s_end - s_start).total_seconds()
d_time = [line[2] for line in s if line[3]=='4']
if len(uline) > 0:
uid = uline[0][12]
else:
uid = 'NotFound'
num_queries = len([line for line in s if line[3]=='27'])
num_purchases = nonzero(purchases[:,0]==sid)[0].shape[0]
sessions.update(sid: (uid, num_queries, num_purchases, s))
f = open('Results/' + sid + '_' + uid + '_' + str(num_queries) + '_' + str(num_purchases) + '_' + str(s_length) + '.txt', 'w')
f.writelines(['t'.join(line) for line in s])
f.close()


Would something like this speed things up?



somevar = dict([sid, for sid in SID])
for line in UBList:
sid = line[1]
dSID[sid].append('t'.join(line))


Also is it possible to get the next line after a certain criteria is met? For example I find a line get its value with the next and do a calculation. And add the results if the line with the criteria was found multiple times.










share|improve this question















I am trying to iterate through a file with millions of lines containg some data and I am retrieving it. Unfortunately this is going very slow and I was wondering how I could make it more efficient.
At the moment I am loading two files and I am iterating by each line.



The code:



# Retrieve session data
UBList = array([line.split('t') for line in source])
SID = set(UBList[:,1])
n_unique_sessions = len(Counter(UBList[:,1]))
source.close()
sessions =
session_info =
purchases = array([line.split('t') for line in open('Data/order_overview.txt', 'r').readlines()])

for sid in SID:
print sid
s = [line for line in UBList if line[1]==sid]
uline = [line for line in s if line[3]=='17']
tline = [line[2] for line in s]
t_format = "%Y-%m-%d %H:%M:%S.%f"
s_start = datetime.strptime(tline[0], t_format)
s_end = datetime.strptime(tline[-1], t_format)
s_length = (s_end - s_start).total_seconds()
d_time = [line[2] for line in s if line[3]=='4']
if len(uline) > 0:
uid = uline[0][12]
else:
uid = 'NotFound'
num_queries = len([line for line in s if line[3]=='27'])
num_purchases = nonzero(purchases[:,0]==sid)[0].shape[0]
sessions.update(sid: (uid, num_queries, num_purchases, s))
f = open('Results/' + sid + '_' + uid + '_' + str(num_queries) + '_' + str(num_purchases) + '_' + str(s_length) + '.txt', 'w')
f.writelines(['t'.join(line) for line in s])
f.close()


Would something like this speed things up?



somevar = dict([sid, for sid in SID])
for line in UBList:
sid = line[1]
dSID[sid].append('t'.join(line))


Also is it possible to get the next line after a certain criteria is met? For example I find a line get its value with the next and do a calculation. And add the results if the line with the criteria was found multiple times.







python loops large-files






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 13 '18 at 2:05









Cœur

17.5k9104145




17.5k9104145










asked Apr 5 '14 at 17:30









EmrulezEmrulez

239414




239414











  • I think this question suits better in codereview.stackexchange.com
    – Ruben Bermudez
    Apr 5 '14 at 17:33










  • Yeah, I'd try cache UBList as what you say, and use csvreader (docs.python.org/2/library/csv.html) to read the file, indicating that your separator is a tab (t) The example is a bit complex for me to understand what's happening there and provide a more definite answer.
    – BorrajaX
    Apr 5 '14 at 17:39










  • I think that array is the function in the numpy module. Could you describe the structure of the lines in source ? I think there are several errors in your use of numpy objects
    – eyquem
    Apr 5 '14 at 20:07










  • @eyquem The line in the source are tab seperated strings on each line. so something like abc [tab] def [tab] 123 [tab] 456 and so on for each line. BorrajaX: so instead of storing UBList read it line for line? I am a bit confised should I store data in a csv file? My current data is in a txt file.
    – Emrulez
    Apr 5 '14 at 21:36











  • Is array the method numpy.array or not ? If not, I wouldn't understand the writing UBList[:,1] with a comma in it
    – eyquem
    Apr 6 '14 at 1:38

















  • I think this question suits better in codereview.stackexchange.com
    – Ruben Bermudez
    Apr 5 '14 at 17:33










  • Yeah, I'd try cache UBList as what you say, and use csvreader (docs.python.org/2/library/csv.html) to read the file, indicating that your separator is a tab (t) The example is a bit complex for me to understand what's happening there and provide a more definite answer.
    – BorrajaX
    Apr 5 '14 at 17:39










  • I think that array is the function in the numpy module. Could you describe the structure of the lines in source ? I think there are several errors in your use of numpy objects
    – eyquem
    Apr 5 '14 at 20:07










  • @eyquem The line in the source are tab seperated strings on each line. so something like abc [tab] def [tab] 123 [tab] 456 and so on for each line. BorrajaX: so instead of storing UBList read it line for line? I am a bit confised should I store data in a csv file? My current data is in a txt file.
    – Emrulez
    Apr 5 '14 at 21:36











  • Is array the method numpy.array or not ? If not, I wouldn't understand the writing UBList[:,1] with a comma in it
    – eyquem
    Apr 6 '14 at 1:38
















I think this question suits better in codereview.stackexchange.com
– Ruben Bermudez
Apr 5 '14 at 17:33




I think this question suits better in codereview.stackexchange.com
– Ruben Bermudez
Apr 5 '14 at 17:33












Yeah, I'd try cache UBList as what you say, and use csvreader (docs.python.org/2/library/csv.html) to read the file, indicating that your separator is a tab (t) The example is a bit complex for me to understand what's happening there and provide a more definite answer.
– BorrajaX
Apr 5 '14 at 17:39




Yeah, I'd try cache UBList as what you say, and use csvreader (docs.python.org/2/library/csv.html) to read the file, indicating that your separator is a tab (t) The example is a bit complex for me to understand what's happening there and provide a more definite answer.
– BorrajaX
Apr 5 '14 at 17:39












I think that array is the function in the numpy module. Could you describe the structure of the lines in source ? I think there are several errors in your use of numpy objects
– eyquem
Apr 5 '14 at 20:07




I think that array is the function in the numpy module. Could you describe the structure of the lines in source ? I think there are several errors in your use of numpy objects
– eyquem
Apr 5 '14 at 20:07












@eyquem The line in the source are tab seperated strings on each line. so something like abc [tab] def [tab] 123 [tab] 456 and so on for each line. BorrajaX: so instead of storing UBList read it line for line? I am a bit confised should I store data in a csv file? My current data is in a txt file.
– Emrulez
Apr 5 '14 at 21:36





@eyquem The line in the source are tab seperated strings on each line. so something like abc [tab] def [tab] 123 [tab] 456 and so on for each line. BorrajaX: so instead of storing UBList read it line for line? I am a bit confised should I store data in a csv file? My current data is in a txt file.
– Emrulez
Apr 5 '14 at 21:36













Is array the method numpy.array or not ? If not, I wouldn't understand the writing UBList[:,1] with a comma in it
– eyquem
Apr 6 '14 at 1:38





Is array the method numpy.array or not ? If not, I wouldn't understand the writing UBList[:,1] with a comma in it
– eyquem
Apr 6 '14 at 1:38













1 Answer
1






active

oldest

votes


















0














If source is your file object, don't read it all to memory beforehand. You can process it and read at the same time, and Python is clever enough to process data while waiting for next chunk to be read. To achieve that, use list generators, not list comprehensions.



You create a lot of arrays you don't actually need, like s. Replace them with generators and get a speedup. Creating standard Python array is a heavy operation, and if it's bigger than a 4kB, generators should be better.






share|improve this answer




















  • So instead of storing I should read it and process it? With something like while open(source) do .....
    – Emrulez
    Apr 5 '14 at 21:39










Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f22884253%2fiterating-through-a-file-with-millions-of-lines%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














If source is your file object, don't read it all to memory beforehand. You can process it and read at the same time, and Python is clever enough to process data while waiting for next chunk to be read. To achieve that, use list generators, not list comprehensions.



You create a lot of arrays you don't actually need, like s. Replace them with generators and get a speedup. Creating standard Python array is a heavy operation, and if it's bigger than a 4kB, generators should be better.






share|improve this answer




















  • So instead of storing I should read it and process it? With something like while open(source) do .....
    – Emrulez
    Apr 5 '14 at 21:39















0














If source is your file object, don't read it all to memory beforehand. You can process it and read at the same time, and Python is clever enough to process data while waiting for next chunk to be read. To achieve that, use list generators, not list comprehensions.



You create a lot of arrays you don't actually need, like s. Replace them with generators and get a speedup. Creating standard Python array is a heavy operation, and if it's bigger than a 4kB, generators should be better.






share|improve this answer




















  • So instead of storing I should read it and process it? With something like while open(source) do .....
    – Emrulez
    Apr 5 '14 at 21:39













0












0








0






If source is your file object, don't read it all to memory beforehand. You can process it and read at the same time, and Python is clever enough to process data while waiting for next chunk to be read. To achieve that, use list generators, not list comprehensions.



You create a lot of arrays you don't actually need, like s. Replace them with generators and get a speedup. Creating standard Python array is a heavy operation, and if it's bigger than a 4kB, generators should be better.






share|improve this answer












If source is your file object, don't read it all to memory beforehand. You can process it and read at the same time, and Python is clever enough to process data while waiting for next chunk to be read. To achieve that, use list generators, not list comprehensions.



You create a lot of arrays you don't actually need, like s. Replace them with generators and get a speedup. Creating standard Python array is a heavy operation, and if it's bigger than a 4kB, generators should be better.







share|improve this answer












share|improve this answer



share|improve this answer










answered Apr 5 '14 at 18:44









Barafu AlbinoBarafu Albino

1,019922




1,019922











  • So instead of storing I should read it and process it? With something like while open(source) do .....
    – Emrulez
    Apr 5 '14 at 21:39
















  • So instead of storing I should read it and process it? With something like while open(source) do .....
    – Emrulez
    Apr 5 '14 at 21:39















So instead of storing I should read it and process it? With something like while open(source) do .....
– Emrulez
Apr 5 '14 at 21:39




So instead of storing I should read it and process it? With something like while open(source) do .....
– Emrulez
Apr 5 '14 at 21:39

















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f22884253%2fiterating-through-a-file-with-millions-of-lines%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

How to read a connectionString WITH PROVIDER in .NET Core?

In R, how to develop a multiplot heatmap.2 figure showing key labels successfully

Museum of Modern and Contemporary Art of Trento and Rovereto