Read and reverse data chunk by chunk from a csv file and copy to a new csv file
Assume I'm dealing with a very large csv file. So, I can only read the data chunk by chunk into the memory. The expected flow of events should be as follows:
1) Read chunk (eg: 10 rows) of data from csv using pandas.
2) Reverse the order of data
3) Copy each row to new csv file in reverse. So each chunk (10 rows) is
written to csv from beginning in reversed order.
In the end the csv file should be in reversed order and this should be done without loading entire file into memory for windows OS.
I am trying to do a time series forecasting I need data to be from old to latest (1st row oldest entry). I can't load entire file into memory I'm looking for a way to do it each chunk at a time if it's possible.
The dataset I tried on train.csv
of the Rossmann dataset from kaggle. You can get it from this github repo
My attempt does not copy the rows into the new csv file properly.
Show below is my code:
import pandas as pd
import csv
def reverse():
fields = ["Store","DayOfWeek","Date","Sales","Customers","Open","Promo","StateHoliday",
"SchoolHoliday"]
with open('processed_train.csv', mode='a') as stock_file:
writer = csv.writer(stock_file,delimiter=',', quotechar='"',
quoting=csv.QUOTE_MINIMAL)
writer.writerow(fields)
for chunk in pd.read_csv("train.csv", chunksize=10):
store_data = chunk.reindex(index=chunk.index[::-1])
append_data_csv(store_data)
def append_data_csv(store_data):
with open('processed_train.csv', mode='a') as store_file:
writer = csv.writer(store_file,delimiter=',', quotechar='"',
quoting=csv.QUOTE_MINIMAL)
for index, row in store_data.iterrows():
print(row)
writer.writerow([row['Store'],row['DayOfWeek'],row['Date'],row['Sales'],
row['Customers'],row['Open'],row['Promo'],
row['StateHoliday'],row['SchoolHoliday']])
reverse()
Thank you, in advance
python python-3.x pandas csv dataset
add a comment |
Assume I'm dealing with a very large csv file. So, I can only read the data chunk by chunk into the memory. The expected flow of events should be as follows:
1) Read chunk (eg: 10 rows) of data from csv using pandas.
2) Reverse the order of data
3) Copy each row to new csv file in reverse. So each chunk (10 rows) is
written to csv from beginning in reversed order.
In the end the csv file should be in reversed order and this should be done without loading entire file into memory for windows OS.
I am trying to do a time series forecasting I need data to be from old to latest (1st row oldest entry). I can't load entire file into memory I'm looking for a way to do it each chunk at a time if it's possible.
The dataset I tried on train.csv
of the Rossmann dataset from kaggle. You can get it from this github repo
My attempt does not copy the rows into the new csv file properly.
Show below is my code:
import pandas as pd
import csv
def reverse():
fields = ["Store","DayOfWeek","Date","Sales","Customers","Open","Promo","StateHoliday",
"SchoolHoliday"]
with open('processed_train.csv', mode='a') as stock_file:
writer = csv.writer(stock_file,delimiter=',', quotechar='"',
quoting=csv.QUOTE_MINIMAL)
writer.writerow(fields)
for chunk in pd.read_csv("train.csv", chunksize=10):
store_data = chunk.reindex(index=chunk.index[::-1])
append_data_csv(store_data)
def append_data_csv(store_data):
with open('processed_train.csv', mode='a') as store_file:
writer = csv.writer(store_file,delimiter=',', quotechar='"',
quoting=csv.QUOTE_MINIMAL)
for index, row in store_data.iterrows():
print(row)
writer.writerow([row['Store'],row['DayOfWeek'],row['Date'],row['Sales'],
row['Customers'],row['Open'],row['Promo'],
row['StateHoliday'],row['SchoolHoliday']])
reverse()
Thank you, in advance
python python-3.x pandas csv dataset
1
You want the entire data to be time-sorted but you’re sorting & writing 10 data at a time FROM THE BEGINNING. That’s why you aren’t getting results. Why don’t you read & reverse the train.csv from the end. See, stackoverflow.com/a/10933932/2895956
– Sujay Kumar
Oct 29 '18 at 7:03
2
I tried the code given in that post earlier. But it did not give a proper result for me.
– Suleka_28
Oct 29 '18 at 7:21
3
Can you elaborate? What exactly you are getting? And if possible upload the file in dropbox. The process is too long, (to download that file a person have to login, register then fill out information...bla..bla..) just to download the file.
– Sujay Kumar
Oct 29 '18 at 7:29
4
Note: Not a perfect solution. But this you can simply revert the file using cmd line, tail -r train.csv > reverse.csv. And manually remove the last line and add it to the first line in reverse.csv
– Sujay Kumar
Oct 29 '18 at 9:20
add a comment |
Assume I'm dealing with a very large csv file. So, I can only read the data chunk by chunk into the memory. The expected flow of events should be as follows:
1) Read chunk (eg: 10 rows) of data from csv using pandas.
2) Reverse the order of data
3) Copy each row to new csv file in reverse. So each chunk (10 rows) is
written to csv from beginning in reversed order.
In the end the csv file should be in reversed order and this should be done without loading entire file into memory for windows OS.
I am trying to do a time series forecasting I need data to be from old to latest (1st row oldest entry). I can't load entire file into memory I'm looking for a way to do it each chunk at a time if it's possible.
The dataset I tried on train.csv
of the Rossmann dataset from kaggle. You can get it from this github repo
My attempt does not copy the rows into the new csv file properly.
Show below is my code:
import pandas as pd
import csv
def reverse():
fields = ["Store","DayOfWeek","Date","Sales","Customers","Open","Promo","StateHoliday",
"SchoolHoliday"]
with open('processed_train.csv', mode='a') as stock_file:
writer = csv.writer(stock_file,delimiter=',', quotechar='"',
quoting=csv.QUOTE_MINIMAL)
writer.writerow(fields)
for chunk in pd.read_csv("train.csv", chunksize=10):
store_data = chunk.reindex(index=chunk.index[::-1])
append_data_csv(store_data)
def append_data_csv(store_data):
with open('processed_train.csv', mode='a') as store_file:
writer = csv.writer(store_file,delimiter=',', quotechar='"',
quoting=csv.QUOTE_MINIMAL)
for index, row in store_data.iterrows():
print(row)
writer.writerow([row['Store'],row['DayOfWeek'],row['Date'],row['Sales'],
row['Customers'],row['Open'],row['Promo'],
row['StateHoliday'],row['SchoolHoliday']])
reverse()
Thank you, in advance
python python-3.x pandas csv dataset
Assume I'm dealing with a very large csv file. So, I can only read the data chunk by chunk into the memory. The expected flow of events should be as follows:
1) Read chunk (eg: 10 rows) of data from csv using pandas.
2) Reverse the order of data
3) Copy each row to new csv file in reverse. So each chunk (10 rows) is
written to csv from beginning in reversed order.
In the end the csv file should be in reversed order and this should be done without loading entire file into memory for windows OS.
I am trying to do a time series forecasting I need data to be from old to latest (1st row oldest entry). I can't load entire file into memory I'm looking for a way to do it each chunk at a time if it's possible.
The dataset I tried on train.csv
of the Rossmann dataset from kaggle. You can get it from this github repo
My attempt does not copy the rows into the new csv file properly.
Show below is my code:
import pandas as pd
import csv
def reverse():
fields = ["Store","DayOfWeek","Date","Sales","Customers","Open","Promo","StateHoliday",
"SchoolHoliday"]
with open('processed_train.csv', mode='a') as stock_file:
writer = csv.writer(stock_file,delimiter=',', quotechar='"',
quoting=csv.QUOTE_MINIMAL)
writer.writerow(fields)
for chunk in pd.read_csv("train.csv", chunksize=10):
store_data = chunk.reindex(index=chunk.index[::-1])
append_data_csv(store_data)
def append_data_csv(store_data):
with open('processed_train.csv', mode='a') as store_file:
writer = csv.writer(store_file,delimiter=',', quotechar='"',
quoting=csv.QUOTE_MINIMAL)
for index, row in store_data.iterrows():
print(row)
writer.writerow([row['Store'],row['DayOfWeek'],row['Date'],row['Sales'],
row['Customers'],row['Open'],row['Promo'],
row['StateHoliday'],row['SchoolHoliday']])
reverse()
Thank you, in advance
python python-3.x pandas csv dataset
python python-3.x pandas csv dataset
edited Nov 17 '18 at 6:35
Sai Kumar
297115
297115
asked Oct 29 '18 at 6:32
Suleka_28Suleka_28
677516
677516
1
You want the entire data to be time-sorted but you’re sorting & writing 10 data at a time FROM THE BEGINNING. That’s why you aren’t getting results. Why don’t you read & reverse the train.csv from the end. See, stackoverflow.com/a/10933932/2895956
– Sujay Kumar
Oct 29 '18 at 7:03
2
I tried the code given in that post earlier. But it did not give a proper result for me.
– Suleka_28
Oct 29 '18 at 7:21
3
Can you elaborate? What exactly you are getting? And if possible upload the file in dropbox. The process is too long, (to download that file a person have to login, register then fill out information...bla..bla..) just to download the file.
– Sujay Kumar
Oct 29 '18 at 7:29
4
Note: Not a perfect solution. But this you can simply revert the file using cmd line, tail -r train.csv > reverse.csv. And manually remove the last line and add it to the first line in reverse.csv
– Sujay Kumar
Oct 29 '18 at 9:20
add a comment |
1
You want the entire data to be time-sorted but you’re sorting & writing 10 data at a time FROM THE BEGINNING. That’s why you aren’t getting results. Why don’t you read & reverse the train.csv from the end. See, stackoverflow.com/a/10933932/2895956
– Sujay Kumar
Oct 29 '18 at 7:03
2
I tried the code given in that post earlier. But it did not give a proper result for me.
– Suleka_28
Oct 29 '18 at 7:21
3
Can you elaborate? What exactly you are getting? And if possible upload the file in dropbox. The process is too long, (to download that file a person have to login, register then fill out information...bla..bla..) just to download the file.
– Sujay Kumar
Oct 29 '18 at 7:29
4
Note: Not a perfect solution. But this you can simply revert the file using cmd line, tail -r train.csv > reverse.csv. And manually remove the last line and add it to the first line in reverse.csv
– Sujay Kumar
Oct 29 '18 at 9:20
1
1
You want the entire data to be time-sorted but you’re sorting & writing 10 data at a time FROM THE BEGINNING. That’s why you aren’t getting results. Why don’t you read & reverse the train.csv from the end. See, stackoverflow.com/a/10933932/2895956
– Sujay Kumar
Oct 29 '18 at 7:03
You want the entire data to be time-sorted but you’re sorting & writing 10 data at a time FROM THE BEGINNING. That’s why you aren’t getting results. Why don’t you read & reverse the train.csv from the end. See, stackoverflow.com/a/10933932/2895956
– Sujay Kumar
Oct 29 '18 at 7:03
2
2
I tried the code given in that post earlier. But it did not give a proper result for me.
– Suleka_28
Oct 29 '18 at 7:21
I tried the code given in that post earlier. But it did not give a proper result for me.
– Suleka_28
Oct 29 '18 at 7:21
3
3
Can you elaborate? What exactly you are getting? And if possible upload the file in dropbox. The process is too long, (to download that file a person have to login, register then fill out information...bla..bla..) just to download the file.
– Sujay Kumar
Oct 29 '18 at 7:29
Can you elaborate? What exactly you are getting? And if possible upload the file in dropbox. The process is too long, (to download that file a person have to login, register then fill out information...bla..bla..) just to download the file.
– Sujay Kumar
Oct 29 '18 at 7:29
4
4
Note: Not a perfect solution. But this you can simply revert the file using cmd line, tail -r train.csv > reverse.csv. And manually remove the last line and add it to the first line in reverse.csv
– Sujay Kumar
Oct 29 '18 at 9:20
Note: Not a perfect solution. But this you can simply revert the file using cmd line, tail -r train.csv > reverse.csv. And manually remove the last line and add it to the first line in reverse.csv
– Sujay Kumar
Oct 29 '18 at 9:20
add a comment |
5 Answers
5
active
oldest
votes
Using bash, you can tail the whole file except the first line and then reverse it and store it with this:
tail -n +2 train.csv | tac > train_rev.csv
If you want to keep the header in the reversed file, write it first and then append the reversed content
head -1 train.csv > train_rev.csv; tail -n +2 train.csv | tac >> train_rev.csv
Unless you have a hard requirement to also run this on a different os, this is probably going to be the fastest/best answer around.
– Julian
Nov 20 '18 at 21:24
add a comment |
I would not recommend using pandas
for parsing or streaming any files as you are only introducing additional overhead. The best way to do it is to read the file from the bottom up. Well, a big part of this code actually comes from here where it takes in a file and returns the reverse in a generator, which I believe is what you want.
What I did was just tested it with your file train.csv
from the provided link and output the results in a new file.
import os
def reverse_readline(filename, buf_size=8192):
"""a generator that returns the lines of a file in reverse order"""
with open(filename) as fh:
segment = None
offset = 0
fh.seek(0, os.SEEK_END)
file_size = remaining_size = fh.tell()
while remaining_size > 0:
offset = min(file_size, offset + buf_size)
fh.seek(file_size - offset)
buffer = fh.read(min(remaining_size, buf_size))
remaining_size -= buf_size
lines = buffer.split('n')
# the first line of the buffer is probably not a complete line so
# we'll save it and append it to the last line of the next buffer
# we read
if segment is not None:
# if the previous chunk starts right from the beginning of line
# do not concact the segment to the last line of new chunk
# instead, yield the segment first
if buffer[-1] != 'n':
lines[-1] += segment
else:
yield segment
segment = lines[0]
for index in range(len(lines) - 1, 0, -1):
if lines[index]:
yield lines[index]
# Don't yield None if the file was empty
if segment is not None:
yield segment
reverse_gen = reverse_readline('train.csv')
with open('rev_train.csv','w') as f:
for row in reverse_gen:
f.write('n'.format(row))
It basically reads it in reverse until it finds a newline then yields a line
from the file from the bottom to top. A pretty interesting way of doing it.
2
This actually will result in overlapping and missing chunks of text if UTF-8 is used. You can't just seek in a variable-width encoded file! Next, if the CSV file has embedded newlines in cell values, reversing can't just split on newlines arbitrarily.
– Martijn Pieters♦
Nov 18 '18 at 14:24
Next, a RFC compliant CSV file will usern
line separators, your code assumes thatn
is used, always, and that there are no embedded newlines in the cell values.
– Martijn Pieters♦
Nov 19 '18 at 8:57
add a comment |
This does exactly what you are requesting, but without Pandas. It reads intest.csv line by line (as opposed to reading the whole file into RAM). It does most of the processing using the file system using a series of chunk files that are aggregated at the end into the outtest.csv file. If you change the maxLines, you can optimise the number of chunk files produced versus RAM consumed (higher numbers consume more RAM but produce fewer chunk files). If you want to keep the CSV header first line, set keepHeader to True; if set to False, it reverses the entire file, including the first line.
For kicks, I ran this on an old Raspberry Pi using a 128GB flash drive on a 6MB csv test file and I thought something had gone wrong because it returned almost immediately, so it's fast even on slower hardware. It imports only one standard python library function (remove), so it's very portable. One advantage of this code is it does not reposition any file pointers. One limitation is it will not work on CSV files that have newlines in the data. For that use case, pandas would be the best solution to read the chunks.
from os import remove
def writechunk(fileCounter, reverseString):
outFile = 'tmpfile' + str(fileCounter) + '.csv'
with open(outFile, 'w') as outfp:
outfp.write(reverseString)
return
def main():
inFile = 'intest.csv'
outFile = 'outtest.csv'
# This is our chunk expressed in lines
maxLines = 10
# Is there a header line we want to keep at the top of the output file?
keepHeader = True
fileCounter = 0
lineCounter = 0
with open(inFile) as infp:
reverseString = ''
line = infp.readline()
if (line and keepHeader):
headerLine = line
line = infp.readline()
while (line):
lineCounter += 1
reverseString = line + reverseString
if (lineCounter == maxLines):
fileCounter += 1
lineCounter = 0
writechunk(fileCounter, reverseString)
reverseString = ''
line = infp.readline()
# Write any leftovers to a chunk file
if (lineCounter != 0):
fileCounter += 1
writechunk(fileCounter,reverseString)
# Read the chunk files backwards and append each to the outFile
with open(outFile, 'w') as outfp:
if (keepHeader):
outfp.write(headerLine)
while (fileCounter > 0):
chunkFile = 'tmpfile' + str(fileCounter) + '.csv'
with open(chunkFile, 'r') as infp:
outfp.write(infp.read())
remove(chunkFile)
fileCounter -= 1
if __name__ == '__main__':
main()
2
This does work, but assumes that there are no newlines in the CSV cell values.
– Martijn Pieters♦
Nov 18 '18 at 14:32
Thanks, @MartijnPieters. You are correct. However, this was the use case expressed in the original question. I will update the answer to point out this limitation. This was designed to have the smallest footprint possible, but we could potentially use pandas' inbuilt chunk function to read bits of the file, which would overcome that limitation at the expense of efficiency (and a modest increase in memory requirement).
– Mark Warburton
Nov 18 '18 at 23:00
1
It absolutely is in the use case expressed, which asks to be able to handle any giant CSV file.
– Martijn Pieters♦
Nov 19 '18 at 8:58
add a comment |
If you have sufficient hard disk space, you can read in chunks, reverse and store. Then pick up the stored chunks in reverse order and write to a new csv file.
Below is an example with Pandas which also uses pickle (for performance efficiency) and gzip (for storage efficiency).
import pandas as pd, numpy as np
# create a dataframe for demonstration purposes
df = pd.DataFrame(np.arange(5*9).reshape((-1, 5)))
df.to_csv('file.csv', index=False)
# number of rows we want to chunk by
n = 3
# iterate chunks, output to pickle files
for idx, chunk in enumerate(pd.read_csv('file.csv', chunksize=n)):
chunk.iloc[::-1].to_pickle(f'file_pkl_idx:03.pkl.gzip', compression='gzip')
# open file in amend mode and write chunks in reverse
# idx stores the index of the last pickle file written
with open('out.csv', 'a') as fout:
for i in range(idx, -1, -1):
chunk_pkl = pd.read_pickle(f'file_pkl_i:03.pkl.gzip', compression='gzip')
chunk_pkl.to_csv(fout, index=False, header=False if i!=idx else True)
# read new file to check results
df_new = pd.read_csv('out.csv')
print(df_new)
0 1 2 3 4
0 40 41 42 43 44
1 35 36 37 38 39
2 30 31 32 33 34
3 25 26 27 28 29
4 20 21 22 23 24
5 15 16 17 18 19
6 10 11 12 13 14
7 5 6 7 8 9
8 0 1 2 3 4
Actually, each file chunk should be reversed and next each reversed chunk of rows should be prepended rather than appending to the csv file. So the entire file will be reversed at the end but chunk by chunk without loading entire file into memory.
– Sai Kumar
Nov 16 '18 at 23:56
@SaiKumar, Got it, have updated with a solution which performs a complete reversal.
– jpp
Nov 18 '18 at 18:40
add a comment |
You have repeated code blocks and you are not taking advantage of pandas at all.
What @sujay kumar pointed out is very correct, I would read that more closely.
The file isnt big at all. I use OHLCV tick data that is in GBs without issues. If you use pandas.read_csv()
you dont have to do chunked transfer. Sure it will take time but it will work fine. Unless you are going in Terrabytes. I have not tested with that.
when you read_csv()
you dont specify any index. If you did you could call sort_index()
with or without ascending=False
depending on the order.
Pandas can write CSV too, please use that instead. I am pasting some example code for you to put together.
df_temp = pd.read_csv(file_path, parse_dates=True, index_col="Date",
usecols=["Date", "Adj Close"], na_values=["nan"])
sorting a Series
s = pd.Series(list('abcde'), index=[0,3,2,5,4])
s.sort_index()
Note: If you stick to Pandas and its functions you will be running already optimized code which dont require entire files to be loaded in memory. It's so easy that its almost like cheating :)
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53039941%2fread-and-reverse-data-chunk-by-chunk-from-a-csv-file-and-copy-to-a-new-csv-file%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
5 Answers
5
active
oldest
votes
5 Answers
5
active
oldest
votes
active
oldest
votes
active
oldest
votes
Using bash, you can tail the whole file except the first line and then reverse it and store it with this:
tail -n +2 train.csv | tac > train_rev.csv
If you want to keep the header in the reversed file, write it first and then append the reversed content
head -1 train.csv > train_rev.csv; tail -n +2 train.csv | tac >> train_rev.csv
Unless you have a hard requirement to also run this on a different os, this is probably going to be the fastest/best answer around.
– Julian
Nov 20 '18 at 21:24
add a comment |
Using bash, you can tail the whole file except the first line and then reverse it and store it with this:
tail -n +2 train.csv | tac > train_rev.csv
If you want to keep the header in the reversed file, write it first and then append the reversed content
head -1 train.csv > train_rev.csv; tail -n +2 train.csv | tac >> train_rev.csv
Unless you have a hard requirement to also run this on a different os, this is probably going to be the fastest/best answer around.
– Julian
Nov 20 '18 at 21:24
add a comment |
Using bash, you can tail the whole file except the first line and then reverse it and store it with this:
tail -n +2 train.csv | tac > train_rev.csv
If you want to keep the header in the reversed file, write it first and then append the reversed content
head -1 train.csv > train_rev.csv; tail -n +2 train.csv | tac >> train_rev.csv
Using bash, you can tail the whole file except the first line and then reverse it and store it with this:
tail -n +2 train.csv | tac > train_rev.csv
If you want to keep the header in the reversed file, write it first and then append the reversed content
head -1 train.csv > train_rev.csv; tail -n +2 train.csv | tac >> train_rev.csv
edited Nov 20 '18 at 16:01
answered Nov 20 '18 at 15:47
gustavovelascohgustavovelascoh
4301412
4301412
Unless you have a hard requirement to also run this on a different os, this is probably going to be the fastest/best answer around.
– Julian
Nov 20 '18 at 21:24
add a comment |
Unless you have a hard requirement to also run this on a different os, this is probably going to be the fastest/best answer around.
– Julian
Nov 20 '18 at 21:24
Unless you have a hard requirement to also run this on a different os, this is probably going to be the fastest/best answer around.
– Julian
Nov 20 '18 at 21:24
Unless you have a hard requirement to also run this on a different os, this is probably going to be the fastest/best answer around.
– Julian
Nov 20 '18 at 21:24
add a comment |
I would not recommend using pandas
for parsing or streaming any files as you are only introducing additional overhead. The best way to do it is to read the file from the bottom up. Well, a big part of this code actually comes from here where it takes in a file and returns the reverse in a generator, which I believe is what you want.
What I did was just tested it with your file train.csv
from the provided link and output the results in a new file.
import os
def reverse_readline(filename, buf_size=8192):
"""a generator that returns the lines of a file in reverse order"""
with open(filename) as fh:
segment = None
offset = 0
fh.seek(0, os.SEEK_END)
file_size = remaining_size = fh.tell()
while remaining_size > 0:
offset = min(file_size, offset + buf_size)
fh.seek(file_size - offset)
buffer = fh.read(min(remaining_size, buf_size))
remaining_size -= buf_size
lines = buffer.split('n')
# the first line of the buffer is probably not a complete line so
# we'll save it and append it to the last line of the next buffer
# we read
if segment is not None:
# if the previous chunk starts right from the beginning of line
# do not concact the segment to the last line of new chunk
# instead, yield the segment first
if buffer[-1] != 'n':
lines[-1] += segment
else:
yield segment
segment = lines[0]
for index in range(len(lines) - 1, 0, -1):
if lines[index]:
yield lines[index]
# Don't yield None if the file was empty
if segment is not None:
yield segment
reverse_gen = reverse_readline('train.csv')
with open('rev_train.csv','w') as f:
for row in reverse_gen:
f.write('n'.format(row))
It basically reads it in reverse until it finds a newline then yields a line
from the file from the bottom to top. A pretty interesting way of doing it.
2
This actually will result in overlapping and missing chunks of text if UTF-8 is used. You can't just seek in a variable-width encoded file! Next, if the CSV file has embedded newlines in cell values, reversing can't just split on newlines arbitrarily.
– Martijn Pieters♦
Nov 18 '18 at 14:24
Next, a RFC compliant CSV file will usern
line separators, your code assumes thatn
is used, always, and that there are no embedded newlines in the cell values.
– Martijn Pieters♦
Nov 19 '18 at 8:57
add a comment |
I would not recommend using pandas
for parsing or streaming any files as you are only introducing additional overhead. The best way to do it is to read the file from the bottom up. Well, a big part of this code actually comes from here where it takes in a file and returns the reverse in a generator, which I believe is what you want.
What I did was just tested it with your file train.csv
from the provided link and output the results in a new file.
import os
def reverse_readline(filename, buf_size=8192):
"""a generator that returns the lines of a file in reverse order"""
with open(filename) as fh:
segment = None
offset = 0
fh.seek(0, os.SEEK_END)
file_size = remaining_size = fh.tell()
while remaining_size > 0:
offset = min(file_size, offset + buf_size)
fh.seek(file_size - offset)
buffer = fh.read(min(remaining_size, buf_size))
remaining_size -= buf_size
lines = buffer.split('n')
# the first line of the buffer is probably not a complete line so
# we'll save it and append it to the last line of the next buffer
# we read
if segment is not None:
# if the previous chunk starts right from the beginning of line
# do not concact the segment to the last line of new chunk
# instead, yield the segment first
if buffer[-1] != 'n':
lines[-1] += segment
else:
yield segment
segment = lines[0]
for index in range(len(lines) - 1, 0, -1):
if lines[index]:
yield lines[index]
# Don't yield None if the file was empty
if segment is not None:
yield segment
reverse_gen = reverse_readline('train.csv')
with open('rev_train.csv','w') as f:
for row in reverse_gen:
f.write('n'.format(row))
It basically reads it in reverse until it finds a newline then yields a line
from the file from the bottom to top. A pretty interesting way of doing it.
2
This actually will result in overlapping and missing chunks of text if UTF-8 is used. You can't just seek in a variable-width encoded file! Next, if the CSV file has embedded newlines in cell values, reversing can't just split on newlines arbitrarily.
– Martijn Pieters♦
Nov 18 '18 at 14:24
Next, a RFC compliant CSV file will usern
line separators, your code assumes thatn
is used, always, and that there are no embedded newlines in the cell values.
– Martijn Pieters♦
Nov 19 '18 at 8:57
add a comment |
I would not recommend using pandas
for parsing or streaming any files as you are only introducing additional overhead. The best way to do it is to read the file from the bottom up. Well, a big part of this code actually comes from here where it takes in a file and returns the reverse in a generator, which I believe is what you want.
What I did was just tested it with your file train.csv
from the provided link and output the results in a new file.
import os
def reverse_readline(filename, buf_size=8192):
"""a generator that returns the lines of a file in reverse order"""
with open(filename) as fh:
segment = None
offset = 0
fh.seek(0, os.SEEK_END)
file_size = remaining_size = fh.tell()
while remaining_size > 0:
offset = min(file_size, offset + buf_size)
fh.seek(file_size - offset)
buffer = fh.read(min(remaining_size, buf_size))
remaining_size -= buf_size
lines = buffer.split('n')
# the first line of the buffer is probably not a complete line so
# we'll save it and append it to the last line of the next buffer
# we read
if segment is not None:
# if the previous chunk starts right from the beginning of line
# do not concact the segment to the last line of new chunk
# instead, yield the segment first
if buffer[-1] != 'n':
lines[-1] += segment
else:
yield segment
segment = lines[0]
for index in range(len(lines) - 1, 0, -1):
if lines[index]:
yield lines[index]
# Don't yield None if the file was empty
if segment is not None:
yield segment
reverse_gen = reverse_readline('train.csv')
with open('rev_train.csv','w') as f:
for row in reverse_gen:
f.write('n'.format(row))
It basically reads it in reverse until it finds a newline then yields a line
from the file from the bottom to top. A pretty interesting way of doing it.
I would not recommend using pandas
for parsing or streaming any files as you are only introducing additional overhead. The best way to do it is to read the file from the bottom up. Well, a big part of this code actually comes from here where it takes in a file and returns the reverse in a generator, which I believe is what you want.
What I did was just tested it with your file train.csv
from the provided link and output the results in a new file.
import os
def reverse_readline(filename, buf_size=8192):
"""a generator that returns the lines of a file in reverse order"""
with open(filename) as fh:
segment = None
offset = 0
fh.seek(0, os.SEEK_END)
file_size = remaining_size = fh.tell()
while remaining_size > 0:
offset = min(file_size, offset + buf_size)
fh.seek(file_size - offset)
buffer = fh.read(min(remaining_size, buf_size))
remaining_size -= buf_size
lines = buffer.split('n')
# the first line of the buffer is probably not a complete line so
# we'll save it and append it to the last line of the next buffer
# we read
if segment is not None:
# if the previous chunk starts right from the beginning of line
# do not concact the segment to the last line of new chunk
# instead, yield the segment first
if buffer[-1] != 'n':
lines[-1] += segment
else:
yield segment
segment = lines[0]
for index in range(len(lines) - 1, 0, -1):
if lines[index]:
yield lines[index]
# Don't yield None if the file was empty
if segment is not None:
yield segment
reverse_gen = reverse_readline('train.csv')
with open('rev_train.csv','w') as f:
for row in reverse_gen:
f.write('n'.format(row))
It basically reads it in reverse until it finds a newline then yields a line
from the file from the bottom to top. A pretty interesting way of doing it.
answered Nov 17 '18 at 3:02
BernardLBernardL
2,37311029
2,37311029
2
This actually will result in overlapping and missing chunks of text if UTF-8 is used. You can't just seek in a variable-width encoded file! Next, if the CSV file has embedded newlines in cell values, reversing can't just split on newlines arbitrarily.
– Martijn Pieters♦
Nov 18 '18 at 14:24
Next, a RFC compliant CSV file will usern
line separators, your code assumes thatn
is used, always, and that there are no embedded newlines in the cell values.
– Martijn Pieters♦
Nov 19 '18 at 8:57
add a comment |
2
This actually will result in overlapping and missing chunks of text if UTF-8 is used. You can't just seek in a variable-width encoded file! Next, if the CSV file has embedded newlines in cell values, reversing can't just split on newlines arbitrarily.
– Martijn Pieters♦
Nov 18 '18 at 14:24
Next, a RFC compliant CSV file will usern
line separators, your code assumes thatn
is used, always, and that there are no embedded newlines in the cell values.
– Martijn Pieters♦
Nov 19 '18 at 8:57
2
2
This actually will result in overlapping and missing chunks of text if UTF-8 is used. You can't just seek in a variable-width encoded file! Next, if the CSV file has embedded newlines in cell values, reversing can't just split on newlines arbitrarily.
– Martijn Pieters♦
Nov 18 '18 at 14:24
This actually will result in overlapping and missing chunks of text if UTF-8 is used. You can't just seek in a variable-width encoded file! Next, if the CSV file has embedded newlines in cell values, reversing can't just split on newlines arbitrarily.
– Martijn Pieters♦
Nov 18 '18 at 14:24
Next, a RFC compliant CSV file will use
rn
line separators, your code assumes that n
is used, always, and that there are no embedded newlines in the cell values.– Martijn Pieters♦
Nov 19 '18 at 8:57
Next, a RFC compliant CSV file will use
rn
line separators, your code assumes that n
is used, always, and that there are no embedded newlines in the cell values.– Martijn Pieters♦
Nov 19 '18 at 8:57
add a comment |
This does exactly what you are requesting, but without Pandas. It reads intest.csv line by line (as opposed to reading the whole file into RAM). It does most of the processing using the file system using a series of chunk files that are aggregated at the end into the outtest.csv file. If you change the maxLines, you can optimise the number of chunk files produced versus RAM consumed (higher numbers consume more RAM but produce fewer chunk files). If you want to keep the CSV header first line, set keepHeader to True; if set to False, it reverses the entire file, including the first line.
For kicks, I ran this on an old Raspberry Pi using a 128GB flash drive on a 6MB csv test file and I thought something had gone wrong because it returned almost immediately, so it's fast even on slower hardware. It imports only one standard python library function (remove), so it's very portable. One advantage of this code is it does not reposition any file pointers. One limitation is it will not work on CSV files that have newlines in the data. For that use case, pandas would be the best solution to read the chunks.
from os import remove
def writechunk(fileCounter, reverseString):
outFile = 'tmpfile' + str(fileCounter) + '.csv'
with open(outFile, 'w') as outfp:
outfp.write(reverseString)
return
def main():
inFile = 'intest.csv'
outFile = 'outtest.csv'
# This is our chunk expressed in lines
maxLines = 10
# Is there a header line we want to keep at the top of the output file?
keepHeader = True
fileCounter = 0
lineCounter = 0
with open(inFile) as infp:
reverseString = ''
line = infp.readline()
if (line and keepHeader):
headerLine = line
line = infp.readline()
while (line):
lineCounter += 1
reverseString = line + reverseString
if (lineCounter == maxLines):
fileCounter += 1
lineCounter = 0
writechunk(fileCounter, reverseString)
reverseString = ''
line = infp.readline()
# Write any leftovers to a chunk file
if (lineCounter != 0):
fileCounter += 1
writechunk(fileCounter,reverseString)
# Read the chunk files backwards and append each to the outFile
with open(outFile, 'w') as outfp:
if (keepHeader):
outfp.write(headerLine)
while (fileCounter > 0):
chunkFile = 'tmpfile' + str(fileCounter) + '.csv'
with open(chunkFile, 'r') as infp:
outfp.write(infp.read())
remove(chunkFile)
fileCounter -= 1
if __name__ == '__main__':
main()
2
This does work, but assumes that there are no newlines in the CSV cell values.
– Martijn Pieters♦
Nov 18 '18 at 14:32
Thanks, @MartijnPieters. You are correct. However, this was the use case expressed in the original question. I will update the answer to point out this limitation. This was designed to have the smallest footprint possible, but we could potentially use pandas' inbuilt chunk function to read bits of the file, which would overcome that limitation at the expense of efficiency (and a modest increase in memory requirement).
– Mark Warburton
Nov 18 '18 at 23:00
1
It absolutely is in the use case expressed, which asks to be able to handle any giant CSV file.
– Martijn Pieters♦
Nov 19 '18 at 8:58
add a comment |
This does exactly what you are requesting, but without Pandas. It reads intest.csv line by line (as opposed to reading the whole file into RAM). It does most of the processing using the file system using a series of chunk files that are aggregated at the end into the outtest.csv file. If you change the maxLines, you can optimise the number of chunk files produced versus RAM consumed (higher numbers consume more RAM but produce fewer chunk files). If you want to keep the CSV header first line, set keepHeader to True; if set to False, it reverses the entire file, including the first line.
For kicks, I ran this on an old Raspberry Pi using a 128GB flash drive on a 6MB csv test file and I thought something had gone wrong because it returned almost immediately, so it's fast even on slower hardware. It imports only one standard python library function (remove), so it's very portable. One advantage of this code is it does not reposition any file pointers. One limitation is it will not work on CSV files that have newlines in the data. For that use case, pandas would be the best solution to read the chunks.
from os import remove
def writechunk(fileCounter, reverseString):
outFile = 'tmpfile' + str(fileCounter) + '.csv'
with open(outFile, 'w') as outfp:
outfp.write(reverseString)
return
def main():
inFile = 'intest.csv'
outFile = 'outtest.csv'
# This is our chunk expressed in lines
maxLines = 10
# Is there a header line we want to keep at the top of the output file?
keepHeader = True
fileCounter = 0
lineCounter = 0
with open(inFile) as infp:
reverseString = ''
line = infp.readline()
if (line and keepHeader):
headerLine = line
line = infp.readline()
while (line):
lineCounter += 1
reverseString = line + reverseString
if (lineCounter == maxLines):
fileCounter += 1
lineCounter = 0
writechunk(fileCounter, reverseString)
reverseString = ''
line = infp.readline()
# Write any leftovers to a chunk file
if (lineCounter != 0):
fileCounter += 1
writechunk(fileCounter,reverseString)
# Read the chunk files backwards and append each to the outFile
with open(outFile, 'w') as outfp:
if (keepHeader):
outfp.write(headerLine)
while (fileCounter > 0):
chunkFile = 'tmpfile' + str(fileCounter) + '.csv'
with open(chunkFile, 'r') as infp:
outfp.write(infp.read())
remove(chunkFile)
fileCounter -= 1
if __name__ == '__main__':
main()
2
This does work, but assumes that there are no newlines in the CSV cell values.
– Martijn Pieters♦
Nov 18 '18 at 14:32
Thanks, @MartijnPieters. You are correct. However, this was the use case expressed in the original question. I will update the answer to point out this limitation. This was designed to have the smallest footprint possible, but we could potentially use pandas' inbuilt chunk function to read bits of the file, which would overcome that limitation at the expense of efficiency (and a modest increase in memory requirement).
– Mark Warburton
Nov 18 '18 at 23:00
1
It absolutely is in the use case expressed, which asks to be able to handle any giant CSV file.
– Martijn Pieters♦
Nov 19 '18 at 8:58
add a comment |
This does exactly what you are requesting, but without Pandas. It reads intest.csv line by line (as opposed to reading the whole file into RAM). It does most of the processing using the file system using a series of chunk files that are aggregated at the end into the outtest.csv file. If you change the maxLines, you can optimise the number of chunk files produced versus RAM consumed (higher numbers consume more RAM but produce fewer chunk files). If you want to keep the CSV header first line, set keepHeader to True; if set to False, it reverses the entire file, including the first line.
For kicks, I ran this on an old Raspberry Pi using a 128GB flash drive on a 6MB csv test file and I thought something had gone wrong because it returned almost immediately, so it's fast even on slower hardware. It imports only one standard python library function (remove), so it's very portable. One advantage of this code is it does not reposition any file pointers. One limitation is it will not work on CSV files that have newlines in the data. For that use case, pandas would be the best solution to read the chunks.
from os import remove
def writechunk(fileCounter, reverseString):
outFile = 'tmpfile' + str(fileCounter) + '.csv'
with open(outFile, 'w') as outfp:
outfp.write(reverseString)
return
def main():
inFile = 'intest.csv'
outFile = 'outtest.csv'
# This is our chunk expressed in lines
maxLines = 10
# Is there a header line we want to keep at the top of the output file?
keepHeader = True
fileCounter = 0
lineCounter = 0
with open(inFile) as infp:
reverseString = ''
line = infp.readline()
if (line and keepHeader):
headerLine = line
line = infp.readline()
while (line):
lineCounter += 1
reverseString = line + reverseString
if (lineCounter == maxLines):
fileCounter += 1
lineCounter = 0
writechunk(fileCounter, reverseString)
reverseString = ''
line = infp.readline()
# Write any leftovers to a chunk file
if (lineCounter != 0):
fileCounter += 1
writechunk(fileCounter,reverseString)
# Read the chunk files backwards and append each to the outFile
with open(outFile, 'w') as outfp:
if (keepHeader):
outfp.write(headerLine)
while (fileCounter > 0):
chunkFile = 'tmpfile' + str(fileCounter) + '.csv'
with open(chunkFile, 'r') as infp:
outfp.write(infp.read())
remove(chunkFile)
fileCounter -= 1
if __name__ == '__main__':
main()
This does exactly what you are requesting, but without Pandas. It reads intest.csv line by line (as opposed to reading the whole file into RAM). It does most of the processing using the file system using a series of chunk files that are aggregated at the end into the outtest.csv file. If you change the maxLines, you can optimise the number of chunk files produced versus RAM consumed (higher numbers consume more RAM but produce fewer chunk files). If you want to keep the CSV header first line, set keepHeader to True; if set to False, it reverses the entire file, including the first line.
For kicks, I ran this on an old Raspberry Pi using a 128GB flash drive on a 6MB csv test file and I thought something had gone wrong because it returned almost immediately, so it's fast even on slower hardware. It imports only one standard python library function (remove), so it's very portable. One advantage of this code is it does not reposition any file pointers. One limitation is it will not work on CSV files that have newlines in the data. For that use case, pandas would be the best solution to read the chunks.
from os import remove
def writechunk(fileCounter, reverseString):
outFile = 'tmpfile' + str(fileCounter) + '.csv'
with open(outFile, 'w') as outfp:
outfp.write(reverseString)
return
def main():
inFile = 'intest.csv'
outFile = 'outtest.csv'
# This is our chunk expressed in lines
maxLines = 10
# Is there a header line we want to keep at the top of the output file?
keepHeader = True
fileCounter = 0
lineCounter = 0
with open(inFile) as infp:
reverseString = ''
line = infp.readline()
if (line and keepHeader):
headerLine = line
line = infp.readline()
while (line):
lineCounter += 1
reverseString = line + reverseString
if (lineCounter == maxLines):
fileCounter += 1
lineCounter = 0
writechunk(fileCounter, reverseString)
reverseString = ''
line = infp.readline()
# Write any leftovers to a chunk file
if (lineCounter != 0):
fileCounter += 1
writechunk(fileCounter,reverseString)
# Read the chunk files backwards and append each to the outFile
with open(outFile, 'w') as outfp:
if (keepHeader):
outfp.write(headerLine)
while (fileCounter > 0):
chunkFile = 'tmpfile' + str(fileCounter) + '.csv'
with open(chunkFile, 'r') as infp:
outfp.write(infp.read())
remove(chunkFile)
fileCounter -= 1
if __name__ == '__main__':
main()
edited Nov 18 '18 at 23:03
answered Nov 17 '18 at 8:13
Mark WarburtonMark Warburton
8517
8517
2
This does work, but assumes that there are no newlines in the CSV cell values.
– Martijn Pieters♦
Nov 18 '18 at 14:32
Thanks, @MartijnPieters. You are correct. However, this was the use case expressed in the original question. I will update the answer to point out this limitation. This was designed to have the smallest footprint possible, but we could potentially use pandas' inbuilt chunk function to read bits of the file, which would overcome that limitation at the expense of efficiency (and a modest increase in memory requirement).
– Mark Warburton
Nov 18 '18 at 23:00
1
It absolutely is in the use case expressed, which asks to be able to handle any giant CSV file.
– Martijn Pieters♦
Nov 19 '18 at 8:58
add a comment |
2
This does work, but assumes that there are no newlines in the CSV cell values.
– Martijn Pieters♦
Nov 18 '18 at 14:32
Thanks, @MartijnPieters. You are correct. However, this was the use case expressed in the original question. I will update the answer to point out this limitation. This was designed to have the smallest footprint possible, but we could potentially use pandas' inbuilt chunk function to read bits of the file, which would overcome that limitation at the expense of efficiency (and a modest increase in memory requirement).
– Mark Warburton
Nov 18 '18 at 23:00
1
It absolutely is in the use case expressed, which asks to be able to handle any giant CSV file.
– Martijn Pieters♦
Nov 19 '18 at 8:58
2
2
This does work, but assumes that there are no newlines in the CSV cell values.
– Martijn Pieters♦
Nov 18 '18 at 14:32
This does work, but assumes that there are no newlines in the CSV cell values.
– Martijn Pieters♦
Nov 18 '18 at 14:32
Thanks, @MartijnPieters. You are correct. However, this was the use case expressed in the original question. I will update the answer to point out this limitation. This was designed to have the smallest footprint possible, but we could potentially use pandas' inbuilt chunk function to read bits of the file, which would overcome that limitation at the expense of efficiency (and a modest increase in memory requirement).
– Mark Warburton
Nov 18 '18 at 23:00
Thanks, @MartijnPieters. You are correct. However, this was the use case expressed in the original question. I will update the answer to point out this limitation. This was designed to have the smallest footprint possible, but we could potentially use pandas' inbuilt chunk function to read bits of the file, which would overcome that limitation at the expense of efficiency (and a modest increase in memory requirement).
– Mark Warburton
Nov 18 '18 at 23:00
1
1
It absolutely is in the use case expressed, which asks to be able to handle any giant CSV file.
– Martijn Pieters♦
Nov 19 '18 at 8:58
It absolutely is in the use case expressed, which asks to be able to handle any giant CSV file.
– Martijn Pieters♦
Nov 19 '18 at 8:58
add a comment |
If you have sufficient hard disk space, you can read in chunks, reverse and store. Then pick up the stored chunks in reverse order and write to a new csv file.
Below is an example with Pandas which also uses pickle (for performance efficiency) and gzip (for storage efficiency).
import pandas as pd, numpy as np
# create a dataframe for demonstration purposes
df = pd.DataFrame(np.arange(5*9).reshape((-1, 5)))
df.to_csv('file.csv', index=False)
# number of rows we want to chunk by
n = 3
# iterate chunks, output to pickle files
for idx, chunk in enumerate(pd.read_csv('file.csv', chunksize=n)):
chunk.iloc[::-1].to_pickle(f'file_pkl_idx:03.pkl.gzip', compression='gzip')
# open file in amend mode and write chunks in reverse
# idx stores the index of the last pickle file written
with open('out.csv', 'a') as fout:
for i in range(idx, -1, -1):
chunk_pkl = pd.read_pickle(f'file_pkl_i:03.pkl.gzip', compression='gzip')
chunk_pkl.to_csv(fout, index=False, header=False if i!=idx else True)
# read new file to check results
df_new = pd.read_csv('out.csv')
print(df_new)
0 1 2 3 4
0 40 41 42 43 44
1 35 36 37 38 39
2 30 31 32 33 34
3 25 26 27 28 29
4 20 21 22 23 24
5 15 16 17 18 19
6 10 11 12 13 14
7 5 6 7 8 9
8 0 1 2 3 4
Actually, each file chunk should be reversed and next each reversed chunk of rows should be prepended rather than appending to the csv file. So the entire file will be reversed at the end but chunk by chunk without loading entire file into memory.
– Sai Kumar
Nov 16 '18 at 23:56
@SaiKumar, Got it, have updated with a solution which performs a complete reversal.
– jpp
Nov 18 '18 at 18:40
add a comment |
If you have sufficient hard disk space, you can read in chunks, reverse and store. Then pick up the stored chunks in reverse order and write to a new csv file.
Below is an example with Pandas which also uses pickle (for performance efficiency) and gzip (for storage efficiency).
import pandas as pd, numpy as np
# create a dataframe for demonstration purposes
df = pd.DataFrame(np.arange(5*9).reshape((-1, 5)))
df.to_csv('file.csv', index=False)
# number of rows we want to chunk by
n = 3
# iterate chunks, output to pickle files
for idx, chunk in enumerate(pd.read_csv('file.csv', chunksize=n)):
chunk.iloc[::-1].to_pickle(f'file_pkl_idx:03.pkl.gzip', compression='gzip')
# open file in amend mode and write chunks in reverse
# idx stores the index of the last pickle file written
with open('out.csv', 'a') as fout:
for i in range(idx, -1, -1):
chunk_pkl = pd.read_pickle(f'file_pkl_i:03.pkl.gzip', compression='gzip')
chunk_pkl.to_csv(fout, index=False, header=False if i!=idx else True)
# read new file to check results
df_new = pd.read_csv('out.csv')
print(df_new)
0 1 2 3 4
0 40 41 42 43 44
1 35 36 37 38 39
2 30 31 32 33 34
3 25 26 27 28 29
4 20 21 22 23 24
5 15 16 17 18 19
6 10 11 12 13 14
7 5 6 7 8 9
8 0 1 2 3 4
Actually, each file chunk should be reversed and next each reversed chunk of rows should be prepended rather than appending to the csv file. So the entire file will be reversed at the end but chunk by chunk without loading entire file into memory.
– Sai Kumar
Nov 16 '18 at 23:56
@SaiKumar, Got it, have updated with a solution which performs a complete reversal.
– jpp
Nov 18 '18 at 18:40
add a comment |
If you have sufficient hard disk space, you can read in chunks, reverse and store. Then pick up the stored chunks in reverse order and write to a new csv file.
Below is an example with Pandas which also uses pickle (for performance efficiency) and gzip (for storage efficiency).
import pandas as pd, numpy as np
# create a dataframe for demonstration purposes
df = pd.DataFrame(np.arange(5*9).reshape((-1, 5)))
df.to_csv('file.csv', index=False)
# number of rows we want to chunk by
n = 3
# iterate chunks, output to pickle files
for idx, chunk in enumerate(pd.read_csv('file.csv', chunksize=n)):
chunk.iloc[::-1].to_pickle(f'file_pkl_idx:03.pkl.gzip', compression='gzip')
# open file in amend mode and write chunks in reverse
# idx stores the index of the last pickle file written
with open('out.csv', 'a') as fout:
for i in range(idx, -1, -1):
chunk_pkl = pd.read_pickle(f'file_pkl_i:03.pkl.gzip', compression='gzip')
chunk_pkl.to_csv(fout, index=False, header=False if i!=idx else True)
# read new file to check results
df_new = pd.read_csv('out.csv')
print(df_new)
0 1 2 3 4
0 40 41 42 43 44
1 35 36 37 38 39
2 30 31 32 33 34
3 25 26 27 28 29
4 20 21 22 23 24
5 15 16 17 18 19
6 10 11 12 13 14
7 5 6 7 8 9
8 0 1 2 3 4
If you have sufficient hard disk space, you can read in chunks, reverse and store. Then pick up the stored chunks in reverse order and write to a new csv file.
Below is an example with Pandas which also uses pickle (for performance efficiency) and gzip (for storage efficiency).
import pandas as pd, numpy as np
# create a dataframe for demonstration purposes
df = pd.DataFrame(np.arange(5*9).reshape((-1, 5)))
df.to_csv('file.csv', index=False)
# number of rows we want to chunk by
n = 3
# iterate chunks, output to pickle files
for idx, chunk in enumerate(pd.read_csv('file.csv', chunksize=n)):
chunk.iloc[::-1].to_pickle(f'file_pkl_idx:03.pkl.gzip', compression='gzip')
# open file in amend mode and write chunks in reverse
# idx stores the index of the last pickle file written
with open('out.csv', 'a') as fout:
for i in range(idx, -1, -1):
chunk_pkl = pd.read_pickle(f'file_pkl_i:03.pkl.gzip', compression='gzip')
chunk_pkl.to_csv(fout, index=False, header=False if i!=idx else True)
# read new file to check results
df_new = pd.read_csv('out.csv')
print(df_new)
0 1 2 3 4
0 40 41 42 43 44
1 35 36 37 38 39
2 30 31 32 33 34
3 25 26 27 28 29
4 20 21 22 23 24
5 15 16 17 18 19
6 10 11 12 13 14
7 5 6 7 8 9
8 0 1 2 3 4
edited Nov 20 '18 at 10:02
answered Nov 16 '18 at 23:27
jppjpp
99.4k2161110
99.4k2161110
Actually, each file chunk should be reversed and next each reversed chunk of rows should be prepended rather than appending to the csv file. So the entire file will be reversed at the end but chunk by chunk without loading entire file into memory.
– Sai Kumar
Nov 16 '18 at 23:56
@SaiKumar, Got it, have updated with a solution which performs a complete reversal.
– jpp
Nov 18 '18 at 18:40
add a comment |
Actually, each file chunk should be reversed and next each reversed chunk of rows should be prepended rather than appending to the csv file. So the entire file will be reversed at the end but chunk by chunk without loading entire file into memory.
– Sai Kumar
Nov 16 '18 at 23:56
@SaiKumar, Got it, have updated with a solution which performs a complete reversal.
– jpp
Nov 18 '18 at 18:40
Actually, each file chunk should be reversed and next each reversed chunk of rows should be prepended rather than appending to the csv file. So the entire file will be reversed at the end but chunk by chunk without loading entire file into memory.
– Sai Kumar
Nov 16 '18 at 23:56
Actually, each file chunk should be reversed and next each reversed chunk of rows should be prepended rather than appending to the csv file. So the entire file will be reversed at the end but chunk by chunk without loading entire file into memory.
– Sai Kumar
Nov 16 '18 at 23:56
@SaiKumar, Got it, have updated with a solution which performs a complete reversal.
– jpp
Nov 18 '18 at 18:40
@SaiKumar, Got it, have updated with a solution which performs a complete reversal.
– jpp
Nov 18 '18 at 18:40
add a comment |
You have repeated code blocks and you are not taking advantage of pandas at all.
What @sujay kumar pointed out is very correct, I would read that more closely.
The file isnt big at all. I use OHLCV tick data that is in GBs without issues. If you use pandas.read_csv()
you dont have to do chunked transfer. Sure it will take time but it will work fine. Unless you are going in Terrabytes. I have not tested with that.
when you read_csv()
you dont specify any index. If you did you could call sort_index()
with or without ascending=False
depending on the order.
Pandas can write CSV too, please use that instead. I am pasting some example code for you to put together.
df_temp = pd.read_csv(file_path, parse_dates=True, index_col="Date",
usecols=["Date", "Adj Close"], na_values=["nan"])
sorting a Series
s = pd.Series(list('abcde'), index=[0,3,2,5,4])
s.sort_index()
Note: If you stick to Pandas and its functions you will be running already optimized code which dont require entire files to be loaded in memory. It's so easy that its almost like cheating :)
add a comment |
You have repeated code blocks and you are not taking advantage of pandas at all.
What @sujay kumar pointed out is very correct, I would read that more closely.
The file isnt big at all. I use OHLCV tick data that is in GBs without issues. If you use pandas.read_csv()
you dont have to do chunked transfer. Sure it will take time but it will work fine. Unless you are going in Terrabytes. I have not tested with that.
when you read_csv()
you dont specify any index. If you did you could call sort_index()
with or without ascending=False
depending on the order.
Pandas can write CSV too, please use that instead. I am pasting some example code for you to put together.
df_temp = pd.read_csv(file_path, parse_dates=True, index_col="Date",
usecols=["Date", "Adj Close"], na_values=["nan"])
sorting a Series
s = pd.Series(list('abcde'), index=[0,3,2,5,4])
s.sort_index()
Note: If you stick to Pandas and its functions you will be running already optimized code which dont require entire files to be loaded in memory. It's so easy that its almost like cheating :)
add a comment |
You have repeated code blocks and you are not taking advantage of pandas at all.
What @sujay kumar pointed out is very correct, I would read that more closely.
The file isnt big at all. I use OHLCV tick data that is in GBs without issues. If you use pandas.read_csv()
you dont have to do chunked transfer. Sure it will take time but it will work fine. Unless you are going in Terrabytes. I have not tested with that.
when you read_csv()
you dont specify any index. If you did you could call sort_index()
with or without ascending=False
depending on the order.
Pandas can write CSV too, please use that instead. I am pasting some example code for you to put together.
df_temp = pd.read_csv(file_path, parse_dates=True, index_col="Date",
usecols=["Date", "Adj Close"], na_values=["nan"])
sorting a Series
s = pd.Series(list('abcde'), index=[0,3,2,5,4])
s.sort_index()
Note: If you stick to Pandas and its functions you will be running already optimized code which dont require entire files to be loaded in memory. It's so easy that its almost like cheating :)
You have repeated code blocks and you are not taking advantage of pandas at all.
What @sujay kumar pointed out is very correct, I would read that more closely.
The file isnt big at all. I use OHLCV tick data that is in GBs without issues. If you use pandas.read_csv()
you dont have to do chunked transfer. Sure it will take time but it will work fine. Unless you are going in Terrabytes. I have not tested with that.
when you read_csv()
you dont specify any index. If you did you could call sort_index()
with or without ascending=False
depending on the order.
Pandas can write CSV too, please use that instead. I am pasting some example code for you to put together.
df_temp = pd.read_csv(file_path, parse_dates=True, index_col="Date",
usecols=["Date", "Adj Close"], na_values=["nan"])
sorting a Series
s = pd.Series(list('abcde'), index=[0,3,2,5,4])
s.sort_index()
Note: If you stick to Pandas and its functions you will be running already optimized code which dont require entire files to be loaded in memory. It's so easy that its almost like cheating :)
edited Nov 18 '18 at 2:34
answered Nov 13 '18 at 23:49
Abhishek DujariAbhishek Dujari
1,2872540
1,2872540
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53039941%2fread-and-reverse-data-chunk-by-chunk-from-a-csv-file-and-copy-to-a-new-csv-file%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
You want the entire data to be time-sorted but you’re sorting & writing 10 data at a time FROM THE BEGINNING. That’s why you aren’t getting results. Why don’t you read & reverse the train.csv from the end. See, stackoverflow.com/a/10933932/2895956
– Sujay Kumar
Oct 29 '18 at 7:03
2
I tried the code given in that post earlier. But it did not give a proper result for me.
– Suleka_28
Oct 29 '18 at 7:21
3
Can you elaborate? What exactly you are getting? And if possible upload the file in dropbox. The process is too long, (to download that file a person have to login, register then fill out information...bla..bla..) just to download the file.
– Sujay Kumar
Oct 29 '18 at 7:29
4
Note: Not a perfect solution. But this you can simply revert the file using cmd line, tail -r train.csv > reverse.csv. And manually remove the last line and add it to the first line in reverse.csv
– Sujay Kumar
Oct 29 '18 at 9:20