Read and reverse data chunk by chunk from a csv file and copy to a new csv file










14















Assume I'm dealing with a very large csv file. So, I can only read the data chunk by chunk into the memory. The expected flow of events should be as follows:




1) Read chunk (eg: 10 rows) of data from csv using pandas.



2) Reverse the order of data



3) Copy each row to new csv file in reverse. So each chunk (10 rows) is
written to csv from beginning in reversed order.




In the end the csv file should be in reversed order and this should be done without loading entire file into memory for windows OS.



I am trying to do a time series forecasting I need data to be from old to latest (1st row oldest entry). I can't load entire file into memory I'm looking for a way to do it each chunk at a time if it's possible.



The dataset I tried on train.csv of the Rossmann dataset from kaggle. You can get it from this github repo



My attempt does not copy the rows into the new csv file properly.



Show below is my code:



import pandas as pd
import csv

def reverse():

fields = ["Store","DayOfWeek","Date","Sales","Customers","Open","Promo","StateHoliday",
"SchoolHoliday"]
with open('processed_train.csv', mode='a') as stock_file:
writer = csv.writer(stock_file,delimiter=',', quotechar='"',
quoting=csv.QUOTE_MINIMAL)
writer.writerow(fields)

for chunk in pd.read_csv("train.csv", chunksize=10):
store_data = chunk.reindex(index=chunk.index[::-1])
append_data_csv(store_data)

def append_data_csv(store_data):
with open('processed_train.csv', mode='a') as store_file:
writer = csv.writer(store_file,delimiter=',', quotechar='"',
quoting=csv.QUOTE_MINIMAL)
for index, row in store_data.iterrows():
print(row)
writer.writerow([row['Store'],row['DayOfWeek'],row['Date'],row['Sales'],
row['Customers'],row['Open'],row['Promo'],
row['StateHoliday'],row['SchoolHoliday']])

reverse()


Thank you, in advance










share|improve this question



















  • 1





    You want the entire data to be time-sorted but you’re sorting & writing 10 data at a time FROM THE BEGINNING. That’s why you aren’t getting results. Why don’t you read & reverse the train.csv from the end. See, stackoverflow.com/a/10933932/2895956

    – Sujay Kumar
    Oct 29 '18 at 7:03







  • 2





    I tried the code given in that post earlier. But it did not give a proper result for me.

    – Suleka_28
    Oct 29 '18 at 7:21






  • 3





    Can you elaborate? What exactly you are getting? And if possible upload the file in dropbox. The process is too long, (to download that file a person have to login, register then fill out information...bla..bla..) just to download the file.

    – Sujay Kumar
    Oct 29 '18 at 7:29







  • 4





    Note: Not a perfect solution. But this you can simply revert the file using cmd line, tail -r train.csv > reverse.csv. And manually remove the last line and add it to the first line in reverse.csv

    – Sujay Kumar
    Oct 29 '18 at 9:20















14















Assume I'm dealing with a very large csv file. So, I can only read the data chunk by chunk into the memory. The expected flow of events should be as follows:




1) Read chunk (eg: 10 rows) of data from csv using pandas.



2) Reverse the order of data



3) Copy each row to new csv file in reverse. So each chunk (10 rows) is
written to csv from beginning in reversed order.




In the end the csv file should be in reversed order and this should be done without loading entire file into memory for windows OS.



I am trying to do a time series forecasting I need data to be from old to latest (1st row oldest entry). I can't load entire file into memory I'm looking for a way to do it each chunk at a time if it's possible.



The dataset I tried on train.csv of the Rossmann dataset from kaggle. You can get it from this github repo



My attempt does not copy the rows into the new csv file properly.



Show below is my code:



import pandas as pd
import csv

def reverse():

fields = ["Store","DayOfWeek","Date","Sales","Customers","Open","Promo","StateHoliday",
"SchoolHoliday"]
with open('processed_train.csv', mode='a') as stock_file:
writer = csv.writer(stock_file,delimiter=',', quotechar='"',
quoting=csv.QUOTE_MINIMAL)
writer.writerow(fields)

for chunk in pd.read_csv("train.csv", chunksize=10):
store_data = chunk.reindex(index=chunk.index[::-1])
append_data_csv(store_data)

def append_data_csv(store_data):
with open('processed_train.csv', mode='a') as store_file:
writer = csv.writer(store_file,delimiter=',', quotechar='"',
quoting=csv.QUOTE_MINIMAL)
for index, row in store_data.iterrows():
print(row)
writer.writerow([row['Store'],row['DayOfWeek'],row['Date'],row['Sales'],
row['Customers'],row['Open'],row['Promo'],
row['StateHoliday'],row['SchoolHoliday']])

reverse()


Thank you, in advance










share|improve this question



















  • 1





    You want the entire data to be time-sorted but you’re sorting & writing 10 data at a time FROM THE BEGINNING. That’s why you aren’t getting results. Why don’t you read & reverse the train.csv from the end. See, stackoverflow.com/a/10933932/2895956

    – Sujay Kumar
    Oct 29 '18 at 7:03







  • 2





    I tried the code given in that post earlier. But it did not give a proper result for me.

    – Suleka_28
    Oct 29 '18 at 7:21






  • 3





    Can you elaborate? What exactly you are getting? And if possible upload the file in dropbox. The process is too long, (to download that file a person have to login, register then fill out information...bla..bla..) just to download the file.

    – Sujay Kumar
    Oct 29 '18 at 7:29







  • 4





    Note: Not a perfect solution. But this you can simply revert the file using cmd line, tail -r train.csv > reverse.csv. And manually remove the last line and add it to the first line in reverse.csv

    – Sujay Kumar
    Oct 29 '18 at 9:20













14












14








14


4






Assume I'm dealing with a very large csv file. So, I can only read the data chunk by chunk into the memory. The expected flow of events should be as follows:




1) Read chunk (eg: 10 rows) of data from csv using pandas.



2) Reverse the order of data



3) Copy each row to new csv file in reverse. So each chunk (10 rows) is
written to csv from beginning in reversed order.




In the end the csv file should be in reversed order and this should be done without loading entire file into memory for windows OS.



I am trying to do a time series forecasting I need data to be from old to latest (1st row oldest entry). I can't load entire file into memory I'm looking for a way to do it each chunk at a time if it's possible.



The dataset I tried on train.csv of the Rossmann dataset from kaggle. You can get it from this github repo



My attempt does not copy the rows into the new csv file properly.



Show below is my code:



import pandas as pd
import csv

def reverse():

fields = ["Store","DayOfWeek","Date","Sales","Customers","Open","Promo","StateHoliday",
"SchoolHoliday"]
with open('processed_train.csv', mode='a') as stock_file:
writer = csv.writer(stock_file,delimiter=',', quotechar='"',
quoting=csv.QUOTE_MINIMAL)
writer.writerow(fields)

for chunk in pd.read_csv("train.csv", chunksize=10):
store_data = chunk.reindex(index=chunk.index[::-1])
append_data_csv(store_data)

def append_data_csv(store_data):
with open('processed_train.csv', mode='a') as store_file:
writer = csv.writer(store_file,delimiter=',', quotechar='"',
quoting=csv.QUOTE_MINIMAL)
for index, row in store_data.iterrows():
print(row)
writer.writerow([row['Store'],row['DayOfWeek'],row['Date'],row['Sales'],
row['Customers'],row['Open'],row['Promo'],
row['StateHoliday'],row['SchoolHoliday']])

reverse()


Thank you, in advance










share|improve this question
















Assume I'm dealing with a very large csv file. So, I can only read the data chunk by chunk into the memory. The expected flow of events should be as follows:




1) Read chunk (eg: 10 rows) of data from csv using pandas.



2) Reverse the order of data



3) Copy each row to new csv file in reverse. So each chunk (10 rows) is
written to csv from beginning in reversed order.




In the end the csv file should be in reversed order and this should be done without loading entire file into memory for windows OS.



I am trying to do a time series forecasting I need data to be from old to latest (1st row oldest entry). I can't load entire file into memory I'm looking for a way to do it each chunk at a time if it's possible.



The dataset I tried on train.csv of the Rossmann dataset from kaggle. You can get it from this github repo



My attempt does not copy the rows into the new csv file properly.



Show below is my code:



import pandas as pd
import csv

def reverse():

fields = ["Store","DayOfWeek","Date","Sales","Customers","Open","Promo","StateHoliday",
"SchoolHoliday"]
with open('processed_train.csv', mode='a') as stock_file:
writer = csv.writer(stock_file,delimiter=',', quotechar='"',
quoting=csv.QUOTE_MINIMAL)
writer.writerow(fields)

for chunk in pd.read_csv("train.csv", chunksize=10):
store_data = chunk.reindex(index=chunk.index[::-1])
append_data_csv(store_data)

def append_data_csv(store_data):
with open('processed_train.csv', mode='a') as store_file:
writer = csv.writer(store_file,delimiter=',', quotechar='"',
quoting=csv.QUOTE_MINIMAL)
for index, row in store_data.iterrows():
print(row)
writer.writerow([row['Store'],row['DayOfWeek'],row['Date'],row['Sales'],
row['Customers'],row['Open'],row['Promo'],
row['StateHoliday'],row['SchoolHoliday']])

reverse()


Thank you, in advance







python python-3.x pandas csv dataset






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 17 '18 at 6:35









Sai Kumar

297115




297115










asked Oct 29 '18 at 6:32









Suleka_28Suleka_28

677516




677516







  • 1





    You want the entire data to be time-sorted but you’re sorting & writing 10 data at a time FROM THE BEGINNING. That’s why you aren’t getting results. Why don’t you read & reverse the train.csv from the end. See, stackoverflow.com/a/10933932/2895956

    – Sujay Kumar
    Oct 29 '18 at 7:03







  • 2





    I tried the code given in that post earlier. But it did not give a proper result for me.

    – Suleka_28
    Oct 29 '18 at 7:21






  • 3





    Can you elaborate? What exactly you are getting? And if possible upload the file in dropbox. The process is too long, (to download that file a person have to login, register then fill out information...bla..bla..) just to download the file.

    – Sujay Kumar
    Oct 29 '18 at 7:29







  • 4





    Note: Not a perfect solution. But this you can simply revert the file using cmd line, tail -r train.csv > reverse.csv. And manually remove the last line and add it to the first line in reverse.csv

    – Sujay Kumar
    Oct 29 '18 at 9:20












  • 1





    You want the entire data to be time-sorted but you’re sorting & writing 10 data at a time FROM THE BEGINNING. That’s why you aren’t getting results. Why don’t you read & reverse the train.csv from the end. See, stackoverflow.com/a/10933932/2895956

    – Sujay Kumar
    Oct 29 '18 at 7:03







  • 2





    I tried the code given in that post earlier. But it did not give a proper result for me.

    – Suleka_28
    Oct 29 '18 at 7:21






  • 3





    Can you elaborate? What exactly you are getting? And if possible upload the file in dropbox. The process is too long, (to download that file a person have to login, register then fill out information...bla..bla..) just to download the file.

    – Sujay Kumar
    Oct 29 '18 at 7:29







  • 4





    Note: Not a perfect solution. But this you can simply revert the file using cmd line, tail -r train.csv > reverse.csv. And manually remove the last line and add it to the first line in reverse.csv

    – Sujay Kumar
    Oct 29 '18 at 9:20







1




1





You want the entire data to be time-sorted but you’re sorting & writing 10 data at a time FROM THE BEGINNING. That’s why you aren’t getting results. Why don’t you read & reverse the train.csv from the end. See, stackoverflow.com/a/10933932/2895956

– Sujay Kumar
Oct 29 '18 at 7:03






You want the entire data to be time-sorted but you’re sorting & writing 10 data at a time FROM THE BEGINNING. That’s why you aren’t getting results. Why don’t you read & reverse the train.csv from the end. See, stackoverflow.com/a/10933932/2895956

– Sujay Kumar
Oct 29 '18 at 7:03





2




2





I tried the code given in that post earlier. But it did not give a proper result for me.

– Suleka_28
Oct 29 '18 at 7:21





I tried the code given in that post earlier. But it did not give a proper result for me.

– Suleka_28
Oct 29 '18 at 7:21




3




3





Can you elaborate? What exactly you are getting? And if possible upload the file in dropbox. The process is too long, (to download that file a person have to login, register then fill out information...bla..bla..) just to download the file.

– Sujay Kumar
Oct 29 '18 at 7:29






Can you elaborate? What exactly you are getting? And if possible upload the file in dropbox. The process is too long, (to download that file a person have to login, register then fill out information...bla..bla..) just to download the file.

– Sujay Kumar
Oct 29 '18 at 7:29





4




4





Note: Not a perfect solution. But this you can simply revert the file using cmd line, tail -r train.csv > reverse.csv. And manually remove the last line and add it to the first line in reverse.csv

– Sujay Kumar
Oct 29 '18 at 9:20





Note: Not a perfect solution. But this you can simply revert the file using cmd line, tail -r train.csv > reverse.csv. And manually remove the last line and add it to the first line in reverse.csv

– Sujay Kumar
Oct 29 '18 at 9:20












5 Answers
5






active

oldest

votes


















3





+50









Using bash, you can tail the whole file except the first line and then reverse it and store it with this:



tail -n +2 train.csv | tac > train_rev.csv


If you want to keep the header in the reversed file, write it first and then append the reversed content



head -1 train.csv > train_rev.csv; tail -n +2 train.csv | tac >> train_rev.csv





share|improve this answer

























  • Unless you have a hard requirement to also run this on a different os, this is probably going to be the fastest/best answer around.

    – Julian
    Nov 20 '18 at 21:24


















0














I would not recommend using pandas for parsing or streaming any files as you are only introducing additional overhead. The best way to do it is to read the file from the bottom up. Well, a big part of this code actually comes from here where it takes in a file and returns the reverse in a generator, which I believe is what you want.



What I did was just tested it with your file train.csv from the provided link and output the results in a new file.



import os

def reverse_readline(filename, buf_size=8192):
"""a generator that returns the lines of a file in reverse order"""
with open(filename) as fh:
segment = None
offset = 0
fh.seek(0, os.SEEK_END)
file_size = remaining_size = fh.tell()
while remaining_size > 0:
offset = min(file_size, offset + buf_size)
fh.seek(file_size - offset)
buffer = fh.read(min(remaining_size, buf_size))
remaining_size -= buf_size
lines = buffer.split('n')
# the first line of the buffer is probably not a complete line so
# we'll save it and append it to the last line of the next buffer
# we read
if segment is not None:
# if the previous chunk starts right from the beginning of line
# do not concact the segment to the last line of new chunk
# instead, yield the segment first
if buffer[-1] != 'n':
lines[-1] += segment
else:
yield segment
segment = lines[0]
for index in range(len(lines) - 1, 0, -1):
if lines[index]:
yield lines[index]
# Don't yield None if the file was empty
if segment is not None:
yield segment

reverse_gen = reverse_readline('train.csv')

with open('rev_train.csv','w') as f:
for row in reverse_gen:
f.write('n'.format(row))


It basically reads it in reverse until it finds a newline then yields a line from the file from the bottom to top. A pretty interesting way of doing it.






share|improve this answer


















  • 2





    This actually will result in overlapping and missing chunks of text if UTF-8 is used. You can't just seek in a variable-width encoded file! Next, if the CSV file has embedded newlines in cell values, reversing can't just split on newlines arbitrarily.

    – Martijn Pieters
    Nov 18 '18 at 14:24












  • Next, a RFC compliant CSV file will use rn line separators, your code assumes that n is used, always, and that there are no embedded newlines in the cell values.

    – Martijn Pieters
    Nov 19 '18 at 8:57


















0














This does exactly what you are requesting, but without Pandas. It reads intest.csv line by line (as opposed to reading the whole file into RAM). It does most of the processing using the file system using a series of chunk files that are aggregated at the end into the outtest.csv file. If you change the maxLines, you can optimise the number of chunk files produced versus RAM consumed (higher numbers consume more RAM but produce fewer chunk files). If you want to keep the CSV header first line, set keepHeader to True; if set to False, it reverses the entire file, including the first line.



For kicks, I ran this on an old Raspberry Pi using a 128GB flash drive on a 6MB csv test file and I thought something had gone wrong because it returned almost immediately, so it's fast even on slower hardware. It imports only one standard python library function (remove), so it's very portable. One advantage of this code is it does not reposition any file pointers. One limitation is it will not work on CSV files that have newlines in the data. For that use case, pandas would be the best solution to read the chunks.



from os import remove

def writechunk(fileCounter, reverseString):
outFile = 'tmpfile' + str(fileCounter) + '.csv'
with open(outFile, 'w') as outfp:
outfp.write(reverseString)
return

def main():
inFile = 'intest.csv'
outFile = 'outtest.csv'
# This is our chunk expressed in lines
maxLines = 10
# Is there a header line we want to keep at the top of the output file?
keepHeader = True

fileCounter = 0
lineCounter = 0
with open(inFile) as infp:
reverseString = ''
line = infp.readline()
if (line and keepHeader):
headerLine = line
line = infp.readline()
while (line):
lineCounter += 1
reverseString = line + reverseString
if (lineCounter == maxLines):
fileCounter += 1
lineCounter = 0
writechunk(fileCounter, reverseString)
reverseString = ''
line = infp.readline()
# Write any leftovers to a chunk file
if (lineCounter != 0):
fileCounter += 1
writechunk(fileCounter,reverseString)
# Read the chunk files backwards and append each to the outFile
with open(outFile, 'w') as outfp:
if (keepHeader):
outfp.write(headerLine)
while (fileCounter > 0):
chunkFile = 'tmpfile' + str(fileCounter) + '.csv'
with open(chunkFile, 'r') as infp:
outfp.write(infp.read())
remove(chunkFile)
fileCounter -= 1

if __name__ == '__main__':
main()





share|improve this answer




















  • 2





    This does work, but assumes that there are no newlines in the CSV cell values.

    – Martijn Pieters
    Nov 18 '18 at 14:32











  • Thanks, @MartijnPieters. You are correct. However, this was the use case expressed in the original question. I will update the answer to point out this limitation. This was designed to have the smallest footprint possible, but we could potentially use pandas' inbuilt chunk function to read bits of the file, which would overcome that limitation at the expense of efficiency (and a modest increase in memory requirement).

    – Mark Warburton
    Nov 18 '18 at 23:00






  • 1





    It absolutely is in the use case expressed, which asks to be able to handle any giant CSV file.

    – Martijn Pieters
    Nov 19 '18 at 8:58


















0














If you have sufficient hard disk space, you can read in chunks, reverse and store. Then pick up the stored chunks in reverse order and write to a new csv file.



Below is an example with Pandas which also uses pickle (for performance efficiency) and gzip (for storage efficiency).



import pandas as pd, numpy as np

# create a dataframe for demonstration purposes
df = pd.DataFrame(np.arange(5*9).reshape((-1, 5)))
df.to_csv('file.csv', index=False)

# number of rows we want to chunk by
n = 3

# iterate chunks, output to pickle files
for idx, chunk in enumerate(pd.read_csv('file.csv', chunksize=n)):
chunk.iloc[::-1].to_pickle(f'file_pkl_idx:03.pkl.gzip', compression='gzip')

# open file in amend mode and write chunks in reverse
# idx stores the index of the last pickle file written
with open('out.csv', 'a') as fout:
for i in range(idx, -1, -1):
chunk_pkl = pd.read_pickle(f'file_pkl_i:03.pkl.gzip', compression='gzip')
chunk_pkl.to_csv(fout, index=False, header=False if i!=idx else True)

# read new file to check results
df_new = pd.read_csv('out.csv')

print(df_new)

0 1 2 3 4
0 40 41 42 43 44
1 35 36 37 38 39
2 30 31 32 33 34
3 25 26 27 28 29
4 20 21 22 23 24
5 15 16 17 18 19
6 10 11 12 13 14
7 5 6 7 8 9
8 0 1 2 3 4





share|improve this answer

























  • Actually, each file chunk should be reversed and next each reversed chunk of rows should be prepended rather than appending to the csv file. So the entire file will be reversed at the end but chunk by chunk without loading entire file into memory.

    – Sai Kumar
    Nov 16 '18 at 23:56












  • @SaiKumar, Got it, have updated with a solution which performs a complete reversal.

    – jpp
    Nov 18 '18 at 18:40


















-3














You have repeated code blocks and you are not taking advantage of pandas at all.



What @sujay kumar pointed out is very correct, I would read that more closely.



The file isnt big at all. I use OHLCV tick data that is in GBs without issues. If you use pandas.read_csv() you dont have to do chunked transfer. Sure it will take time but it will work fine. Unless you are going in Terrabytes. I have not tested with that.



when you read_csv() you dont specify any index. If you did you could call sort_index() with or without ascending=False depending on the order.



Pandas can write CSV too, please use that instead. I am pasting some example code for you to put together.



df_temp = pd.read_csv(file_path, parse_dates=True, index_col="Date",
usecols=["Date", "Adj Close"], na_values=["nan"])



sorting a Series



s = pd.Series(list('abcde'), index=[0,3,2,5,4])
s.sort_index()



Note: If you stick to Pandas and its functions you will be running already optimized code which dont require entire files to be loaded in memory. It's so easy that its almost like cheating :)






share|improve this answer
























    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53039941%2fread-and-reverse-data-chunk-by-chunk-from-a-csv-file-and-copy-to-a-new-csv-file%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    5 Answers
    5






    active

    oldest

    votes








    5 Answers
    5






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    3





    +50









    Using bash, you can tail the whole file except the first line and then reverse it and store it with this:



    tail -n +2 train.csv | tac > train_rev.csv


    If you want to keep the header in the reversed file, write it first and then append the reversed content



    head -1 train.csv > train_rev.csv; tail -n +2 train.csv | tac >> train_rev.csv





    share|improve this answer

























    • Unless you have a hard requirement to also run this on a different os, this is probably going to be the fastest/best answer around.

      – Julian
      Nov 20 '18 at 21:24















    3





    +50









    Using bash, you can tail the whole file except the first line and then reverse it and store it with this:



    tail -n +2 train.csv | tac > train_rev.csv


    If you want to keep the header in the reversed file, write it first and then append the reversed content



    head -1 train.csv > train_rev.csv; tail -n +2 train.csv | tac >> train_rev.csv





    share|improve this answer

























    • Unless you have a hard requirement to also run this on a different os, this is probably going to be the fastest/best answer around.

      – Julian
      Nov 20 '18 at 21:24













    3





    +50







    3





    +50



    3




    +50





    Using bash, you can tail the whole file except the first line and then reverse it and store it with this:



    tail -n +2 train.csv | tac > train_rev.csv


    If you want to keep the header in the reversed file, write it first and then append the reversed content



    head -1 train.csv > train_rev.csv; tail -n +2 train.csv | tac >> train_rev.csv





    share|improve this answer















    Using bash, you can tail the whole file except the first line and then reverse it and store it with this:



    tail -n +2 train.csv | tac > train_rev.csv


    If you want to keep the header in the reversed file, write it first and then append the reversed content



    head -1 train.csv > train_rev.csv; tail -n +2 train.csv | tac >> train_rev.csv






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 20 '18 at 16:01

























    answered Nov 20 '18 at 15:47









    gustavovelascohgustavovelascoh

    4301412




    4301412












    • Unless you have a hard requirement to also run this on a different os, this is probably going to be the fastest/best answer around.

      – Julian
      Nov 20 '18 at 21:24

















    • Unless you have a hard requirement to also run this on a different os, this is probably going to be the fastest/best answer around.

      – Julian
      Nov 20 '18 at 21:24
















    Unless you have a hard requirement to also run this on a different os, this is probably going to be the fastest/best answer around.

    – Julian
    Nov 20 '18 at 21:24





    Unless you have a hard requirement to also run this on a different os, this is probably going to be the fastest/best answer around.

    – Julian
    Nov 20 '18 at 21:24













    0














    I would not recommend using pandas for parsing or streaming any files as you are only introducing additional overhead. The best way to do it is to read the file from the bottom up. Well, a big part of this code actually comes from here where it takes in a file and returns the reverse in a generator, which I believe is what you want.



    What I did was just tested it with your file train.csv from the provided link and output the results in a new file.



    import os

    def reverse_readline(filename, buf_size=8192):
    """a generator that returns the lines of a file in reverse order"""
    with open(filename) as fh:
    segment = None
    offset = 0
    fh.seek(0, os.SEEK_END)
    file_size = remaining_size = fh.tell()
    while remaining_size > 0:
    offset = min(file_size, offset + buf_size)
    fh.seek(file_size - offset)
    buffer = fh.read(min(remaining_size, buf_size))
    remaining_size -= buf_size
    lines = buffer.split('n')
    # the first line of the buffer is probably not a complete line so
    # we'll save it and append it to the last line of the next buffer
    # we read
    if segment is not None:
    # if the previous chunk starts right from the beginning of line
    # do not concact the segment to the last line of new chunk
    # instead, yield the segment first
    if buffer[-1] != 'n':
    lines[-1] += segment
    else:
    yield segment
    segment = lines[0]
    for index in range(len(lines) - 1, 0, -1):
    if lines[index]:
    yield lines[index]
    # Don't yield None if the file was empty
    if segment is not None:
    yield segment

    reverse_gen = reverse_readline('train.csv')

    with open('rev_train.csv','w') as f:
    for row in reverse_gen:
    f.write('n'.format(row))


    It basically reads it in reverse until it finds a newline then yields a line from the file from the bottom to top. A pretty interesting way of doing it.






    share|improve this answer


















    • 2





      This actually will result in overlapping and missing chunks of text if UTF-8 is used. You can't just seek in a variable-width encoded file! Next, if the CSV file has embedded newlines in cell values, reversing can't just split on newlines arbitrarily.

      – Martijn Pieters
      Nov 18 '18 at 14:24












    • Next, a RFC compliant CSV file will use rn line separators, your code assumes that n is used, always, and that there are no embedded newlines in the cell values.

      – Martijn Pieters
      Nov 19 '18 at 8:57















    0














    I would not recommend using pandas for parsing or streaming any files as you are only introducing additional overhead. The best way to do it is to read the file from the bottom up. Well, a big part of this code actually comes from here where it takes in a file and returns the reverse in a generator, which I believe is what you want.



    What I did was just tested it with your file train.csv from the provided link and output the results in a new file.



    import os

    def reverse_readline(filename, buf_size=8192):
    """a generator that returns the lines of a file in reverse order"""
    with open(filename) as fh:
    segment = None
    offset = 0
    fh.seek(0, os.SEEK_END)
    file_size = remaining_size = fh.tell()
    while remaining_size > 0:
    offset = min(file_size, offset + buf_size)
    fh.seek(file_size - offset)
    buffer = fh.read(min(remaining_size, buf_size))
    remaining_size -= buf_size
    lines = buffer.split('n')
    # the first line of the buffer is probably not a complete line so
    # we'll save it and append it to the last line of the next buffer
    # we read
    if segment is not None:
    # if the previous chunk starts right from the beginning of line
    # do not concact the segment to the last line of new chunk
    # instead, yield the segment first
    if buffer[-1] != 'n':
    lines[-1] += segment
    else:
    yield segment
    segment = lines[0]
    for index in range(len(lines) - 1, 0, -1):
    if lines[index]:
    yield lines[index]
    # Don't yield None if the file was empty
    if segment is not None:
    yield segment

    reverse_gen = reverse_readline('train.csv')

    with open('rev_train.csv','w') as f:
    for row in reverse_gen:
    f.write('n'.format(row))


    It basically reads it in reverse until it finds a newline then yields a line from the file from the bottom to top. A pretty interesting way of doing it.






    share|improve this answer


















    • 2





      This actually will result in overlapping and missing chunks of text if UTF-8 is used. You can't just seek in a variable-width encoded file! Next, if the CSV file has embedded newlines in cell values, reversing can't just split on newlines arbitrarily.

      – Martijn Pieters
      Nov 18 '18 at 14:24












    • Next, a RFC compliant CSV file will use rn line separators, your code assumes that n is used, always, and that there are no embedded newlines in the cell values.

      – Martijn Pieters
      Nov 19 '18 at 8:57













    0












    0








    0







    I would not recommend using pandas for parsing or streaming any files as you are only introducing additional overhead. The best way to do it is to read the file from the bottom up. Well, a big part of this code actually comes from here where it takes in a file and returns the reverse in a generator, which I believe is what you want.



    What I did was just tested it with your file train.csv from the provided link and output the results in a new file.



    import os

    def reverse_readline(filename, buf_size=8192):
    """a generator that returns the lines of a file in reverse order"""
    with open(filename) as fh:
    segment = None
    offset = 0
    fh.seek(0, os.SEEK_END)
    file_size = remaining_size = fh.tell()
    while remaining_size > 0:
    offset = min(file_size, offset + buf_size)
    fh.seek(file_size - offset)
    buffer = fh.read(min(remaining_size, buf_size))
    remaining_size -= buf_size
    lines = buffer.split('n')
    # the first line of the buffer is probably not a complete line so
    # we'll save it and append it to the last line of the next buffer
    # we read
    if segment is not None:
    # if the previous chunk starts right from the beginning of line
    # do not concact the segment to the last line of new chunk
    # instead, yield the segment first
    if buffer[-1] != 'n':
    lines[-1] += segment
    else:
    yield segment
    segment = lines[0]
    for index in range(len(lines) - 1, 0, -1):
    if lines[index]:
    yield lines[index]
    # Don't yield None if the file was empty
    if segment is not None:
    yield segment

    reverse_gen = reverse_readline('train.csv')

    with open('rev_train.csv','w') as f:
    for row in reverse_gen:
    f.write('n'.format(row))


    It basically reads it in reverse until it finds a newline then yields a line from the file from the bottom to top. A pretty interesting way of doing it.






    share|improve this answer













    I would not recommend using pandas for parsing or streaming any files as you are only introducing additional overhead. The best way to do it is to read the file from the bottom up. Well, a big part of this code actually comes from here where it takes in a file and returns the reverse in a generator, which I believe is what you want.



    What I did was just tested it with your file train.csv from the provided link and output the results in a new file.



    import os

    def reverse_readline(filename, buf_size=8192):
    """a generator that returns the lines of a file in reverse order"""
    with open(filename) as fh:
    segment = None
    offset = 0
    fh.seek(0, os.SEEK_END)
    file_size = remaining_size = fh.tell()
    while remaining_size > 0:
    offset = min(file_size, offset + buf_size)
    fh.seek(file_size - offset)
    buffer = fh.read(min(remaining_size, buf_size))
    remaining_size -= buf_size
    lines = buffer.split('n')
    # the first line of the buffer is probably not a complete line so
    # we'll save it and append it to the last line of the next buffer
    # we read
    if segment is not None:
    # if the previous chunk starts right from the beginning of line
    # do not concact the segment to the last line of new chunk
    # instead, yield the segment first
    if buffer[-1] != 'n':
    lines[-1] += segment
    else:
    yield segment
    segment = lines[0]
    for index in range(len(lines) - 1, 0, -1):
    if lines[index]:
    yield lines[index]
    # Don't yield None if the file was empty
    if segment is not None:
    yield segment

    reverse_gen = reverse_readline('train.csv')

    with open('rev_train.csv','w') as f:
    for row in reverse_gen:
    f.write('n'.format(row))


    It basically reads it in reverse until it finds a newline then yields a line from the file from the bottom to top. A pretty interesting way of doing it.







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Nov 17 '18 at 3:02









    BernardLBernardL

    2,37311029




    2,37311029







    • 2





      This actually will result in overlapping and missing chunks of text if UTF-8 is used. You can't just seek in a variable-width encoded file! Next, if the CSV file has embedded newlines in cell values, reversing can't just split on newlines arbitrarily.

      – Martijn Pieters
      Nov 18 '18 at 14:24












    • Next, a RFC compliant CSV file will use rn line separators, your code assumes that n is used, always, and that there are no embedded newlines in the cell values.

      – Martijn Pieters
      Nov 19 '18 at 8:57












    • 2





      This actually will result in overlapping and missing chunks of text if UTF-8 is used. You can't just seek in a variable-width encoded file! Next, if the CSV file has embedded newlines in cell values, reversing can't just split on newlines arbitrarily.

      – Martijn Pieters
      Nov 18 '18 at 14:24












    • Next, a RFC compliant CSV file will use rn line separators, your code assumes that n is used, always, and that there are no embedded newlines in the cell values.

      – Martijn Pieters
      Nov 19 '18 at 8:57







    2




    2





    This actually will result in overlapping and missing chunks of text if UTF-8 is used. You can't just seek in a variable-width encoded file! Next, if the CSV file has embedded newlines in cell values, reversing can't just split on newlines arbitrarily.

    – Martijn Pieters
    Nov 18 '18 at 14:24






    This actually will result in overlapping and missing chunks of text if UTF-8 is used. You can't just seek in a variable-width encoded file! Next, if the CSV file has embedded newlines in cell values, reversing can't just split on newlines arbitrarily.

    – Martijn Pieters
    Nov 18 '18 at 14:24














    Next, a RFC compliant CSV file will use rn line separators, your code assumes that n is used, always, and that there are no embedded newlines in the cell values.

    – Martijn Pieters
    Nov 19 '18 at 8:57





    Next, a RFC compliant CSV file will use rn line separators, your code assumes that n is used, always, and that there are no embedded newlines in the cell values.

    – Martijn Pieters
    Nov 19 '18 at 8:57











    0














    This does exactly what you are requesting, but without Pandas. It reads intest.csv line by line (as opposed to reading the whole file into RAM). It does most of the processing using the file system using a series of chunk files that are aggregated at the end into the outtest.csv file. If you change the maxLines, you can optimise the number of chunk files produced versus RAM consumed (higher numbers consume more RAM but produce fewer chunk files). If you want to keep the CSV header first line, set keepHeader to True; if set to False, it reverses the entire file, including the first line.



    For kicks, I ran this on an old Raspberry Pi using a 128GB flash drive on a 6MB csv test file and I thought something had gone wrong because it returned almost immediately, so it's fast even on slower hardware. It imports only one standard python library function (remove), so it's very portable. One advantage of this code is it does not reposition any file pointers. One limitation is it will not work on CSV files that have newlines in the data. For that use case, pandas would be the best solution to read the chunks.



    from os import remove

    def writechunk(fileCounter, reverseString):
    outFile = 'tmpfile' + str(fileCounter) + '.csv'
    with open(outFile, 'w') as outfp:
    outfp.write(reverseString)
    return

    def main():
    inFile = 'intest.csv'
    outFile = 'outtest.csv'
    # This is our chunk expressed in lines
    maxLines = 10
    # Is there a header line we want to keep at the top of the output file?
    keepHeader = True

    fileCounter = 0
    lineCounter = 0
    with open(inFile) as infp:
    reverseString = ''
    line = infp.readline()
    if (line and keepHeader):
    headerLine = line
    line = infp.readline()
    while (line):
    lineCounter += 1
    reverseString = line + reverseString
    if (lineCounter == maxLines):
    fileCounter += 1
    lineCounter = 0
    writechunk(fileCounter, reverseString)
    reverseString = ''
    line = infp.readline()
    # Write any leftovers to a chunk file
    if (lineCounter != 0):
    fileCounter += 1
    writechunk(fileCounter,reverseString)
    # Read the chunk files backwards and append each to the outFile
    with open(outFile, 'w') as outfp:
    if (keepHeader):
    outfp.write(headerLine)
    while (fileCounter > 0):
    chunkFile = 'tmpfile' + str(fileCounter) + '.csv'
    with open(chunkFile, 'r') as infp:
    outfp.write(infp.read())
    remove(chunkFile)
    fileCounter -= 1

    if __name__ == '__main__':
    main()





    share|improve this answer




















    • 2





      This does work, but assumes that there are no newlines in the CSV cell values.

      – Martijn Pieters
      Nov 18 '18 at 14:32











    • Thanks, @MartijnPieters. You are correct. However, this was the use case expressed in the original question. I will update the answer to point out this limitation. This was designed to have the smallest footprint possible, but we could potentially use pandas' inbuilt chunk function to read bits of the file, which would overcome that limitation at the expense of efficiency (and a modest increase in memory requirement).

      – Mark Warburton
      Nov 18 '18 at 23:00






    • 1





      It absolutely is in the use case expressed, which asks to be able to handle any giant CSV file.

      – Martijn Pieters
      Nov 19 '18 at 8:58















    0














    This does exactly what you are requesting, but without Pandas. It reads intest.csv line by line (as opposed to reading the whole file into RAM). It does most of the processing using the file system using a series of chunk files that are aggregated at the end into the outtest.csv file. If you change the maxLines, you can optimise the number of chunk files produced versus RAM consumed (higher numbers consume more RAM but produce fewer chunk files). If you want to keep the CSV header first line, set keepHeader to True; if set to False, it reverses the entire file, including the first line.



    For kicks, I ran this on an old Raspberry Pi using a 128GB flash drive on a 6MB csv test file and I thought something had gone wrong because it returned almost immediately, so it's fast even on slower hardware. It imports only one standard python library function (remove), so it's very portable. One advantage of this code is it does not reposition any file pointers. One limitation is it will not work on CSV files that have newlines in the data. For that use case, pandas would be the best solution to read the chunks.



    from os import remove

    def writechunk(fileCounter, reverseString):
    outFile = 'tmpfile' + str(fileCounter) + '.csv'
    with open(outFile, 'w') as outfp:
    outfp.write(reverseString)
    return

    def main():
    inFile = 'intest.csv'
    outFile = 'outtest.csv'
    # This is our chunk expressed in lines
    maxLines = 10
    # Is there a header line we want to keep at the top of the output file?
    keepHeader = True

    fileCounter = 0
    lineCounter = 0
    with open(inFile) as infp:
    reverseString = ''
    line = infp.readline()
    if (line and keepHeader):
    headerLine = line
    line = infp.readline()
    while (line):
    lineCounter += 1
    reverseString = line + reverseString
    if (lineCounter == maxLines):
    fileCounter += 1
    lineCounter = 0
    writechunk(fileCounter, reverseString)
    reverseString = ''
    line = infp.readline()
    # Write any leftovers to a chunk file
    if (lineCounter != 0):
    fileCounter += 1
    writechunk(fileCounter,reverseString)
    # Read the chunk files backwards and append each to the outFile
    with open(outFile, 'w') as outfp:
    if (keepHeader):
    outfp.write(headerLine)
    while (fileCounter > 0):
    chunkFile = 'tmpfile' + str(fileCounter) + '.csv'
    with open(chunkFile, 'r') as infp:
    outfp.write(infp.read())
    remove(chunkFile)
    fileCounter -= 1

    if __name__ == '__main__':
    main()





    share|improve this answer




















    • 2





      This does work, but assumes that there are no newlines in the CSV cell values.

      – Martijn Pieters
      Nov 18 '18 at 14:32











    • Thanks, @MartijnPieters. You are correct. However, this was the use case expressed in the original question. I will update the answer to point out this limitation. This was designed to have the smallest footprint possible, but we could potentially use pandas' inbuilt chunk function to read bits of the file, which would overcome that limitation at the expense of efficiency (and a modest increase in memory requirement).

      – Mark Warburton
      Nov 18 '18 at 23:00






    • 1





      It absolutely is in the use case expressed, which asks to be able to handle any giant CSV file.

      – Martijn Pieters
      Nov 19 '18 at 8:58













    0












    0








    0







    This does exactly what you are requesting, but without Pandas. It reads intest.csv line by line (as opposed to reading the whole file into RAM). It does most of the processing using the file system using a series of chunk files that are aggregated at the end into the outtest.csv file. If you change the maxLines, you can optimise the number of chunk files produced versus RAM consumed (higher numbers consume more RAM but produce fewer chunk files). If you want to keep the CSV header first line, set keepHeader to True; if set to False, it reverses the entire file, including the first line.



    For kicks, I ran this on an old Raspberry Pi using a 128GB flash drive on a 6MB csv test file and I thought something had gone wrong because it returned almost immediately, so it's fast even on slower hardware. It imports only one standard python library function (remove), so it's very portable. One advantage of this code is it does not reposition any file pointers. One limitation is it will not work on CSV files that have newlines in the data. For that use case, pandas would be the best solution to read the chunks.



    from os import remove

    def writechunk(fileCounter, reverseString):
    outFile = 'tmpfile' + str(fileCounter) + '.csv'
    with open(outFile, 'w') as outfp:
    outfp.write(reverseString)
    return

    def main():
    inFile = 'intest.csv'
    outFile = 'outtest.csv'
    # This is our chunk expressed in lines
    maxLines = 10
    # Is there a header line we want to keep at the top of the output file?
    keepHeader = True

    fileCounter = 0
    lineCounter = 0
    with open(inFile) as infp:
    reverseString = ''
    line = infp.readline()
    if (line and keepHeader):
    headerLine = line
    line = infp.readline()
    while (line):
    lineCounter += 1
    reverseString = line + reverseString
    if (lineCounter == maxLines):
    fileCounter += 1
    lineCounter = 0
    writechunk(fileCounter, reverseString)
    reverseString = ''
    line = infp.readline()
    # Write any leftovers to a chunk file
    if (lineCounter != 0):
    fileCounter += 1
    writechunk(fileCounter,reverseString)
    # Read the chunk files backwards and append each to the outFile
    with open(outFile, 'w') as outfp:
    if (keepHeader):
    outfp.write(headerLine)
    while (fileCounter > 0):
    chunkFile = 'tmpfile' + str(fileCounter) + '.csv'
    with open(chunkFile, 'r') as infp:
    outfp.write(infp.read())
    remove(chunkFile)
    fileCounter -= 1

    if __name__ == '__main__':
    main()





    share|improve this answer















    This does exactly what you are requesting, but without Pandas. It reads intest.csv line by line (as opposed to reading the whole file into RAM). It does most of the processing using the file system using a series of chunk files that are aggregated at the end into the outtest.csv file. If you change the maxLines, you can optimise the number of chunk files produced versus RAM consumed (higher numbers consume more RAM but produce fewer chunk files). If you want to keep the CSV header first line, set keepHeader to True; if set to False, it reverses the entire file, including the first line.



    For kicks, I ran this on an old Raspberry Pi using a 128GB flash drive on a 6MB csv test file and I thought something had gone wrong because it returned almost immediately, so it's fast even on slower hardware. It imports only one standard python library function (remove), so it's very portable. One advantage of this code is it does not reposition any file pointers. One limitation is it will not work on CSV files that have newlines in the data. For that use case, pandas would be the best solution to read the chunks.



    from os import remove

    def writechunk(fileCounter, reverseString):
    outFile = 'tmpfile' + str(fileCounter) + '.csv'
    with open(outFile, 'w') as outfp:
    outfp.write(reverseString)
    return

    def main():
    inFile = 'intest.csv'
    outFile = 'outtest.csv'
    # This is our chunk expressed in lines
    maxLines = 10
    # Is there a header line we want to keep at the top of the output file?
    keepHeader = True

    fileCounter = 0
    lineCounter = 0
    with open(inFile) as infp:
    reverseString = ''
    line = infp.readline()
    if (line and keepHeader):
    headerLine = line
    line = infp.readline()
    while (line):
    lineCounter += 1
    reverseString = line + reverseString
    if (lineCounter == maxLines):
    fileCounter += 1
    lineCounter = 0
    writechunk(fileCounter, reverseString)
    reverseString = ''
    line = infp.readline()
    # Write any leftovers to a chunk file
    if (lineCounter != 0):
    fileCounter += 1
    writechunk(fileCounter,reverseString)
    # Read the chunk files backwards and append each to the outFile
    with open(outFile, 'w') as outfp:
    if (keepHeader):
    outfp.write(headerLine)
    while (fileCounter > 0):
    chunkFile = 'tmpfile' + str(fileCounter) + '.csv'
    with open(chunkFile, 'r') as infp:
    outfp.write(infp.read())
    remove(chunkFile)
    fileCounter -= 1

    if __name__ == '__main__':
    main()






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 18 '18 at 23:03

























    answered Nov 17 '18 at 8:13









    Mark WarburtonMark Warburton

    8517




    8517







    • 2





      This does work, but assumes that there are no newlines in the CSV cell values.

      – Martijn Pieters
      Nov 18 '18 at 14:32











    • Thanks, @MartijnPieters. You are correct. However, this was the use case expressed in the original question. I will update the answer to point out this limitation. This was designed to have the smallest footprint possible, but we could potentially use pandas' inbuilt chunk function to read bits of the file, which would overcome that limitation at the expense of efficiency (and a modest increase in memory requirement).

      – Mark Warburton
      Nov 18 '18 at 23:00






    • 1





      It absolutely is in the use case expressed, which asks to be able to handle any giant CSV file.

      – Martijn Pieters
      Nov 19 '18 at 8:58












    • 2





      This does work, but assumes that there are no newlines in the CSV cell values.

      – Martijn Pieters
      Nov 18 '18 at 14:32











    • Thanks, @MartijnPieters. You are correct. However, this was the use case expressed in the original question. I will update the answer to point out this limitation. This was designed to have the smallest footprint possible, but we could potentially use pandas' inbuilt chunk function to read bits of the file, which would overcome that limitation at the expense of efficiency (and a modest increase in memory requirement).

      – Mark Warburton
      Nov 18 '18 at 23:00






    • 1





      It absolutely is in the use case expressed, which asks to be able to handle any giant CSV file.

      – Martijn Pieters
      Nov 19 '18 at 8:58







    2




    2





    This does work, but assumes that there are no newlines in the CSV cell values.

    – Martijn Pieters
    Nov 18 '18 at 14:32





    This does work, but assumes that there are no newlines in the CSV cell values.

    – Martijn Pieters
    Nov 18 '18 at 14:32













    Thanks, @MartijnPieters. You are correct. However, this was the use case expressed in the original question. I will update the answer to point out this limitation. This was designed to have the smallest footprint possible, but we could potentially use pandas' inbuilt chunk function to read bits of the file, which would overcome that limitation at the expense of efficiency (and a modest increase in memory requirement).

    – Mark Warburton
    Nov 18 '18 at 23:00





    Thanks, @MartijnPieters. You are correct. However, this was the use case expressed in the original question. I will update the answer to point out this limitation. This was designed to have the smallest footprint possible, but we could potentially use pandas' inbuilt chunk function to read bits of the file, which would overcome that limitation at the expense of efficiency (and a modest increase in memory requirement).

    – Mark Warburton
    Nov 18 '18 at 23:00




    1




    1





    It absolutely is in the use case expressed, which asks to be able to handle any giant CSV file.

    – Martijn Pieters
    Nov 19 '18 at 8:58





    It absolutely is in the use case expressed, which asks to be able to handle any giant CSV file.

    – Martijn Pieters
    Nov 19 '18 at 8:58











    0














    If you have sufficient hard disk space, you can read in chunks, reverse and store. Then pick up the stored chunks in reverse order and write to a new csv file.



    Below is an example with Pandas which also uses pickle (for performance efficiency) and gzip (for storage efficiency).



    import pandas as pd, numpy as np

    # create a dataframe for demonstration purposes
    df = pd.DataFrame(np.arange(5*9).reshape((-1, 5)))
    df.to_csv('file.csv', index=False)

    # number of rows we want to chunk by
    n = 3

    # iterate chunks, output to pickle files
    for idx, chunk in enumerate(pd.read_csv('file.csv', chunksize=n)):
    chunk.iloc[::-1].to_pickle(f'file_pkl_idx:03.pkl.gzip', compression='gzip')

    # open file in amend mode and write chunks in reverse
    # idx stores the index of the last pickle file written
    with open('out.csv', 'a') as fout:
    for i in range(idx, -1, -1):
    chunk_pkl = pd.read_pickle(f'file_pkl_i:03.pkl.gzip', compression='gzip')
    chunk_pkl.to_csv(fout, index=False, header=False if i!=idx else True)

    # read new file to check results
    df_new = pd.read_csv('out.csv')

    print(df_new)

    0 1 2 3 4
    0 40 41 42 43 44
    1 35 36 37 38 39
    2 30 31 32 33 34
    3 25 26 27 28 29
    4 20 21 22 23 24
    5 15 16 17 18 19
    6 10 11 12 13 14
    7 5 6 7 8 9
    8 0 1 2 3 4





    share|improve this answer

























    • Actually, each file chunk should be reversed and next each reversed chunk of rows should be prepended rather than appending to the csv file. So the entire file will be reversed at the end but chunk by chunk without loading entire file into memory.

      – Sai Kumar
      Nov 16 '18 at 23:56












    • @SaiKumar, Got it, have updated with a solution which performs a complete reversal.

      – jpp
      Nov 18 '18 at 18:40















    0














    If you have sufficient hard disk space, you can read in chunks, reverse and store. Then pick up the stored chunks in reverse order and write to a new csv file.



    Below is an example with Pandas which also uses pickle (for performance efficiency) and gzip (for storage efficiency).



    import pandas as pd, numpy as np

    # create a dataframe for demonstration purposes
    df = pd.DataFrame(np.arange(5*9).reshape((-1, 5)))
    df.to_csv('file.csv', index=False)

    # number of rows we want to chunk by
    n = 3

    # iterate chunks, output to pickle files
    for idx, chunk in enumerate(pd.read_csv('file.csv', chunksize=n)):
    chunk.iloc[::-1].to_pickle(f'file_pkl_idx:03.pkl.gzip', compression='gzip')

    # open file in amend mode and write chunks in reverse
    # idx stores the index of the last pickle file written
    with open('out.csv', 'a') as fout:
    for i in range(idx, -1, -1):
    chunk_pkl = pd.read_pickle(f'file_pkl_i:03.pkl.gzip', compression='gzip')
    chunk_pkl.to_csv(fout, index=False, header=False if i!=idx else True)

    # read new file to check results
    df_new = pd.read_csv('out.csv')

    print(df_new)

    0 1 2 3 4
    0 40 41 42 43 44
    1 35 36 37 38 39
    2 30 31 32 33 34
    3 25 26 27 28 29
    4 20 21 22 23 24
    5 15 16 17 18 19
    6 10 11 12 13 14
    7 5 6 7 8 9
    8 0 1 2 3 4





    share|improve this answer

























    • Actually, each file chunk should be reversed and next each reversed chunk of rows should be prepended rather than appending to the csv file. So the entire file will be reversed at the end but chunk by chunk without loading entire file into memory.

      – Sai Kumar
      Nov 16 '18 at 23:56












    • @SaiKumar, Got it, have updated with a solution which performs a complete reversal.

      – jpp
      Nov 18 '18 at 18:40













    0












    0








    0







    If you have sufficient hard disk space, you can read in chunks, reverse and store. Then pick up the stored chunks in reverse order and write to a new csv file.



    Below is an example with Pandas which also uses pickle (for performance efficiency) and gzip (for storage efficiency).



    import pandas as pd, numpy as np

    # create a dataframe for demonstration purposes
    df = pd.DataFrame(np.arange(5*9).reshape((-1, 5)))
    df.to_csv('file.csv', index=False)

    # number of rows we want to chunk by
    n = 3

    # iterate chunks, output to pickle files
    for idx, chunk in enumerate(pd.read_csv('file.csv', chunksize=n)):
    chunk.iloc[::-1].to_pickle(f'file_pkl_idx:03.pkl.gzip', compression='gzip')

    # open file in amend mode and write chunks in reverse
    # idx stores the index of the last pickle file written
    with open('out.csv', 'a') as fout:
    for i in range(idx, -1, -1):
    chunk_pkl = pd.read_pickle(f'file_pkl_i:03.pkl.gzip', compression='gzip')
    chunk_pkl.to_csv(fout, index=False, header=False if i!=idx else True)

    # read new file to check results
    df_new = pd.read_csv('out.csv')

    print(df_new)

    0 1 2 3 4
    0 40 41 42 43 44
    1 35 36 37 38 39
    2 30 31 32 33 34
    3 25 26 27 28 29
    4 20 21 22 23 24
    5 15 16 17 18 19
    6 10 11 12 13 14
    7 5 6 7 8 9
    8 0 1 2 3 4





    share|improve this answer















    If you have sufficient hard disk space, you can read in chunks, reverse and store. Then pick up the stored chunks in reverse order and write to a new csv file.



    Below is an example with Pandas which also uses pickle (for performance efficiency) and gzip (for storage efficiency).



    import pandas as pd, numpy as np

    # create a dataframe for demonstration purposes
    df = pd.DataFrame(np.arange(5*9).reshape((-1, 5)))
    df.to_csv('file.csv', index=False)

    # number of rows we want to chunk by
    n = 3

    # iterate chunks, output to pickle files
    for idx, chunk in enumerate(pd.read_csv('file.csv', chunksize=n)):
    chunk.iloc[::-1].to_pickle(f'file_pkl_idx:03.pkl.gzip', compression='gzip')

    # open file in amend mode and write chunks in reverse
    # idx stores the index of the last pickle file written
    with open('out.csv', 'a') as fout:
    for i in range(idx, -1, -1):
    chunk_pkl = pd.read_pickle(f'file_pkl_i:03.pkl.gzip', compression='gzip')
    chunk_pkl.to_csv(fout, index=False, header=False if i!=idx else True)

    # read new file to check results
    df_new = pd.read_csv('out.csv')

    print(df_new)

    0 1 2 3 4
    0 40 41 42 43 44
    1 35 36 37 38 39
    2 30 31 32 33 34
    3 25 26 27 28 29
    4 20 21 22 23 24
    5 15 16 17 18 19
    6 10 11 12 13 14
    7 5 6 7 8 9
    8 0 1 2 3 4






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 20 '18 at 10:02

























    answered Nov 16 '18 at 23:27









    jppjpp

    99.4k2161110




    99.4k2161110












    • Actually, each file chunk should be reversed and next each reversed chunk of rows should be prepended rather than appending to the csv file. So the entire file will be reversed at the end but chunk by chunk without loading entire file into memory.

      – Sai Kumar
      Nov 16 '18 at 23:56












    • @SaiKumar, Got it, have updated with a solution which performs a complete reversal.

      – jpp
      Nov 18 '18 at 18:40

















    • Actually, each file chunk should be reversed and next each reversed chunk of rows should be prepended rather than appending to the csv file. So the entire file will be reversed at the end but chunk by chunk without loading entire file into memory.

      – Sai Kumar
      Nov 16 '18 at 23:56












    • @SaiKumar, Got it, have updated with a solution which performs a complete reversal.

      – jpp
      Nov 18 '18 at 18:40
















    Actually, each file chunk should be reversed and next each reversed chunk of rows should be prepended rather than appending to the csv file. So the entire file will be reversed at the end but chunk by chunk without loading entire file into memory.

    – Sai Kumar
    Nov 16 '18 at 23:56






    Actually, each file chunk should be reversed and next each reversed chunk of rows should be prepended rather than appending to the csv file. So the entire file will be reversed at the end but chunk by chunk without loading entire file into memory.

    – Sai Kumar
    Nov 16 '18 at 23:56














    @SaiKumar, Got it, have updated with a solution which performs a complete reversal.

    – jpp
    Nov 18 '18 at 18:40





    @SaiKumar, Got it, have updated with a solution which performs a complete reversal.

    – jpp
    Nov 18 '18 at 18:40











    -3














    You have repeated code blocks and you are not taking advantage of pandas at all.



    What @sujay kumar pointed out is very correct, I would read that more closely.



    The file isnt big at all. I use OHLCV tick data that is in GBs without issues. If you use pandas.read_csv() you dont have to do chunked transfer. Sure it will take time but it will work fine. Unless you are going in Terrabytes. I have not tested with that.



    when you read_csv() you dont specify any index. If you did you could call sort_index() with or without ascending=False depending on the order.



    Pandas can write CSV too, please use that instead. I am pasting some example code for you to put together.



    df_temp = pd.read_csv(file_path, parse_dates=True, index_col="Date",
    usecols=["Date", "Adj Close"], na_values=["nan"])



    sorting a Series



    s = pd.Series(list('abcde'), index=[0,3,2,5,4])
    s.sort_index()



    Note: If you stick to Pandas and its functions you will be running already optimized code which dont require entire files to be loaded in memory. It's so easy that its almost like cheating :)






    share|improve this answer





























      -3














      You have repeated code blocks and you are not taking advantage of pandas at all.



      What @sujay kumar pointed out is very correct, I would read that more closely.



      The file isnt big at all. I use OHLCV tick data that is in GBs without issues. If you use pandas.read_csv() you dont have to do chunked transfer. Sure it will take time but it will work fine. Unless you are going in Terrabytes. I have not tested with that.



      when you read_csv() you dont specify any index. If you did you could call sort_index() with or without ascending=False depending on the order.



      Pandas can write CSV too, please use that instead. I am pasting some example code for you to put together.



      df_temp = pd.read_csv(file_path, parse_dates=True, index_col="Date",
      usecols=["Date", "Adj Close"], na_values=["nan"])



      sorting a Series



      s = pd.Series(list('abcde'), index=[0,3,2,5,4])
      s.sort_index()



      Note: If you stick to Pandas and its functions you will be running already optimized code which dont require entire files to be loaded in memory. It's so easy that its almost like cheating :)






      share|improve this answer



























        -3












        -3








        -3







        You have repeated code blocks and you are not taking advantage of pandas at all.



        What @sujay kumar pointed out is very correct, I would read that more closely.



        The file isnt big at all. I use OHLCV tick data that is in GBs without issues. If you use pandas.read_csv() you dont have to do chunked transfer. Sure it will take time but it will work fine. Unless you are going in Terrabytes. I have not tested with that.



        when you read_csv() you dont specify any index. If you did you could call sort_index() with or without ascending=False depending on the order.



        Pandas can write CSV too, please use that instead. I am pasting some example code for you to put together.



        df_temp = pd.read_csv(file_path, parse_dates=True, index_col="Date",
        usecols=["Date", "Adj Close"], na_values=["nan"])



        sorting a Series



        s = pd.Series(list('abcde'), index=[0,3,2,5,4])
        s.sort_index()



        Note: If you stick to Pandas and its functions you will be running already optimized code which dont require entire files to be loaded in memory. It's so easy that its almost like cheating :)






        share|improve this answer















        You have repeated code blocks and you are not taking advantage of pandas at all.



        What @sujay kumar pointed out is very correct, I would read that more closely.



        The file isnt big at all. I use OHLCV tick data that is in GBs without issues. If you use pandas.read_csv() you dont have to do chunked transfer. Sure it will take time but it will work fine. Unless you are going in Terrabytes. I have not tested with that.



        when you read_csv() you dont specify any index. If you did you could call sort_index() with or without ascending=False depending on the order.



        Pandas can write CSV too, please use that instead. I am pasting some example code for you to put together.



        df_temp = pd.read_csv(file_path, parse_dates=True, index_col="Date",
        usecols=["Date", "Adj Close"], na_values=["nan"])



        sorting a Series



        s = pd.Series(list('abcde'), index=[0,3,2,5,4])
        s.sort_index()



        Note: If you stick to Pandas and its functions you will be running already optimized code which dont require entire files to be loaded in memory. It's so easy that its almost like cheating :)







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 18 '18 at 2:34

























        answered Nov 13 '18 at 23:49









        Abhishek DujariAbhishek Dujari

        1,2872540




        1,2872540



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53039941%2fread-and-reverse-data-chunk-by-chunk-from-a-csv-file-and-copy-to-a-new-csv-file%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            這個網誌中的熱門文章

            Barbados

            How to read a connectionString WITH PROVIDER in .NET Core?

            Node.js Script on GitHub Pages or Amazon S3