Read and reverse data chunk by chunk from a csv file and copy to a new csv file

Assume I'm dealing with a very large csv file. So, I can only read the data chunk by chunk into the memory. The expected flow of events should be as follows:

1) Read chunk (eg: 10 rows) of data from csv using pandas.

2) Reverse the order of data

3) Copy each row to new csv file in reverse. So each chunk (10 rows) is
written to csv from beginning in reversed order.

In the end the csv file should be in reversed order and this should be done without loading entire file into memory for windows OS.

I am trying to do a time series forecasting I need data to be from old to latest (1st row oldest entry). I can't load entire file into memory I'm looking for a way to do it each chunk at a time if it's possible.

The dataset I tried on train.csv of the Rossmann dataset from kaggle. You can get it from this github repo

My attempt does not copy the rows into the new csv file properly.

Show below is my code:

import pandas as pd
import csv

def reverse():

 fields = ["Store","DayOfWeek","Date","Sales","Customers","Open","Promo","StateHoliday",
 "SchoolHoliday"]
 with open('processed_train.csv', mode='a') as stock_file:
 writer = csv.writer(stock_file,delimiter=',', quotechar='"', 
 quoting=csv.QUOTE_MINIMAL)
 writer.writerow(fields)

 for chunk in pd.read_csv("train.csv", chunksize=10):
 store_data = chunk.reindex(index=chunk.index[::-1])
 append_data_csv(store_data)

def append_data_csv(store_data):
 with open('processed_train.csv', mode='a') as store_file:
 writer = csv.writer(store_file,delimiter=',', quotechar='"',
 quoting=csv.QUOTE_MINIMAL)
 for index, row in store_data.iterrows():
 print(row)
 writer.writerow([row['Store'],row['DayOfWeek'],row['Date'],row['Sales'],
 row['Customers'],row['Open'],row['Promo'],
 row['StateHoliday'],row['SchoolHoliday']])

reverse()

Thank you, in advance

edited Nov 17 '18 at 6:35

Sai Kumar

297115

asked Oct 29 '18 at 6:32

Suleka_28

677516

1

You want the entire data to be time-sorted but you’re sorting & writing 10 data at a time FROM THE BEGINNING. That’s why you aren’t getting results. Why don’t you read & reverse the train.csv from the end. See, stackoverflow.com/a/10933932/2895956

– Sujay Kumar
Oct 29 '18 at 7:03

2

I tried the code given in that post earlier. But it did not give a proper result for me.

– Suleka_28
Oct 29 '18 at 7:21

3

Can you elaborate? What exactly you are getting? And if possible upload the file in dropbox. The process is too long, (to download that file a person have to login, register then fill out information...bla..bla..) just to download the file.

– Sujay Kumar
Oct 29 '18 at 7:29

4

Note: Not a perfect solution. But this you can simply revert the file using cmd line, tail -r train.csv > reverse.csv. And manually remove the last line and add it to the first line in reverse.csv

– Sujay Kumar
Oct 29 '18 at 9:20

add a comment |

Assume I'm dealing with a very large csv file. So, I can only read the data chunk by chunk into the memory. The expected flow of events should be as follows:

1) Read chunk (eg: 10 rows) of data from csv using pandas.

2) Reverse the order of data

3) Copy each row to new csv file in reverse. So each chunk (10 rows) is
written to csv from beginning in reversed order.

In the end the csv file should be in reversed order and this should be done without loading entire file into memory for windows OS.

The dataset I tried on train.csv of the Rossmann dataset from kaggle. You can get it from this github repo

My attempt does not copy the rows into the new csv file properly.

Show below is my code:

import pandas as pd
import csv

def reverse():

 fields = ["Store","DayOfWeek","Date","Sales","Customers","Open","Promo","StateHoliday",
 "SchoolHoliday"]
 with open('processed_train.csv', mode='a') as stock_file:
 writer = csv.writer(stock_file,delimiter=',', quotechar='"', 
 quoting=csv.QUOTE_MINIMAL)
 writer.writerow(fields)

 for chunk in pd.read_csv("train.csv", chunksize=10):
 store_data = chunk.reindex(index=chunk.index[::-1])
 append_data_csv(store_data)

def append_data_csv(store_data):
 with open('processed_train.csv', mode='a') as store_file:
 writer = csv.writer(store_file,delimiter=',', quotechar='"',
 quoting=csv.QUOTE_MINIMAL)
 for index, row in store_data.iterrows():
 print(row)
 writer.writerow([row['Store'],row['DayOfWeek'],row['Date'],row['Sales'],
 row['Customers'],row['Open'],row['Promo'],
 row['StateHoliday'],row['SchoolHoliday']])

reverse()

Thank you, in advance

edited Nov 17 '18 at 6:35

Sai Kumar

297115

asked Oct 29 '18 at 6:32

Suleka_28

677516

1

You want the entire data to be time-sorted but you’re sorting & writing 10 data at a time FROM THE BEGINNING. That’s why you aren’t getting results. Why don’t you read & reverse the train.csv from the end. See, stackoverflow.com/a/10933932/2895956

– Sujay Kumar
Oct 29 '18 at 7:03

2

I tried the code given in that post earlier. But it did not give a proper result for me.

– Suleka_28
Oct 29 '18 at 7:21

3

Can you elaborate? What exactly you are getting? And if possible upload the file in dropbox. The process is too long, (to download that file a person have to login, register then fill out information...bla..bla..) just to download the file.

– Sujay Kumar
Oct 29 '18 at 7:29

4

Note: Not a perfect solution. But this you can simply revert the file using cmd line, tail -r train.csv > reverse.csv. And manually remove the last line and add it to the first line in reverse.csv

– Sujay Kumar
Oct 29 '18 at 9:20

add a comment |

Assume I'm dealing with a very large csv file. So, I can only read the data chunk by chunk into the memory. The expected flow of events should be as follows:

1) Read chunk (eg: 10 rows) of data from csv using pandas.

2) Reverse the order of data

3) Copy each row to new csv file in reverse. So each chunk (10 rows) is
written to csv from beginning in reversed order.

In the end the csv file should be in reversed order and this should be done without loading entire file into memory for windows OS.

The dataset I tried on train.csv of the Rossmann dataset from kaggle. You can get it from this github repo

My attempt does not copy the rows into the new csv file properly.

Show below is my code:

import pandas as pd
import csv

def reverse():

 fields = ["Store","DayOfWeek","Date","Sales","Customers","Open","Promo","StateHoliday",
 "SchoolHoliday"]
 with open('processed_train.csv', mode='a') as stock_file:
 writer = csv.writer(stock_file,delimiter=',', quotechar='"', 
 quoting=csv.QUOTE_MINIMAL)
 writer.writerow(fields)

 for chunk in pd.read_csv("train.csv", chunksize=10):
 store_data = chunk.reindex(index=chunk.index[::-1])
 append_data_csv(store_data)

def append_data_csv(store_data):
 with open('processed_train.csv', mode='a') as store_file:
 writer = csv.writer(store_file,delimiter=',', quotechar='"',
 quoting=csv.QUOTE_MINIMAL)
 for index, row in store_data.iterrows():
 print(row)
 writer.writerow([row['Store'],row['DayOfWeek'],row['Date'],row['Sales'],
 row['Customers'],row['Open'],row['Promo'],
 row['StateHoliday'],row['SchoolHoliday']])

reverse()

Thank you, in advance

edited Nov 17 '18 at 6:35

Sai Kumar

297115

asked Oct 29 '18 at 6:32

Suleka_28

677516

Assume I'm dealing with a very large csv file. So, I can only read the data chunk by chunk into the memory. The expected flow of events should be as follows:

1) Read chunk (eg: 10 rows) of data from csv using pandas.

2) Reverse the order of data

3) Copy each row to new csv file in reverse. So each chunk (10 rows) is
written to csv from beginning in reversed order.

In the end the csv file should be in reversed order and this should be done without loading entire file into memory for windows OS.

The dataset I tried on train.csv of the Rossmann dataset from kaggle. You can get it from this github repo

My attempt does not copy the rows into the new csv file properly.

Show below is my code:

import pandas as pd
import csv

def reverse():

 fields = ["Store","DayOfWeek","Date","Sales","Customers","Open","Promo","StateHoliday",
 "SchoolHoliday"]
 with open('processed_train.csv', mode='a') as stock_file:
 writer = csv.writer(stock_file,delimiter=',', quotechar='"', 
 quoting=csv.QUOTE_MINIMAL)
 writer.writerow(fields)

 for chunk in pd.read_csv("train.csv", chunksize=10):
 store_data = chunk.reindex(index=chunk.index[::-1])
 append_data_csv(store_data)

def append_data_csv(store_data):
 with open('processed_train.csv', mode='a') as store_file:
 writer = csv.writer(store_file,delimiter=',', quotechar='"',
 quoting=csv.QUOTE_MINIMAL)
 for index, row in store_data.iterrows():
 print(row)
 writer.writerow([row['Store'],row['DayOfWeek'],row['Date'],row['Sales'],
 row['Customers'],row['Open'],row['Promo'],
 row['StateHoliday'],row['SchoolHoliday']])

reverse()

Thank you, in advance

python python-3.x pandas csv dataset

edited Nov 17 '18 at 6:35

Sai Kumar

297115

asked Oct 29 '18 at 6:32

Suleka_28

677516

edited Nov 17 '18 at 6:35

Sai Kumar

297115

asked Oct 29 '18 at 6:32

Suleka_28

677516

edited Nov 17 '18 at 6:35

Sai Kumar

297115

edited Nov 17 '18 at 6:35

Sai Kumar

297115

edited Nov 17 '18 at 6:35

Sai Kumar

297115

asked Oct 29 '18 at 6:32

Suleka_28

677516

asked Oct 29 '18 at 6:32

Suleka_28

677516

asked Oct 29 '18 at 6:32

Suleka_28

677516

1

You want the entire data to be time-sorted but you’re sorting & writing 10 data at a time FROM THE BEGINNING. That’s why you aren’t getting results. Why don’t you read & reverse the train.csv from the end. See, stackoverflow.com/a/10933932/2895956

– Sujay Kumar
Oct 29 '18 at 7:03

2

I tried the code given in that post earlier. But it did not give a proper result for me.

– Suleka_28
Oct 29 '18 at 7:21

3

Can you elaborate? What exactly you are getting? And if possible upload the file in dropbox. The process is too long, (to download that file a person have to login, register then fill out information...bla..bla..) just to download the file.

– Sujay Kumar
Oct 29 '18 at 7:29

4

Note: Not a perfect solution. But this you can simply revert the file using cmd line, tail -r train.csv > reverse.csv. And manually remove the last line and add it to the first line in reverse.csv

– Sujay Kumar
Oct 29 '18 at 9:20

add a comment |

1

You want the entire data to be time-sorted but you’re sorting & writing 10 data at a time FROM THE BEGINNING. That’s why you aren’t getting results. Why don’t you read & reverse the train.csv from the end. See, stackoverflow.com/a/10933932/2895956

– Sujay Kumar
Oct 29 '18 at 7:03

2

I tried the code given in that post earlier. But it did not give a proper result for me.

– Suleka_28
Oct 29 '18 at 7:21

3

Can you elaborate? What exactly you are getting? And if possible upload the file in dropbox. The process is too long, (to download that file a person have to login, register then fill out information...bla..bla..) just to download the file.

– Sujay Kumar
Oct 29 '18 at 7:29

4

Note: Not a perfect solution. But this you can simply revert the file using cmd line, tail -r train.csv > reverse.csv. And manually remove the last line and add it to the first line in reverse.csv

– Sujay Kumar
Oct 29 '18 at 9:20

You want the entire data to be time-sorted but you’re sorting & writing 10 data at a time FROM THE BEGINNING. That’s why you aren’t getting results. Why don’t you read & reverse the train.csv from the end. See, stackoverflow.com/a/10933932/2895956

– Sujay Kumar
Oct 29 '18 at 7:03

I tried the code given in that post earlier. But it did not give a proper result for me.

– Suleka_28
Oct 29 '18 at 7:21

Can you elaborate? What exactly you are getting? And if possible upload the file in dropbox. The process is too long, (to download that file a person have to login, register then fill out information...bla..bla..) just to download the file.

– Sujay Kumar
Oct 29 '18 at 7:29

Note: Not a perfect solution. But this you can simply revert the file using cmd line, tail -r train.csv > reverse.csv. And manually remove the last line and add it to the first line in reverse.csv

– Sujay Kumar
Oct 29 '18 at 9:20

add a comment |

5 Answers
5

active

oldest

votes

+50

Using bash, you can tail the whole file except the first line and then reverse it and store it with this:

tail -n +2 train.csv | tac > train_rev.csv

If you want to keep the header in the reversed file, write it first and then append the reversed content

head -1 train.csv > train_rev.csv; tail -n +2 train.csv | tac >> train_rev.csv

edited Nov 20 '18 at 16:01

answered Nov 20 '18 at 15:47

gustavovelascoh

4301412

Unless you have a hard requirement to also run this on a different os, this is probably going to be the fastest/best answer around.

– Julian
Nov 20 '18 at 21:24

add a comment |

I would not recommend using pandas for parsing or streaming any files as you are only introducing additional overhead. The best way to do it is to read the file from the bottom up. Well, a big part of this code actually comes from here where it takes in a file and returns the reverse in a generator, which I believe is what you want.

What I did was just tested it with your file train.csv from the provided link and output the results in a new file.

import os

def reverse_readline(filename, buf_size=8192):
 """a generator that returns the lines of a file in reverse order"""
 with open(filename) as fh:
 segment = None
 offset = 0
 fh.seek(0, os.SEEK_END)
 file_size = remaining_size = fh.tell()
 while remaining_size > 0:
 offset = min(file_size, offset + buf_size)
 fh.seek(file_size - offset)
 buffer = fh.read(min(remaining_size, buf_size))
 remaining_size -= buf_size
 lines = buffer.split('n')
 # the first line of the buffer is probably not a complete line so
 # we'll save it and append it to the last line of the next buffer
 # we read
 if segment is not None:
 # if the previous chunk starts right from the beginning of line
 # do not concact the segment to the last line of new chunk
 # instead, yield the segment first 
 if buffer[-1] != 'n':
 lines[-1] += segment
 else:
 yield segment
 segment = lines[0]
 for index in range(len(lines) - 1, 0, -1):
 if lines[index]:
 yield lines[index]
 # Don't yield None if the file was empty
 if segment is not None:
 yield segment

reverse_gen = reverse_readline('train.csv')

with open('rev_train.csv','w') as f:
 for row in reverse_gen:
 f.write('n'.format(row))

It basically reads it in reverse until it finds a newline then yields a line from the file from the bottom to top. A pretty interesting way of doing it.

answered Nov 17 '18 at 3:02

BernardL

2,37311029

2

This actually will result in overlapping and missing chunks of text if UTF-8 is used. You can't just seek in a variable-width encoded file! Next, if the CSV file has embedded newlines in cell values, reversing can't just split on newlines arbitrarily.

– Martijn Pieters♦
Nov 18 '18 at 14:24

Next, a RFC compliant CSV file will use rn line separators, your code assumes that n is used, always, and that there are no embedded newlines in the cell values.

– Martijn Pieters♦
Nov 19 '18 at 8:57

add a comment |

This does exactly what you are requesting, but without Pandas. It reads intest.csv line by line (as opposed to reading the whole file into RAM). It does most of the processing using the file system using a series of chunk files that are aggregated at the end into the outtest.csv file. If you change the maxLines, you can optimise the number of chunk files produced versus RAM consumed (higher numbers consume more RAM but produce fewer chunk files). If you want to keep the CSV header first line, set keepHeader to True; if set to False, it reverses the entire file, including the first line.

For kicks, I ran this on an old Raspberry Pi using a 128GB flash drive on a 6MB csv test file and I thought something had gone wrong because it returned almost immediately, so it's fast even on slower hardware. It imports only one standard python library function (remove), so it's very portable. One advantage of this code is it does not reposition any file pointers. One limitation is it will not work on CSV files that have newlines in the data. For that use case, pandas would be the best solution to read the chunks.

from os import remove

def writechunk(fileCounter, reverseString):
 outFile = 'tmpfile' + str(fileCounter) + '.csv'
 with open(outFile, 'w') as outfp:
 outfp.write(reverseString)
 return

def main():
 inFile = 'intest.csv'
 outFile = 'outtest.csv'
 # This is our chunk expressed in lines
 maxLines = 10
 # Is there a header line we want to keep at the top of the output file?
 keepHeader = True

 fileCounter = 0
 lineCounter = 0
 with open(inFile) as infp:
 reverseString = ''
 line = infp.readline()
 if (line and keepHeader):
 headerLine = line
 line = infp.readline()
 while (line):
 lineCounter += 1
 reverseString = line + reverseString
 if (lineCounter == maxLines):
 fileCounter += 1
 lineCounter = 0
 writechunk(fileCounter, reverseString)
 reverseString = ''
 line = infp.readline()
 # Write any leftovers to a chunk file
 if (lineCounter != 0):
 fileCounter += 1
 writechunk(fileCounter,reverseString)
 # Read the chunk files backwards and append each to the outFile
 with open(outFile, 'w') as outfp:
 if (keepHeader):
 outfp.write(headerLine)
 while (fileCounter > 0):
 chunkFile = 'tmpfile' + str(fileCounter) + '.csv'
 with open(chunkFile, 'r') as infp:
 outfp.write(infp.read())
 remove(chunkFile)
 fileCounter -= 1

if __name__ == '__main__':
 main()

edited Nov 18 '18 at 23:03

answered Nov 17 '18 at 8:13

Mark Warburton

8517

2

This does work, but assumes that there are no newlines in the CSV cell values.

– Martijn Pieters♦
Nov 18 '18 at 14:32

Thanks, @MartijnPieters. You are correct. However, this was the use case expressed in the original question. I will update the answer to point out this limitation. This was designed to have the smallest footprint possible, but we could potentially use pandas' inbuilt chunk function to read bits of the file, which would overcome that limitation at the expense of efficiency (and a modest increase in memory requirement).

– Mark Warburton
Nov 18 '18 at 23:00

1

It absolutely is in the use case expressed, which asks to be able to handle any giant CSV file.

– Martijn Pieters♦
Nov 19 '18 at 8:58

add a comment |

If you have sufficient hard disk space, you can read in chunks, reverse and store. Then pick up the stored chunks in reverse order and write to a new csv file.

Below is an example with Pandas which also uses pickle (for performance efficiency) and gzip (for storage efficiency).

import pandas as pd, numpy as np

# create a dataframe for demonstration purposes
df = pd.DataFrame(np.arange(5*9).reshape((-1, 5)))
df.to_csv('file.csv', index=False)

# number of rows we want to chunk by
n = 3

# iterate chunks, output to pickle files
for idx, chunk in enumerate(pd.read_csv('file.csv', chunksize=n)):
 chunk.iloc[::-1].to_pickle(f'file_pkl_idx:03.pkl.gzip', compression='gzip')

# open file in amend mode and write chunks in reverse
# idx stores the index of the last pickle file written
with open('out.csv', 'a') as fout:
 for i in range(idx, -1, -1):
 chunk_pkl = pd.read_pickle(f'file_pkl_i:03.pkl.gzip', compression='gzip')
 chunk_pkl.to_csv(fout, index=False, header=False if i!=idx else True)

# read new file to check results
df_new = pd.read_csv('out.csv')

print(df_new)

 0 1 2 3 4
0 40 41 42 43 44
1 35 36 37 38 39
2 30 31 32 33 34
3 25 26 27 28 29
4 20 21 22 23 24
5 15 16 17 18 19
6 10 11 12 13 14
7 5 6 7 8 9
8 0 1 2 3 4

edited Nov 20 '18 at 10:02

answered Nov 16 '18 at 23:27

jpp

99.4k2161110

Actually, each file chunk should be reversed and next each reversed chunk of rows should be prepended rather than appending to the csv file. So the entire file will be reversed at the end but chunk by chunk without loading entire file into memory.

– Sai Kumar
Nov 16 '18 at 23:56

@SaiKumar, Got it, have updated with a solution which performs a complete reversal.

– jpp
Nov 18 '18 at 18:40

add a comment |

-3

You have repeated code blocks and you are not taking advantage of pandas at all.

What @sujay kumar pointed out is very correct, I would read that more closely.

The file isnt big at all. I use OHLCV tick data that is in GBs without issues. If you use pandas.read_csv() you dont have to do chunked transfer. Sure it will take time but it will work fine. Unless you are going in Terrabytes. I have not tested with that.

when you read_csv() you dont specify any index. If you did you could call sort_index() with or without ascending=False depending on the order.

Pandas can write CSV too, please use that instead. I am pasting some example code for you to put together.

df_temp = pd.read_csv(file_path, parse_dates=True, index_col="Date", usecols=["Date", "Adj Close"], na_values=["nan"])

sorting a Series

s = pd.Series(list('abcde'), index=[0,3,2,5,4]) s.sort_index()

Note: If you stick to Pandas and its functions you will be running already optimized code which dont require entire files to be loaded in memory. It's so easy that its almost like cheating :)

edited Nov 18 '18 at 2:34

answered Nov 13 '18 at 23:49

Abhishek Dujari

1,2872540

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53039941%2fread-and-reverse-data-chunk-by-chunk-from-a-csv-file-and-copy-to-a-new-csv-file%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

5 Answers
5

active

oldest

votes

5 Answers
5

active

oldest

votes

+50

Using bash, you can tail the whole file except the first line and then reverse it and store it with this:

tail -n +2 train.csv | tac > train_rev.csv

If you want to keep the header in the reversed file, write it first and then append the reversed content

head -1 train.csv > train_rev.csv; tail -n +2 train.csv | tac >> train_rev.csv

edited Nov 20 '18 at 16:01

answered Nov 20 '18 at 15:47

gustavovelascoh

4301412

Unless you have a hard requirement to also run this on a different os, this is probably going to be the fastest/best answer around.

– Julian
Nov 20 '18 at 21:24

add a comment |

+50

Using bash, you can tail the whole file except the first line and then reverse it and store it with this:

tail -n +2 train.csv | tac > train_rev.csv

If you want to keep the header in the reversed file, write it first and then append the reversed content

head -1 train.csv > train_rev.csv; tail -n +2 train.csv | tac >> train_rev.csv

edited Nov 20 '18 at 16:01

answered Nov 20 '18 at 15:47

gustavovelascoh

4301412

Unless you have a hard requirement to also run this on a different os, this is probably going to be the fastest/best answer around.

– Julian
Nov 20 '18 at 21:24

add a comment |

+50

Using bash, you can tail the whole file except the first line and then reverse it and store it with this:

tail -n +2 train.csv | tac > train_rev.csv

If you want to keep the header in the reversed file, write it first and then append the reversed content

head -1 train.csv > train_rev.csv; tail -n +2 train.csv | tac >> train_rev.csv

edited Nov 20 '18 at 16:01

answered Nov 20 '18 at 15:47

gustavovelascoh

4301412

Using bash, you can tail the whole file except the first line and then reverse it and store it with this:

tail -n +2 train.csv | tac > train_rev.csv

If you want to keep the header in the reversed file, write it first and then append the reversed content

head -1 train.csv > train_rev.csv; tail -n +2 train.csv | tac >> train_rev.csv

edited Nov 20 '18 at 16:01

answered Nov 20 '18 at 15:47

gustavovelascoh

4301412

edited Nov 20 '18 at 16:01

answered Nov 20 '18 at 15:47

gustavovelascoh

4301412

answered Nov 20 '18 at 15:47

gustavovelascoh

4301412

answered Nov 20 '18 at 15:47

gustavovelascoh

4301412

Unless you have a hard requirement to also run this on a different os, this is probably going to be the fastest/best answer around.

– Julian
Nov 20 '18 at 21:24

add a comment |

Unless you have a hard requirement to also run this on a different os, this is probably going to be the fastest/best answer around.

– Julian
Nov 20 '18 at 21:24

Unless you have a hard requirement to also run this on a different os, this is probably going to be the fastest/best answer around.

– Julian
Nov 20 '18 at 21:24

add a comment |

What I did was just tested it with your file train.csv from the provided link and output the results in a new file.

import os

def reverse_readline(filename, buf_size=8192):
 """a generator that returns the lines of a file in reverse order"""
 with open(filename) as fh:
 segment = None
 offset = 0
 fh.seek(0, os.SEEK_END)
 file_size = remaining_size = fh.tell()
 while remaining_size > 0:
 offset = min(file_size, offset + buf_size)
 fh.seek(file_size - offset)
 buffer = fh.read(min(remaining_size, buf_size))
 remaining_size -= buf_size
 lines = buffer.split('n')
 # the first line of the buffer is probably not a complete line so
 # we'll save it and append it to the last line of the next buffer
 # we read
 if segment is not None:
 # if the previous chunk starts right from the beginning of line
 # do not concact the segment to the last line of new chunk
 # instead, yield the segment first 
 if buffer[-1] != 'n':
 lines[-1] += segment
 else:
 yield segment
 segment = lines[0]
 for index in range(len(lines) - 1, 0, -1):
 if lines[index]:
 yield lines[index]
 # Don't yield None if the file was empty
 if segment is not None:
 yield segment

reverse_gen = reverse_readline('train.csv')

with open('rev_train.csv','w') as f:
 for row in reverse_gen:
 f.write('n'.format(row))

It basically reads it in reverse until it finds a newline then yields a line from the file from the bottom to top. A pretty interesting way of doing it.

answered Nov 17 '18 at 3:02

BernardL

2,37311029

2

This actually will result in overlapping and missing chunks of text if UTF-8 is used. You can't just seek in a variable-width encoded file! Next, if the CSV file has embedded newlines in cell values, reversing can't just split on newlines arbitrarily.

– Martijn Pieters♦
Nov 18 '18 at 14:24

Next, a RFC compliant CSV file will use rn line separators, your code assumes that n is used, always, and that there are no embedded newlines in the cell values.

– Martijn Pieters♦
Nov 19 '18 at 8:57

add a comment |

What I did was just tested it with your file train.csv from the provided link and output the results in a new file.

import os

def reverse_readline(filename, buf_size=8192):
 """a generator that returns the lines of a file in reverse order"""
 with open(filename) as fh:
 segment = None
 offset = 0
 fh.seek(0, os.SEEK_END)
 file_size = remaining_size = fh.tell()
 while remaining_size > 0:
 offset = min(file_size, offset + buf_size)
 fh.seek(file_size - offset)
 buffer = fh.read(min(remaining_size, buf_size))
 remaining_size -= buf_size
 lines = buffer.split('n')
 # the first line of the buffer is probably not a complete line so
 # we'll save it and append it to the last line of the next buffer
 # we read
 if segment is not None:
 # if the previous chunk starts right from the beginning of line
 # do not concact the segment to the last line of new chunk
 # instead, yield the segment first 
 if buffer[-1] != 'n':
 lines[-1] += segment
 else:
 yield segment
 segment = lines[0]
 for index in range(len(lines) - 1, 0, -1):
 if lines[index]:
 yield lines[index]
 # Don't yield None if the file was empty
 if segment is not None:
 yield segment

reverse_gen = reverse_readline('train.csv')

with open('rev_train.csv','w') as f:
 for row in reverse_gen:
 f.write('n'.format(row))

It basically reads it in reverse until it finds a newline then yields a line from the file from the bottom to top. A pretty interesting way of doing it.

answered Nov 17 '18 at 3:02

BernardL

2,37311029

2

This actually will result in overlapping and missing chunks of text if UTF-8 is used. You can't just seek in a variable-width encoded file! Next, if the CSV file has embedded newlines in cell values, reversing can't just split on newlines arbitrarily.

– Martijn Pieters♦
Nov 18 '18 at 14:24

Next, a RFC compliant CSV file will use rn line separators, your code assumes that n is used, always, and that there are no embedded newlines in the cell values.

– Martijn Pieters♦
Nov 19 '18 at 8:57

add a comment |

What I did was just tested it with your file train.csv from the provided link and output the results in a new file.

import os

def reverse_readline(filename, buf_size=8192):
 """a generator that returns the lines of a file in reverse order"""
 with open(filename) as fh:
 segment = None
 offset = 0
 fh.seek(0, os.SEEK_END)
 file_size = remaining_size = fh.tell()
 while remaining_size > 0:
 offset = min(file_size, offset + buf_size)
 fh.seek(file_size - offset)
 buffer = fh.read(min(remaining_size, buf_size))
 remaining_size -= buf_size
 lines = buffer.split('n')
 # the first line of the buffer is probably not a complete line so
 # we'll save it and append it to the last line of the next buffer
 # we read
 if segment is not None:
 # if the previous chunk starts right from the beginning of line
 # do not concact the segment to the last line of new chunk
 # instead, yield the segment first 
 if buffer[-1] != 'n':
 lines[-1] += segment
 else:
 yield segment
 segment = lines[0]
 for index in range(len(lines) - 1, 0, -1):
 if lines[index]:
 yield lines[index]
 # Don't yield None if the file was empty
 if segment is not None:
 yield segment

reverse_gen = reverse_readline('train.csv')

with open('rev_train.csv','w') as f:
 for row in reverse_gen:
 f.write('n'.format(row))

It basically reads it in reverse until it finds a newline then yields a line from the file from the bottom to top. A pretty interesting way of doing it.

answered Nov 17 '18 at 3:02

BernardL

2,37311029

What I did was just tested it with your file train.csv from the provided link and output the results in a new file.

import os

def reverse_readline(filename, buf_size=8192):
 """a generator that returns the lines of a file in reverse order"""
 with open(filename) as fh:
 segment = None
 offset = 0
 fh.seek(0, os.SEEK_END)
 file_size = remaining_size = fh.tell()
 while remaining_size > 0:
 offset = min(file_size, offset + buf_size)
 fh.seek(file_size - offset)
 buffer = fh.read(min(remaining_size, buf_size))
 remaining_size -= buf_size
 lines = buffer.split('n')
 # the first line of the buffer is probably not a complete line so
 # we'll save it and append it to the last line of the next buffer
 # we read
 if segment is not None:
 # if the previous chunk starts right from the beginning of line
 # do not concact the segment to the last line of new chunk
 # instead, yield the segment first 
 if buffer[-1] != 'n':
 lines[-1] += segment
 else:
 yield segment
 segment = lines[0]
 for index in range(len(lines) - 1, 0, -1):
 if lines[index]:
 yield lines[index]
 # Don't yield None if the file was empty
 if segment is not None:
 yield segment

reverse_gen = reverse_readline('train.csv')

with open('rev_train.csv','w') as f:
 for row in reverse_gen:
 f.write('n'.format(row))

It basically reads it in reverse until it finds a newline then yields a line from the file from the bottom to top. A pretty interesting way of doing it.

answered Nov 17 '18 at 3:02

BernardL

2,37311029

answered Nov 17 '18 at 3:02

BernardL

2,37311029

answered Nov 17 '18 at 3:02

BernardL

2,37311029

answered Nov 17 '18 at 3:02

BernardL

2,37311029

2

This actually will result in overlapping and missing chunks of text if UTF-8 is used. You can't just seek in a variable-width encoded file! Next, if the CSV file has embedded newlines in cell values, reversing can't just split on newlines arbitrarily.

– Martijn Pieters♦
Nov 18 '18 at 14:24

Next, a RFC compliant CSV file will use rn line separators, your code assumes that n is used, always, and that there are no embedded newlines in the cell values.

– Martijn Pieters♦
Nov 19 '18 at 8:57

add a comment |

2

This actually will result in overlapping and missing chunks of text if UTF-8 is used. You can't just seek in a variable-width encoded file! Next, if the CSV file has embedded newlines in cell values, reversing can't just split on newlines arbitrarily.

– Martijn Pieters♦
Nov 18 '18 at 14:24

Next, a RFC compliant CSV file will use rn line separators, your code assumes that n is used, always, and that there are no embedded newlines in the cell values.

– Martijn Pieters♦
Nov 19 '18 at 8:57

This actually will result in overlapping and missing chunks of text if UTF-8 is used. You can't just seek in a variable-width encoded file! Next, if the CSV file has embedded newlines in cell values, reversing can't just split on newlines arbitrarily.

– Martijn Pieters♦
Nov 18 '18 at 14:24

Next, a RFC compliant CSV file will use rn line separators, your code assumes that n is used, always, and that there are no embedded newlines in the cell values.

– Martijn Pieters♦
Nov 19 '18 at 8:57

add a comment |

from os import remove

def writechunk(fileCounter, reverseString):
 outFile = 'tmpfile' + str(fileCounter) + '.csv'
 with open(outFile, 'w') as outfp:
 outfp.write(reverseString)
 return

def main():
 inFile = 'intest.csv'
 outFile = 'outtest.csv'
 # This is our chunk expressed in lines
 maxLines = 10
 # Is there a header line we want to keep at the top of the output file?
 keepHeader = True

 fileCounter = 0
 lineCounter = 0
 with open(inFile) as infp:
 reverseString = ''
 line = infp.readline()
 if (line and keepHeader):
 headerLine = line
 line = infp.readline()
 while (line):
 lineCounter += 1
 reverseString = line + reverseString
 if (lineCounter == maxLines):
 fileCounter += 1
 lineCounter = 0
 writechunk(fileCounter, reverseString)
 reverseString = ''
 line = infp.readline()
 # Write any leftovers to a chunk file
 if (lineCounter != 0):
 fileCounter += 1
 writechunk(fileCounter,reverseString)
 # Read the chunk files backwards and append each to the outFile
 with open(outFile, 'w') as outfp:
 if (keepHeader):
 outfp.write(headerLine)
 while (fileCounter > 0):
 chunkFile = 'tmpfile' + str(fileCounter) + '.csv'
 with open(chunkFile, 'r') as infp:
 outfp.write(infp.read())
 remove(chunkFile)
 fileCounter -= 1

if __name__ == '__main__':
 main()

edited Nov 18 '18 at 23:03

answered Nov 17 '18 at 8:13

Mark Warburton

8517

2

This does work, but assumes that there are no newlines in the CSV cell values.

– Martijn Pieters♦
Nov 18 '18 at 14:32

Thanks, @MartijnPieters. You are correct. However, this was the use case expressed in the original question. I will update the answer to point out this limitation. This was designed to have the smallest footprint possible, but we could potentially use pandas' inbuilt chunk function to read bits of the file, which would overcome that limitation at the expense of efficiency (and a modest increase in memory requirement).

– Mark Warburton
Nov 18 '18 at 23:00

1

It absolutely is in the use case expressed, which asks to be able to handle any giant CSV file.

– Martijn Pieters♦
Nov 19 '18 at 8:58

add a comment |

from os import remove

def writechunk(fileCounter, reverseString):
 outFile = 'tmpfile' + str(fileCounter) + '.csv'
 with open(outFile, 'w') as outfp:
 outfp.write(reverseString)
 return

def main():
 inFile = 'intest.csv'
 outFile = 'outtest.csv'
 # This is our chunk expressed in lines
 maxLines = 10
 # Is there a header line we want to keep at the top of the output file?
 keepHeader = True

 fileCounter = 0
 lineCounter = 0
 with open(inFile) as infp:
 reverseString = ''
 line = infp.readline()
 if (line and keepHeader):
 headerLine = line
 line = infp.readline()
 while (line):
 lineCounter += 1
 reverseString = line + reverseString
 if (lineCounter == maxLines):
 fileCounter += 1
 lineCounter = 0
 writechunk(fileCounter, reverseString)
 reverseString = ''
 line = infp.readline()
 # Write any leftovers to a chunk file
 if (lineCounter != 0):
 fileCounter += 1
 writechunk(fileCounter,reverseString)
 # Read the chunk files backwards and append each to the outFile
 with open(outFile, 'w') as outfp:
 if (keepHeader):
 outfp.write(headerLine)
 while (fileCounter > 0):
 chunkFile = 'tmpfile' + str(fileCounter) + '.csv'
 with open(chunkFile, 'r') as infp:
 outfp.write(infp.read())
 remove(chunkFile)
 fileCounter -= 1

if __name__ == '__main__':
 main()

edited Nov 18 '18 at 23:03

answered Nov 17 '18 at 8:13

Mark Warburton

8517

2

This does work, but assumes that there are no newlines in the CSV cell values.

– Martijn Pieters♦
Nov 18 '18 at 14:32

Thanks, @MartijnPieters. You are correct. However, this was the use case expressed in the original question. I will update the answer to point out this limitation. This was designed to have the smallest footprint possible, but we could potentially use pandas' inbuilt chunk function to read bits of the file, which would overcome that limitation at the expense of efficiency (and a modest increase in memory requirement).

– Mark Warburton
Nov 18 '18 at 23:00

1

It absolutely is in the use case expressed, which asks to be able to handle any giant CSV file.

– Martijn Pieters♦
Nov 19 '18 at 8:58

add a comment |

from os import remove

def writechunk(fileCounter, reverseString):
 outFile = 'tmpfile' + str(fileCounter) + '.csv'
 with open(outFile, 'w') as outfp:
 outfp.write(reverseString)
 return

def main():
 inFile = 'intest.csv'
 outFile = 'outtest.csv'
 # This is our chunk expressed in lines
 maxLines = 10
 # Is there a header line we want to keep at the top of the output file?
 keepHeader = True

 fileCounter = 0
 lineCounter = 0
 with open(inFile) as infp:
 reverseString = ''
 line = infp.readline()
 if (line and keepHeader):
 headerLine = line
 line = infp.readline()
 while (line):
 lineCounter += 1
 reverseString = line + reverseString
 if (lineCounter == maxLines):
 fileCounter += 1
 lineCounter = 0
 writechunk(fileCounter, reverseString)
 reverseString = ''
 line = infp.readline()
 # Write any leftovers to a chunk file
 if (lineCounter != 0):
 fileCounter += 1
 writechunk(fileCounter,reverseString)
 # Read the chunk files backwards and append each to the outFile
 with open(outFile, 'w') as outfp:
 if (keepHeader):
 outfp.write(headerLine)
 while (fileCounter > 0):
 chunkFile = 'tmpfile' + str(fileCounter) + '.csv'
 with open(chunkFile, 'r') as infp:
 outfp.write(infp.read())
 remove(chunkFile)
 fileCounter -= 1

if __name__ == '__main__':
 main()

edited Nov 18 '18 at 23:03

answered Nov 17 '18 at 8:13

Mark Warburton

8517

from os import remove

def writechunk(fileCounter, reverseString):
 outFile = 'tmpfile' + str(fileCounter) + '.csv'
 with open(outFile, 'w') as outfp:
 outfp.write(reverseString)
 return

def main():
 inFile = 'intest.csv'
 outFile = 'outtest.csv'
 # This is our chunk expressed in lines
 maxLines = 10
 # Is there a header line we want to keep at the top of the output file?
 keepHeader = True

 fileCounter = 0
 lineCounter = 0
 with open(inFile) as infp:
 reverseString = ''
 line = infp.readline()
 if (line and keepHeader):
 headerLine = line
 line = infp.readline()
 while (line):
 lineCounter += 1
 reverseString = line + reverseString
 if (lineCounter == maxLines):
 fileCounter += 1
 lineCounter = 0
 writechunk(fileCounter, reverseString)
 reverseString = ''
 line = infp.readline()
 # Write any leftovers to a chunk file
 if (lineCounter != 0):
 fileCounter += 1
 writechunk(fileCounter,reverseString)
 # Read the chunk files backwards and append each to the outFile
 with open(outFile, 'w') as outfp:
 if (keepHeader):
 outfp.write(headerLine)
 while (fileCounter > 0):
 chunkFile = 'tmpfile' + str(fileCounter) + '.csv'
 with open(chunkFile, 'r') as infp:
 outfp.write(infp.read())
 remove(chunkFile)
 fileCounter -= 1

if __name__ == '__main__':
 main()

edited Nov 18 '18 at 23:03

answered Nov 17 '18 at 8:13

Mark Warburton

8517

edited Nov 18 '18 at 23:03

answered Nov 17 '18 at 8:13

Mark Warburton

8517

answered Nov 17 '18 at 8:13

Mark Warburton

8517

answered Nov 17 '18 at 8:13

Mark Warburton

8517

2

This does work, but assumes that there are no newlines in the CSV cell values.

– Martijn Pieters♦
Nov 18 '18 at 14:32

Thanks, @MartijnPieters. You are correct. However, this was the use case expressed in the original question. I will update the answer to point out this limitation. This was designed to have the smallest footprint possible, but we could potentially use pandas' inbuilt chunk function to read bits of the file, which would overcome that limitation at the expense of efficiency (and a modest increase in memory requirement).

– Mark Warburton
Nov 18 '18 at 23:00

1

It absolutely is in the use case expressed, which asks to be able to handle any giant CSV file.

– Martijn Pieters♦
Nov 19 '18 at 8:58

add a comment |

2

This does work, but assumes that there are no newlines in the CSV cell values.

– Martijn Pieters♦
Nov 18 '18 at 14:32

Thanks, @MartijnPieters. You are correct. However, this was the use case expressed in the original question. I will update the answer to point out this limitation. This was designed to have the smallest footprint possible, but we could potentially use pandas' inbuilt chunk function to read bits of the file, which would overcome that limitation at the expense of efficiency (and a modest increase in memory requirement).

– Mark Warburton
Nov 18 '18 at 23:00

1

It absolutely is in the use case expressed, which asks to be able to handle any giant CSV file.

– Martijn Pieters♦
Nov 19 '18 at 8:58

This does work, but assumes that there are no newlines in the CSV cell values.

– Martijn Pieters♦
Nov 18 '18 at 14:32

Thanks, @MartijnPieters. You are correct. However, this was the use case expressed in the original question. I will update the answer to point out this limitation. This was designed to have the smallest footprint possible, but we could potentially use pandas' inbuilt chunk function to read bits of the file, which would overcome that limitation at the expense of efficiency (and a modest increase in memory requirement).

– Mark Warburton
Nov 18 '18 at 23:00

It absolutely is in the use case expressed, which asks to be able to handle any giant CSV file.

– Martijn Pieters♦
Nov 19 '18 at 8:58

add a comment |

If you have sufficient hard disk space, you can read in chunks, reverse and store. Then pick up the stored chunks in reverse order and write to a new csv file.

Below is an example with Pandas which also uses pickle (for performance efficiency) and gzip (for storage efficiency).

import pandas as pd, numpy as np

# create a dataframe for demonstration purposes
df = pd.DataFrame(np.arange(5*9).reshape((-1, 5)))
df.to_csv('file.csv', index=False)

# number of rows we want to chunk by
n = 3

# iterate chunks, output to pickle files
for idx, chunk in enumerate(pd.read_csv('file.csv', chunksize=n)):
 chunk.iloc[::-1].to_pickle(f'file_pkl_idx:03.pkl.gzip', compression='gzip')

# open file in amend mode and write chunks in reverse
# idx stores the index of the last pickle file written
with open('out.csv', 'a') as fout:
 for i in range(idx, -1, -1):
 chunk_pkl = pd.read_pickle(f'file_pkl_i:03.pkl.gzip', compression='gzip')
 chunk_pkl.to_csv(fout, index=False, header=False if i!=idx else True)

# read new file to check results
df_new = pd.read_csv('out.csv')

print(df_new)

 0 1 2 3 4
0 40 41 42 43 44
1 35 36 37 38 39
2 30 31 32 33 34
3 25 26 27 28 29
4 20 21 22 23 24
5 15 16 17 18 19
6 10 11 12 13 14
7 5 6 7 8 9
8 0 1 2 3 4

edited Nov 20 '18 at 10:02

answered Nov 16 '18 at 23:27

jpp

99.4k2161110

Actually, each file chunk should be reversed and next each reversed chunk of rows should be prepended rather than appending to the csv file. So the entire file will be reversed at the end but chunk by chunk without loading entire file into memory.

– Sai Kumar
Nov 16 '18 at 23:56

@SaiKumar, Got it, have updated with a solution which performs a complete reversal.

– jpp
Nov 18 '18 at 18:40

add a comment |

If you have sufficient hard disk space, you can read in chunks, reverse and store. Then pick up the stored chunks in reverse order and write to a new csv file.

Below is an example with Pandas which also uses pickle (for performance efficiency) and gzip (for storage efficiency).

import pandas as pd, numpy as np

# create a dataframe for demonstration purposes
df = pd.DataFrame(np.arange(5*9).reshape((-1, 5)))
df.to_csv('file.csv', index=False)

# number of rows we want to chunk by
n = 3

# iterate chunks, output to pickle files
for idx, chunk in enumerate(pd.read_csv('file.csv', chunksize=n)):
 chunk.iloc[::-1].to_pickle(f'file_pkl_idx:03.pkl.gzip', compression='gzip')

# open file in amend mode and write chunks in reverse
# idx stores the index of the last pickle file written
with open('out.csv', 'a') as fout:
 for i in range(idx, -1, -1):
 chunk_pkl = pd.read_pickle(f'file_pkl_i:03.pkl.gzip', compression='gzip')
 chunk_pkl.to_csv(fout, index=False, header=False if i!=idx else True)

# read new file to check results
df_new = pd.read_csv('out.csv')

print(df_new)

 0 1 2 3 4
0 40 41 42 43 44
1 35 36 37 38 39
2 30 31 32 33 34
3 25 26 27 28 29
4 20 21 22 23 24
5 15 16 17 18 19
6 10 11 12 13 14
7 5 6 7 8 9
8 0 1 2 3 4

edited Nov 20 '18 at 10:02

answered Nov 16 '18 at 23:27

jpp

99.4k2161110

Actually, each file chunk should be reversed and next each reversed chunk of rows should be prepended rather than appending to the csv file. So the entire file will be reversed at the end but chunk by chunk without loading entire file into memory.

– Sai Kumar
Nov 16 '18 at 23:56

@SaiKumar, Got it, have updated with a solution which performs a complete reversal.

– jpp
Nov 18 '18 at 18:40

add a comment |

If you have sufficient hard disk space, you can read in chunks, reverse and store. Then pick up the stored chunks in reverse order and write to a new csv file.

Below is an example with Pandas which also uses pickle (for performance efficiency) and gzip (for storage efficiency).

import pandas as pd, numpy as np

# create a dataframe for demonstration purposes
df = pd.DataFrame(np.arange(5*9).reshape((-1, 5)))
df.to_csv('file.csv', index=False)

# number of rows we want to chunk by
n = 3

# iterate chunks, output to pickle files
for idx, chunk in enumerate(pd.read_csv('file.csv', chunksize=n)):
 chunk.iloc[::-1].to_pickle(f'file_pkl_idx:03.pkl.gzip', compression='gzip')

# open file in amend mode and write chunks in reverse
# idx stores the index of the last pickle file written
with open('out.csv', 'a') as fout:
 for i in range(idx, -1, -1):
 chunk_pkl = pd.read_pickle(f'file_pkl_i:03.pkl.gzip', compression='gzip')
 chunk_pkl.to_csv(fout, index=False, header=False if i!=idx else True)

# read new file to check results
df_new = pd.read_csv('out.csv')

print(df_new)

 0 1 2 3 4
0 40 41 42 43 44
1 35 36 37 38 39
2 30 31 32 33 34
3 25 26 27 28 29
4 20 21 22 23 24
5 15 16 17 18 19
6 10 11 12 13 14
7 5 6 7 8 9
8 0 1 2 3 4

edited Nov 20 '18 at 10:02

answered Nov 16 '18 at 23:27

jpp

99.4k2161110

If you have sufficient hard disk space, you can read in chunks, reverse and store. Then pick up the stored chunks in reverse order and write to a new csv file.

Below is an example with Pandas which also uses pickle (for performance efficiency) and gzip (for storage efficiency).

import pandas as pd, numpy as np

# create a dataframe for demonstration purposes
df = pd.DataFrame(np.arange(5*9).reshape((-1, 5)))
df.to_csv('file.csv', index=False)

# number of rows we want to chunk by
n = 3

# iterate chunks, output to pickle files
for idx, chunk in enumerate(pd.read_csv('file.csv', chunksize=n)):
 chunk.iloc[::-1].to_pickle(f'file_pkl_idx:03.pkl.gzip', compression='gzip')

# open file in amend mode and write chunks in reverse
# idx stores the index of the last pickle file written
with open('out.csv', 'a') as fout:
 for i in range(idx, -1, -1):
 chunk_pkl = pd.read_pickle(f'file_pkl_i:03.pkl.gzip', compression='gzip')
 chunk_pkl.to_csv(fout, index=False, header=False if i!=idx else True)

# read new file to check results
df_new = pd.read_csv('out.csv')

print(df_new)

 0 1 2 3 4
0 40 41 42 43 44
1 35 36 37 38 39
2 30 31 32 33 34
3 25 26 27 28 29
4 20 21 22 23 24
5 15 16 17 18 19
6 10 11 12 13 14
7 5 6 7 8 9
8 0 1 2 3 4

edited Nov 20 '18 at 10:02

answered Nov 16 '18 at 23:27

jpp

99.4k2161110

edited Nov 20 '18 at 10:02

answered Nov 16 '18 at 23:27

jpp

99.4k2161110

answered Nov 16 '18 at 23:27

jpp

99.4k2161110

answered Nov 16 '18 at 23:27

jpp

99.4k2161110

Actually, each file chunk should be reversed and next each reversed chunk of rows should be prepended rather than appending to the csv file. So the entire file will be reversed at the end but chunk by chunk without loading entire file into memory.

– Sai Kumar
Nov 16 '18 at 23:56

@SaiKumar, Got it, have updated with a solution which performs a complete reversal.

– jpp
Nov 18 '18 at 18:40

add a comment |

Actually, each file chunk should be reversed and next each reversed chunk of rows should be prepended rather than appending to the csv file. So the entire file will be reversed at the end but chunk by chunk without loading entire file into memory.

– Sai Kumar
Nov 16 '18 at 23:56

@SaiKumar, Got it, have updated with a solution which performs a complete reversal.

– jpp
Nov 18 '18 at 18:40

Actually, each file chunk should be reversed and next each reversed chunk of rows should be prepended rather than appending to the csv file. So the entire file will be reversed at the end but chunk by chunk without loading entire file into memory.

– Sai Kumar
Nov 16 '18 at 23:56

@SaiKumar, Got it, have updated with a solution which performs a complete reversal.

– jpp
Nov 18 '18 at 18:40

add a comment |

-3

You have repeated code blocks and you are not taking advantage of pandas at all.

What @sujay kumar pointed out is very correct, I would read that more closely.

when you read_csv() you dont specify any index. If you did you could call sort_index() with or without ascending=False depending on the order.

Pandas can write CSV too, please use that instead. I am pasting some example code for you to put together.

df_temp = pd.read_csv(file_path, parse_dates=True, index_col="Date", usecols=["Date", "Adj Close"], na_values=["nan"])

sorting a Series

s = pd.Series(list('abcde'), index=[0,3,2,5,4]) s.sort_index()

Note: If you stick to Pandas and its functions you will be running already optimized code which dont require entire files to be loaded in memory. It's so easy that its almost like cheating :)

edited Nov 18 '18 at 2:34

answered Nov 13 '18 at 23:49

Abhishek Dujari

1,2872540

add a comment |

-3

You have repeated code blocks and you are not taking advantage of pandas at all.

What @sujay kumar pointed out is very correct, I would read that more closely.

when you read_csv() you dont specify any index. If you did you could call sort_index() with or without ascending=False depending on the order.

Pandas can write CSV too, please use that instead. I am pasting some example code for you to put together.

df_temp = pd.read_csv(file_path, parse_dates=True, index_col="Date", usecols=["Date", "Adj Close"], na_values=["nan"])

sorting a Series

s = pd.Series(list('abcde'), index=[0,3,2,5,4]) s.sort_index()

Note: If you stick to Pandas and its functions you will be running already optimized code which dont require entire files to be loaded in memory. It's so easy that its almost like cheating :)

edited Nov 18 '18 at 2:34

answered Nov 13 '18 at 23:49

Abhishek Dujari

1,2872540

add a comment |

-3

You have repeated code blocks and you are not taking advantage of pandas at all.

What @sujay kumar pointed out is very correct, I would read that more closely.

when you read_csv() you dont specify any index. If you did you could call sort_index() with or without ascending=False depending on the order.

Pandas can write CSV too, please use that instead. I am pasting some example code for you to put together.

df_temp = pd.read_csv(file_path, parse_dates=True, index_col="Date", usecols=["Date", "Adj Close"], na_values=["nan"])

sorting a Series

s = pd.Series(list('abcde'), index=[0,3,2,5,4]) s.sort_index()

Note: If you stick to Pandas and its functions you will be running already optimized code which dont require entire files to be loaded in memory. It's so easy that its almost like cheating :)

edited Nov 18 '18 at 2:34

answered Nov 13 '18 at 23:49

Abhishek Dujari

1,2872540

You have repeated code blocks and you are not taking advantage of pandas at all.

What @sujay kumar pointed out is very correct, I would read that more closely.

when you read_csv() you dont specify any index. If you did you could call sort_index() with or without ascending=False depending on the order.

Pandas can write CSV too, please use that instead. I am pasting some example code for you to put together.

df_temp = pd.read_csv(file_path, parse_dates=True, index_col="Date", usecols=["Date", "Adj Close"], na_values=["nan"])

sorting a Series

s = pd.Series(list('abcde'), index=[0,3,2,5,4]) s.sort_index()

Note: If you stick to Pandas and its functions you will be running already optimized code which dont require entire files to be loaded in memory. It's so easy that its almost like cheating :)

edited Nov 18 '18 at 2:34

answered Nov 13 '18 at 23:49

Abhishek Dujari

1,2872540

edited Nov 18 '18 at 2:34

answered Nov 13 '18 at 23:49

Abhishek Dujari

1,2872540

answered Nov 13 '18 at 23:49

Abhishek Dujari

1,2872540

answered Nov 13 '18 at 23:49

Abhishek Dujari

1,2872540

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

MKLIZ,p cK9bZVq iln8fzmSO7LngUPz90I,Rung qYTj7e 4R,DzMQQBv093A9nT LOB1YSLi3,2 hmw9W,U jMcYb7rII16,aH,eQ3,l9

搜尋此網誌

Odtnhj