How can I filter lines on load in Pandas read

How can I filter lines on load in Pandas read_csv function?

How can I filter which lines of a CSV to be loaded into memory using pandas? This seems like an option that one should find in read_csv. Am I missing something?

Example: we've a CSV with a timestamp column and we'd like to load just the lines that with a timestamp greater than a given constant.

edited Oct 16 '17 at 12:25

Martin Thoma

40.3k52289508

asked Nov 30 '12 at 18:38

benjaminwilson

524157

add a comment |

How can I filter which lines of a CSV to be loaded into memory using pandas? This seems like an option that one should find in read_csv. Am I missing something?

Example: we've a CSV with a timestamp column and we'd like to load just the lines that with a timestamp greater than a given constant.

edited Oct 16 '17 at 12:25

Martin Thoma

40.3k52289508

asked Nov 30 '12 at 18:38

benjaminwilson

524157

add a comment |

How can I filter which lines of a CSV to be loaded into memory using pandas? This seems like an option that one should find in read_csv. Am I missing something?

Example: we've a CSV with a timestamp column and we'd like to load just the lines that with a timestamp greater than a given constant.

edited Oct 16 '17 at 12:25

Martin Thoma

40.3k52289508

asked Nov 30 '12 at 18:38

benjaminwilson

524157

How can I filter which lines of a CSV to be loaded into memory using pandas? This seems like an option that one should find in read_csv. Am I missing something?

Example: we've a CSV with a timestamp column and we'd like to load just the lines that with a timestamp greater than a given constant.

pandas

edited Oct 16 '17 at 12:25

Martin Thoma

40.3k52289508

asked Nov 30 '12 at 18:38

benjaminwilson

524157

edited Oct 16 '17 at 12:25

Martin Thoma

40.3k52289508

asked Nov 30 '12 at 18:38

benjaminwilson

524157

edited Oct 16 '17 at 12:25

Martin Thoma

40.3k52289508

edited Oct 16 '17 at 12:25

Martin Thoma

40.3k52289508

edited Oct 16 '17 at 12:25

Martin Thoma

40.3k52289508

asked Nov 30 '12 at 18:38

benjaminwilson

524157

asked Nov 30 '12 at 18:38

benjaminwilson

524157

asked Nov 30 '12 at 18:38

benjaminwilson

524157

add a comment |

4 Answers
4

active

oldest

votes

111

There isn't an option to filter the rows before the CSV file is loaded into a pandas object.

You can either load the file and then filter using df[df['field'] > constant], or if you have a very large file and you are worried about memory running out, then use an iterator and apply the filter as you concatenate chunks of your file e.g.:

import pandas as pd
iter_csv = pd.read_csv('file.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv])

You can vary the chunksize to suit your available memory. See here for more details.

edited Apr 20 at 9:49

Madhup Kumar

answered Nov 30 '12 at 21:31

Matti John

10.3k33237

for chunk['filed']>constant can I sandwich it between 2 constant values? E.g.: constant1 > chunk['field'] > constant2. Or can I use 'in range' ?
– weefwefwqg3
Feb 19 '17 at 6:32

add a comment |

I didn't find a straight-forward way to do it within context of read_csv. However, read_csv returns a DataFrame, which can be filtered by selecting rows by boolean vector df[bool_vec]:

filtered = df[(df['timestamp'] > targettime)]

This is selecting all rows in df (assuming df is any DataFrame, such as the result of a read_csv call, that at least contains a datetime column timestamp) for which the values in the timestamp column are greater than the value of targettime. Similar question.

edited May 23 '17 at 11:47

Community♦

answered Nov 30 '12 at 19:43

Griffin

1,1941122

add a comment |

You can specify nrows parameter.

import pandas as pd df = pd.read_csv('file.csv', nrows=100)

This code works well in version 0.20.3.

answered Nov 12 at 5:59

user1083290

411

add a comment |

If you are on linux you can use grep.

# to import either on Python2 or Python3
import pandas as pd
from time import time # not needed just for timing
try:
 from StringIO import StringIO
except ImportError:
 from io import StringIO


def zgrep_data(f, string):
 '''grep multiple items f is filepath, string is what you are filtering for'''

 grep = 'grep' # change to zgrep for gzipped files
 print(' for from '.format(grep,string,f))
 start_time = time()
 if string == '':
 out = subprocess.check_output([grep, string, f])
 grep_data = StringIO(out)
 data = pd.read_csv(grep_data, sep=',', header=0)

 else:
 # read only the first row to get the columns. May need to change depending on 
 # how the data is stored
 columns = pd.read_csv(f, sep=',', nrows=1, header=None).values.tolist()[0] 

 out = subprocess.check_output([grep, string, f])
 grep_data = StringIO(out)

 data = pd.read_csv(grep_data, sep=',', names=columns, header=None)

 print(' finished for - seconds'.format(grep,f,time()-start_time))
 return data

answered Dec 13 '17 at 14:26

Christopher Bell

385

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f13651117%2fhow-can-i-filter-lines-on-load-in-pandas-read-csv-function%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

111

There isn't an option to filter the rows before the CSV file is loaded into a pandas object.

import pandas as pd
iter_csv = pd.read_csv('file.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv])

You can vary the chunksize to suit your available memory. See here for more details.

edited Apr 20 at 9:49

Madhup Kumar

answered Nov 30 '12 at 21:31

Matti John

10.3k33237

for chunk['filed']>constant can I sandwich it between 2 constant values? E.g.: constant1 > chunk['field'] > constant2. Or can I use 'in range' ?
– weefwefwqg3
Feb 19 '17 at 6:32

add a comment |

111

There isn't an option to filter the rows before the CSV file is loaded into a pandas object.

import pandas as pd
iter_csv = pd.read_csv('file.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv])

You can vary the chunksize to suit your available memory. See here for more details.

edited Apr 20 at 9:49

Madhup Kumar

answered Nov 30 '12 at 21:31

Matti John

10.3k33237

for chunk['filed']>constant can I sandwich it between 2 constant values? E.g.: constant1 > chunk['field'] > constant2. Or can I use 'in range' ?
– weefwefwqg3
Feb 19 '17 at 6:32

add a comment |

111

There isn't an option to filter the rows before the CSV file is loaded into a pandas object.

import pandas as pd
iter_csv = pd.read_csv('file.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv])

You can vary the chunksize to suit your available memory. See here for more details.

edited Apr 20 at 9:49

Madhup Kumar

answered Nov 30 '12 at 21:31

Matti John

10.3k33237

There isn't an option to filter the rows before the CSV file is loaded into a pandas object.

import pandas as pd
iter_csv = pd.read_csv('file.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv])

You can vary the chunksize to suit your available memory. See here for more details.

edited Apr 20 at 9:49

Madhup Kumar

answered Nov 30 '12 at 21:31

Matti John

10.3k33237

edited Apr 20 at 9:49

Madhup Kumar

edited Apr 20 at 9:49

Madhup Kumar

edited Apr 20 at 9:49

Madhup Kumar

answered Nov 30 '12 at 21:31

Matti John

10.3k33237

answered Nov 30 '12 at 21:31

Matti John

10.3k33237

answered Nov 30 '12 at 21:31

Matti John

10.3k33237

for chunk['filed']>constant can I sandwich it between 2 constant values? E.g.: constant1 > chunk['field'] > constant2. Or can I use 'in range' ?
– weefwefwqg3
Feb 19 '17 at 6:32

add a comment |

for chunk['filed']>constant can I sandwich it between 2 constant values? E.g.: constant1 > chunk['field'] > constant2. Or can I use 'in range' ?
– weefwefwqg3
Feb 19 '17 at 6:32

for chunk['filed']>constant can I sandwich it between 2 constant values? E.g.: constant1 > chunk['field'] > constant2. Or can I use 'in range' ?
– weefwefwqg3
Feb 19 '17 at 6:32

add a comment |

I didn't find a straight-forward way to do it within context of read_csv. However, read_csv returns a DataFrame, which can be filtered by selecting rows by boolean vector df[bool_vec]:

filtered = df[(df['timestamp'] > targettime)]

edited May 23 '17 at 11:47

Community♦

answered Nov 30 '12 at 19:43

Griffin

1,1941122

add a comment |

I didn't find a straight-forward way to do it within context of read_csv. However, read_csv returns a DataFrame, which can be filtered by selecting rows by boolean vector df[bool_vec]:

filtered = df[(df['timestamp'] > targettime)]

edited May 23 '17 at 11:47

Community♦

answered Nov 30 '12 at 19:43

Griffin

1,1941122

add a comment |

I didn't find a straight-forward way to do it within context of read_csv. However, read_csv returns a DataFrame, which can be filtered by selecting rows by boolean vector df[bool_vec]:

filtered = df[(df['timestamp'] > targettime)]

edited May 23 '17 at 11:47

Community♦

answered Nov 30 '12 at 19:43

Griffin

1,1941122

I didn't find a straight-forward way to do it within context of read_csv. However, read_csv returns a DataFrame, which can be filtered by selecting rows by boolean vector df[bool_vec]:

filtered = df[(df['timestamp'] > targettime)]

edited May 23 '17 at 11:47

Community♦

answered Nov 30 '12 at 19:43

Griffin

1,1941122

edited May 23 '17 at 11:47

Community♦

edited May 23 '17 at 11:47

Community♦

edited May 23 '17 at 11:47

Community♦

answered Nov 30 '12 at 19:43

Griffin

1,1941122

answered Nov 30 '12 at 19:43

Griffin

1,1941122

answered Nov 30 '12 at 19:43

Griffin

1,1941122

add a comment |

You can specify nrows parameter.

import pandas as pd df = pd.read_csv('file.csv', nrows=100)

This code works well in version 0.20.3.

answered Nov 12 at 5:59

user1083290

411

add a comment |

You can specify nrows parameter.

import pandas as pd df = pd.read_csv('file.csv', nrows=100)

This code works well in version 0.20.3.

answered Nov 12 at 5:59

user1083290

411

add a comment |

You can specify nrows parameter.

import pandas as pd df = pd.read_csv('file.csv', nrows=100)

This code works well in version 0.20.3.

answered Nov 12 at 5:59

user1083290

411

You can specify nrows parameter.

import pandas as pd df = pd.read_csv('file.csv', nrows=100)

This code works well in version 0.20.3.

answered Nov 12 at 5:59

user1083290

411

answered Nov 12 at 5:59

user1083290

411

answered Nov 12 at 5:59

user1083290

411

answered Nov 12 at 5:59

user1083290

411

add a comment |

If you are on linux you can use grep.

# to import either on Python2 or Python3
import pandas as pd
from time import time # not needed just for timing
try:
 from StringIO import StringIO
except ImportError:
 from io import StringIO


def zgrep_data(f, string):
 '''grep multiple items f is filepath, string is what you are filtering for'''

 grep = 'grep' # change to zgrep for gzipped files
 print(' for from '.format(grep,string,f))
 start_time = time()
 if string == '':
 out = subprocess.check_output([grep, string, f])
 grep_data = StringIO(out)
 data = pd.read_csv(grep_data, sep=',', header=0)

 else:
 # read only the first row to get the columns. May need to change depending on 
 # how the data is stored
 columns = pd.read_csv(f, sep=',', nrows=1, header=None).values.tolist()[0] 

 out = subprocess.check_output([grep, string, f])
 grep_data = StringIO(out)

 data = pd.read_csv(grep_data, sep=',', names=columns, header=None)

 print(' finished for - seconds'.format(grep,f,time()-start_time))
 return data

answered Dec 13 '17 at 14:26

Christopher Bell

385

add a comment |

If you are on linux you can use grep.

# to import either on Python2 or Python3
import pandas as pd
from time import time # not needed just for timing
try:
 from StringIO import StringIO
except ImportError:
 from io import StringIO


def zgrep_data(f, string):
 '''grep multiple items f is filepath, string is what you are filtering for'''

 grep = 'grep' # change to zgrep for gzipped files
 print(' for from '.format(grep,string,f))
 start_time = time()
 if string == '':
 out = subprocess.check_output([grep, string, f])
 grep_data = StringIO(out)
 data = pd.read_csv(grep_data, sep=',', header=0)

 else:
 # read only the first row to get the columns. May need to change depending on 
 # how the data is stored
 columns = pd.read_csv(f, sep=',', nrows=1, header=None).values.tolist()[0] 

 out = subprocess.check_output([grep, string, f])
 grep_data = StringIO(out)

 data = pd.read_csv(grep_data, sep=',', names=columns, header=None)

 print(' finished for - seconds'.format(grep,f,time()-start_time))
 return data

answered Dec 13 '17 at 14:26

Christopher Bell

385

add a comment |

If you are on linux you can use grep.

# to import either on Python2 or Python3
import pandas as pd
from time import time # not needed just for timing
try:
 from StringIO import StringIO
except ImportError:
 from io import StringIO


def zgrep_data(f, string):
 '''grep multiple items f is filepath, string is what you are filtering for'''

 grep = 'grep' # change to zgrep for gzipped files
 print(' for from '.format(grep,string,f))
 start_time = time()
 if string == '':
 out = subprocess.check_output([grep, string, f])
 grep_data = StringIO(out)
 data = pd.read_csv(grep_data, sep=',', header=0)

 else:
 # read only the first row to get the columns. May need to change depending on 
 # how the data is stored
 columns = pd.read_csv(f, sep=',', nrows=1, header=None).values.tolist()[0] 

 out = subprocess.check_output([grep, string, f])
 grep_data = StringIO(out)

 data = pd.read_csv(grep_data, sep=',', names=columns, header=None)

 print(' finished for - seconds'.format(grep,f,time()-start_time))
 return data

answered Dec 13 '17 at 14:26

Christopher Bell

385

If you are on linux you can use grep.

# to import either on Python2 or Python3
import pandas as pd
from time import time # not needed just for timing
try:
 from StringIO import StringIO
except ImportError:
 from io import StringIO


def zgrep_data(f, string):
 '''grep multiple items f is filepath, string is what you are filtering for'''

 grep = 'grep' # change to zgrep for gzipped files
 print(' for from '.format(grep,string,f))
 start_time = time()
 if string == '':
 out = subprocess.check_output([grep, string, f])
 grep_data = StringIO(out)
 data = pd.read_csv(grep_data, sep=',', header=0)

 else:
 # read only the first row to get the columns. May need to change depending on 
 # how the data is stored
 columns = pd.read_csv(f, sep=',', nrows=1, header=None).values.tolist()[0] 

 out = subprocess.check_output([grep, string, f])
 grep_data = StringIO(out)

 data = pd.read_csv(grep_data, sep=',', names=columns, header=None)

 print(' finished for - seconds'.format(grep,f,time()-start_time))
 return data

answered Dec 13 '17 at 14:26

Christopher Bell

385

answered Dec 13 '17 at 14:26

Christopher Bell

385

answered Dec 13 '17 at 14:26

Christopher Bell

385

answered Dec 13 '17 at 14:26

Christopher Bell

385

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

N,Y 9T2EvEjmEfS Oi6uZEudR9kquYmcgByhDJ YLNM12,GRwZunFaKoET 7o6NHCCexP2QtWr tL60,gRpV58XzN 6s

搜尋此網誌

Odtnhj