How can I filter lines on load in Pandas read_csv function?
How can I filter which lines of a CSV to be loaded into memory using pandas? This seems like an option that one should find in read_csv
. Am I missing something?
Example: we've a CSV with a timestamp column and we'd like to load just the lines that with a timestamp greater than a given constant.
pandas
add a comment |
How can I filter which lines of a CSV to be loaded into memory using pandas? This seems like an option that one should find in read_csv
. Am I missing something?
Example: we've a CSV with a timestamp column and we'd like to load just the lines that with a timestamp greater than a given constant.
pandas
add a comment |
How can I filter which lines of a CSV to be loaded into memory using pandas? This seems like an option that one should find in read_csv
. Am I missing something?
Example: we've a CSV with a timestamp column and we'd like to load just the lines that with a timestamp greater than a given constant.
pandas
How can I filter which lines of a CSV to be loaded into memory using pandas? This seems like an option that one should find in read_csv
. Am I missing something?
Example: we've a CSV with a timestamp column and we'd like to load just the lines that with a timestamp greater than a given constant.
pandas
pandas
edited Oct 16 '17 at 12:25
Martin Thoma
40.3k52289508
40.3k52289508
asked Nov 30 '12 at 18:38
benjaminwilson
524157
524157
add a comment |
add a comment |
4 Answers
4
active
oldest
votes
There isn't an option to filter the rows before the CSV file is loaded into a pandas object.
You can either load the file and then filter using df[df['field'] > constant]
, or if you have a very large file and you are worried about memory running out, then use an iterator and apply the filter as you concatenate chunks of your file e.g.:
import pandas as pd
iter_csv = pd.read_csv('file.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv])
You can vary the chunksize
to suit your available memory. See here for more details.
forchunk['filed']>constant
can I sandwich it between 2 constant values? E.g.: constant1 > chunk['field'] > constant2. Or can I use 'in range' ?
– weefwefwqg3
Feb 19 '17 at 6:32
add a comment |
I didn't find a straight-forward way to do it within context of read_csv
. However, read_csv
returns a DataFrame, which can be filtered by selecting rows by boolean vector df[bool_vec]
:
filtered = df[(df['timestamp'] > targettime)]
This is selecting all rows in df (assuming df is any DataFrame, such as the result of a read_csv
call, that at least contains a datetime column timestamp
) for which the values in the timestamp
column are greater than the value of targettime. Similar question.
add a comment |
You can specify nrows
parameter.
import pandas as pd
df = pd.read_csv('file.csv', nrows=100)
This code works well in version 0.20.3.
add a comment |
If you are on linux you can use grep.
# to import either on Python2 or Python3
import pandas as pd
from time import time # not needed just for timing
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
def zgrep_data(f, string):
'''grep multiple items f is filepath, string is what you are filtering for'''
grep = 'grep' # change to zgrep for gzipped files
print(' for from '.format(grep,string,f))
start_time = time()
if string == '':
out = subprocess.check_output([grep, string, f])
grep_data = StringIO(out)
data = pd.read_csv(grep_data, sep=',', header=0)
else:
# read only the first row to get the columns. May need to change depending on
# how the data is stored
columns = pd.read_csv(f, sep=',', nrows=1, header=None).values.tolist()[0]
out = subprocess.check_output([grep, string, f])
grep_data = StringIO(out)
data = pd.read_csv(grep_data, sep=',', names=columns, header=None)
print(' finished for - seconds'.format(grep,f,time()-start_time))
return data
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f13651117%2fhow-can-i-filter-lines-on-load-in-pandas-read-csv-function%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
There isn't an option to filter the rows before the CSV file is loaded into a pandas object.
You can either load the file and then filter using df[df['field'] > constant]
, or if you have a very large file and you are worried about memory running out, then use an iterator and apply the filter as you concatenate chunks of your file e.g.:
import pandas as pd
iter_csv = pd.read_csv('file.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv])
You can vary the chunksize
to suit your available memory. See here for more details.
forchunk['filed']>constant
can I sandwich it between 2 constant values? E.g.: constant1 > chunk['field'] > constant2. Or can I use 'in range' ?
– weefwefwqg3
Feb 19 '17 at 6:32
add a comment |
There isn't an option to filter the rows before the CSV file is loaded into a pandas object.
You can either load the file and then filter using df[df['field'] > constant]
, or if you have a very large file and you are worried about memory running out, then use an iterator and apply the filter as you concatenate chunks of your file e.g.:
import pandas as pd
iter_csv = pd.read_csv('file.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv])
You can vary the chunksize
to suit your available memory. See here for more details.
forchunk['filed']>constant
can I sandwich it between 2 constant values? E.g.: constant1 > chunk['field'] > constant2. Or can I use 'in range' ?
– weefwefwqg3
Feb 19 '17 at 6:32
add a comment |
There isn't an option to filter the rows before the CSV file is loaded into a pandas object.
You can either load the file and then filter using df[df['field'] > constant]
, or if you have a very large file and you are worried about memory running out, then use an iterator and apply the filter as you concatenate chunks of your file e.g.:
import pandas as pd
iter_csv = pd.read_csv('file.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv])
You can vary the chunksize
to suit your available memory. See here for more details.
There isn't an option to filter the rows before the CSV file is loaded into a pandas object.
You can either load the file and then filter using df[df['field'] > constant]
, or if you have a very large file and you are worried about memory running out, then use an iterator and apply the filter as you concatenate chunks of your file e.g.:
import pandas as pd
iter_csv = pd.read_csv('file.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv])
You can vary the chunksize
to suit your available memory. See here for more details.
edited Apr 20 at 9:49
Madhup Kumar
53
53
answered Nov 30 '12 at 21:31
Matti John
10.3k33237
10.3k33237
forchunk['filed']>constant
can I sandwich it between 2 constant values? E.g.: constant1 > chunk['field'] > constant2. Or can I use 'in range' ?
– weefwefwqg3
Feb 19 '17 at 6:32
add a comment |
forchunk['filed']>constant
can I sandwich it between 2 constant values? E.g.: constant1 > chunk['field'] > constant2. Or can I use 'in range' ?
– weefwefwqg3
Feb 19 '17 at 6:32
for
chunk['filed']>constant
can I sandwich it between 2 constant values? E.g.: constant1 > chunk['field'] > constant2. Or can I use 'in range' ?– weefwefwqg3
Feb 19 '17 at 6:32
for
chunk['filed']>constant
can I sandwich it between 2 constant values? E.g.: constant1 > chunk['field'] > constant2. Or can I use 'in range' ?– weefwefwqg3
Feb 19 '17 at 6:32
add a comment |
I didn't find a straight-forward way to do it within context of read_csv
. However, read_csv
returns a DataFrame, which can be filtered by selecting rows by boolean vector df[bool_vec]
:
filtered = df[(df['timestamp'] > targettime)]
This is selecting all rows in df (assuming df is any DataFrame, such as the result of a read_csv
call, that at least contains a datetime column timestamp
) for which the values in the timestamp
column are greater than the value of targettime. Similar question.
add a comment |
I didn't find a straight-forward way to do it within context of read_csv
. However, read_csv
returns a DataFrame, which can be filtered by selecting rows by boolean vector df[bool_vec]
:
filtered = df[(df['timestamp'] > targettime)]
This is selecting all rows in df (assuming df is any DataFrame, such as the result of a read_csv
call, that at least contains a datetime column timestamp
) for which the values in the timestamp
column are greater than the value of targettime. Similar question.
add a comment |
I didn't find a straight-forward way to do it within context of read_csv
. However, read_csv
returns a DataFrame, which can be filtered by selecting rows by boolean vector df[bool_vec]
:
filtered = df[(df['timestamp'] > targettime)]
This is selecting all rows in df (assuming df is any DataFrame, such as the result of a read_csv
call, that at least contains a datetime column timestamp
) for which the values in the timestamp
column are greater than the value of targettime. Similar question.
I didn't find a straight-forward way to do it within context of read_csv
. However, read_csv
returns a DataFrame, which can be filtered by selecting rows by boolean vector df[bool_vec]
:
filtered = df[(df['timestamp'] > targettime)]
This is selecting all rows in df (assuming df is any DataFrame, such as the result of a read_csv
call, that at least contains a datetime column timestamp
) for which the values in the timestamp
column are greater than the value of targettime. Similar question.
edited May 23 '17 at 11:47
Community♦
11
11
answered Nov 30 '12 at 19:43
Griffin
1,1941122
1,1941122
add a comment |
add a comment |
You can specify nrows
parameter.
import pandas as pd
df = pd.read_csv('file.csv', nrows=100)
This code works well in version 0.20.3.
add a comment |
You can specify nrows
parameter.
import pandas as pd
df = pd.read_csv('file.csv', nrows=100)
This code works well in version 0.20.3.
add a comment |
You can specify nrows
parameter.
import pandas as pd
df = pd.read_csv('file.csv', nrows=100)
This code works well in version 0.20.3.
You can specify nrows
parameter.
import pandas as pd
df = pd.read_csv('file.csv', nrows=100)
This code works well in version 0.20.3.
answered Nov 12 at 5:59
user1083290
411
411
add a comment |
add a comment |
If you are on linux you can use grep.
# to import either on Python2 or Python3
import pandas as pd
from time import time # not needed just for timing
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
def zgrep_data(f, string):
'''grep multiple items f is filepath, string is what you are filtering for'''
grep = 'grep' # change to zgrep for gzipped files
print(' for from '.format(grep,string,f))
start_time = time()
if string == '':
out = subprocess.check_output([grep, string, f])
grep_data = StringIO(out)
data = pd.read_csv(grep_data, sep=',', header=0)
else:
# read only the first row to get the columns. May need to change depending on
# how the data is stored
columns = pd.read_csv(f, sep=',', nrows=1, header=None).values.tolist()[0]
out = subprocess.check_output([grep, string, f])
grep_data = StringIO(out)
data = pd.read_csv(grep_data, sep=',', names=columns, header=None)
print(' finished for - seconds'.format(grep,f,time()-start_time))
return data
add a comment |
If you are on linux you can use grep.
# to import either on Python2 or Python3
import pandas as pd
from time import time # not needed just for timing
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
def zgrep_data(f, string):
'''grep multiple items f is filepath, string is what you are filtering for'''
grep = 'grep' # change to zgrep for gzipped files
print(' for from '.format(grep,string,f))
start_time = time()
if string == '':
out = subprocess.check_output([grep, string, f])
grep_data = StringIO(out)
data = pd.read_csv(grep_data, sep=',', header=0)
else:
# read only the first row to get the columns. May need to change depending on
# how the data is stored
columns = pd.read_csv(f, sep=',', nrows=1, header=None).values.tolist()[0]
out = subprocess.check_output([grep, string, f])
grep_data = StringIO(out)
data = pd.read_csv(grep_data, sep=',', names=columns, header=None)
print(' finished for - seconds'.format(grep,f,time()-start_time))
return data
add a comment |
If you are on linux you can use grep.
# to import either on Python2 or Python3
import pandas as pd
from time import time # not needed just for timing
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
def zgrep_data(f, string):
'''grep multiple items f is filepath, string is what you are filtering for'''
grep = 'grep' # change to zgrep for gzipped files
print(' for from '.format(grep,string,f))
start_time = time()
if string == '':
out = subprocess.check_output([grep, string, f])
grep_data = StringIO(out)
data = pd.read_csv(grep_data, sep=',', header=0)
else:
# read only the first row to get the columns. May need to change depending on
# how the data is stored
columns = pd.read_csv(f, sep=',', nrows=1, header=None).values.tolist()[0]
out = subprocess.check_output([grep, string, f])
grep_data = StringIO(out)
data = pd.read_csv(grep_data, sep=',', names=columns, header=None)
print(' finished for - seconds'.format(grep,f,time()-start_time))
return data
If you are on linux you can use grep.
# to import either on Python2 or Python3
import pandas as pd
from time import time # not needed just for timing
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
def zgrep_data(f, string):
'''grep multiple items f is filepath, string is what you are filtering for'''
grep = 'grep' # change to zgrep for gzipped files
print(' for from '.format(grep,string,f))
start_time = time()
if string == '':
out = subprocess.check_output([grep, string, f])
grep_data = StringIO(out)
data = pd.read_csv(grep_data, sep=',', header=0)
else:
# read only the first row to get the columns. May need to change depending on
# how the data is stored
columns = pd.read_csv(f, sep=',', nrows=1, header=None).values.tolist()[0]
out = subprocess.check_output([grep, string, f])
grep_data = StringIO(out)
data = pd.read_csv(grep_data, sep=',', names=columns, header=None)
print(' finished for - seconds'.format(grep,f,time()-start_time))
return data
answered Dec 13 '17 at 14:26
Christopher Bell
385
385
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f13651117%2fhow-can-i-filter-lines-on-load-in-pandas-read-csv-function%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown