How can I filter lines on load in Pandas read_csv function?










64














How can I filter which lines of a CSV to be loaded into memory using pandas? This seems like an option that one should find in read_csv. Am I missing something?



Example: we've a CSV with a timestamp column and we'd like to load just the lines that with a timestamp greater than a given constant.










share|improve this question




























    64














    How can I filter which lines of a CSV to be loaded into memory using pandas? This seems like an option that one should find in read_csv. Am I missing something?



    Example: we've a CSV with a timestamp column and we'd like to load just the lines that with a timestamp greater than a given constant.










    share|improve this question


























      64












      64








      64


      17





      How can I filter which lines of a CSV to be loaded into memory using pandas? This seems like an option that one should find in read_csv. Am I missing something?



      Example: we've a CSV with a timestamp column and we'd like to load just the lines that with a timestamp greater than a given constant.










      share|improve this question















      How can I filter which lines of a CSV to be loaded into memory using pandas? This seems like an option that one should find in read_csv. Am I missing something?



      Example: we've a CSV with a timestamp column and we'd like to load just the lines that with a timestamp greater than a given constant.







      pandas






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Oct 16 '17 at 12:25









      Martin Thoma

      40.3k52289508




      40.3k52289508










      asked Nov 30 '12 at 18:38









      benjaminwilson

      524157




      524157






















          4 Answers
          4






          active

          oldest

          votes


















          111














          There isn't an option to filter the rows before the CSV file is loaded into a pandas object.



          You can either load the file and then filter using df[df['field'] > constant], or if you have a very large file and you are worried about memory running out, then use an iterator and apply the filter as you concatenate chunks of your file e.g.:



          import pandas as pd
          iter_csv = pd.read_csv('file.csv', iterator=True, chunksize=1000)
          df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv])


          You can vary the chunksize to suit your available memory. See here for more details.






          share|improve this answer






















          • for chunk['filed']>constant can I sandwich it between 2 constant values? E.g.: constant1 > chunk['field'] > constant2. Or can I use 'in range' ?
            – weefwefwqg3
            Feb 19 '17 at 6:32


















          7














          I didn't find a straight-forward way to do it within context of read_csv. However, read_csv returns a DataFrame, which can be filtered by selecting rows by boolean vector df[bool_vec]:



          filtered = df[(df['timestamp'] > targettime)]


          This is selecting all rows in df (assuming df is any DataFrame, such as the result of a read_csv call, that at least contains a datetime column timestamp) for which the values in the timestamp column are greater than the value of targettime. Similar question.






          share|improve this answer






























            2














            You can specify nrows parameter.




            import pandas as pd
            df = pd.read_csv('file.csv', nrows=100)



            This code works well in version 0.20.3.






            share|improve this answer




























              1














              If you are on linux you can use grep.



              # to import either on Python2 or Python3
              import pandas as pd
              from time import time # not needed just for timing
              try:
              from StringIO import StringIO
              except ImportError:
              from io import StringIO


              def zgrep_data(f, string):
              '''grep multiple items f is filepath, string is what you are filtering for'''

              grep = 'grep' # change to zgrep for gzipped files
              print(' for from '.format(grep,string,f))
              start_time = time()
              if string == '':
              out = subprocess.check_output([grep, string, f])
              grep_data = StringIO(out)
              data = pd.read_csv(grep_data, sep=',', header=0)

              else:
              # read only the first row to get the columns. May need to change depending on
              # how the data is stored
              columns = pd.read_csv(f, sep=',', nrows=1, header=None).values.tolist()[0]

              out = subprocess.check_output([grep, string, f])
              grep_data = StringIO(out)

              data = pd.read_csv(grep_data, sep=',', names=columns, header=None)

              print(' finished for - seconds'.format(grep,f,time()-start_time))
              return data





              share|improve this answer




















                Your Answer






                StackExchange.ifUsing("editor", function ()
                StackExchange.using("externalEditor", function ()
                StackExchange.using("snippets", function ()
                StackExchange.snippets.init();
                );
                );
                , "code-snippets");

                StackExchange.ready(function()
                var channelOptions =
                tags: "".split(" "),
                id: "1"
                ;
                initTagRenderer("".split(" "), "".split(" "), channelOptions);

                StackExchange.using("externalEditor", function()
                // Have to fire editor after snippets, if snippets enabled
                if (StackExchange.settings.snippets.snippetsEnabled)
                StackExchange.using("snippets", function()
                createEditor();
                );

                else
                createEditor();

                );

                function createEditor()
                StackExchange.prepareEditor(
                heartbeatType: 'answer',
                autoActivateHeartbeat: false,
                convertImagesToLinks: true,
                noModals: true,
                showLowRepImageUploadWarning: true,
                reputationToPostImages: 10,
                bindNavPrevention: true,
                postfix: "",
                imageUploader:
                brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
                contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
                allowUrls: true
                ,
                onDemand: true,
                discardSelector: ".discard-answer"
                ,immediatelyShowMarkdownHelp:true
                );



                );













                draft saved

                draft discarded


















                StackExchange.ready(
                function ()
                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f13651117%2fhow-can-i-filter-lines-on-load-in-pandas-read-csv-function%23new-answer', 'question_page');

                );

                Post as a guest















                Required, but never shown

























                4 Answers
                4






                active

                oldest

                votes








                4 Answers
                4






                active

                oldest

                votes









                active

                oldest

                votes






                active

                oldest

                votes









                111














                There isn't an option to filter the rows before the CSV file is loaded into a pandas object.



                You can either load the file and then filter using df[df['field'] > constant], or if you have a very large file and you are worried about memory running out, then use an iterator and apply the filter as you concatenate chunks of your file e.g.:



                import pandas as pd
                iter_csv = pd.read_csv('file.csv', iterator=True, chunksize=1000)
                df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv])


                You can vary the chunksize to suit your available memory. See here for more details.






                share|improve this answer






















                • for chunk['filed']>constant can I sandwich it between 2 constant values? E.g.: constant1 > chunk['field'] > constant2. Or can I use 'in range' ?
                  – weefwefwqg3
                  Feb 19 '17 at 6:32















                111














                There isn't an option to filter the rows before the CSV file is loaded into a pandas object.



                You can either load the file and then filter using df[df['field'] > constant], or if you have a very large file and you are worried about memory running out, then use an iterator and apply the filter as you concatenate chunks of your file e.g.:



                import pandas as pd
                iter_csv = pd.read_csv('file.csv', iterator=True, chunksize=1000)
                df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv])


                You can vary the chunksize to suit your available memory. See here for more details.






                share|improve this answer






















                • for chunk['filed']>constant can I sandwich it between 2 constant values? E.g.: constant1 > chunk['field'] > constant2. Or can I use 'in range' ?
                  – weefwefwqg3
                  Feb 19 '17 at 6:32













                111












                111








                111






                There isn't an option to filter the rows before the CSV file is loaded into a pandas object.



                You can either load the file and then filter using df[df['field'] > constant], or if you have a very large file and you are worried about memory running out, then use an iterator and apply the filter as you concatenate chunks of your file e.g.:



                import pandas as pd
                iter_csv = pd.read_csv('file.csv', iterator=True, chunksize=1000)
                df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv])


                You can vary the chunksize to suit your available memory. See here for more details.






                share|improve this answer














                There isn't an option to filter the rows before the CSV file is loaded into a pandas object.



                You can either load the file and then filter using df[df['field'] > constant], or if you have a very large file and you are worried about memory running out, then use an iterator and apply the filter as you concatenate chunks of your file e.g.:



                import pandas as pd
                iter_csv = pd.read_csv('file.csv', iterator=True, chunksize=1000)
                df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv])


                You can vary the chunksize to suit your available memory. See here for more details.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Apr 20 at 9:49









                Madhup Kumar

                53




                53










                answered Nov 30 '12 at 21:31









                Matti John

                10.3k33237




                10.3k33237











                • for chunk['filed']>constant can I sandwich it between 2 constant values? E.g.: constant1 > chunk['field'] > constant2. Or can I use 'in range' ?
                  – weefwefwqg3
                  Feb 19 '17 at 6:32
















                • for chunk['filed']>constant can I sandwich it between 2 constant values? E.g.: constant1 > chunk['field'] > constant2. Or can I use 'in range' ?
                  – weefwefwqg3
                  Feb 19 '17 at 6:32















                for chunk['filed']>constant can I sandwich it between 2 constant values? E.g.: constant1 > chunk['field'] > constant2. Or can I use 'in range' ?
                – weefwefwqg3
                Feb 19 '17 at 6:32




                for chunk['filed']>constant can I sandwich it between 2 constant values? E.g.: constant1 > chunk['field'] > constant2. Or can I use 'in range' ?
                – weefwefwqg3
                Feb 19 '17 at 6:32













                7














                I didn't find a straight-forward way to do it within context of read_csv. However, read_csv returns a DataFrame, which can be filtered by selecting rows by boolean vector df[bool_vec]:



                filtered = df[(df['timestamp'] > targettime)]


                This is selecting all rows in df (assuming df is any DataFrame, such as the result of a read_csv call, that at least contains a datetime column timestamp) for which the values in the timestamp column are greater than the value of targettime. Similar question.






                share|improve this answer



























                  7














                  I didn't find a straight-forward way to do it within context of read_csv. However, read_csv returns a DataFrame, which can be filtered by selecting rows by boolean vector df[bool_vec]:



                  filtered = df[(df['timestamp'] > targettime)]


                  This is selecting all rows in df (assuming df is any DataFrame, such as the result of a read_csv call, that at least contains a datetime column timestamp) for which the values in the timestamp column are greater than the value of targettime. Similar question.






                  share|improve this answer

























                    7












                    7








                    7






                    I didn't find a straight-forward way to do it within context of read_csv. However, read_csv returns a DataFrame, which can be filtered by selecting rows by boolean vector df[bool_vec]:



                    filtered = df[(df['timestamp'] > targettime)]


                    This is selecting all rows in df (assuming df is any DataFrame, such as the result of a read_csv call, that at least contains a datetime column timestamp) for which the values in the timestamp column are greater than the value of targettime. Similar question.






                    share|improve this answer














                    I didn't find a straight-forward way to do it within context of read_csv. However, read_csv returns a DataFrame, which can be filtered by selecting rows by boolean vector df[bool_vec]:



                    filtered = df[(df['timestamp'] > targettime)]


                    This is selecting all rows in df (assuming df is any DataFrame, such as the result of a read_csv call, that at least contains a datetime column timestamp) for which the values in the timestamp column are greater than the value of targettime. Similar question.







                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited May 23 '17 at 11:47









                    Community

                    11




                    11










                    answered Nov 30 '12 at 19:43









                    Griffin

                    1,1941122




                    1,1941122





















                        2














                        You can specify nrows parameter.




                        import pandas as pd
                        df = pd.read_csv('file.csv', nrows=100)



                        This code works well in version 0.20.3.






                        share|improve this answer

























                          2














                          You can specify nrows parameter.




                          import pandas as pd
                          df = pd.read_csv('file.csv', nrows=100)



                          This code works well in version 0.20.3.






                          share|improve this answer























                            2












                            2








                            2






                            You can specify nrows parameter.




                            import pandas as pd
                            df = pd.read_csv('file.csv', nrows=100)



                            This code works well in version 0.20.3.






                            share|improve this answer












                            You can specify nrows parameter.




                            import pandas as pd
                            df = pd.read_csv('file.csv', nrows=100)



                            This code works well in version 0.20.3.







                            share|improve this answer












                            share|improve this answer



                            share|improve this answer










                            answered Nov 12 at 5:59









                            user1083290

                            411




                            411





















                                1














                                If you are on linux you can use grep.



                                # to import either on Python2 or Python3
                                import pandas as pd
                                from time import time # not needed just for timing
                                try:
                                from StringIO import StringIO
                                except ImportError:
                                from io import StringIO


                                def zgrep_data(f, string):
                                '''grep multiple items f is filepath, string is what you are filtering for'''

                                grep = 'grep' # change to zgrep for gzipped files
                                print(' for from '.format(grep,string,f))
                                start_time = time()
                                if string == '':
                                out = subprocess.check_output([grep, string, f])
                                grep_data = StringIO(out)
                                data = pd.read_csv(grep_data, sep=',', header=0)

                                else:
                                # read only the first row to get the columns. May need to change depending on
                                # how the data is stored
                                columns = pd.read_csv(f, sep=',', nrows=1, header=None).values.tolist()[0]

                                out = subprocess.check_output([grep, string, f])
                                grep_data = StringIO(out)

                                data = pd.read_csv(grep_data, sep=',', names=columns, header=None)

                                print(' finished for - seconds'.format(grep,f,time()-start_time))
                                return data





                                share|improve this answer

























                                  1














                                  If you are on linux you can use grep.



                                  # to import either on Python2 or Python3
                                  import pandas as pd
                                  from time import time # not needed just for timing
                                  try:
                                  from StringIO import StringIO
                                  except ImportError:
                                  from io import StringIO


                                  def zgrep_data(f, string):
                                  '''grep multiple items f is filepath, string is what you are filtering for'''

                                  grep = 'grep' # change to zgrep for gzipped files
                                  print(' for from '.format(grep,string,f))
                                  start_time = time()
                                  if string == '':
                                  out = subprocess.check_output([grep, string, f])
                                  grep_data = StringIO(out)
                                  data = pd.read_csv(grep_data, sep=',', header=0)

                                  else:
                                  # read only the first row to get the columns. May need to change depending on
                                  # how the data is stored
                                  columns = pd.read_csv(f, sep=',', nrows=1, header=None).values.tolist()[0]

                                  out = subprocess.check_output([grep, string, f])
                                  grep_data = StringIO(out)

                                  data = pd.read_csv(grep_data, sep=',', names=columns, header=None)

                                  print(' finished for - seconds'.format(grep,f,time()-start_time))
                                  return data





                                  share|improve this answer























                                    1












                                    1








                                    1






                                    If you are on linux you can use grep.



                                    # to import either on Python2 or Python3
                                    import pandas as pd
                                    from time import time # not needed just for timing
                                    try:
                                    from StringIO import StringIO
                                    except ImportError:
                                    from io import StringIO


                                    def zgrep_data(f, string):
                                    '''grep multiple items f is filepath, string is what you are filtering for'''

                                    grep = 'grep' # change to zgrep for gzipped files
                                    print(' for from '.format(grep,string,f))
                                    start_time = time()
                                    if string == '':
                                    out = subprocess.check_output([grep, string, f])
                                    grep_data = StringIO(out)
                                    data = pd.read_csv(grep_data, sep=',', header=0)

                                    else:
                                    # read only the first row to get the columns. May need to change depending on
                                    # how the data is stored
                                    columns = pd.read_csv(f, sep=',', nrows=1, header=None).values.tolist()[0]

                                    out = subprocess.check_output([grep, string, f])
                                    grep_data = StringIO(out)

                                    data = pd.read_csv(grep_data, sep=',', names=columns, header=None)

                                    print(' finished for - seconds'.format(grep,f,time()-start_time))
                                    return data





                                    share|improve this answer












                                    If you are on linux you can use grep.



                                    # to import either on Python2 or Python3
                                    import pandas as pd
                                    from time import time # not needed just for timing
                                    try:
                                    from StringIO import StringIO
                                    except ImportError:
                                    from io import StringIO


                                    def zgrep_data(f, string):
                                    '''grep multiple items f is filepath, string is what you are filtering for'''

                                    grep = 'grep' # change to zgrep for gzipped files
                                    print(' for from '.format(grep,string,f))
                                    start_time = time()
                                    if string == '':
                                    out = subprocess.check_output([grep, string, f])
                                    grep_data = StringIO(out)
                                    data = pd.read_csv(grep_data, sep=',', header=0)

                                    else:
                                    # read only the first row to get the columns. May need to change depending on
                                    # how the data is stored
                                    columns = pd.read_csv(f, sep=',', nrows=1, header=None).values.tolist()[0]

                                    out = subprocess.check_output([grep, string, f])
                                    grep_data = StringIO(out)

                                    data = pd.read_csv(grep_data, sep=',', names=columns, header=None)

                                    print(' finished for - seconds'.format(grep,f,time()-start_time))
                                    return data






                                    share|improve this answer












                                    share|improve this answer



                                    share|improve this answer










                                    answered Dec 13 '17 at 14:26









                                    Christopher Bell

                                    385




                                    385



























                                        draft saved

                                        draft discarded
















































                                        Thanks for contributing an answer to Stack Overflow!


                                        • Please be sure to answer the question. Provide details and share your research!

                                        But avoid


                                        • Asking for help, clarification, or responding to other answers.

                                        • Making statements based on opinion; back them up with references or personal experience.

                                        To learn more, see our tips on writing great answers.





                                        Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                                        Please pay close attention to the following guidance:


                                        • Please be sure to answer the question. Provide details and share your research!

                                        But avoid


                                        • Asking for help, clarification, or responding to other answers.

                                        • Making statements based on opinion; back them up with references or personal experience.

                                        To learn more, see our tips on writing great answers.




                                        draft saved


                                        draft discarded














                                        StackExchange.ready(
                                        function ()
                                        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f13651117%2fhow-can-i-filter-lines-on-load-in-pandas-read-csv-function%23new-answer', 'question_page');

                                        );

                                        Post as a guest















                                        Required, but never shown





















































                                        Required, but never shown














                                        Required, but never shown












                                        Required, but never shown







                                        Required, but never shown

































                                        Required, but never shown














                                        Required, but never shown












                                        Required, but never shown







                                        Required, but never shown







                                        這個網誌中的熱門文章

                                        Barbados

                                        How to read a connectionString WITH PROVIDER in .NET Core?

                                        Node.js Script on GitHub Pages or Amazon S3