skip lines from csv file if it contains specific keyword in Pyspark










0















I have a CSV file with the details as shown below:



emp_id,emp_name,emp_city,emp_salary
1,VIKRANT SINGH RANA ,NOIDA ,10000
3,GOVIND NIMBHAL ,DWARKA ,92000
2,RAGHVENDRA KUMAR GUPTA,GURGAON ,50000
4,ABHIJAN SINHA ,SAKET ,65000
5,SUPER DEVELOPER ,USA ,50000
6,RAJAT TYAGI ,UP ,65000
7,AJAY SHARMA ,NOIDA ,70000
8,SIDDHARTH BASU ,SAKET ,72000
9,ROBERT ,GURGAON ,70000
9,ABC ,ROBERT ,10000
9,XYZ ,ROBERTGURGAON,70000


I want to skip the lines if it contains keyword "ROBERT" and the expecting output as:



+------+--------------------+-------------+----------+
|emp_id| emp_name| emp_city|emp_salary|
+------+--------------------+-------------+----------+
| 1|VIKRANT SINGH RAN...| NOIDA | 10000|
| 3|GOVIND NIMBHAL ...| DWARKA | 92000|
| 2|RAGHVENDRA KUMAR ...| GURGAON | 50000|
| 4|ABHIJAN SINHA ...| SAKET | 65000|
| 5|SUPER DEVELOPER ...| USA | 50000|
| 6|RAJAT TYAGI ...| UP | 65000|
| 7|AJAY SHARMA ...| NOIDA | 70000|
| 8|SIDDHARTH BASU ...| SAKET | 72000|
+------+--------------------+-------------+----------+


I can load this file into dataframe and can filter using below expression for each column



newdf = emp_df.where(~ col("emp_city").like("ROBERT%"))


I am looking for some solution so that I can filter it before loading it into dataframe and need not to traversed all the columns to find the specific string.










share|improve this question

















  • 1





    Have you tried to read it to rdd and filter the result using lambda function before creating dataframe? Something like this: rdd.filter(lambda row : 'ROBERT' not in row).toDF(rdd.first())

    – Alla Tarighati
    Nov 13 '18 at 7:31












  • Yeah it works. Thanks

    – vikrant rana
    Nov 13 '18 at 8:49











  • @Alla Tarighati- Thanks for your help

    – vikrant rana
    Nov 15 '18 at 13:02











  • Happy to hear that it helped :) and thank you for writing a proper answer for it.

    – Alla Tarighati
    Nov 15 '18 at 16:00
















0















I have a CSV file with the details as shown below:



emp_id,emp_name,emp_city,emp_salary
1,VIKRANT SINGH RANA ,NOIDA ,10000
3,GOVIND NIMBHAL ,DWARKA ,92000
2,RAGHVENDRA KUMAR GUPTA,GURGAON ,50000
4,ABHIJAN SINHA ,SAKET ,65000
5,SUPER DEVELOPER ,USA ,50000
6,RAJAT TYAGI ,UP ,65000
7,AJAY SHARMA ,NOIDA ,70000
8,SIDDHARTH BASU ,SAKET ,72000
9,ROBERT ,GURGAON ,70000
9,ABC ,ROBERT ,10000
9,XYZ ,ROBERTGURGAON,70000


I want to skip the lines if it contains keyword "ROBERT" and the expecting output as:



+------+--------------------+-------------+----------+
|emp_id| emp_name| emp_city|emp_salary|
+------+--------------------+-------------+----------+
| 1|VIKRANT SINGH RAN...| NOIDA | 10000|
| 3|GOVIND NIMBHAL ...| DWARKA | 92000|
| 2|RAGHVENDRA KUMAR ...| GURGAON | 50000|
| 4|ABHIJAN SINHA ...| SAKET | 65000|
| 5|SUPER DEVELOPER ...| USA | 50000|
| 6|RAJAT TYAGI ...| UP | 65000|
| 7|AJAY SHARMA ...| NOIDA | 70000|
| 8|SIDDHARTH BASU ...| SAKET | 72000|
+------+--------------------+-------------+----------+


I can load this file into dataframe and can filter using below expression for each column



newdf = emp_df.where(~ col("emp_city").like("ROBERT%"))


I am looking for some solution so that I can filter it before loading it into dataframe and need not to traversed all the columns to find the specific string.










share|improve this question

















  • 1





    Have you tried to read it to rdd and filter the result using lambda function before creating dataframe? Something like this: rdd.filter(lambda row : 'ROBERT' not in row).toDF(rdd.first())

    – Alla Tarighati
    Nov 13 '18 at 7:31












  • Yeah it works. Thanks

    – vikrant rana
    Nov 13 '18 at 8:49











  • @Alla Tarighati- Thanks for your help

    – vikrant rana
    Nov 15 '18 at 13:02











  • Happy to hear that it helped :) and thank you for writing a proper answer for it.

    – Alla Tarighati
    Nov 15 '18 at 16:00














0












0








0








I have a CSV file with the details as shown below:



emp_id,emp_name,emp_city,emp_salary
1,VIKRANT SINGH RANA ,NOIDA ,10000
3,GOVIND NIMBHAL ,DWARKA ,92000
2,RAGHVENDRA KUMAR GUPTA,GURGAON ,50000
4,ABHIJAN SINHA ,SAKET ,65000
5,SUPER DEVELOPER ,USA ,50000
6,RAJAT TYAGI ,UP ,65000
7,AJAY SHARMA ,NOIDA ,70000
8,SIDDHARTH BASU ,SAKET ,72000
9,ROBERT ,GURGAON ,70000
9,ABC ,ROBERT ,10000
9,XYZ ,ROBERTGURGAON,70000


I want to skip the lines if it contains keyword "ROBERT" and the expecting output as:



+------+--------------------+-------------+----------+
|emp_id| emp_name| emp_city|emp_salary|
+------+--------------------+-------------+----------+
| 1|VIKRANT SINGH RAN...| NOIDA | 10000|
| 3|GOVIND NIMBHAL ...| DWARKA | 92000|
| 2|RAGHVENDRA KUMAR ...| GURGAON | 50000|
| 4|ABHIJAN SINHA ...| SAKET | 65000|
| 5|SUPER DEVELOPER ...| USA | 50000|
| 6|RAJAT TYAGI ...| UP | 65000|
| 7|AJAY SHARMA ...| NOIDA | 70000|
| 8|SIDDHARTH BASU ...| SAKET | 72000|
+------+--------------------+-------------+----------+


I can load this file into dataframe and can filter using below expression for each column



newdf = emp_df.where(~ col("emp_city").like("ROBERT%"))


I am looking for some solution so that I can filter it before loading it into dataframe and need not to traversed all the columns to find the specific string.










share|improve this question














I have a CSV file with the details as shown below:



emp_id,emp_name,emp_city,emp_salary
1,VIKRANT SINGH RANA ,NOIDA ,10000
3,GOVIND NIMBHAL ,DWARKA ,92000
2,RAGHVENDRA KUMAR GUPTA,GURGAON ,50000
4,ABHIJAN SINHA ,SAKET ,65000
5,SUPER DEVELOPER ,USA ,50000
6,RAJAT TYAGI ,UP ,65000
7,AJAY SHARMA ,NOIDA ,70000
8,SIDDHARTH BASU ,SAKET ,72000
9,ROBERT ,GURGAON ,70000
9,ABC ,ROBERT ,10000
9,XYZ ,ROBERTGURGAON,70000


I want to skip the lines if it contains keyword "ROBERT" and the expecting output as:



+------+--------------------+-------------+----------+
|emp_id| emp_name| emp_city|emp_salary|
+------+--------------------+-------------+----------+
| 1|VIKRANT SINGH RAN...| NOIDA | 10000|
| 3|GOVIND NIMBHAL ...| DWARKA | 92000|
| 2|RAGHVENDRA KUMAR ...| GURGAON | 50000|
| 4|ABHIJAN SINHA ...| SAKET | 65000|
| 5|SUPER DEVELOPER ...| USA | 50000|
| 6|RAJAT TYAGI ...| UP | 65000|
| 7|AJAY SHARMA ...| NOIDA | 70000|
| 8|SIDDHARTH BASU ...| SAKET | 72000|
+------+--------------------+-------------+----------+


I can load this file into dataframe and can filter using below expression for each column



newdf = emp_df.where(~ col("emp_city").like("ROBERT%"))


I am looking for some solution so that I can filter it before loading it into dataframe and need not to traversed all the columns to find the specific string.







python-3.x csv pyspark






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 13 '18 at 7:17









vikrant ranavikrant rana

6021115




6021115







  • 1





    Have you tried to read it to rdd and filter the result using lambda function before creating dataframe? Something like this: rdd.filter(lambda row : 'ROBERT' not in row).toDF(rdd.first())

    – Alla Tarighati
    Nov 13 '18 at 7:31












  • Yeah it works. Thanks

    – vikrant rana
    Nov 13 '18 at 8:49











  • @Alla Tarighati- Thanks for your help

    – vikrant rana
    Nov 15 '18 at 13:02











  • Happy to hear that it helped :) and thank you for writing a proper answer for it.

    – Alla Tarighati
    Nov 15 '18 at 16:00













  • 1





    Have you tried to read it to rdd and filter the result using lambda function before creating dataframe? Something like this: rdd.filter(lambda row : 'ROBERT' not in row).toDF(rdd.first())

    – Alla Tarighati
    Nov 13 '18 at 7:31












  • Yeah it works. Thanks

    – vikrant rana
    Nov 13 '18 at 8:49











  • @Alla Tarighati- Thanks for your help

    – vikrant rana
    Nov 15 '18 at 13:02











  • Happy to hear that it helped :) and thank you for writing a proper answer for it.

    – Alla Tarighati
    Nov 15 '18 at 16:00








1




1





Have you tried to read it to rdd and filter the result using lambda function before creating dataframe? Something like this: rdd.filter(lambda row : 'ROBERT' not in row).toDF(rdd.first())

– Alla Tarighati
Nov 13 '18 at 7:31






Have you tried to read it to rdd and filter the result using lambda function before creating dataframe? Something like this: rdd.filter(lambda row : 'ROBERT' not in row).toDF(rdd.first())

– Alla Tarighati
Nov 13 '18 at 7:31














Yeah it works. Thanks

– vikrant rana
Nov 13 '18 at 8:49





Yeah it works. Thanks

– vikrant rana
Nov 13 '18 at 8:49













@Alla Tarighati- Thanks for your help

– vikrant rana
Nov 15 '18 at 13:02





@Alla Tarighati- Thanks for your help

– vikrant rana
Nov 15 '18 at 13:02













Happy to hear that it helped :) and thank you for writing a proper answer for it.

– Alla Tarighati
Nov 15 '18 at 16:00






Happy to hear that it helped :) and thank you for writing a proper answer for it.

– Alla Tarighati
Nov 15 '18 at 16:00













1 Answer
1






active

oldest

votes


















1














I was able to filter it using RDD.



textdata = sc.textFile(PATH_TO_FILE)
header=textdata.first();
textnewdata = textdata.filter(lambda x:x != header)
newRDD = textnewdata.filter(lambda row : 'ROBERT' not in row)

[u'1,VIKRANT SINGH RANA ,NOIDA ,10000',
u'3,GOVIND NIMBHAL ,DWARKA ,92000',
u'2,RAGHVENDRA KUMAR GUPTA,GURGAON ,50000',
u'4,ABHIJAN SINHA ,SAKET ,65000',
u'5,SUPER DEVELOPER ,USA ,50000',
u'6,RAJAT TYAGI ,UP ,65000',
u'7,AJAY SHARMA ,NOIDA ,70000',
u'8,SIDDHARTH BASU ,SAKET ,72000']

newsplitRDD = newRDD.map(lambda l: l.split(","))

newDF = newsplitRDD.toDF()

>>> newDF.show();
+---+--------------------+--------+-----+
| _1| _2| _3| _4|
+---+--------------------+--------+-----+
| 1|VIKRANT SINGH RAN...|NOIDA |10000|
| 3|GOVIND NIMBHAL ...|DWARKA |92000|
| 2|RAGHVENDRA KUMAR ...|GURGAON |50000|
| 4|ABHIJAN SINHA ...|SAKET |65000|
| 5|SUPER DEVELOPER ...|USA |50000|
| 6|RAJAT TYAGI ...|UP |65000|
| 7|AJAY SHARMA ...|NOIDA |70000|
| 8|SIDDHARTH BASU ...|SAKET |72000|
+---+--------------------+--------+-----+





share|improve this answer






















    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53275750%2fskip-lines-from-csv-file-if-it-contains-specific-keyword-in-pyspark%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    I was able to filter it using RDD.



    textdata = sc.textFile(PATH_TO_FILE)
    header=textdata.first();
    textnewdata = textdata.filter(lambda x:x != header)
    newRDD = textnewdata.filter(lambda row : 'ROBERT' not in row)

    [u'1,VIKRANT SINGH RANA ,NOIDA ,10000',
    u'3,GOVIND NIMBHAL ,DWARKA ,92000',
    u'2,RAGHVENDRA KUMAR GUPTA,GURGAON ,50000',
    u'4,ABHIJAN SINHA ,SAKET ,65000',
    u'5,SUPER DEVELOPER ,USA ,50000',
    u'6,RAJAT TYAGI ,UP ,65000',
    u'7,AJAY SHARMA ,NOIDA ,70000',
    u'8,SIDDHARTH BASU ,SAKET ,72000']

    newsplitRDD = newRDD.map(lambda l: l.split(","))

    newDF = newsplitRDD.toDF()

    >>> newDF.show();
    +---+--------------------+--------+-----+
    | _1| _2| _3| _4|
    +---+--------------------+--------+-----+
    | 1|VIKRANT SINGH RAN...|NOIDA |10000|
    | 3|GOVIND NIMBHAL ...|DWARKA |92000|
    | 2|RAGHVENDRA KUMAR ...|GURGAON |50000|
    | 4|ABHIJAN SINHA ...|SAKET |65000|
    | 5|SUPER DEVELOPER ...|USA |50000|
    | 6|RAJAT TYAGI ...|UP |65000|
    | 7|AJAY SHARMA ...|NOIDA |70000|
    | 8|SIDDHARTH BASU ...|SAKET |72000|
    +---+--------------------+--------+-----+





    share|improve this answer



























      1














      I was able to filter it using RDD.



      textdata = sc.textFile(PATH_TO_FILE)
      header=textdata.first();
      textnewdata = textdata.filter(lambda x:x != header)
      newRDD = textnewdata.filter(lambda row : 'ROBERT' not in row)

      [u'1,VIKRANT SINGH RANA ,NOIDA ,10000',
      u'3,GOVIND NIMBHAL ,DWARKA ,92000',
      u'2,RAGHVENDRA KUMAR GUPTA,GURGAON ,50000',
      u'4,ABHIJAN SINHA ,SAKET ,65000',
      u'5,SUPER DEVELOPER ,USA ,50000',
      u'6,RAJAT TYAGI ,UP ,65000',
      u'7,AJAY SHARMA ,NOIDA ,70000',
      u'8,SIDDHARTH BASU ,SAKET ,72000']

      newsplitRDD = newRDD.map(lambda l: l.split(","))

      newDF = newsplitRDD.toDF()

      >>> newDF.show();
      +---+--------------------+--------+-----+
      | _1| _2| _3| _4|
      +---+--------------------+--------+-----+
      | 1|VIKRANT SINGH RAN...|NOIDA |10000|
      | 3|GOVIND NIMBHAL ...|DWARKA |92000|
      | 2|RAGHVENDRA KUMAR ...|GURGAON |50000|
      | 4|ABHIJAN SINHA ...|SAKET |65000|
      | 5|SUPER DEVELOPER ...|USA |50000|
      | 6|RAJAT TYAGI ...|UP |65000|
      | 7|AJAY SHARMA ...|NOIDA |70000|
      | 8|SIDDHARTH BASU ...|SAKET |72000|
      +---+--------------------+--------+-----+





      share|improve this answer

























        1












        1








        1







        I was able to filter it using RDD.



        textdata = sc.textFile(PATH_TO_FILE)
        header=textdata.first();
        textnewdata = textdata.filter(lambda x:x != header)
        newRDD = textnewdata.filter(lambda row : 'ROBERT' not in row)

        [u'1,VIKRANT SINGH RANA ,NOIDA ,10000',
        u'3,GOVIND NIMBHAL ,DWARKA ,92000',
        u'2,RAGHVENDRA KUMAR GUPTA,GURGAON ,50000',
        u'4,ABHIJAN SINHA ,SAKET ,65000',
        u'5,SUPER DEVELOPER ,USA ,50000',
        u'6,RAJAT TYAGI ,UP ,65000',
        u'7,AJAY SHARMA ,NOIDA ,70000',
        u'8,SIDDHARTH BASU ,SAKET ,72000']

        newsplitRDD = newRDD.map(lambda l: l.split(","))

        newDF = newsplitRDD.toDF()

        >>> newDF.show();
        +---+--------------------+--------+-----+
        | _1| _2| _3| _4|
        +---+--------------------+--------+-----+
        | 1|VIKRANT SINGH RAN...|NOIDA |10000|
        | 3|GOVIND NIMBHAL ...|DWARKA |92000|
        | 2|RAGHVENDRA KUMAR ...|GURGAON |50000|
        | 4|ABHIJAN SINHA ...|SAKET |65000|
        | 5|SUPER DEVELOPER ...|USA |50000|
        | 6|RAJAT TYAGI ...|UP |65000|
        | 7|AJAY SHARMA ...|NOIDA |70000|
        | 8|SIDDHARTH BASU ...|SAKET |72000|
        +---+--------------------+--------+-----+





        share|improve this answer













        I was able to filter it using RDD.



        textdata = sc.textFile(PATH_TO_FILE)
        header=textdata.first();
        textnewdata = textdata.filter(lambda x:x != header)
        newRDD = textnewdata.filter(lambda row : 'ROBERT' not in row)

        [u'1,VIKRANT SINGH RANA ,NOIDA ,10000',
        u'3,GOVIND NIMBHAL ,DWARKA ,92000',
        u'2,RAGHVENDRA KUMAR GUPTA,GURGAON ,50000',
        u'4,ABHIJAN SINHA ,SAKET ,65000',
        u'5,SUPER DEVELOPER ,USA ,50000',
        u'6,RAJAT TYAGI ,UP ,65000',
        u'7,AJAY SHARMA ,NOIDA ,70000',
        u'8,SIDDHARTH BASU ,SAKET ,72000']

        newsplitRDD = newRDD.map(lambda l: l.split(","))

        newDF = newsplitRDD.toDF()

        >>> newDF.show();
        +---+--------------------+--------+-----+
        | _1| _2| _3| _4|
        +---+--------------------+--------+-----+
        | 1|VIKRANT SINGH RAN...|NOIDA |10000|
        | 3|GOVIND NIMBHAL ...|DWARKA |92000|
        | 2|RAGHVENDRA KUMAR ...|GURGAON |50000|
        | 4|ABHIJAN SINHA ...|SAKET |65000|
        | 5|SUPER DEVELOPER ...|USA |50000|
        | 6|RAJAT TYAGI ...|UP |65000|
        | 7|AJAY SHARMA ...|NOIDA |70000|
        | 8|SIDDHARTH BASU ...|SAKET |72000|
        +---+--------------------+--------+-----+






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 13 '18 at 8:51









        vikrant ranavikrant rana

        6021115




        6021115



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53275750%2fskip-lines-from-csv-file-if-it-contains-specific-keyword-in-pyspark%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            這個網誌中的熱門文章

            How to read a connectionString WITH PROVIDER in .NET Core?

            Node.js Script on GitHub Pages or Amazon S3

            Museum of Modern and Contemporary Art of Trento and Rovereto