How do I use HashSet to remove duplicates from a text file? (C#)










-3















So I've decided to create a program that does quite a few things. As a part of this program there's a section called "text tools" which takes a text file (via 1 button) and then has additional buttons that perform other functions like removing whitespace and empty lines from the file, removing duplicates and removing lines that match a certain pattern eg 123 or abc.



I'm able to import the file and print the list using a foreach loop and I believe I'm along the right lines however I need to remove duplicates. I've decided to use HashSet thanks to this thread in which it says it's the simplest and fastest method (my file will contain million of lines).



The problem is that I can't figure out just what I'm doing wrong, I've got the event handler for the button click, created a list of strings in memory, looped through each line in the file (adding it to the list) then creating another list and setting that to be the HashSet of list. (sorry if that's convoluted, it doesn't work for a reason).



I've looked at every stackoverflow question similar to this but I can't find any solution. I've also looked into HashSet in general to no avail.



Here's my code so far:



 private void btnClearDuplicates_Copy_Click(object sender, RoutedEventArgs e)

List<string> list = new List<string>();

foreach (string line in File.ReadLines(FilePath, Encoding.UTF8))

list.Add(line);


var DuplicatesRemoved = new HashSet<String>(list);












share|improve this question
























  • stackoverflow.com/questions/31052953/…

    – Mitch Wheat
    Nov 15 '18 at 2:13











  • docs.microsoft.com/en-us/dotnet/api/…

    – mjwills
    Nov 15 '18 at 2:19











  • cannot convert from 'System.Collections.Generic.List<string>' to 'System.Collections.Generic.IEqualityComparer<System.Windows.Documents.List>'

    – College Ameteur
    Nov 15 '18 at 2:20







  • 2





    Respectfully I didn't open the question to ask for links that I've already found If you are going to be snarky, at least provide the links that you have read. We aren't mind readers. :)

    – mjwills
    Nov 15 '18 at 2:24







  • 3





    I'd suggest stopping using the List<string> altogether and use a HashSet<string> then. You don't need the List. Note that HashSet could, in theory, return data in a different order than in the file (it won't with the current implementation, but it could in future).

    – mjwills
    Nov 15 '18 at 2:36
















-3















So I've decided to create a program that does quite a few things. As a part of this program there's a section called "text tools" which takes a text file (via 1 button) and then has additional buttons that perform other functions like removing whitespace and empty lines from the file, removing duplicates and removing lines that match a certain pattern eg 123 or abc.



I'm able to import the file and print the list using a foreach loop and I believe I'm along the right lines however I need to remove duplicates. I've decided to use HashSet thanks to this thread in which it says it's the simplest and fastest method (my file will contain million of lines).



The problem is that I can't figure out just what I'm doing wrong, I've got the event handler for the button click, created a list of strings in memory, looped through each line in the file (adding it to the list) then creating another list and setting that to be the HashSet of list. (sorry if that's convoluted, it doesn't work for a reason).



I've looked at every stackoverflow question similar to this but I can't find any solution. I've also looked into HashSet in general to no avail.



Here's my code so far:



 private void btnClearDuplicates_Copy_Click(object sender, RoutedEventArgs e)

List<string> list = new List<string>();

foreach (string line in File.ReadLines(FilePath, Encoding.UTF8))

list.Add(line);


var DuplicatesRemoved = new HashSet<String>(list);












share|improve this question
























  • stackoverflow.com/questions/31052953/…

    – Mitch Wheat
    Nov 15 '18 at 2:13











  • docs.microsoft.com/en-us/dotnet/api/…

    – mjwills
    Nov 15 '18 at 2:19











  • cannot convert from 'System.Collections.Generic.List<string>' to 'System.Collections.Generic.IEqualityComparer<System.Windows.Documents.List>'

    – College Ameteur
    Nov 15 '18 at 2:20







  • 2





    Respectfully I didn't open the question to ask for links that I've already found If you are going to be snarky, at least provide the links that you have read. We aren't mind readers. :)

    – mjwills
    Nov 15 '18 at 2:24







  • 3





    I'd suggest stopping using the List<string> altogether and use a HashSet<string> then. You don't need the List. Note that HashSet could, in theory, return data in a different order than in the file (it won't with the current implementation, but it could in future).

    – mjwills
    Nov 15 '18 at 2:36














-3












-3








-3








So I've decided to create a program that does quite a few things. As a part of this program there's a section called "text tools" which takes a text file (via 1 button) and then has additional buttons that perform other functions like removing whitespace and empty lines from the file, removing duplicates and removing lines that match a certain pattern eg 123 or abc.



I'm able to import the file and print the list using a foreach loop and I believe I'm along the right lines however I need to remove duplicates. I've decided to use HashSet thanks to this thread in which it says it's the simplest and fastest method (my file will contain million of lines).



The problem is that I can't figure out just what I'm doing wrong, I've got the event handler for the button click, created a list of strings in memory, looped through each line in the file (adding it to the list) then creating another list and setting that to be the HashSet of list. (sorry if that's convoluted, it doesn't work for a reason).



I've looked at every stackoverflow question similar to this but I can't find any solution. I've also looked into HashSet in general to no avail.



Here's my code so far:



 private void btnClearDuplicates_Copy_Click(object sender, RoutedEventArgs e)

List<string> list = new List<string>();

foreach (string line in File.ReadLines(FilePath, Encoding.UTF8))

list.Add(line);


var DuplicatesRemoved = new HashSet<String>(list);












share|improve this question
















So I've decided to create a program that does quite a few things. As a part of this program there's a section called "text tools" which takes a text file (via 1 button) and then has additional buttons that perform other functions like removing whitespace and empty lines from the file, removing duplicates and removing lines that match a certain pattern eg 123 or abc.



I'm able to import the file and print the list using a foreach loop and I believe I'm along the right lines however I need to remove duplicates. I've decided to use HashSet thanks to this thread in which it says it's the simplest and fastest method (my file will contain million of lines).



The problem is that I can't figure out just what I'm doing wrong, I've got the event handler for the button click, created a list of strings in memory, looped through each line in the file (adding it to the list) then creating another list and setting that to be the HashSet of list. (sorry if that's convoluted, it doesn't work for a reason).



I've looked at every stackoverflow question similar to this but I can't find any solution. I've also looked into HashSet in general to no avail.



Here's my code so far:



 private void btnClearDuplicates_Copy_Click(object sender, RoutedEventArgs e)

List<string> list = new List<string>();

foreach (string line in File.ReadLines(FilePath, Encoding.UTF8))

list.Add(line);


var DuplicatesRemoved = new HashSet<String>(list);









c#






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 15 '18 at 2:28







College Ameteur

















asked Nov 15 '18 at 2:11









College AmeteurCollege Ameteur

43




43












  • stackoverflow.com/questions/31052953/…

    – Mitch Wheat
    Nov 15 '18 at 2:13











  • docs.microsoft.com/en-us/dotnet/api/…

    – mjwills
    Nov 15 '18 at 2:19











  • cannot convert from 'System.Collections.Generic.List<string>' to 'System.Collections.Generic.IEqualityComparer<System.Windows.Documents.List>'

    – College Ameteur
    Nov 15 '18 at 2:20







  • 2





    Respectfully I didn't open the question to ask for links that I've already found If you are going to be snarky, at least provide the links that you have read. We aren't mind readers. :)

    – mjwills
    Nov 15 '18 at 2:24







  • 3





    I'd suggest stopping using the List<string> altogether and use a HashSet<string> then. You don't need the List. Note that HashSet could, in theory, return data in a different order than in the file (it won't with the current implementation, but it could in future).

    – mjwills
    Nov 15 '18 at 2:36


















  • stackoverflow.com/questions/31052953/…

    – Mitch Wheat
    Nov 15 '18 at 2:13











  • docs.microsoft.com/en-us/dotnet/api/…

    – mjwills
    Nov 15 '18 at 2:19











  • cannot convert from 'System.Collections.Generic.List<string>' to 'System.Collections.Generic.IEqualityComparer<System.Windows.Documents.List>'

    – College Ameteur
    Nov 15 '18 at 2:20







  • 2





    Respectfully I didn't open the question to ask for links that I've already found If you are going to be snarky, at least provide the links that you have read. We aren't mind readers. :)

    – mjwills
    Nov 15 '18 at 2:24







  • 3





    I'd suggest stopping using the List<string> altogether and use a HashSet<string> then. You don't need the List. Note that HashSet could, in theory, return data in a different order than in the file (it won't with the current implementation, but it could in future).

    – mjwills
    Nov 15 '18 at 2:36

















stackoverflow.com/questions/31052953/…

– Mitch Wheat
Nov 15 '18 at 2:13





stackoverflow.com/questions/31052953/…

– Mitch Wheat
Nov 15 '18 at 2:13













docs.microsoft.com/en-us/dotnet/api/…

– mjwills
Nov 15 '18 at 2:19





docs.microsoft.com/en-us/dotnet/api/…

– mjwills
Nov 15 '18 at 2:19













cannot convert from 'System.Collections.Generic.List<string>' to 'System.Collections.Generic.IEqualityComparer<System.Windows.Documents.List>'

– College Ameteur
Nov 15 '18 at 2:20






cannot convert from 'System.Collections.Generic.List<string>' to 'System.Collections.Generic.IEqualityComparer<System.Windows.Documents.List>'

– College Ameteur
Nov 15 '18 at 2:20





2




2





Respectfully I didn't open the question to ask for links that I've already found If you are going to be snarky, at least provide the links that you have read. We aren't mind readers. :)

– mjwills
Nov 15 '18 at 2:24






Respectfully I didn't open the question to ask for links that I've already found If you are going to be snarky, at least provide the links that you have read. We aren't mind readers. :)

– mjwills
Nov 15 '18 at 2:24





3




3





I'd suggest stopping using the List<string> altogether and use a HashSet<string> then. You don't need the List. Note that HashSet could, in theory, return data in a different order than in the file (it won't with the current implementation, but it could in future).

– mjwills
Nov 15 '18 at 2:36






I'd suggest stopping using the List<string> altogether and use a HashSet<string> then. You don't need the List. Note that HashSet could, in theory, return data in a different order than in the file (it won't with the current implementation, but it could in future).

– mjwills
Nov 15 '18 at 2:36













2 Answers
2






active

oldest

votes


















2














To be specific to your question, and to get my last 3 points.



var lines = File.ReadAllLines("somepath");
var hashSet = new HashSet<string>(lines);
File.WriteAllLines("somepath", hashSet.ToList());


Note there are other ways and maybe more performant ways of doing this. it depends on the amount of duplicates, and the size of the file






share|improve this answer























  • 2 things: 1) Would this write the files to the same path as it took them from? (just to clarify) 2) I used ReadLines above because people said it was faster, would there be any impact on performance between the two methods using a file that has millions of lines?

    – College Ameteur
    Nov 15 '18 at 2:32






  • 1





    @CollegeAmeteur millions of lines is a completely different optimization, and there maybe several things involved to make this more efficient than ReadAllLines and ReadLines. what i suggest you do, download a benchmark tool and see what works for you.

    – Michael Randall
    Nov 15 '18 at 2:36


















0














It is preferable to process file as a stream if possible. I would not even call it optimization, I would rather call it not wasting. If you can use stream approach, ReadAllLines approach is somewhere between almost good and very bad, depending on situation. It is also good idea to preserve lines order. HashSet generally does not preserve order, if you store everything into it and read it, it can be shuffled.



using (var outFile = new StreamWriter(outFilePath))

HashSet<string> seen = new HashSet<string>();
foreach (string line in File.ReadLines(FilePath, Encoding.UTF8))

if (seen.Add(line))

outFile.WriteLine(line);








share|improve this answer






















    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53311453%2fhow-do-i-use-hashset-to-remove-duplicates-from-a-text-file-c%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    2














    To be specific to your question, and to get my last 3 points.



    var lines = File.ReadAllLines("somepath");
    var hashSet = new HashSet<string>(lines);
    File.WriteAllLines("somepath", hashSet.ToList());


    Note there are other ways and maybe more performant ways of doing this. it depends on the amount of duplicates, and the size of the file






    share|improve this answer























    • 2 things: 1) Would this write the files to the same path as it took them from? (just to clarify) 2) I used ReadLines above because people said it was faster, would there be any impact on performance between the two methods using a file that has millions of lines?

      – College Ameteur
      Nov 15 '18 at 2:32






    • 1





      @CollegeAmeteur millions of lines is a completely different optimization, and there maybe several things involved to make this more efficient than ReadAllLines and ReadLines. what i suggest you do, download a benchmark tool and see what works for you.

      – Michael Randall
      Nov 15 '18 at 2:36















    2














    To be specific to your question, and to get my last 3 points.



    var lines = File.ReadAllLines("somepath");
    var hashSet = new HashSet<string>(lines);
    File.WriteAllLines("somepath", hashSet.ToList());


    Note there are other ways and maybe more performant ways of doing this. it depends on the amount of duplicates, and the size of the file






    share|improve this answer























    • 2 things: 1) Would this write the files to the same path as it took them from? (just to clarify) 2) I used ReadLines above because people said it was faster, would there be any impact on performance between the two methods using a file that has millions of lines?

      – College Ameteur
      Nov 15 '18 at 2:32






    • 1





      @CollegeAmeteur millions of lines is a completely different optimization, and there maybe several things involved to make this more efficient than ReadAllLines and ReadLines. what i suggest you do, download a benchmark tool and see what works for you.

      – Michael Randall
      Nov 15 '18 at 2:36













    2












    2








    2







    To be specific to your question, and to get my last 3 points.



    var lines = File.ReadAllLines("somepath");
    var hashSet = new HashSet<string>(lines);
    File.WriteAllLines("somepath", hashSet.ToList());


    Note there are other ways and maybe more performant ways of doing this. it depends on the amount of duplicates, and the size of the file






    share|improve this answer













    To be specific to your question, and to get my last 3 points.



    var lines = File.ReadAllLines("somepath");
    var hashSet = new HashSet<string>(lines);
    File.WriteAllLines("somepath", hashSet.ToList());


    Note there are other ways and maybe more performant ways of doing this. it depends on the amount of duplicates, and the size of the file







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Nov 15 '18 at 2:29









    Michael RandallMichael Randall

    33.3k73566




    33.3k73566












    • 2 things: 1) Would this write the files to the same path as it took them from? (just to clarify) 2) I used ReadLines above because people said it was faster, would there be any impact on performance between the two methods using a file that has millions of lines?

      – College Ameteur
      Nov 15 '18 at 2:32






    • 1





      @CollegeAmeteur millions of lines is a completely different optimization, and there maybe several things involved to make this more efficient than ReadAllLines and ReadLines. what i suggest you do, download a benchmark tool and see what works for you.

      – Michael Randall
      Nov 15 '18 at 2:36

















    • 2 things: 1) Would this write the files to the same path as it took them from? (just to clarify) 2) I used ReadLines above because people said it was faster, would there be any impact on performance between the two methods using a file that has millions of lines?

      – College Ameteur
      Nov 15 '18 at 2:32






    • 1





      @CollegeAmeteur millions of lines is a completely different optimization, and there maybe several things involved to make this more efficient than ReadAllLines and ReadLines. what i suggest you do, download a benchmark tool and see what works for you.

      – Michael Randall
      Nov 15 '18 at 2:36
















    2 things: 1) Would this write the files to the same path as it took them from? (just to clarify) 2) I used ReadLines above because people said it was faster, would there be any impact on performance between the two methods using a file that has millions of lines?

    – College Ameteur
    Nov 15 '18 at 2:32





    2 things: 1) Would this write the files to the same path as it took them from? (just to clarify) 2) I used ReadLines above because people said it was faster, would there be any impact on performance between the two methods using a file that has millions of lines?

    – College Ameteur
    Nov 15 '18 at 2:32




    1




    1





    @CollegeAmeteur millions of lines is a completely different optimization, and there maybe several things involved to make this more efficient than ReadAllLines and ReadLines. what i suggest you do, download a benchmark tool and see what works for you.

    – Michael Randall
    Nov 15 '18 at 2:36





    @CollegeAmeteur millions of lines is a completely different optimization, and there maybe several things involved to make this more efficient than ReadAllLines and ReadLines. what i suggest you do, download a benchmark tool and see what works for you.

    – Michael Randall
    Nov 15 '18 at 2:36













    0














    It is preferable to process file as a stream if possible. I would not even call it optimization, I would rather call it not wasting. If you can use stream approach, ReadAllLines approach is somewhere between almost good and very bad, depending on situation. It is also good idea to preserve lines order. HashSet generally does not preserve order, if you store everything into it and read it, it can be shuffled.



    using (var outFile = new StreamWriter(outFilePath))

    HashSet<string> seen = new HashSet<string>();
    foreach (string line in File.ReadLines(FilePath, Encoding.UTF8))

    if (seen.Add(line))

    outFile.WriteLine(line);








    share|improve this answer



























      0














      It is preferable to process file as a stream if possible. I would not even call it optimization, I would rather call it not wasting. If you can use stream approach, ReadAllLines approach is somewhere between almost good and very bad, depending on situation. It is also good idea to preserve lines order. HashSet generally does not preserve order, if you store everything into it and read it, it can be shuffled.



      using (var outFile = new StreamWriter(outFilePath))

      HashSet<string> seen = new HashSet<string>();
      foreach (string line in File.ReadLines(FilePath, Encoding.UTF8))

      if (seen.Add(line))

      outFile.WriteLine(line);








      share|improve this answer

























        0












        0








        0







        It is preferable to process file as a stream if possible. I would not even call it optimization, I would rather call it not wasting. If you can use stream approach, ReadAllLines approach is somewhere between almost good and very bad, depending on situation. It is also good idea to preserve lines order. HashSet generally does not preserve order, if you store everything into it and read it, it can be shuffled.



        using (var outFile = new StreamWriter(outFilePath))

        HashSet<string> seen = new HashSet<string>();
        foreach (string line in File.ReadLines(FilePath, Encoding.UTF8))

        if (seen.Add(line))

        outFile.WriteLine(line);








        share|improve this answer













        It is preferable to process file as a stream if possible. I would not even call it optimization, I would rather call it not wasting. If you can use stream approach, ReadAllLines approach is somewhere between almost good and very bad, depending on situation. It is also good idea to preserve lines order. HashSet generally does not preserve order, if you store everything into it and read it, it can be shuffled.



        using (var outFile = new StreamWriter(outFilePath))

        HashSet<string> seen = new HashSet<string>();
        foreach (string line in File.ReadLines(FilePath, Encoding.UTF8))

        if (seen.Add(line))

        outFile.WriteLine(line);









        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 15 '18 at 3:24









        Antonín LejsekAntonín Lejsek

        4,23721118




        4,23721118



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53311453%2fhow-do-i-use-hashset-to-remove-duplicates-from-a-text-file-c%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            這個網誌中的熱門文章

            How to read a connectionString WITH PROVIDER in .NET Core?

            In R, how to develop a multiplot heatmap.2 figure showing key labels successfully

            Museum of Modern and Contemporary Art of Trento and Rovereto