How to remove lines that start with the same characters (but are random) in python?










-1















I am trying to remove lines in a file that start with the same 5 characters, however, the first 5 characters are random (I don't know what they will be)?



I have a code that reads the last 5 characters of the first line of a file and matches them to the FIRST 5 characters on a random line in the file that has the same 5 characters. The problem is, when there are two or more matches that have the same first 5 characters the code messes up. I need something that reads all the lines in the file and removes one of the two lines that have the same 5 first characters.



Example (issue):



CCTGGATGGCTTATATAAGAT***GTTAT***

***GTTAT***ATAATATACCACCGGGCTGCTT

***GTTAT***ATAGTTACAGCGGAGTCTTGTGACTGGCTCGAGTCAAAAT


What I need as result after one is taken out of file:



CCTGGATGGCTTATATAAGAT***GTTAT***

***GTTAT***ATAATATACCACCGGGCTGCTT


(no third line)



I will greatly appreciate it if you could explain how I could go about this with words as well.










share|improve this question
























  • Welcome to StackOverflow. Please read and follow the posting guidelines in the help documentation, as suggested when you created this account. On topic, how to ask, and ... the perfect question apply here. StackOverflow is not a design, coding, research, or tutorial resource. However, if you follow whatever resources you find on line, make an honest coding attempt, and run into a problem, you'd have a good example to post.

    – Prune
    Nov 15 '18 at 20:12











  • Hi and welcome to SO. Your posted question does not appear to include any attempt at all to solve the problem. StackOverflow expects you to try to solve your own problem first, as your attempts help us to better understand what you want. Please edit the question to show what you've tried, so as to illustrate a specific problem you're having in a Minimal, Complete, and Verifiable example. For more information, please see How to Ask and take the Tour.

    – quant
    Nov 15 '18 at 20:16











  • Show us the code you wrote so far so we can see how it can be improved

    – Milo Bem
    Nov 15 '18 at 20:19















-1















I am trying to remove lines in a file that start with the same 5 characters, however, the first 5 characters are random (I don't know what they will be)?



I have a code that reads the last 5 characters of the first line of a file and matches them to the FIRST 5 characters on a random line in the file that has the same 5 characters. The problem is, when there are two or more matches that have the same first 5 characters the code messes up. I need something that reads all the lines in the file and removes one of the two lines that have the same 5 first characters.



Example (issue):



CCTGGATGGCTTATATAAGAT***GTTAT***

***GTTAT***ATAATATACCACCGGGCTGCTT

***GTTAT***ATAGTTACAGCGGAGTCTTGTGACTGGCTCGAGTCAAAAT


What I need as result after one is taken out of file:



CCTGGATGGCTTATATAAGAT***GTTAT***

***GTTAT***ATAATATACCACCGGGCTGCTT


(no third line)



I will greatly appreciate it if you could explain how I could go about this with words as well.










share|improve this question
























  • Welcome to StackOverflow. Please read and follow the posting guidelines in the help documentation, as suggested when you created this account. On topic, how to ask, and ... the perfect question apply here. StackOverflow is not a design, coding, research, or tutorial resource. However, if you follow whatever resources you find on line, make an honest coding attempt, and run into a problem, you'd have a good example to post.

    – Prune
    Nov 15 '18 at 20:12











  • Hi and welcome to SO. Your posted question does not appear to include any attempt at all to solve the problem. StackOverflow expects you to try to solve your own problem first, as your attempts help us to better understand what you want. Please edit the question to show what you've tried, so as to illustrate a specific problem you're having in a Minimal, Complete, and Verifiable example. For more information, please see How to Ask and take the Tour.

    – quant
    Nov 15 '18 at 20:16











  • Show us the code you wrote so far so we can see how it can be improved

    – Milo Bem
    Nov 15 '18 at 20:19













-1












-1








-1








I am trying to remove lines in a file that start with the same 5 characters, however, the first 5 characters are random (I don't know what they will be)?



I have a code that reads the last 5 characters of the first line of a file and matches them to the FIRST 5 characters on a random line in the file that has the same 5 characters. The problem is, when there are two or more matches that have the same first 5 characters the code messes up. I need something that reads all the lines in the file and removes one of the two lines that have the same 5 first characters.



Example (issue):



CCTGGATGGCTTATATAAGAT***GTTAT***

***GTTAT***ATAATATACCACCGGGCTGCTT

***GTTAT***ATAGTTACAGCGGAGTCTTGTGACTGGCTCGAGTCAAAAT


What I need as result after one is taken out of file:



CCTGGATGGCTTATATAAGAT***GTTAT***

***GTTAT***ATAATATACCACCGGGCTGCTT


(no third line)



I will greatly appreciate it if you could explain how I could go about this with words as well.










share|improve this question
















I am trying to remove lines in a file that start with the same 5 characters, however, the first 5 characters are random (I don't know what they will be)?



I have a code that reads the last 5 characters of the first line of a file and matches them to the FIRST 5 characters on a random line in the file that has the same 5 characters. The problem is, when there are two or more matches that have the same first 5 characters the code messes up. I need something that reads all the lines in the file and removes one of the two lines that have the same 5 first characters.



Example (issue):



CCTGGATGGCTTATATAAGAT***GTTAT***

***GTTAT***ATAATATACCACCGGGCTGCTT

***GTTAT***ATAGTTACAGCGGAGTCTTGTGACTGGCTCGAGTCAAAAT


What I need as result after one is taken out of file:



CCTGGATGGCTTATATAAGAT***GTTAT***

***GTTAT***ATAATATACCACCGGGCTGCTT


(no third line)



I will greatly appreciate it if you could explain how I could go about this with words as well.







python bioinformatics matching dna-sequence






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 15 '18 at 21:46









quant

1,60711527




1,60711527










asked Nov 15 '18 at 20:09









Alpa LucaAlpa Luca

85




85












  • Welcome to StackOverflow. Please read and follow the posting guidelines in the help documentation, as suggested when you created this account. On topic, how to ask, and ... the perfect question apply here. StackOverflow is not a design, coding, research, or tutorial resource. However, if you follow whatever resources you find on line, make an honest coding attempt, and run into a problem, you'd have a good example to post.

    – Prune
    Nov 15 '18 at 20:12











  • Hi and welcome to SO. Your posted question does not appear to include any attempt at all to solve the problem. StackOverflow expects you to try to solve your own problem first, as your attempts help us to better understand what you want. Please edit the question to show what you've tried, so as to illustrate a specific problem you're having in a Minimal, Complete, and Verifiable example. For more information, please see How to Ask and take the Tour.

    – quant
    Nov 15 '18 at 20:16











  • Show us the code you wrote so far so we can see how it can be improved

    – Milo Bem
    Nov 15 '18 at 20:19

















  • Welcome to StackOverflow. Please read and follow the posting guidelines in the help documentation, as suggested when you created this account. On topic, how to ask, and ... the perfect question apply here. StackOverflow is not a design, coding, research, or tutorial resource. However, if you follow whatever resources you find on line, make an honest coding attempt, and run into a problem, you'd have a good example to post.

    – Prune
    Nov 15 '18 at 20:12











  • Hi and welcome to SO. Your posted question does not appear to include any attempt at all to solve the problem. StackOverflow expects you to try to solve your own problem first, as your attempts help us to better understand what you want. Please edit the question to show what you've tried, so as to illustrate a specific problem you're having in a Minimal, Complete, and Verifiable example. For more information, please see How to Ask and take the Tour.

    – quant
    Nov 15 '18 at 20:16











  • Show us the code you wrote so far so we can see how it can be improved

    – Milo Bem
    Nov 15 '18 at 20:19
















Welcome to StackOverflow. Please read and follow the posting guidelines in the help documentation, as suggested when you created this account. On topic, how to ask, and ... the perfect question apply here. StackOverflow is not a design, coding, research, or tutorial resource. However, if you follow whatever resources you find on line, make an honest coding attempt, and run into a problem, you'd have a good example to post.

– Prune
Nov 15 '18 at 20:12





Welcome to StackOverflow. Please read and follow the posting guidelines in the help documentation, as suggested when you created this account. On topic, how to ask, and ... the perfect question apply here. StackOverflow is not a design, coding, research, or tutorial resource. However, if you follow whatever resources you find on line, make an honest coding attempt, and run into a problem, you'd have a good example to post.

– Prune
Nov 15 '18 at 20:12













Hi and welcome to SO. Your posted question does not appear to include any attempt at all to solve the problem. StackOverflow expects you to try to solve your own problem first, as your attempts help us to better understand what you want. Please edit the question to show what you've tried, so as to illustrate a specific problem you're having in a Minimal, Complete, and Verifiable example. For more information, please see How to Ask and take the Tour.

– quant
Nov 15 '18 at 20:16





Hi and welcome to SO. Your posted question does not appear to include any attempt at all to solve the problem. StackOverflow expects you to try to solve your own problem first, as your attempts help us to better understand what you want. Please edit the question to show what you've tried, so as to illustrate a specific problem you're having in a Minimal, Complete, and Verifiable example. For more information, please see How to Ask and take the Tour.

– quant
Nov 15 '18 at 20:16













Show us the code you wrote so far so we can see how it can be improved

– Milo Bem
Nov 15 '18 at 20:19





Show us the code you wrote so far so we can see how it can be improved

– Milo Bem
Nov 15 '18 at 20:19












1 Answer
1






active

oldest

votes


















0














You can do this for example like so:



FILE_NAME = "data.txt" # the name of the file to read in
NR_MATCHING_CHARS = 5 # the number of characters that need to match

lines = set() # a set of lines that contain the beginning of the lines that have already been outputted
with open(FILE_NAME, "r") as inF: # open the file
for line in inF: # for every line
line = line.strip() # that is
if line == "": continue # not empty
beginOfSequence = line[:NR_MATCHING_CHARS]
if not (beginOfSequence in lines): # and the beginning of this line was not printed yet
print(line) # print the line
lines.add(beginOfSequence) # remember that the beginning of the line





share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53327189%2fhow-to-remove-lines-that-start-with-the-same-characters-but-are-random-in-pyth%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    You can do this for example like so:



    FILE_NAME = "data.txt" # the name of the file to read in
    NR_MATCHING_CHARS = 5 # the number of characters that need to match

    lines = set() # a set of lines that contain the beginning of the lines that have already been outputted
    with open(FILE_NAME, "r") as inF: # open the file
    for line in inF: # for every line
    line = line.strip() # that is
    if line == "": continue # not empty
    beginOfSequence = line[:NR_MATCHING_CHARS]
    if not (beginOfSequence in lines): # and the beginning of this line was not printed yet
    print(line) # print the line
    lines.add(beginOfSequence) # remember that the beginning of the line





    share|improve this answer



























      0














      You can do this for example like so:



      FILE_NAME = "data.txt" # the name of the file to read in
      NR_MATCHING_CHARS = 5 # the number of characters that need to match

      lines = set() # a set of lines that contain the beginning of the lines that have already been outputted
      with open(FILE_NAME, "r") as inF: # open the file
      for line in inF: # for every line
      line = line.strip() # that is
      if line == "": continue # not empty
      beginOfSequence = line[:NR_MATCHING_CHARS]
      if not (beginOfSequence in lines): # and the beginning of this line was not printed yet
      print(line) # print the line
      lines.add(beginOfSequence) # remember that the beginning of the line





      share|improve this answer

























        0












        0








        0







        You can do this for example like so:



        FILE_NAME = "data.txt" # the name of the file to read in
        NR_MATCHING_CHARS = 5 # the number of characters that need to match

        lines = set() # a set of lines that contain the beginning of the lines that have already been outputted
        with open(FILE_NAME, "r") as inF: # open the file
        for line in inF: # for every line
        line = line.strip() # that is
        if line == "": continue # not empty
        beginOfSequence = line[:NR_MATCHING_CHARS]
        if not (beginOfSequence in lines): # and the beginning of this line was not printed yet
        print(line) # print the line
        lines.add(beginOfSequence) # remember that the beginning of the line





        share|improve this answer













        You can do this for example like so:



        FILE_NAME = "data.txt" # the name of the file to read in
        NR_MATCHING_CHARS = 5 # the number of characters that need to match

        lines = set() # a set of lines that contain the beginning of the lines that have already been outputted
        with open(FILE_NAME, "r") as inF: # open the file
        for line in inF: # for every line
        line = line.strip() # that is
        if line == "": continue # not empty
        beginOfSequence = line[:NR_MATCHING_CHARS]
        if not (beginOfSequence in lines): # and the beginning of this line was not printed yet
        print(line) # print the line
        lines.add(beginOfSequence) # remember that the beginning of the line






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 15 '18 at 20:30









        quantquant

        1,60711527




        1,60711527





























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53327189%2fhow-to-remove-lines-that-start-with-the-same-characters-but-are-random-in-pyth%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            這個網誌中的熱門文章

            How to read a connectionString WITH PROVIDER in .NET Core?

            In R, how to develop a multiplot heatmap.2 figure showing key labels successfully

            Museum of Modern and Contemporary Art of Trento and Rovereto