how to subsample a fasta file based on the headers if headers contain certain strings?










0














I have a fasta file like this:



>gi|373248686|emb|HE586118.1| Streptomyces albus subsp. albus salinomycin biosynthesis cluster, strain DSM 41398
GGATGCGAAGGACGCGCTGCGCAAGGCGCTGTCGATGGGTGCGGACAAGGGCATCCACGT
CGAGGACGACGATCTGCACGGCACCGACGCCGTGGGTACCTCGCTGGTGCTGGCCAAGGC
>gi|1139489917|gb|KX622588.1| Hyalangium minutum strain DSM 14724 myxochromide D subtype 1 biosynthetic gene cluster and tRNA-Thr gene, complete sequence
ATGCGCAAGCTCGTCATCACGGTGGGGATTCTGGTGGGGTTGGGGCTCGTGGTCCTTTGG
TTCTGGAGCCCGGGAGGCCCAGTCCCCTCCACGGACACGGAGGGGGAAGGGCGGAGTCAG
CGCCGGCAGGCCATGGCCCGGCCCGGCTCCGCGCAGCTGGAGAGTCCCGAGGACATGGGG
>gi|930076459|gb|KR364704.1| Streptomyces sioyaensis strain BCCO10_981 putative annimycin-type biosynthetic gene cluster, partial sequence
GCCGGCAGGTGGGCCGCGGTCAGCTTCAGGACCGTGGCCGTCGCGCCCGCCAGCACCACG
GAGGCCCCCACGGCCAGCGCCGGGCCCGTGCCCGTGCCGTACGCGAGGTCCGTGCTGAAC


and I have a text file containing a list of numbers:



373248686
930076459
296280703
......


I want to do the following:



if the header in the fasta file contains the numbers in the text file:
write all the matches(header+sequence) to a new output.fasta file.


How to do this in python? It seems easy, just some for loops may do the job, but somehow I cannot make that happen, and if my files are really big, loop in another loop may take a long time. Here's what I have tried:



from Bio import SeqIO 
import sys

wanted =
for line in open(sys.argv[2]):
titles = line.strip()
wanted.append(titles)


seqiter = SeqIO.parse(open(sys.argv[1]), 'fasta')
sys.stdout = open('output.fasta', 'w')

new_seq =

for seq in seqiter:
new_seq.append(seq if i in seq.id for i in wanted)


SeqIO.write(new_seq, sys.stdout, "fasta")
sys.stdout.close()


Got this error:



new_seq.append(seq if i in seq.id for i in wanted)
^
SyntaxError: invalid syntax


Is there a better way to do this?



Thank you!










share|improve this question




























    0














    I have a fasta file like this:



    >gi|373248686|emb|HE586118.1| Streptomyces albus subsp. albus salinomycin biosynthesis cluster, strain DSM 41398
    GGATGCGAAGGACGCGCTGCGCAAGGCGCTGTCGATGGGTGCGGACAAGGGCATCCACGT
    CGAGGACGACGATCTGCACGGCACCGACGCCGTGGGTACCTCGCTGGTGCTGGCCAAGGC
    >gi|1139489917|gb|KX622588.1| Hyalangium minutum strain DSM 14724 myxochromide D subtype 1 biosynthetic gene cluster and tRNA-Thr gene, complete sequence
    ATGCGCAAGCTCGTCATCACGGTGGGGATTCTGGTGGGGTTGGGGCTCGTGGTCCTTTGG
    TTCTGGAGCCCGGGAGGCCCAGTCCCCTCCACGGACACGGAGGGGGAAGGGCGGAGTCAG
    CGCCGGCAGGCCATGGCCCGGCCCGGCTCCGCGCAGCTGGAGAGTCCCGAGGACATGGGG
    >gi|930076459|gb|KR364704.1| Streptomyces sioyaensis strain BCCO10_981 putative annimycin-type biosynthetic gene cluster, partial sequence
    GCCGGCAGGTGGGCCGCGGTCAGCTTCAGGACCGTGGCCGTCGCGCCCGCCAGCACCACG
    GAGGCCCCCACGGCCAGCGCCGGGCCCGTGCCCGTGCCGTACGCGAGGTCCGTGCTGAAC


    and I have a text file containing a list of numbers:



    373248686
    930076459
    296280703
    ......


    I want to do the following:



    if the header in the fasta file contains the numbers in the text file:
    write all the matches(header+sequence) to a new output.fasta file.


    How to do this in python? It seems easy, just some for loops may do the job, but somehow I cannot make that happen, and if my files are really big, loop in another loop may take a long time. Here's what I have tried:



    from Bio import SeqIO 
    import sys

    wanted =
    for line in open(sys.argv[2]):
    titles = line.strip()
    wanted.append(titles)


    seqiter = SeqIO.parse(open(sys.argv[1]), 'fasta')
    sys.stdout = open('output.fasta', 'w')

    new_seq =

    for seq in seqiter:
    new_seq.append(seq if i in seq.id for i in wanted)


    SeqIO.write(new_seq, sys.stdout, "fasta")
    sys.stdout.close()


    Got this error:



    new_seq.append(seq if i in seq.id for i in wanted)
    ^
    SyntaxError: invalid syntax


    Is there a better way to do this?



    Thank you!










    share|improve this question


























      0












      0








      0







      I have a fasta file like this:



      >gi|373248686|emb|HE586118.1| Streptomyces albus subsp. albus salinomycin biosynthesis cluster, strain DSM 41398
      GGATGCGAAGGACGCGCTGCGCAAGGCGCTGTCGATGGGTGCGGACAAGGGCATCCACGT
      CGAGGACGACGATCTGCACGGCACCGACGCCGTGGGTACCTCGCTGGTGCTGGCCAAGGC
      >gi|1139489917|gb|KX622588.1| Hyalangium minutum strain DSM 14724 myxochromide D subtype 1 biosynthetic gene cluster and tRNA-Thr gene, complete sequence
      ATGCGCAAGCTCGTCATCACGGTGGGGATTCTGGTGGGGTTGGGGCTCGTGGTCCTTTGG
      TTCTGGAGCCCGGGAGGCCCAGTCCCCTCCACGGACACGGAGGGGGAAGGGCGGAGTCAG
      CGCCGGCAGGCCATGGCCCGGCCCGGCTCCGCGCAGCTGGAGAGTCCCGAGGACATGGGG
      >gi|930076459|gb|KR364704.1| Streptomyces sioyaensis strain BCCO10_981 putative annimycin-type biosynthetic gene cluster, partial sequence
      GCCGGCAGGTGGGCCGCGGTCAGCTTCAGGACCGTGGCCGTCGCGCCCGCCAGCACCACG
      GAGGCCCCCACGGCCAGCGCCGGGCCCGTGCCCGTGCCGTACGCGAGGTCCGTGCTGAAC


      and I have a text file containing a list of numbers:



      373248686
      930076459
      296280703
      ......


      I want to do the following:



      if the header in the fasta file contains the numbers in the text file:
      write all the matches(header+sequence) to a new output.fasta file.


      How to do this in python? It seems easy, just some for loops may do the job, but somehow I cannot make that happen, and if my files are really big, loop in another loop may take a long time. Here's what I have tried:



      from Bio import SeqIO 
      import sys

      wanted =
      for line in open(sys.argv[2]):
      titles = line.strip()
      wanted.append(titles)


      seqiter = SeqIO.parse(open(sys.argv[1]), 'fasta')
      sys.stdout = open('output.fasta', 'w')

      new_seq =

      for seq in seqiter:
      new_seq.append(seq if i in seq.id for i in wanted)


      SeqIO.write(new_seq, sys.stdout, "fasta")
      sys.stdout.close()


      Got this error:



      new_seq.append(seq if i in seq.id for i in wanted)
      ^
      SyntaxError: invalid syntax


      Is there a better way to do this?



      Thank you!










      share|improve this question















      I have a fasta file like this:



      >gi|373248686|emb|HE586118.1| Streptomyces albus subsp. albus salinomycin biosynthesis cluster, strain DSM 41398
      GGATGCGAAGGACGCGCTGCGCAAGGCGCTGTCGATGGGTGCGGACAAGGGCATCCACGT
      CGAGGACGACGATCTGCACGGCACCGACGCCGTGGGTACCTCGCTGGTGCTGGCCAAGGC
      >gi|1139489917|gb|KX622588.1| Hyalangium minutum strain DSM 14724 myxochromide D subtype 1 biosynthetic gene cluster and tRNA-Thr gene, complete sequence
      ATGCGCAAGCTCGTCATCACGGTGGGGATTCTGGTGGGGTTGGGGCTCGTGGTCCTTTGG
      TTCTGGAGCCCGGGAGGCCCAGTCCCCTCCACGGACACGGAGGGGGAAGGGCGGAGTCAG
      CGCCGGCAGGCCATGGCCCGGCCCGGCTCCGCGCAGCTGGAGAGTCCCGAGGACATGGGG
      >gi|930076459|gb|KR364704.1| Streptomyces sioyaensis strain BCCO10_981 putative annimycin-type biosynthetic gene cluster, partial sequence
      GCCGGCAGGTGGGCCGCGGTCAGCTTCAGGACCGTGGCCGTCGCGCCCGCCAGCACCACG
      GAGGCCCCCACGGCCAGCGCCGGGCCCGTGCCCGTGCCGTACGCGAGGTCCGTGCTGAAC


      and I have a text file containing a list of numbers:



      373248686
      930076459
      296280703
      ......


      I want to do the following:



      if the header in the fasta file contains the numbers in the text file:
      write all the matches(header+sequence) to a new output.fasta file.


      How to do this in python? It seems easy, just some for loops may do the job, but somehow I cannot make that happen, and if my files are really big, loop in another loop may take a long time. Here's what I have tried:



      from Bio import SeqIO 
      import sys

      wanted =
      for line in open(sys.argv[2]):
      titles = line.strip()
      wanted.append(titles)


      seqiter = SeqIO.parse(open(sys.argv[1]), 'fasta')
      sys.stdout = open('output.fasta', 'w')

      new_seq =

      for seq in seqiter:
      new_seq.append(seq if i in seq.id for i in wanted)


      SeqIO.write(new_seq, sys.stdout, "fasta")
      sys.stdout.close()


      Got this error:



      new_seq.append(seq if i in seq.id for i in wanted)
      ^
      SyntaxError: invalid syntax


      Is there a better way to do this?



      Thank you!







      python bioinformatics






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 13 '18 at 1:08









      quant

      1,58211526




      1,58211526










      asked Nov 12 '18 at 20:26









      stevexstevex

      275




      275






















          2 Answers
          2






          active

          oldest

          votes


















          0














          Use a program like this



          from Bio import SeqIO
          import sys

          # read in the text file
          numbersInTxtFile = set()
          # hint: use with, then you don't need to
          # program file closing. Furhtermore error
          # handling is comming along with this too
          with open(sys.argv[2], "r") as inF:
          for line in inF:
          line = line.strip()
          if line == "": continue
          numbersInTxtFile.add(int(line))

          # read in the fasta file
          with open(sys.argv[1], "r") as inF:
          for record in SeqIO.parse(inF, "fasta"):
          # now check if this record in the fasta file
          # has an id we are searching for
          name = record.description
          id = int(name.split("|")[1])
          print(id, numbersInTxtFile, id in numbersInTxtFile)
          if id in numbersInTxtFile:
          # we need to output
          print(">%s" % name)
          print(record.seq)


          which you can then call like so from the commandline



          python3 nameOfProg.py inputFastaFile.fa idsToSearch.txt > outputFastaFile.fa





          share|improve this answer




















          • Thank you! It works!
            – stevex
            Nov 12 '18 at 22:18


















          0














          Import your "keeper" IDs into a dictionary rather than a list, this will be much faster as the list doesn't have to be searched thousands of times.



          keepers = 
          with open("ids.txt", "r") as id_handle:
          for curr_id in id_handle:
          keepers[curr_id] = True


          A list comprehension generates a list, so you don't need to append to another list.



          keeper_seqs = [x for x in seqiter if x.id in keepers]


          With larger files you will want to loop over seqiter and write the entries one at a time to avoid memory issues.



          You should also never assign to sys.stdout without a good reason, if you want to output to STDOUT just use print or sys.stdout.write().






          share|improve this answer




















          • Thank you very much!
            – stevex
            Nov 12 '18 at 22:18










          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53269587%2fhow-to-subsample-a-fasta-file-based-on-the-headers-if-headers-contain-certain-st%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          2 Answers
          2






          active

          oldest

          votes








          2 Answers
          2






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0














          Use a program like this



          from Bio import SeqIO
          import sys

          # read in the text file
          numbersInTxtFile = set()
          # hint: use with, then you don't need to
          # program file closing. Furhtermore error
          # handling is comming along with this too
          with open(sys.argv[2], "r") as inF:
          for line in inF:
          line = line.strip()
          if line == "": continue
          numbersInTxtFile.add(int(line))

          # read in the fasta file
          with open(sys.argv[1], "r") as inF:
          for record in SeqIO.parse(inF, "fasta"):
          # now check if this record in the fasta file
          # has an id we are searching for
          name = record.description
          id = int(name.split("|")[1])
          print(id, numbersInTxtFile, id in numbersInTxtFile)
          if id in numbersInTxtFile:
          # we need to output
          print(">%s" % name)
          print(record.seq)


          which you can then call like so from the commandline



          python3 nameOfProg.py inputFastaFile.fa idsToSearch.txt > outputFastaFile.fa





          share|improve this answer




















          • Thank you! It works!
            – stevex
            Nov 12 '18 at 22:18















          0














          Use a program like this



          from Bio import SeqIO
          import sys

          # read in the text file
          numbersInTxtFile = set()
          # hint: use with, then you don't need to
          # program file closing. Furhtermore error
          # handling is comming along with this too
          with open(sys.argv[2], "r") as inF:
          for line in inF:
          line = line.strip()
          if line == "": continue
          numbersInTxtFile.add(int(line))

          # read in the fasta file
          with open(sys.argv[1], "r") as inF:
          for record in SeqIO.parse(inF, "fasta"):
          # now check if this record in the fasta file
          # has an id we are searching for
          name = record.description
          id = int(name.split("|")[1])
          print(id, numbersInTxtFile, id in numbersInTxtFile)
          if id in numbersInTxtFile:
          # we need to output
          print(">%s" % name)
          print(record.seq)


          which you can then call like so from the commandline



          python3 nameOfProg.py inputFastaFile.fa idsToSearch.txt > outputFastaFile.fa





          share|improve this answer




















          • Thank you! It works!
            – stevex
            Nov 12 '18 at 22:18













          0












          0








          0






          Use a program like this



          from Bio import SeqIO
          import sys

          # read in the text file
          numbersInTxtFile = set()
          # hint: use with, then you don't need to
          # program file closing. Furhtermore error
          # handling is comming along with this too
          with open(sys.argv[2], "r") as inF:
          for line in inF:
          line = line.strip()
          if line == "": continue
          numbersInTxtFile.add(int(line))

          # read in the fasta file
          with open(sys.argv[1], "r") as inF:
          for record in SeqIO.parse(inF, "fasta"):
          # now check if this record in the fasta file
          # has an id we are searching for
          name = record.description
          id = int(name.split("|")[1])
          print(id, numbersInTxtFile, id in numbersInTxtFile)
          if id in numbersInTxtFile:
          # we need to output
          print(">%s" % name)
          print(record.seq)


          which you can then call like so from the commandline



          python3 nameOfProg.py inputFastaFile.fa idsToSearch.txt > outputFastaFile.fa





          share|improve this answer












          Use a program like this



          from Bio import SeqIO
          import sys

          # read in the text file
          numbersInTxtFile = set()
          # hint: use with, then you don't need to
          # program file closing. Furhtermore error
          # handling is comming along with this too
          with open(sys.argv[2], "r") as inF:
          for line in inF:
          line = line.strip()
          if line == "": continue
          numbersInTxtFile.add(int(line))

          # read in the fasta file
          with open(sys.argv[1], "r") as inF:
          for record in SeqIO.parse(inF, "fasta"):
          # now check if this record in the fasta file
          # has an id we are searching for
          name = record.description
          id = int(name.split("|")[1])
          print(id, numbersInTxtFile, id in numbersInTxtFile)
          if id in numbersInTxtFile:
          # we need to output
          print(">%s" % name)
          print(record.seq)


          which you can then call like so from the commandline



          python3 nameOfProg.py inputFastaFile.fa idsToSearch.txt > outputFastaFile.fa






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 12 '18 at 20:44









          quantquant

          1,58211526




          1,58211526











          • Thank you! It works!
            – stevex
            Nov 12 '18 at 22:18
















          • Thank you! It works!
            – stevex
            Nov 12 '18 at 22:18















          Thank you! It works!
          – stevex
          Nov 12 '18 at 22:18




          Thank you! It works!
          – stevex
          Nov 12 '18 at 22:18













          0














          Import your "keeper" IDs into a dictionary rather than a list, this will be much faster as the list doesn't have to be searched thousands of times.



          keepers = 
          with open("ids.txt", "r") as id_handle:
          for curr_id in id_handle:
          keepers[curr_id] = True


          A list comprehension generates a list, so you don't need to append to another list.



          keeper_seqs = [x for x in seqiter if x.id in keepers]


          With larger files you will want to loop over seqiter and write the entries one at a time to avoid memory issues.



          You should also never assign to sys.stdout without a good reason, if you want to output to STDOUT just use print or sys.stdout.write().






          share|improve this answer




















          • Thank you very much!
            – stevex
            Nov 12 '18 at 22:18















          0














          Import your "keeper" IDs into a dictionary rather than a list, this will be much faster as the list doesn't have to be searched thousands of times.



          keepers = 
          with open("ids.txt", "r") as id_handle:
          for curr_id in id_handle:
          keepers[curr_id] = True


          A list comprehension generates a list, so you don't need to append to another list.



          keeper_seqs = [x for x in seqiter if x.id in keepers]


          With larger files you will want to loop over seqiter and write the entries one at a time to avoid memory issues.



          You should also never assign to sys.stdout without a good reason, if you want to output to STDOUT just use print or sys.stdout.write().






          share|improve this answer




















          • Thank you very much!
            – stevex
            Nov 12 '18 at 22:18













          0












          0








          0






          Import your "keeper" IDs into a dictionary rather than a list, this will be much faster as the list doesn't have to be searched thousands of times.



          keepers = 
          with open("ids.txt", "r") as id_handle:
          for curr_id in id_handle:
          keepers[curr_id] = True


          A list comprehension generates a list, so you don't need to append to another list.



          keeper_seqs = [x for x in seqiter if x.id in keepers]


          With larger files you will want to loop over seqiter and write the entries one at a time to avoid memory issues.



          You should also never assign to sys.stdout without a good reason, if you want to output to STDOUT just use print or sys.stdout.write().






          share|improve this answer












          Import your "keeper" IDs into a dictionary rather than a list, this will be much faster as the list doesn't have to be searched thousands of times.



          keepers = 
          with open("ids.txt", "r") as id_handle:
          for curr_id in id_handle:
          keepers[curr_id] = True


          A list comprehension generates a list, so you don't need to append to another list.



          keeper_seqs = [x for x in seqiter if x.id in keepers]


          With larger files you will want to loop over seqiter and write the entries one at a time to avoid memory issues.



          You should also never assign to sys.stdout without a good reason, if you want to output to STDOUT just use print or sys.stdout.write().







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 12 '18 at 20:45









          hurfdurfhurfdurf

          20228




          20228











          • Thank you very much!
            – stevex
            Nov 12 '18 at 22:18
















          • Thank you very much!
            – stevex
            Nov 12 '18 at 22:18















          Thank you very much!
          – stevex
          Nov 12 '18 at 22:18




          Thank you very much!
          – stevex
          Nov 12 '18 at 22:18

















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.





          Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


          Please pay close attention to the following guidance:


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53269587%2fhow-to-subsample-a-fasta-file-based-on-the-headers-if-headers-contain-certain-st%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          這個網誌中的熱門文章

          How to read a connectionString WITH PROVIDER in .NET Core?

          In R, how to develop a multiplot heatmap.2 figure showing key labels successfully

          Museum of Modern and Contemporary Art of Trento and Rovereto