how to subsample a fasta file based on the headers if headers contain certain strings?

I have a fasta file like this:

>gi|373248686|emb|HE586118.1| Streptomyces albus subsp. albus salinomycin biosynthesis cluster, strain DSM 41398
GGATGCGAAGGACGCGCTGCGCAAGGCGCTGTCGATGGGTGCGGACAAGGGCATCCACGT
CGAGGACGACGATCTGCACGGCACCGACGCCGTGGGTACCTCGCTGGTGCTGGCCAAGGC
>gi|1139489917|gb|KX622588.1| Hyalangium minutum strain DSM 14724 myxochromide D subtype 1 biosynthetic gene cluster and tRNA-Thr gene, complete sequence
ATGCGCAAGCTCGTCATCACGGTGGGGATTCTGGTGGGGTTGGGGCTCGTGGTCCTTTGG
TTCTGGAGCCCGGGAGGCCCAGTCCCCTCCACGGACACGGAGGGGGAAGGGCGGAGTCAG
CGCCGGCAGGCCATGGCCCGGCCCGGCTCCGCGCAGCTGGAGAGTCCCGAGGACATGGGG
>gi|930076459|gb|KR364704.1| Streptomyces sioyaensis strain BCCO10_981 putative annimycin-type biosynthetic gene cluster, partial sequence
GCCGGCAGGTGGGCCGCGGTCAGCTTCAGGACCGTGGCCGTCGCGCCCGCCAGCACCACG
GAGGCCCCCACGGCCAGCGCCGGGCCCGTGCCCGTGCCGTACGCGAGGTCCGTGCTGAAC

and I have a text file containing a list of numbers:

I want to do the following:

if the header in the fasta file contains the numbers in the text file:
 write all the matches(header+sequence) to a new output.fasta file.

How to do this in python? It seems easy, just some for loops may do the job, but somehow I cannot make that happen, and if my files are really big, loop in another loop may take a long time. Here's what I have tried:

from Bio import SeqIO 
import sys 

wanted = 
for line in open(sys.argv[2]):
 titles = line.strip() 
 wanted.append(titles)


seqiter = SeqIO.parse(open(sys.argv[1]), 'fasta') 
sys.stdout = open('output.fasta', 'w') 

new_seq = 

for seq in seqiter:
 new_seq.append(seq if i in seq.id for i in wanted)


SeqIO.write(new_seq, sys.stdout, "fasta")
sys.stdout.close()

Got this error:

new_seq.append(seq if i in seq.id for i in wanted)
 ^
SyntaxError: invalid syntax

Is there a better way to do this?

Thank you!

edited Nov 13 '18 at 1:08

quant

1,58211526

asked Nov 12 '18 at 20:26

stevex

275

add a comment |

I have a fasta file like this:

>gi|373248686|emb|HE586118.1| Streptomyces albus subsp. albus salinomycin biosynthesis cluster, strain DSM 41398
GGATGCGAAGGACGCGCTGCGCAAGGCGCTGTCGATGGGTGCGGACAAGGGCATCCACGT
CGAGGACGACGATCTGCACGGCACCGACGCCGTGGGTACCTCGCTGGTGCTGGCCAAGGC
>gi|1139489917|gb|KX622588.1| Hyalangium minutum strain DSM 14724 myxochromide D subtype 1 biosynthetic gene cluster and tRNA-Thr gene, complete sequence
ATGCGCAAGCTCGTCATCACGGTGGGGATTCTGGTGGGGTTGGGGCTCGTGGTCCTTTGG
TTCTGGAGCCCGGGAGGCCCAGTCCCCTCCACGGACACGGAGGGGGAAGGGCGGAGTCAG
CGCCGGCAGGCCATGGCCCGGCCCGGCTCCGCGCAGCTGGAGAGTCCCGAGGACATGGGG
>gi|930076459|gb|KR364704.1| Streptomyces sioyaensis strain BCCO10_981 putative annimycin-type biosynthetic gene cluster, partial sequence
GCCGGCAGGTGGGCCGCGGTCAGCTTCAGGACCGTGGCCGTCGCGCCCGCCAGCACCACG
GAGGCCCCCACGGCCAGCGCCGGGCCCGTGCCCGTGCCGTACGCGAGGTCCGTGCTGAAC

and I have a text file containing a list of numbers:

I want to do the following:

if the header in the fasta file contains the numbers in the text file:
 write all the matches(header+sequence) to a new output.fasta file.

from Bio import SeqIO 
import sys 

wanted = 
for line in open(sys.argv[2]):
 titles = line.strip() 
 wanted.append(titles)


seqiter = SeqIO.parse(open(sys.argv[1]), 'fasta') 
sys.stdout = open('output.fasta', 'w') 

new_seq = 

for seq in seqiter:
 new_seq.append(seq if i in seq.id for i in wanted)


SeqIO.write(new_seq, sys.stdout, "fasta")
sys.stdout.close()

Got this error:

new_seq.append(seq if i in seq.id for i in wanted)
 ^
SyntaxError: invalid syntax

Is there a better way to do this?

Thank you!

edited Nov 13 '18 at 1:08

quant

1,58211526

asked Nov 12 '18 at 20:26

stevex

275

add a comment |

I have a fasta file like this:

>gi|373248686|emb|HE586118.1| Streptomyces albus subsp. albus salinomycin biosynthesis cluster, strain DSM 41398
GGATGCGAAGGACGCGCTGCGCAAGGCGCTGTCGATGGGTGCGGACAAGGGCATCCACGT
CGAGGACGACGATCTGCACGGCACCGACGCCGTGGGTACCTCGCTGGTGCTGGCCAAGGC
>gi|1139489917|gb|KX622588.1| Hyalangium minutum strain DSM 14724 myxochromide D subtype 1 biosynthetic gene cluster and tRNA-Thr gene, complete sequence
ATGCGCAAGCTCGTCATCACGGTGGGGATTCTGGTGGGGTTGGGGCTCGTGGTCCTTTGG
TTCTGGAGCCCGGGAGGCCCAGTCCCCTCCACGGACACGGAGGGGGAAGGGCGGAGTCAG
CGCCGGCAGGCCATGGCCCGGCCCGGCTCCGCGCAGCTGGAGAGTCCCGAGGACATGGGG
>gi|930076459|gb|KR364704.1| Streptomyces sioyaensis strain BCCO10_981 putative annimycin-type biosynthetic gene cluster, partial sequence
GCCGGCAGGTGGGCCGCGGTCAGCTTCAGGACCGTGGCCGTCGCGCCCGCCAGCACCACG
GAGGCCCCCACGGCCAGCGCCGGGCCCGTGCCCGTGCCGTACGCGAGGTCCGTGCTGAAC

and I have a text file containing a list of numbers:

I want to do the following:

if the header in the fasta file contains the numbers in the text file:
 write all the matches(header+sequence) to a new output.fasta file.

from Bio import SeqIO 
import sys 

wanted = 
for line in open(sys.argv[2]):
 titles = line.strip() 
 wanted.append(titles)


seqiter = SeqIO.parse(open(sys.argv[1]), 'fasta') 
sys.stdout = open('output.fasta', 'w') 

new_seq = 

for seq in seqiter:
 new_seq.append(seq if i in seq.id for i in wanted)


SeqIO.write(new_seq, sys.stdout, "fasta")
sys.stdout.close()

Got this error:

new_seq.append(seq if i in seq.id for i in wanted)
 ^
SyntaxError: invalid syntax

Is there a better way to do this?

Thank you!

edited Nov 13 '18 at 1:08

quant

1,58211526

asked Nov 12 '18 at 20:26

stevex

275

I have a fasta file like this:

>gi|373248686|emb|HE586118.1| Streptomyces albus subsp. albus salinomycin biosynthesis cluster, strain DSM 41398
GGATGCGAAGGACGCGCTGCGCAAGGCGCTGTCGATGGGTGCGGACAAGGGCATCCACGT
CGAGGACGACGATCTGCACGGCACCGACGCCGTGGGTACCTCGCTGGTGCTGGCCAAGGC
>gi|1139489917|gb|KX622588.1| Hyalangium minutum strain DSM 14724 myxochromide D subtype 1 biosynthetic gene cluster and tRNA-Thr gene, complete sequence
ATGCGCAAGCTCGTCATCACGGTGGGGATTCTGGTGGGGTTGGGGCTCGTGGTCCTTTGG
TTCTGGAGCCCGGGAGGCCCAGTCCCCTCCACGGACACGGAGGGGGAAGGGCGGAGTCAG
CGCCGGCAGGCCATGGCCCGGCCCGGCTCCGCGCAGCTGGAGAGTCCCGAGGACATGGGG
>gi|930076459|gb|KR364704.1| Streptomyces sioyaensis strain BCCO10_981 putative annimycin-type biosynthetic gene cluster, partial sequence
GCCGGCAGGTGGGCCGCGGTCAGCTTCAGGACCGTGGCCGTCGCGCCCGCCAGCACCACG
GAGGCCCCCACGGCCAGCGCCGGGCCCGTGCCCGTGCCGTACGCGAGGTCCGTGCTGAAC

and I have a text file containing a list of numbers:

I want to do the following:

if the header in the fasta file contains the numbers in the text file:
 write all the matches(header+sequence) to a new output.fasta file.

from Bio import SeqIO 
import sys 

wanted = 
for line in open(sys.argv[2]):
 titles = line.strip() 
 wanted.append(titles)


seqiter = SeqIO.parse(open(sys.argv[1]), 'fasta') 
sys.stdout = open('output.fasta', 'w') 

new_seq = 

for seq in seqiter:
 new_seq.append(seq if i in seq.id for i in wanted)


SeqIO.write(new_seq, sys.stdout, "fasta")
sys.stdout.close()

Got this error:

new_seq.append(seq if i in seq.id for i in wanted)
 ^
SyntaxError: invalid syntax

Is there a better way to do this?

Thank you!

python bioinformatics

edited Nov 13 '18 at 1:08

quant

1,58211526

asked Nov 12 '18 at 20:26

stevex

275

edited Nov 13 '18 at 1:08

quant

1,58211526

asked Nov 12 '18 at 20:26

stevex

275

edited Nov 13 '18 at 1:08

quant

1,58211526

edited Nov 13 '18 at 1:08

quant

1,58211526

edited Nov 13 '18 at 1:08

quant

1,58211526

asked Nov 12 '18 at 20:26

stevex

275

asked Nov 12 '18 at 20:26

stevex

275

asked Nov 12 '18 at 20:26

stevex

275

add a comment |

2 Answers
2

active

oldest

votes

Use a program like this

from Bio import SeqIO
import sys

# read in the text file
numbersInTxtFile = set()
# hint: use with, then you don't need to
# program file closing. Furhtermore error
# handling is comming along with this too
with open(sys.argv[2], "r") as inF:
 for line in inF:
 line = line.strip()
 if line == "": continue
 numbersInTxtFile.add(int(line))

# read in the fasta file
with open(sys.argv[1], "r") as inF:
 for record in SeqIO.parse(inF, "fasta"):
 # now check if this record in the fasta file 
 # has an id we are searching for
 name = record.description
 id = int(name.split("|")[1])
 print(id, numbersInTxtFile, id in numbersInTxtFile)
 if id in numbersInTxtFile: 
 # we need to output
 print(">%s" % name)
 print(record.seq)

which you can then call like so from the commandline

python3 nameOfProg.py inputFastaFile.fa idsToSearch.txt > outputFastaFile.fa

answered Nov 12 '18 at 20:44

quant

1,58211526

Thank you! It works!
– stevex
Nov 12 '18 at 22:18

add a comment |

Import your "keeper" IDs into a dictionary rather than a list, this will be much faster as the list doesn't have to be searched thousands of times.

keepers = 
with open("ids.txt", "r") as id_handle:
 for curr_id in id_handle:
 keepers[curr_id] = True

A list comprehension generates a list, so you don't need to append to another list.

keeper_seqs = [x for x in seqiter if x.id in keepers]

With larger files you will want to loop over seqiter and write the entries one at a time to avoid memory issues.

You should also never assign to sys.stdout without a good reason, if you want to output to STDOUT just use print or sys.stdout.write().

answered Nov 12 '18 at 20:45

hurfdurf

20228

Thank you very much!
– stevex
Nov 12 '18 at 22:18

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53269587%2fhow-to-subsample-a-fasta-file-based-on-the-headers-if-headers-contain-certain-st%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Use a program like this

from Bio import SeqIO
import sys

# read in the text file
numbersInTxtFile = set()
# hint: use with, then you don't need to
# program file closing. Furhtermore error
# handling is comming along with this too
with open(sys.argv[2], "r") as inF:
 for line in inF:
 line = line.strip()
 if line == "": continue
 numbersInTxtFile.add(int(line))

# read in the fasta file
with open(sys.argv[1], "r") as inF:
 for record in SeqIO.parse(inF, "fasta"):
 # now check if this record in the fasta file 
 # has an id we are searching for
 name = record.description
 id = int(name.split("|")[1])
 print(id, numbersInTxtFile, id in numbersInTxtFile)
 if id in numbersInTxtFile: 
 # we need to output
 print(">%s" % name)
 print(record.seq)

which you can then call like so from the commandline

python3 nameOfProg.py inputFastaFile.fa idsToSearch.txt > outputFastaFile.fa

answered Nov 12 '18 at 20:44

quant

1,58211526

Thank you! It works!
– stevex
Nov 12 '18 at 22:18

add a comment |

Use a program like this

from Bio import SeqIO
import sys

# read in the text file
numbersInTxtFile = set()
# hint: use with, then you don't need to
# program file closing. Furhtermore error
# handling is comming along with this too
with open(sys.argv[2], "r") as inF:
 for line in inF:
 line = line.strip()
 if line == "": continue
 numbersInTxtFile.add(int(line))

# read in the fasta file
with open(sys.argv[1], "r") as inF:
 for record in SeqIO.parse(inF, "fasta"):
 # now check if this record in the fasta file 
 # has an id we are searching for
 name = record.description
 id = int(name.split("|")[1])
 print(id, numbersInTxtFile, id in numbersInTxtFile)
 if id in numbersInTxtFile: 
 # we need to output
 print(">%s" % name)
 print(record.seq)

which you can then call like so from the commandline

python3 nameOfProg.py inputFastaFile.fa idsToSearch.txt > outputFastaFile.fa

answered Nov 12 '18 at 20:44

quant

1,58211526

Thank you! It works!
– stevex
Nov 12 '18 at 22:18

add a comment |

Use a program like this

from Bio import SeqIO
import sys

# read in the text file
numbersInTxtFile = set()
# hint: use with, then you don't need to
# program file closing. Furhtermore error
# handling is comming along with this too
with open(sys.argv[2], "r") as inF:
 for line in inF:
 line = line.strip()
 if line == "": continue
 numbersInTxtFile.add(int(line))

# read in the fasta file
with open(sys.argv[1], "r") as inF:
 for record in SeqIO.parse(inF, "fasta"):
 # now check if this record in the fasta file 
 # has an id we are searching for
 name = record.description
 id = int(name.split("|")[1])
 print(id, numbersInTxtFile, id in numbersInTxtFile)
 if id in numbersInTxtFile: 
 # we need to output
 print(">%s" % name)
 print(record.seq)

which you can then call like so from the commandline

python3 nameOfProg.py inputFastaFile.fa idsToSearch.txt > outputFastaFile.fa

answered Nov 12 '18 at 20:44

quant

1,58211526

Use a program like this

from Bio import SeqIO
import sys

# read in the text file
numbersInTxtFile = set()
# hint: use with, then you don't need to
# program file closing. Furhtermore error
# handling is comming along with this too
with open(sys.argv[2], "r") as inF:
 for line in inF:
 line = line.strip()
 if line == "": continue
 numbersInTxtFile.add(int(line))

# read in the fasta file
with open(sys.argv[1], "r") as inF:
 for record in SeqIO.parse(inF, "fasta"):
 # now check if this record in the fasta file 
 # has an id we are searching for
 name = record.description
 id = int(name.split("|")[1])
 print(id, numbersInTxtFile, id in numbersInTxtFile)
 if id in numbersInTxtFile: 
 # we need to output
 print(">%s" % name)
 print(record.seq)

which you can then call like so from the commandline

python3 nameOfProg.py inputFastaFile.fa idsToSearch.txt > outputFastaFile.fa

answered Nov 12 '18 at 20:44

quant

1,58211526

answered Nov 12 '18 at 20:44

quant

1,58211526

answered Nov 12 '18 at 20:44

quant

1,58211526

answered Nov 12 '18 at 20:44

quant

1,58211526

Thank you! It works!
– stevex
Nov 12 '18 at 22:18

add a comment |

Thank you! It works!
– stevex
Nov 12 '18 at 22:18

Thank you! It works!
– stevex
Nov 12 '18 at 22:18

add a comment |

Import your "keeper" IDs into a dictionary rather than a list, this will be much faster as the list doesn't have to be searched thousands of times.

keepers = 
with open("ids.txt", "r") as id_handle:
 for curr_id in id_handle:
 keepers[curr_id] = True

A list comprehension generates a list, so you don't need to append to another list.

keeper_seqs = [x for x in seqiter if x.id in keepers]

With larger files you will want to loop over seqiter and write the entries one at a time to avoid memory issues.

You should also never assign to sys.stdout without a good reason, if you want to output to STDOUT just use print or sys.stdout.write().

answered Nov 12 '18 at 20:45

hurfdurf

20228

Thank you very much!
– stevex
Nov 12 '18 at 22:18

add a comment |

Import your "keeper" IDs into a dictionary rather than a list, this will be much faster as the list doesn't have to be searched thousands of times.

keepers = 
with open("ids.txt", "r") as id_handle:
 for curr_id in id_handle:
 keepers[curr_id] = True

A list comprehension generates a list, so you don't need to append to another list.

keeper_seqs = [x for x in seqiter if x.id in keepers]

With larger files you will want to loop over seqiter and write the entries one at a time to avoid memory issues.

You should also never assign to sys.stdout without a good reason, if you want to output to STDOUT just use print or sys.stdout.write().

answered Nov 12 '18 at 20:45

hurfdurf

20228

Thank you very much!
– stevex
Nov 12 '18 at 22:18

add a comment |

Import your "keeper" IDs into a dictionary rather than a list, this will be much faster as the list doesn't have to be searched thousands of times.

keepers = 
with open("ids.txt", "r") as id_handle:
 for curr_id in id_handle:
 keepers[curr_id] = True

A list comprehension generates a list, so you don't need to append to another list.

keeper_seqs = [x for x in seqiter if x.id in keepers]

With larger files you will want to loop over seqiter and write the entries one at a time to avoid memory issues.

You should also never assign to sys.stdout without a good reason, if you want to output to STDOUT just use print or sys.stdout.write().

answered Nov 12 '18 at 20:45

hurfdurf

20228

Import your "keeper" IDs into a dictionary rather than a list, this will be much faster as the list doesn't have to be searched thousands of times.

keepers = 
with open("ids.txt", "r") as id_handle:
 for curr_id in id_handle:
 keepers[curr_id] = True

A list comprehension generates a list, so you don't need to append to another list.

keeper_seqs = [x for x in seqiter if x.id in keepers]

With larger files you will want to loop over seqiter and write the entries one at a time to avoid memory issues.

You should also never assign to sys.stdout without a good reason, if you want to output to STDOUT just use print or sys.stdout.write().

answered Nov 12 '18 at 20:45

hurfdurf

20228

answered Nov 12 '18 at 20:45

hurfdurf

20228

answered Nov 12 '18 at 20:45

hurfdurf

20228

answered Nov 12 '18 at 20:45

hurfdurf

20228

Thank you very much!
– stevex
Nov 12 '18 at 22:18

add a comment |

Thank you very much!
– stevex
Nov 12 '18 at 22:18

Thank you very much!
– stevex
Nov 12 '18 at 22:18

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Odtnhj