how to subsample a fasta file based on the headers if headers contain certain strings?
I have a fasta file like this:
>gi|373248686|emb|HE586118.1| Streptomyces albus subsp. albus salinomycin biosynthesis cluster, strain DSM 41398
GGATGCGAAGGACGCGCTGCGCAAGGCGCTGTCGATGGGTGCGGACAAGGGCATCCACGT
CGAGGACGACGATCTGCACGGCACCGACGCCGTGGGTACCTCGCTGGTGCTGGCCAAGGC
>gi|1139489917|gb|KX622588.1| Hyalangium minutum strain DSM 14724 myxochromide D subtype 1 biosynthetic gene cluster and tRNA-Thr gene, complete sequence
ATGCGCAAGCTCGTCATCACGGTGGGGATTCTGGTGGGGTTGGGGCTCGTGGTCCTTTGG
TTCTGGAGCCCGGGAGGCCCAGTCCCCTCCACGGACACGGAGGGGGAAGGGCGGAGTCAG
CGCCGGCAGGCCATGGCCCGGCCCGGCTCCGCGCAGCTGGAGAGTCCCGAGGACATGGGG
>gi|930076459|gb|KR364704.1| Streptomyces sioyaensis strain BCCO10_981 putative annimycin-type biosynthetic gene cluster, partial sequence
GCCGGCAGGTGGGCCGCGGTCAGCTTCAGGACCGTGGCCGTCGCGCCCGCCAGCACCACG
GAGGCCCCCACGGCCAGCGCCGGGCCCGTGCCCGTGCCGTACGCGAGGTCCGTGCTGAAC
and I have a text file containing a list of numbers:
373248686
930076459
296280703
......
I want to do the following:
if the header in the fasta file contains the numbers in the text file:
write all the matches(header+sequence) to a new output.fasta file.
How to do this in python? It seems easy, just some for loops may do the job, but somehow I cannot make that happen, and if my files are really big, loop in another loop may take a long time. Here's what I have tried:
from Bio import SeqIO
import sys
wanted =
for line in open(sys.argv[2]):
titles = line.strip()
wanted.append(titles)
seqiter = SeqIO.parse(open(sys.argv[1]), 'fasta')
sys.stdout = open('output.fasta', 'w')
new_seq =
for seq in seqiter:
new_seq.append(seq if i in seq.id for i in wanted)
SeqIO.write(new_seq, sys.stdout, "fasta")
sys.stdout.close()
Got this error:
new_seq.append(seq if i in seq.id for i in wanted)
^
SyntaxError: invalid syntax
Is there a better way to do this?
Thank you!
python bioinformatics
add a comment |
I have a fasta file like this:
>gi|373248686|emb|HE586118.1| Streptomyces albus subsp. albus salinomycin biosynthesis cluster, strain DSM 41398
GGATGCGAAGGACGCGCTGCGCAAGGCGCTGTCGATGGGTGCGGACAAGGGCATCCACGT
CGAGGACGACGATCTGCACGGCACCGACGCCGTGGGTACCTCGCTGGTGCTGGCCAAGGC
>gi|1139489917|gb|KX622588.1| Hyalangium minutum strain DSM 14724 myxochromide D subtype 1 biosynthetic gene cluster and tRNA-Thr gene, complete sequence
ATGCGCAAGCTCGTCATCACGGTGGGGATTCTGGTGGGGTTGGGGCTCGTGGTCCTTTGG
TTCTGGAGCCCGGGAGGCCCAGTCCCCTCCACGGACACGGAGGGGGAAGGGCGGAGTCAG
CGCCGGCAGGCCATGGCCCGGCCCGGCTCCGCGCAGCTGGAGAGTCCCGAGGACATGGGG
>gi|930076459|gb|KR364704.1| Streptomyces sioyaensis strain BCCO10_981 putative annimycin-type biosynthetic gene cluster, partial sequence
GCCGGCAGGTGGGCCGCGGTCAGCTTCAGGACCGTGGCCGTCGCGCCCGCCAGCACCACG
GAGGCCCCCACGGCCAGCGCCGGGCCCGTGCCCGTGCCGTACGCGAGGTCCGTGCTGAAC
and I have a text file containing a list of numbers:
373248686
930076459
296280703
......
I want to do the following:
if the header in the fasta file contains the numbers in the text file:
write all the matches(header+sequence) to a new output.fasta file.
How to do this in python? It seems easy, just some for loops may do the job, but somehow I cannot make that happen, and if my files are really big, loop in another loop may take a long time. Here's what I have tried:
from Bio import SeqIO
import sys
wanted =
for line in open(sys.argv[2]):
titles = line.strip()
wanted.append(titles)
seqiter = SeqIO.parse(open(sys.argv[1]), 'fasta')
sys.stdout = open('output.fasta', 'w')
new_seq =
for seq in seqiter:
new_seq.append(seq if i in seq.id for i in wanted)
SeqIO.write(new_seq, sys.stdout, "fasta")
sys.stdout.close()
Got this error:
new_seq.append(seq if i in seq.id for i in wanted)
^
SyntaxError: invalid syntax
Is there a better way to do this?
Thank you!
python bioinformatics
add a comment |
I have a fasta file like this:
>gi|373248686|emb|HE586118.1| Streptomyces albus subsp. albus salinomycin biosynthesis cluster, strain DSM 41398
GGATGCGAAGGACGCGCTGCGCAAGGCGCTGTCGATGGGTGCGGACAAGGGCATCCACGT
CGAGGACGACGATCTGCACGGCACCGACGCCGTGGGTACCTCGCTGGTGCTGGCCAAGGC
>gi|1139489917|gb|KX622588.1| Hyalangium minutum strain DSM 14724 myxochromide D subtype 1 biosynthetic gene cluster and tRNA-Thr gene, complete sequence
ATGCGCAAGCTCGTCATCACGGTGGGGATTCTGGTGGGGTTGGGGCTCGTGGTCCTTTGG
TTCTGGAGCCCGGGAGGCCCAGTCCCCTCCACGGACACGGAGGGGGAAGGGCGGAGTCAG
CGCCGGCAGGCCATGGCCCGGCCCGGCTCCGCGCAGCTGGAGAGTCCCGAGGACATGGGG
>gi|930076459|gb|KR364704.1| Streptomyces sioyaensis strain BCCO10_981 putative annimycin-type biosynthetic gene cluster, partial sequence
GCCGGCAGGTGGGCCGCGGTCAGCTTCAGGACCGTGGCCGTCGCGCCCGCCAGCACCACG
GAGGCCCCCACGGCCAGCGCCGGGCCCGTGCCCGTGCCGTACGCGAGGTCCGTGCTGAAC
and I have a text file containing a list of numbers:
373248686
930076459
296280703
......
I want to do the following:
if the header in the fasta file contains the numbers in the text file:
write all the matches(header+sequence) to a new output.fasta file.
How to do this in python? It seems easy, just some for loops may do the job, but somehow I cannot make that happen, and if my files are really big, loop in another loop may take a long time. Here's what I have tried:
from Bio import SeqIO
import sys
wanted =
for line in open(sys.argv[2]):
titles = line.strip()
wanted.append(titles)
seqiter = SeqIO.parse(open(sys.argv[1]), 'fasta')
sys.stdout = open('output.fasta', 'w')
new_seq =
for seq in seqiter:
new_seq.append(seq if i in seq.id for i in wanted)
SeqIO.write(new_seq, sys.stdout, "fasta")
sys.stdout.close()
Got this error:
new_seq.append(seq if i in seq.id for i in wanted)
^
SyntaxError: invalid syntax
Is there a better way to do this?
Thank you!
python bioinformatics
I have a fasta file like this:
>gi|373248686|emb|HE586118.1| Streptomyces albus subsp. albus salinomycin biosynthesis cluster, strain DSM 41398
GGATGCGAAGGACGCGCTGCGCAAGGCGCTGTCGATGGGTGCGGACAAGGGCATCCACGT
CGAGGACGACGATCTGCACGGCACCGACGCCGTGGGTACCTCGCTGGTGCTGGCCAAGGC
>gi|1139489917|gb|KX622588.1| Hyalangium minutum strain DSM 14724 myxochromide D subtype 1 biosynthetic gene cluster and tRNA-Thr gene, complete sequence
ATGCGCAAGCTCGTCATCACGGTGGGGATTCTGGTGGGGTTGGGGCTCGTGGTCCTTTGG
TTCTGGAGCCCGGGAGGCCCAGTCCCCTCCACGGACACGGAGGGGGAAGGGCGGAGTCAG
CGCCGGCAGGCCATGGCCCGGCCCGGCTCCGCGCAGCTGGAGAGTCCCGAGGACATGGGG
>gi|930076459|gb|KR364704.1| Streptomyces sioyaensis strain BCCO10_981 putative annimycin-type biosynthetic gene cluster, partial sequence
GCCGGCAGGTGGGCCGCGGTCAGCTTCAGGACCGTGGCCGTCGCGCCCGCCAGCACCACG
GAGGCCCCCACGGCCAGCGCCGGGCCCGTGCCCGTGCCGTACGCGAGGTCCGTGCTGAAC
and I have a text file containing a list of numbers:
373248686
930076459
296280703
......
I want to do the following:
if the header in the fasta file contains the numbers in the text file:
write all the matches(header+sequence) to a new output.fasta file.
How to do this in python? It seems easy, just some for loops may do the job, but somehow I cannot make that happen, and if my files are really big, loop in another loop may take a long time. Here's what I have tried:
from Bio import SeqIO
import sys
wanted =
for line in open(sys.argv[2]):
titles = line.strip()
wanted.append(titles)
seqiter = SeqIO.parse(open(sys.argv[1]), 'fasta')
sys.stdout = open('output.fasta', 'w')
new_seq =
for seq in seqiter:
new_seq.append(seq if i in seq.id for i in wanted)
SeqIO.write(new_seq, sys.stdout, "fasta")
sys.stdout.close()
Got this error:
new_seq.append(seq if i in seq.id for i in wanted)
^
SyntaxError: invalid syntax
Is there a better way to do this?
Thank you!
python bioinformatics
python bioinformatics
edited Nov 13 '18 at 1:08
quant
1,58211526
1,58211526
asked Nov 12 '18 at 20:26
stevexstevex
275
275
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
Use a program like this
from Bio import SeqIO
import sys
# read in the text file
numbersInTxtFile = set()
# hint: use with, then you don't need to
# program file closing. Furhtermore error
# handling is comming along with this too
with open(sys.argv[2], "r") as inF:
for line in inF:
line = line.strip()
if line == "": continue
numbersInTxtFile.add(int(line))
# read in the fasta file
with open(sys.argv[1], "r") as inF:
for record in SeqIO.parse(inF, "fasta"):
# now check if this record in the fasta file
# has an id we are searching for
name = record.description
id = int(name.split("|")[1])
print(id, numbersInTxtFile, id in numbersInTxtFile)
if id in numbersInTxtFile:
# we need to output
print(">%s" % name)
print(record.seq)
which you can then call like so from the commandline
python3 nameOfProg.py inputFastaFile.fa idsToSearch.txt > outputFastaFile.fa
Thank you! It works!
– stevex
Nov 12 '18 at 22:18
add a comment |
Import your "keeper" IDs into a dictionary rather than a list, this will be much faster as the list doesn't have to be searched thousands of times.
keepers =
with open("ids.txt", "r") as id_handle:
for curr_id in id_handle:
keepers[curr_id] = True
A list comprehension generates a list, so you don't need to append to another list.
keeper_seqs = [x for x in seqiter if x.id in keepers]
With larger files you will want to loop over seqiter and write the entries one at a time to avoid memory issues.
You should also never assign to sys.stdout
without a good reason, if you want to output to STDOUT
just use print
or sys.stdout.write()
.
Thank you very much!
– stevex
Nov 12 '18 at 22:18
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53269587%2fhow-to-subsample-a-fasta-file-based-on-the-headers-if-headers-contain-certain-st%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Use a program like this
from Bio import SeqIO
import sys
# read in the text file
numbersInTxtFile = set()
# hint: use with, then you don't need to
# program file closing. Furhtermore error
# handling is comming along with this too
with open(sys.argv[2], "r") as inF:
for line in inF:
line = line.strip()
if line == "": continue
numbersInTxtFile.add(int(line))
# read in the fasta file
with open(sys.argv[1], "r") as inF:
for record in SeqIO.parse(inF, "fasta"):
# now check if this record in the fasta file
# has an id we are searching for
name = record.description
id = int(name.split("|")[1])
print(id, numbersInTxtFile, id in numbersInTxtFile)
if id in numbersInTxtFile:
# we need to output
print(">%s" % name)
print(record.seq)
which you can then call like so from the commandline
python3 nameOfProg.py inputFastaFile.fa idsToSearch.txt > outputFastaFile.fa
Thank you! It works!
– stevex
Nov 12 '18 at 22:18
add a comment |
Use a program like this
from Bio import SeqIO
import sys
# read in the text file
numbersInTxtFile = set()
# hint: use with, then you don't need to
# program file closing. Furhtermore error
# handling is comming along with this too
with open(sys.argv[2], "r") as inF:
for line in inF:
line = line.strip()
if line == "": continue
numbersInTxtFile.add(int(line))
# read in the fasta file
with open(sys.argv[1], "r") as inF:
for record in SeqIO.parse(inF, "fasta"):
# now check if this record in the fasta file
# has an id we are searching for
name = record.description
id = int(name.split("|")[1])
print(id, numbersInTxtFile, id in numbersInTxtFile)
if id in numbersInTxtFile:
# we need to output
print(">%s" % name)
print(record.seq)
which you can then call like so from the commandline
python3 nameOfProg.py inputFastaFile.fa idsToSearch.txt > outputFastaFile.fa
Thank you! It works!
– stevex
Nov 12 '18 at 22:18
add a comment |
Use a program like this
from Bio import SeqIO
import sys
# read in the text file
numbersInTxtFile = set()
# hint: use with, then you don't need to
# program file closing. Furhtermore error
# handling is comming along with this too
with open(sys.argv[2], "r") as inF:
for line in inF:
line = line.strip()
if line == "": continue
numbersInTxtFile.add(int(line))
# read in the fasta file
with open(sys.argv[1], "r") as inF:
for record in SeqIO.parse(inF, "fasta"):
# now check if this record in the fasta file
# has an id we are searching for
name = record.description
id = int(name.split("|")[1])
print(id, numbersInTxtFile, id in numbersInTxtFile)
if id in numbersInTxtFile:
# we need to output
print(">%s" % name)
print(record.seq)
which you can then call like so from the commandline
python3 nameOfProg.py inputFastaFile.fa idsToSearch.txt > outputFastaFile.fa
Use a program like this
from Bio import SeqIO
import sys
# read in the text file
numbersInTxtFile = set()
# hint: use with, then you don't need to
# program file closing. Furhtermore error
# handling is comming along with this too
with open(sys.argv[2], "r") as inF:
for line in inF:
line = line.strip()
if line == "": continue
numbersInTxtFile.add(int(line))
# read in the fasta file
with open(sys.argv[1], "r") as inF:
for record in SeqIO.parse(inF, "fasta"):
# now check if this record in the fasta file
# has an id we are searching for
name = record.description
id = int(name.split("|")[1])
print(id, numbersInTxtFile, id in numbersInTxtFile)
if id in numbersInTxtFile:
# we need to output
print(">%s" % name)
print(record.seq)
which you can then call like so from the commandline
python3 nameOfProg.py inputFastaFile.fa idsToSearch.txt > outputFastaFile.fa
answered Nov 12 '18 at 20:44
quantquant
1,58211526
1,58211526
Thank you! It works!
– stevex
Nov 12 '18 at 22:18
add a comment |
Thank you! It works!
– stevex
Nov 12 '18 at 22:18
Thank you! It works!
– stevex
Nov 12 '18 at 22:18
Thank you! It works!
– stevex
Nov 12 '18 at 22:18
add a comment |
Import your "keeper" IDs into a dictionary rather than a list, this will be much faster as the list doesn't have to be searched thousands of times.
keepers =
with open("ids.txt", "r") as id_handle:
for curr_id in id_handle:
keepers[curr_id] = True
A list comprehension generates a list, so you don't need to append to another list.
keeper_seqs = [x for x in seqiter if x.id in keepers]
With larger files you will want to loop over seqiter and write the entries one at a time to avoid memory issues.
You should also never assign to sys.stdout
without a good reason, if you want to output to STDOUT
just use print
or sys.stdout.write()
.
Thank you very much!
– stevex
Nov 12 '18 at 22:18
add a comment |
Import your "keeper" IDs into a dictionary rather than a list, this will be much faster as the list doesn't have to be searched thousands of times.
keepers =
with open("ids.txt", "r") as id_handle:
for curr_id in id_handle:
keepers[curr_id] = True
A list comprehension generates a list, so you don't need to append to another list.
keeper_seqs = [x for x in seqiter if x.id in keepers]
With larger files you will want to loop over seqiter and write the entries one at a time to avoid memory issues.
You should also never assign to sys.stdout
without a good reason, if you want to output to STDOUT
just use print
or sys.stdout.write()
.
Thank you very much!
– stevex
Nov 12 '18 at 22:18
add a comment |
Import your "keeper" IDs into a dictionary rather than a list, this will be much faster as the list doesn't have to be searched thousands of times.
keepers =
with open("ids.txt", "r") as id_handle:
for curr_id in id_handle:
keepers[curr_id] = True
A list comprehension generates a list, so you don't need to append to another list.
keeper_seqs = [x for x in seqiter if x.id in keepers]
With larger files you will want to loop over seqiter and write the entries one at a time to avoid memory issues.
You should also never assign to sys.stdout
without a good reason, if you want to output to STDOUT
just use print
or sys.stdout.write()
.
Import your "keeper" IDs into a dictionary rather than a list, this will be much faster as the list doesn't have to be searched thousands of times.
keepers =
with open("ids.txt", "r") as id_handle:
for curr_id in id_handle:
keepers[curr_id] = True
A list comprehension generates a list, so you don't need to append to another list.
keeper_seqs = [x for x in seqiter if x.id in keepers]
With larger files you will want to loop over seqiter and write the entries one at a time to avoid memory issues.
You should also never assign to sys.stdout
without a good reason, if you want to output to STDOUT
just use print
or sys.stdout.write()
.
answered Nov 12 '18 at 20:45
hurfdurfhurfdurf
20228
20228
Thank you very much!
– stevex
Nov 12 '18 at 22:18
add a comment |
Thank you very much!
– stevex
Nov 12 '18 at 22:18
Thank you very much!
– stevex
Nov 12 '18 at 22:18
Thank you very much!
– stevex
Nov 12 '18 at 22:18
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53269587%2fhow-to-subsample-a-fasta-file-based-on-the-headers-if-headers-contain-certain-st%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown