Performing a find
This is a locked post that has been migrated from our previous forum. Please start a new post if you would like to continue the discussion.
Geneious User:
Can you tell me how I'd conduct a search of a single sequence string, or a set (i.e. in an AnnotatedPluginDocument[]). I can see some likely methods in the api documentation but it would be really helpful if you could give me some example code.
I'm needing to search very large numbers of short reads from some Illumina sequencing. Any suggestions for the fastest possible way to search would be appreciated. The find function in Geneious seems to be pretty fast compared to my own attempts in Java.
Many thanks
Geneious Support:
The simlest way is to use SequenceUtilities.getSequences to get all sequences out of some AnnotatedPluginDocuments. For each SequenceDocument, you can call getCharSequence(). On alignments/contigs, the char sequence may be mostly end gaps which you probably don't care about, so calling getInternalCharSequence on that will give you the internal sequence which you can efficeintly search. Converting that to a String is easiest, but may require a lot of memory. Or you can iterate over the bases 1 at a time. For example:
List extends SequenceDocument> sequences = SequenceUtilities.getSequences(documents,
SequenceDocument.Alphabet.NUCLEOTIDE, ProgressListener.EMPTY);
for (SequenceDocument sequenceDocument : sequences) {
CharSequence internalSequence = sequenceDocument.getCharSequence().getInternalCharSequence();
if (internalSequence.toString().contains("GGG")) {
...
}
}
Geneious Support:
Note, the above code fragment I gave will work without needing to load all the sequences into memory at once, which is important when dealing with millions of reads.
I'm guessing you probably want to build a new list of the sequences that match your query. You can use a SequenceListOnDisk.Builder to build a new document containing those, without ever needing to load all the sequences into memory (in case there are millions of them). For example:
List extends SequenceDocument> sequences = SequenceUtilities.getSequences(documents,
SequenceDocument.Alphabet.NUCLEOTIDE, ProgressListener.EMPTY);
SequenceListOnDisk.Builder builder = new SequenceListOnDisk.Builder(false, SequenceDocument.Alphabet.NUCLEOTIDE, false);
for (SequenceDocument sequenceDocument : sequences) {
CharSequence internalSequence = sequenceDocument.getCharSequence().getInternalCharSequence();
if (internalSequence.toString().contains("GGG")) {
builder.addSequence(SequenceExtractionUtilities.removeGaps(sequenceDocument),ProgressListener.EMPTY);
}
}
DefaultSequenceListDocument sequenceList = builder.toSequenceListDocument(ProgressListener.EMPTY);
return DocumentUtilities.createAnnotatedPluginDocument(sequenceList);
Geneious User:
Can you tell me how I'd conduct a search of a single sequence string, or a set (i.e. in an AnnotatedPluginDocument[]). I can see some likely methods in the api documentation but it would be really helpful if you could give me some example code.
I'm needing to search very large numbers of short reads from some Illumina sequencing. Any suggestions for the fastest possible way to search would be appreciated. The find function in Geneious seems to be pretty fast compared to my own attempts in Java.
Many thanks
Geneious Support:
The simlest way is to use SequenceUtilities.getSequences to get all sequences out of some AnnotatedPluginDocuments. For each SequenceDocument, you can call getCharSequence(). On alignments/contigs, the char sequence may be mostly end gaps which you probably don't care about, so calling getInternalCharSequence on that will give you the internal sequence which you can efficeintly search. Converting that to a String is easiest, but may require a lot of memory. Or you can iterate over the bases 1 at a time. For example:
List extends SequenceDocument> sequences = SequenceUtilities.getSequences(documents,
SequenceDocument.Alphabet.NUCLEOTIDE, ProgressListener.EMPTY);
for (SequenceDocument sequenceDocument : sequences) {
CharSequence internalSequence = sequenceDocument.getCharSequence().getInternalCharSequence();
if (internalSequence.toString().contains("GGG")) {
...
}
}
Geneious Support:
Note, the above code fragment I gave will work without needing to load all the sequences into memory at once, which is important when dealing with millions of reads.
I'm guessing you probably want to build a new list of the sequences that match your query. You can use a SequenceListOnDisk.Builder to build a new document containing those, without ever needing to load all the sequences into memory (in case there are millions of them). For example:
List extends SequenceDocument> sequences = SequenceUtilities.getSequences(documents,
SequenceDocument.Alphabet.NUCLEOTIDE, ProgressListener.EMPTY);
SequenceListOnDisk.Builder builder = new SequenceListOnDisk.Builder(false, SequenceDocument.Alphabet.NUCLEOTIDE, false);
for (SequenceDocument sequenceDocument : sequences) {
CharSequence internalSequence = sequenceDocument.getCharSequence().getInternalCharSequence();
if (internalSequence.toString().contains("GGG")) {
builder.addSequence(SequenceExtractionUtilities.removeGaps(sequenceDocument),ProgressListener.EMPTY);
}
}
DefaultSequenceListDocument sequenceList = builder.toSequenceListDocument(ProgressListener.EMPTY);
return DocumentUtilities.createAnnotatedPluginDocument(sequenceList);
0
Post is closed for comments.
Comments
0 comments