Features¶
This guide provides instructions on creating, querying, and utilising features to manipulate biological sequence data.
How to create a custom Feature
¶
Via add_feature
¶
We can dynamically create a Feature
via the add_feature()
method, providing the key information about the feature.
The new feature will be added to the annotation_db
attribute of the Sequence
and/or Alignment
. A Feature
instance will be returned.
On a Sequence
¶
We use add_feature
to add a Feature
to a Sequence
, providing the biotype, name/id, and a list of start and stop indices.
A feature can also be added as a series of non-overlapping genomic segments:
On a Sequence
within an Alignment
¶
We use add_feature
and provide a value for seqid
to add a feature to a sequence within an Alignment
or SequenceCollection
.
Note the difference between the value provided to spans
, and the value of map
in the returned Feature
… the resulting feature is in alignment coordinates!
The Feature
specifies that seqid="from 'seq1'"
, indicating that it “belongs” to seq1.
On an Alignment
¶
We use add_feature
to add a Feature
to an Alignment
.
The Feature
specifies that seqid=None
, indicating that it belongs to the alignment
Via an AnnotationDb
¶
We can use add_feature
to add a feature directly into an AnnotationDb
, and assign it to the annotation_db
attribute of a Sequence
or Alignment
. For extensive documentation on handling features directly via an AnnotationDb
see Annotation Databases.
How to load bulk Features from a File¶
Typically, we want to load bulk features from a genomic annotation file, such as a GFF or Genbank file. For the following examples, we will use Caenorhabditis elegans chromosome I.
Note
See the list of Data Files Used in the Documentation to download the data used in the following examples.
To load features from a genomic annotation file along with the corresponding sequence, we can use the load_seq
function. The features are stored in a AnnotationDb
and assigned to the annotation_db
attribute of the sequence.
From a Genbank file¶
How to load features and sequence data¶
To load the sequence and all 40,578 features from C. elegans Chromosome 1, we use the load_seq
function ⚡️
The features are stored in the annotation_db
attribute.
Now that the Sequence
is annotated, we can query it for specific features. For more details on querying, skip to Querying for Features.
From a GFF file¶
How to load features and sequence data¶
Given a FASTA file with sequence data and a GFF file with annotations, we can use load_seq
to load both the sequence and its corresponding features.
Warning
total_records=0
? 🤔 This is because load_seq
assumes the sequence names match exactly between files! If the names are different, you need to provide function to the label_to_name
argument.
Because the names above are different, for FASTA its "I dna:chromosome chromosome:WBcel235:I:1:15072434:1 REF"
and for GFF its "I"
, we need a label_to_name
argument. We provide a lambda function.
How to load features and associate them with an existing sequence¶
We can use the annotate_from_gff()
method to associate the features from a GFF file with the existing Sequence
.
If we know that the features and the sequence share the same coordinate space, then we only need to provide the path to the annotation file.
If the feature coordinates precede the sequence, for example, a sequence corresponds to a gene that starts 600 base pairs from the beginning of chromosome, but the annotation file is for the entire chromosome, we need to provide an offset to the annotate_from_gff()
method.
How to load features and associate them with sequences in an existing alignment¶
To annotate one or more Sequence
in an Alignment
, call annotate_from_gff()
on the Alignment
instance, passing in the path to the GFF annotation file and a list of sequence names to annotate to the seq_ids
argument.
For example, first we load an alignment of the brca1 gene in primates.
Next, we annotate with a GFF file that contains features specific to the human gene.
Note that the AnnotationDb
is accessible via the Alignment
(above) and Sequence
(below) attribute.
Note
Alignment.annotate_from_gff()
does not support setting an offset. If you need to set the offset for a sequence within an alignment, you can do so directly using the Sequence.annotation_offset
attribute.
How to query a Sequence or Alignment for Features¶
The method get_features
yields all features that match the given arguments. You can provide conditions for the name, biotype, and start/stop location of a feature.
Querying a Sequence
for Features¶
Querying via Feature Name¶
We can search for a gene given its name (AKA its unique ID). For example we can search for a gene with name="WBGene00021661"
.
Querying via Feature Biotype¶
We can search for features with a certain biotype, for example, all coding sequences (CDS):
We can also provide combinations of argument to search, for example, all CDS with a given name:
Querying via region of interest¶
We can provide start
and end
arguments to get_features()
and all features within the coordinates will be returned.
We can again provide a combination of conditions, for example, querying for all features with biotype="mRNA"
within a certain range, and returning the first match.
Querying a Sequence (via an Alignment) for Features¶
To query for a particular Sequence
within an Alignment
or SequenceCollection
, we can use get_features
as shown above for a Sequence
, but providing the seqid for the sequence of interest.
For example, given an alignment of primates, we can search for features that are just on the human sequence as follows:
Note that seqid="from'Human'"
indicated this feature belongs to this particular sequence.
Querying an Alignment for Features¶
Querying for features on any Sequence
in an Alignment
¶
todo: on_alignment=False
and dont provide seqid
Note there are features from both Rhesus, which we just added, and Human, which we annotated above
Querying for features on an Alignment
¶
todo: on_alignment=True
and dont provide seqid
Using add_feature
we add a feature to the brca1 alignment we have been using above, by specifying on_alignment=True
this feature will be on the Alignment
.
To query for features on the alignment, we use get_features
, again specifying on_alignment=True
.
Note how even though we annotated the Human and Rhesus sequences in the above examples, only the pseudogene we added to Alignment
is returned by this query.
Querying features that span gaps in alignments¶
If you query for a Feature
from a Sequence
(i.e. the feature is in sequence coordinates), its alignment coordinates may be discontinuous. This will lead to an omission of data from other sequences!
Note
In the above, the T
in sequence Y opposite the gap is missing since this approach only returns positions directly corresponding to the feature.
To include the gaps, use the allow_gaps
argument
Examples using the methods available on Features¶
A Feature
has many methods to manipulate the sequence or alignment that they are bound to.
How to slice a Sequence
or Alignment
by its features¶
Given a Feature
, we can directly slice its parent sequence to return its sequence information
Note
This only works for the Sequence
that the Feature
“belongs” to.
We can also achieve this via get_slice()
How to display the features of a Sequence¶
We can display all the features on a sequence using .get_drawable()
, or a subset of biotypes. We do this for only the first 50,000 base pairs. The plotly figure returned, as displayed below, is interactive! 🤩 Zoom in on the dark vertical lines in the big gene and you will see small genes on the opposite strand. Hover your cursor over each block and the gene name is displayed.
Note
If a feature extends beyond the sequence region selected, its name includes the text “(incomplete)”.
How to find the coordinates of a feature¶
These are useful for doing custom things, e.g. if the introns are not annotated for a gene, we can generate the introns from the coordinates of the exons as follows:
We generate the intron coordinates from the second element of the first tuple, and the first element of the second tuple and so on:
We can then add the introns as a Feature
to the sequence!
How to take the union of features¶
We can create a feature that is the union of all coding sequence.
How to get the shadow of a Feature¶
The “shadow” of a feature is a new feature containing all of the sequence except the feature!
How to use the shadow of a Feature to return the intergenic sequence¶
We first need to query our sequence for all genes. Using the union()
method we combine all genes into a single feature.
Taking the “shadow” of all genes will return the intergenic region as a valid Feature
We can slice the sequence by this new Feature to return the complete intergenic sequence!
How to mask annotated regions¶
Masking annotated regions on a sequence¶
We can mask a certain annotation using with_masked_annotations()
The above sequence could then have positions filtered so no position with the ambiguous character ‘?’ was present.
Masking annotated regions on an Alignment¶
We can mask exons on an alignment.
After a reverse complement operation
these persist.
How to find the “children” of a Feature¶
To find the “children” of a feature, we can use the get_children()
method. A “child” refers to a feature that is nested within or contained by another “parent” feature. For example, a child feature could be an exon contained within a gene or a CDS contained within a transcript.
This method returns a generator that yields all the child features of the specified feature.
For example, let’s find the children of the gene “WBGene00021661”:
How to find the “parent” of a Feature¶
To find the “parent” of a feature, we can use the get_parent()
method, which achieves the inverse of the above method.
For example, we can use the first “child” we returned above, "transcript:Y74C9A.2a.3"
, to find the original parent gene!
How to copy features¶
We can copy features onto sequences with the same name. Note that the AnnotationDb
instance bound to the alignment and its member sequences is the same.
Warning
Despite this, it is possible for the attributes to get out-of-sync. So, any copy annotations should be done using alignment.copy_annotations()
, not alignment.get_seq("x").copy_annotations()
.
However, if the feature lies outside the sequence being copied to, you get a lost span
How to get the positions of a feature as one span¶
as_one_span
unifies features with discontinuous alignment coordinates and returns positions spanned by a feature, including gaps.
Behaviour of annotations on nucleic acid sequences¶
Reverse complementing a sequence does not reverse features. Features are considered to have strand specific meaning (e.g. CDS, exons) and so they retain the reference to the frame for which they were defined.