All Classes and Interfaces
Class
Description
ACGT tree
Count singletons and other allele counts per sample
Methods for manipulating arrays.
A Hash that creates new elements if they don't exists
A simple class that calculates averages
A simple class that calculates average of integer numbers
Counts how many bases changed, given an XOR between two longs
Formats: Show all annotations that intersect the BED input file.
Opens a sequence change file and iterates over all intervals in BED format.
Formats output as BED file
Referneces: http://genome.ucsc.edu/FAQ/FAQformat.html#format1
FileIterator for BigBed features
WARNING: Removed in 2022-01 due to dependency on IGV's code (which depends on Log4j, which has a major security issue)
Note: I use Broad's IGV code to do all the work, this is just a wrapper
Base class for a binary 'read'.
Calculate binomial distribution
References http://en.wikipedia.org/wiki/Binomial_distribution
Reads all sequences from a file
Warning: You should always call "close()" at the end of the iteration.
BioTypes: Gene or transcript bioType annotation
References: http://www.ensembl.org/info/genome/genebuild/biotypes.html
Biotypes classifies genes and transcripts into groups including: protein coding, pseudogene
, processed pseudogene, miRNA, rRNA, scRNA, snoRNA, snRNA.
A variety of high efficiency bit twiddling routines.
A black box event
Iterate on each line of a GWAS catalog (TXT format)
A mutable boolean
A catalyst activity event
CDS: The coding region of a gene, also known as the coding sequence or CDS (from Coding DNA Sequence), is
that portion of a gene's DNA or RNA, composed of exons, that codes for protein.
Interval for the whole chromosome
If a SNP has no 'ChromosomeInterval' => it is outside the chromosome => Invalid
Convert chromosome names to simple names
A list of <chromosome, position, scores>
How many changes per position do we have in a chromosome.
Correct circular genomic coordinates
Nomenclature: We use coordinates at the beginning of the chromosme and negative coordinates
Calculate a Cochran-Armitage test
Reference: http://en.wikipedia.org/wiki/Cochran-Armitage_test_for_trend
The trend test is applied when the data take the form of a 2 x k contingency
table.
Class used to encode & decode sequences into binary and vice-versa
They are usually stored in 'long' words
Analyze codon changes based on a variant and a Transcript
Calculate codon changes produced by a deletion
Calculate codon changes produced by a duplication
Calculate codon changes produced by an insertion
Calculate codon changes produced by a Interval
Note: An interval does not produce any effect.
Calculate codon changes produced by an inversion
Calculate codon changes produced by a 'mixed' variant
Essentially every 'mixed' variant can be represented as a concatenation of a SNP/MNP + an INS/DEL
Calculate codon changes produced by a MNP
Calculate codon changes produced by a SNP
Calculate codon changes produced by a duplication
A codon translation table
All codon tables are stored here.
Generate all possible 'count' combinations
Command line and arguments
The way to run a command from 'main' is usually:
public static void main(String[] args) {
Command cmd = new Command();
cmd.parseArgs(args);
cmd.run();
}
Compare two elements in a Map (e.g.
Compare two elements in a Map (e.g.
Compare effects in tests cases
Compare our results to ENSEML's Variant Effect predictor's output
Compare our results to ENSEML's Variant Effect predictor's output
A Reactome compartment (a part of a cell)
A Reactome complex (a bunch of molecules or complexes
Counters indexed by key.
Counters indexed by 'type' (type is a generic string that can mean anything)
A simple class that counts...
A simple class that counts...
Base by base coverage (one chromsome)
Count how many reads map (from many SAM/BAM files) onto markers
Count how many reads map (from many SAM/BAM files) onto markers
Base by base coverage (one chromsome)
Base by base coverage (one chromsome)
This is a custom interval (i.e.
A controlled vocabulary term
Cytband definitions
E.g.: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/cytoBand.txt.gz
A depolymerization event
Binary packed DNA sequence and base calling quality
Notes:
- This is designed for short sequences (such as "short reads")
- Every base is encoded in 8 bits:
- Six bits for the base quality [0 , ..
DnaAndQualitySequence with an ID
Class used to encode & decode sequences into binary and vice-versa
Note:This is a singleton class.
Binary packed DNA sequence that allows also 'N' bases: {A, C, G, T, N}
Class used to encode & decode sequences into binary and vice-versa
- Every base is encoded in 8 bits:
- Six bits for the base quality [0 , ..
Compares two subsequences of DNA (DnaAndQualitySequence)
Binary packed DNA sequence
Notes:
- This is designed for short sequences (such as "short reads")
- Every base is encoded in 2 bits {a, c, g, t} <=> {0, 1, 2, 3}
- All bits are stored in an array of 'words' (integers)
- Most significant bits are the first bases in the sequence (makes comparison easier)
Binary packed DNA sequence.
Binary packed DNA sequence with an ID (long)
Pair end DNA sequence (binary packed)
It consists of 2 DNA sequences separated by a gap.
Compares two subsequences of DNA (DnaSequence)
Command line program: Build database
Interval for a gene, as well as some other information: exons, utrs, cds, etc.
Effect type:
Note that effects are sorted (declared) by impact (highest to lowest putative impact).
VcfFields in SnpEff version 2.X have a different format than 3.X
As of version 4.1 we switch to a standard annotation format
A class representing the same data as an EMBL file
References: http://www.ebi.ac.uk/embl/Documentation/User_manual/usrman.html
A file containing one or more set of features (e.g.
A generic enrichment algorithm for selecting gene-sets from a collection of gene-sets
A generic greedy enrichment algorithm for selecting gene-sets
A greedy enrichment algorithm for selecting gene-sets using a variable geneSet-size strategy:
i) Select only from geneSets in low-sizes e.g.
A reactome basic entity (e.g.
Errors and warnings
A reactome event (any generic event, from pathways to polymerizations)
Launches an 'OS command' (e.g.
Interval for an exon
Characterize exons based on alternative splicing
References: "Alternative splicing and evolution - diversification, exon definition and function" (see Box 1)
Characterize exons based on alternative splicing
References: "Alternative splicing and evolution - diversification, exon definition and function" (see Box 1)
Opens a fasta file and iterates over all fasta sequences in the file
Convert FASTQ (phred64) file to FASTQ (phred33)
Opens a fastq file and iterates over all fastq sequences in the file
Unlike BioJava's version, this one does NOT load all sequences in
memory.
Split a fastq into N files
Simple maipulation of fastq sequences
Trim fastq sequence when quality drops below a threshold
The resulting sequence has to ba at least 'minBases'
Trim fastq sequence when:
- Median quality drops below a threshold (mean is calculated every 2 bases instead of every base)
- Sequence length is at least 'minBases'
From Adrian Platts
...Also the sliding window was not every base.
Trim fastq sequence when median quality drops below a threshold
A feature in a GenBank or EMBL file
A feature in a GenBank or EMBL file
A class representing a set of features
References: http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html
A file containing one or more set of features (e.g.
Index a file that has "chr \t pos" as the beginning of a line (e.g.
Opens a file and iterates over all objects in the file
Note: The file is not loaded in memory, thus allows to iterate over very large files
A Generic filter interface
Find intervals where rare amino acids occur
Calculate Fisher's exact test (based on hypergeometric distribution)
A simple class that does some basic statistics on double numbers
Type of frame calculations
Internally, we use GFF style frame calculation for Exon / Transcript
Technically, these are 'frame' and 'phase' which are calculated in different ways
UCSC type: Indicated the coding base number modulo 3.
A class representing the same data as a GenBank file (a 'GB' file)
References:
http://www.insdc.org/documents/feature-table
http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord
A file containing one or more set of features (e.g.
Interval for a gene, as well as transcripts
Count for each 'type' and 'gene'.
Maps different Gene IDs:
- ENSEMBL Gene ID to transcript ID
- ENSEMBL Gene ID to Gene Name
- ENSEMBL Gene ID to Refseq Gene ID
- ENSEMBL Gene ID to Refseq Protein ID
An interval intended as a mark
Opens a file and creates generic markers (one per line)
A collection of genes (marker intervals)
Note: It is assumed that all genes belong to the same genome
An set of genes (that belongs to a collection of gene-sets)
A collection of GeneSets
Genes have associated "experimental values"
A collection of GeneSets
Genes are ranked (usually by 'value')
Some statistics about a gene
This is just used for the Interval class.
This class stores all "relevant" sequences in a genome
This class is able to:
i) Add all regions of interest
ii) Store genomic sequences for those regions of interest
iii) Retrieve genomic sequences by interval
Simple test program
Calculate statistics on genotype
A vector of genotypes in a 'compact' structure
Note: Genotypes 0/0, 0/1, 1/0, 1/1 are stored in 2 bits.
Opens a sequence change file and iterates over all intervals in GFF3 format.
An interval intended as a mark
A simple wrapper to goolge charts API (from charts4j)
Plots integer data
A simple wrapper to goolge charts API (from charts4j)
A simple wrapper to goolge charts API (from charts4j)
A simple wrapper to goolge charts API (from charts4j)
A simple wrapper to goolge charts API (from charts4j)
Plots integer data
An instance of a GO term (a node in the DAG)
A collection of GO terms
General pupose rutines
General stuff realted to HTML
Load data from GTEx files.
A 'column' in a GTEx file (values from one experiment
An interval intended as a mark
Given a table in a TXT file, try to guess the value types for each column
A Hash<long, long[]> using primitive types instead or warped object
The idea is to be able to add many long values for each key
This could be implemented by simply doing HashMap<Long, Set > (but it
would consume much more memory)
Note: We call each 'long[]' a bucket
WARNING: This collection does NOT allow elements to be deleted! But you can replace values.
HGSV notation
References: http://www.hgvs.org/
Coding DNA reference sequence
References http://www.hgvs.org/mutnomen/recs.html
Nucleotide numbering:
- there is no nucleotide 0
- nucleotide 1 is the A of the ATG-translation initiation codon
- the nucleotide 5' of the ATG-translation initiation codon is -1, the previous -2, etc.
Coding change in HGVS notation (amino acid changes)
References: http://www.hgvs.org/mutnomen/recs.html
Count Hom/Het per sample
From Pierre:
For multiple ALT, I suggest to count the number of REF allele
0/1 => ALT1
0/2 => ALT1
1/1 => ALT2
2/2 => ALT2
1/2 => ALT2
Calculate hypergeometric distribution using an optimized algorithm
that avoids problems with big factorials.
Generates Id
Maps many IDs to many Names
I.e.
Map IDs
An entry in a ID mapping file
Base class for integration tests
Interval for in intergenic region
Interval for a conserved intergenic region
A genomic interval.
Interval that contains sub intervals.
Compare intervals by end position
Compare intervals by start position
A set of interval trees (e.g.
Node for interval tree structure
The Node class contains the interval tree information for one single node
Iterate over intervals.
An Interval Tree is essentially a map from intervals to objects, which
can be queried for all data associated with a particular interval of
point
Interval tree structure using arrays
This is slightly faster than the new IntervalTree implementation
An Interval Tree is essentially a map from intervals to objects, which
can be queried for all data associated with a particular interval of
point
Histogram of integer numbers
Intron
Interval for a conserved non-coding region in an intron
A simple class that does some basic statistics on integer numbers
Convert an iterator instance to a (fake) iterable
Interval tree interface
Find all bases combinations from a string containing IUB codes
Load PWM matrices from a Jaspar file
A "key = value" pair
Leading edge fraction algorithm
References: "Common Inherited Variation in Mitochondrial Genes Is Not Enriched for Associations with Type 2 Diabetes or Related Glycemic Traits"
http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1001058
See page 12, "Step 4"
A simple chr:pos parser
Stores using bytes instead of chars
Iterate on each line.
Iterate on each line in this file
Filter a line before processing
One line per sequence.
A location (i.e.
A location respect to an isoform
A location respect to two locations within an isoform
In this case "start" and "end" are not really an interval, but an interaction between
two locations (e.g.
Logging
Log basic usage information to a server (for feedback and stats)
This information an always be suppressed (no info sent at all)
Analyze if a set of effects are can create a "Loss Of Function"
and "Nonsense mediated decays" effects.
An interval intended as a mark
Opens a Marker file and iterates over all markers
This is a marker used as a 'fake' parent during data serialization
A collection of markers
Marker with a DNA sequence
Serialize markers to (and from) file
Create a list of marker types (names or labels for markers)
Generic utility methods for Markers
A Marker that has 'frame' information (Exon and Cds)
A simple entry in a 'Matrix' file
Iterate on each line of a file, creating a MatrixEntry
Entry in a MicroCosm (miRNA target prediction) file
Iterate on each line of a MicroCosm predictions
References:
http://www.ebi.ac.uk/enright-srv/microcosm/
miRna binding site (usually this was predicted by some algorithm)
Mine marker intervals: I.e.
Mine marker intervals: I.e.
Regulatory elements
Opens a regulation file and create Motif elements.
Create a DNA logo for a PWM
References:
- See WebLogo http://weblogo.berkeley.edu/
- "WebLogo: A Sequence Logo Generator"
A Hash that can hold multiple values for each key
Needleman-Wunsch (global sequence alignment) algorithm for sequence alignment (short strings, since it's not memory optimized)
Needleman-Wunsch algorithm for string alignment (short strings, since it's not memory optimized)
NextProt annotation marker
Parse NetxProt XML file and build a database
Handler used in XML parsing for NextProt database
It keeps track of the tags and saves state data to create Markers using NextProtMarkerFactory
http://www.nextprot.org/
Creates Markers from nextprot XML annotations
A simple analysis of sequence conservation for each entry type
Why? Many NextProt annotations are only a few amino acids long (or only 1 AA) and often
they only involve very specific sequences
If the sequence is highly conserved and a non-synonymous mutation occurs, then
this might be disruptive (i.e.
Mimics the 'annotation' tag in a NextProt XML file
Mimics the 'entry' in a NextProt XML file
Mimics the 'isoform-mapping' in a NextProt XML file
Mimics a node in NextProt XML file
Binary packed N-mer (i.e.
Mark if an Nmer has been 'seen'
It only count up to 255 (one byte per counter)
Create a counter that can count Nmers as well as their WC complements
That means that given an Nmer, the nmer and the Watson-Crick complement are counted the same.
An algorithm that does nothing
Calculate Normal distribution (PDF & CDF) using more precision if required
A buffered reader for a file.
Observed over expected values (o/e) ratios
E.g.: CpG dinucleotides in a sequence
Observed over expected values (o/e) of CHG in a sequence
Observed over expected values (o/e) of CHH in a sequence
Observed over expected values (o/e) of CpG in a sequence
An "open" BitSet implementation that allows direct access to the array of words
storing the bits.
A queue of commands to be run.
Run an OS command as a thread
Formats output
How is this used:
- newSection(); // Create a new 'section' on the output format (e.g.
Calculates the best overlap between two sequences
Note: An overlap is a simple 'alignment' which can only contain gaps at the
beginning or at the end of the sequences.
Indicate whether an overlap between two sequences should be considered or not
Only allow overlaps between sequences mapped to same/different partition
Only allow sequences with different IDs to be overlapped
An object used to store overlap parameters
Parse a string and return a collection of objects.
A Reactome pathway
Author's data
A structure that reads PDB files
This code is similar to 'PDBFileReader' from BioJava, but the BioJava version
doesn't close file descriptors and eventually produces a crash when reading
many files.
An entry in a PED table.
A family: A group of Tfams with the same familyId
PED file iterator (PED file from PLINK)
Reference: http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml
A Simple genotype implementation for PED files
A pedigree for cancer samples
Pedigree entry in a VCF file header
E.g.:
##PEDIGREE=<Derived=Patient_01_Somatic,Original=Patient_01_Germline>
or
##PEDIGREE=<Child=CHILD-GENOME-ID,Mother=MOTHER-GENOME-ID,Father=FATHER-GENOME-ID>
A pedigree of PedEntries
PLINK MAP file
References: http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml
A polymerization event
How many changes per position do we have in a chromosome.
Get promoter sequences from genes
Protein interaction: An amino acid that is "in contact" with another amino acid.
Protein interaction: An amino acid that is "in contact" with another amino acid
within the same protein.
Protein interaction: An amino acid that is "in contact" with another amino acid.
Analize purity changes in codons and amino acids
A list of pvalues (i.e.
Create a DNA motif count matrix
Refrence http://en.wikipedia.org/wiki/Position-specific_scoring_matrix
Create a DNA motif count matrix and also
count the number of sequences in that contribute
to this motif.
Convert qseq file to fastq
Create random markers using a uniform distribution
Calculate rank sum probability distribution function (pdf) and cumulative distribution function (cdf).
Calculate rank sum probability distribution function (pdf) and cumulative distribution function (cdf).
Calculate rank sum probability distribution function (pdf) and cumulative distribution function (cdf).
Rare amino acid annotation:
These are amino acids that occurs very rarely in an organism.
A reaction
Reaction regulation types
Load reactome data from TXT files
Calculate the maximum interval length by type, for all markers in a genome
Create a probability model based on binomial ditribution.
Regulatory elements
Opens a GFF3 file and create regulatory elements.
Create a regulation consensus from multiple BED files
Create a regulation consensus from a regulation file.
Opens a regulation file and create Regulation elements.
Split regulation files into smaller files (one per 'regulation type')
Regulation files can be quite large and we cannot read them into
memory.
Opens a GFF3 file and create regulatory elements.
Re-sample statistic
Statistic is a sum of a set of integer numbers (e.g.
Resample statistic
Re-sample statistic using ranks of scores (scores are double)
Store a result form a greedy search algorithm
An entry in a SAM file
References: http://samtools.sourceforge.net/SAM-1.3.pdf
Reads a SAM file
Note: This is a very 'rustic' reader (we should use Picard's API instead)
Sam header
Sam header record
SQ header: Reference sequence dictionary.
Perform stats by analyzing some samples
A list of scores
A buffered reader for a file.
Measures the complexity of a sequence
Ideally we'd like to measure the Kolmogorov complexity of the sequence.
A collection of sequences that are indexed using some algorithm
Note: The ID is just the position in the array.
A reference to a sequence.
Rotates a binary packed sequence
WARNING: We only rotate up to Coder.basesPerWord() because after that the sequences are the same (with an integer offset)
NOTE: Left rotation 'n' is the same as a right rotation 'Coder.basesPerWord() - n'
Smith-Waterman (local sequence alignment) algorithm for sequence alignment (short strings, since it's not memory optimized)
SnpEff's main command line program
Available gene database formats
Available input formats
Available output formats
ACAT: Create ACAT score for T2D project
Note: This is just used to compile 'ACAT' score in T2D-GENES project, not
useful at all for general audience.
Command line program: Build database
Parse NetxProt XML file and build a database
http://www.nextprot.org/
Command line: Calculate coding sequences from a file and compare them to the ones calculated from our data structures
Command line: Find closes marker to each variant
Count reads from a BAM file given a list of intervals
Show all databases configures in snpEff.config
Create an HTML 'download' table based on the config file
Also creates a list of genome for Galaxy menu
Command line program: Download and install a (pre built) database
Command line program: Build database
Command line program: Predict variant effects
Simple test program
Command line: Gene-Sets Analysis
Perform gene set analysys
Calculate the maximum interval length by type, for all markers in a genome
PDB distance analysis
Command line: Read protein sequences from a file and compare them to the ones calculated from our data structures
Note: This is done in order to see potential incompatibility
errors between genome sequence and annotation.
Command line program: Show a transcript or a gene
Command line program: Show a transcript or a gene
Analyze sequences from splice sites
Create an SVG representation of a Marker
Predicts effects of SNPs
This class creates a SnpEffectPredictor from a file (or a set of files) and a configuration
This class creates a SnpEffectPredictor from an Embl file.
This class creates a SnpEffectPredictor from a 'features' file.
This class creates a SnpEffectPredictor from a GenBank file.
This class creates a SnpEffectPredictor from a file (or a set of files) and a configuration
The files used are:
- genes.txt : Biomart query from Ensembl (see scripts/genes_dataset.xml)
- Fasta files: One per chromosome (as described in the config file)
This class creates a SnpEffectPredictor from a GFF file.
This class creates a SnpEffectPredictor from a GFF2 file.
This class creates a SnpEffectPredictor from a GFF3 file
References:
- http://www.sequenceontology.org/gff3.shtml
- http://gmod.org/wiki/GFF3
- http://www.eu-sol.net/science/bioinformatics/standards-documents/gff3-format-description
This class creates a SnpEffectPredictor from a GTF 2.2 file
References: http://mblab.wustl.edu/GTF22.html
This class creates a SnpEffectPredictor from a TXT file dumped using UCSC table browser
Fields in this table
Field Example SQL type Info Description
----- ------- -------- ---- -----------
name uc001aaa.3 varchar(255) values Name of gene
chrom chr1 varchar(255) values Reference sequence chromosome or scaffold
strand + char(1) values + or - for strand
txStart 11873 int(10) unsigned range Transcription start position
txEnd 14409 int(10) unsigned range Transcription end position
cdsStart 11873 int(10) unsigned range Coding region start
cdsEnd 11873 int(10) unsigned range Coding region end
exonCount 3 int(10) unsigned range Number of exons
exonStarts 11873,12612,13220, longblob Exon start positions
exonEnds 12227,12721,14409, longblob Exon end positions
proteinID varchar(40) values UniProt display ID for Known Genes, UniProt accession or RefSeq protein ID for UCSC Genes
alignID uc001aaa.3 varchar(255) values Unique identifier for each (known gene, alignment position) pair
This class creates a random set of chromosomes, genes, transcripts and exons
This class creates a SnpEffectPredictor from a TXT file dumped using UCSC table browser
RefSeq table schema: http://genome.ucsc.edu/cgi-bin/hgTables
field example SQL type info description
bin 585 smallint(5) range Indexing field to speed chromosome range queries.
Interval for a splice site
Reference: http://en.wikipedia.org/wiki/RNA_splicing
Spliceosomal introns often reside in eukaryotic protein-coding genes.
Interval for a splice site acceptor
Note: Splice sites donnor are defined as the last 2 bases of an intron
Reference: http://en.wikipedia.org/wiki/RNA_splicing
A (putative) branch site.
A (putative) U12 branch site.
Interval for a splice site donnor
Note: Splice sites donnor are defined as the first 2 bases of an intron
Reference: http://en.wikipedia.org/wiki/RNA_splicing
Interval for a splice site acceptor
From Sequence Ontology: A sequence variant in which a change has occurred
within the region of the splice site, either within 1-3 bases of the exon
or 3-8 bases of the intron.
Analyze sequences from splice sites
Read the contents of a stream in a separate thread
This class is used when executing OS commands in order to read STDOUT / STDERR and prevent process blocking
It can alert an AlertListener when a given string is in the stream
Compare two subsequences (actually it compares two sequences from different starting points)
Index all suffixes of all the sequences (it indexes using Nmers).
Create an SVG representation of a Marker
Create an SVG representation of a BND (translocation) variant
In a VCF file, there are four possible translocations (BND) entries:
REF ALT Meaning
type 1: s t[p[ piece extending to the right of p is joined after t
type 2: s t]p] reverse comp piece extending left of p is joined after t
type 3: s ]p]t piece extending to the left of p is joined before t
type 4: s [p[t reverse comp piece extending right of p is joined before t
Create an SVG representation of a Marker
Create an SVG representation of a Marker
Create an SVG representation of a Marker
Create an SVG representation of a Marker
Create an SVG representation of a NextProt annotation tracks
Create an SVG representation of a "Scale and Chromsome labels
Leave an empty vertical space
Create an SVG representation of a transcript
Create an SVG representation of a BND (translocation) variant
In a VCF file, there are four possible translocations (BND) entries:
REF ALT Meaning
type 1: s t[p[ piece extending to the right of p is joined after t
type 2: s t]p] reverse comp piece extending left of p is joined after t
type 3: s ]p]t piece extending to the left of p is joined before t
type 4: s [p[t reverse comp piece extending right of p is joined before t
Load a table from a file.
test cases for Sequence alignment
Test case
Test case for parsing ANN fields
Test cases: apply a variant (DEL) to a transcript
Test cases: apply a variant (INS) to a transcript
Test cases: apply a variant (MIXED) to a transcript
Test cases: apply a variant (MNP) to a transcript
Test cases: apply a variant (SNP) to a transcript
Base class for some test cases
Test case
Test for Binomial distribution
Test case
Test random SNP changes
Test for Hypergeometric distribution and Fisher exact test
Test cases for circular genomes
Cochran-Armitage test statistic test case
Codon tables
Test case for cytobands
Test random DEL changes
Test random DEL changes
Test Splice sites variants
Test case
Test case for FASTA file parsing
Test cases for file index (chr:pos index on files)
Test for Hypergeometric distribution and Fisher exact test
GenePvalueList statistics test case
Test case
Test cases for GenotypeVector class
Test case for basic HGV annotations
Test random SNP changes
Test case
Test cases for HGVS's 'dup' on the negative strand
Test random SNP changes
Test random SNP changes
Test case
Test for Hypergeometric distribution and Fisher exact test
Test random SNP changes
Test case
Test 'apply' method (apply variant to marker)
Base class: Provides common methods used for testing
Test cases for annotation of protein interaction loci
Test cases for cancer effect (difference betwee somatic an germline tissue)
Test cases for canonical transcript selection
Test case
Test case: Make sure VCF entries have some 'coding' (transcript biotype), even
when biotype info is not available (e.g.
Test case
Test COVID19 build
Test Loss of Function prediction
Test cases on deletions
Test case
Test cases for other 'effect' issues
Test case for EMBL file parsing (database creation)
Test cases for error reporting
Test case for exon frames
Filter transcripts
Test case for EMBL file parsing (database creation)
Test case for genomic sequences
Test case for GFF3 file parsing
Test case for GTF22 file parsing
Test random SNP changes
Test cases for HGVS notation on insertions
Test case
Test case
Test case HGSV: Hard cases
Test cases for HGVS notation on insertions
Test random SNP changes
Test case HGSV for MNPs
Test cases for HGVS notation
Test random SNP changes
Test random SNP changes
Test case where VCF entries are huge (e.g.
Test Loss of Function prediction
Test case
Calculate missense over silent ratio
Test mixed variants
Test random SNP changes
Test Motif databases
Test NextProt databases
Test Nonsense mediated decay prediction
Test case where VCF entries has no sequence change (either REF=ALT or ALT=".")
Protein translation test case
Test cases for annotation of protein interaction loci
Test case for GTF22 file parsing
Test case
Test case for sequence ontology
Test SNP variants
Invoke all integration test cases
Invoke multi thread integration test
Test random SNP changes
Test cases for variants
Test SNP variants
Test random SNP changes
Test case where VCF entries hit a transcript that has errors
Test cases for variants
VCF annotations test cases
Test case
Test intergenic markers
Test case for interval tree structure
Test case for interval tree structure
Test case for interval tree structure
Test random Interval Variants (e.g.
Test case for Jaspar parsing
Test random SNP changes
Test cases for protein interaction
Test Reactome circuits
Seekable file reader test case
Test random SNP changes
Test Splice sites variants
Test Splice sites variants
Test case
Gene: geneId1
1:957-1157, strand: +, id:transcript_0, Protein
Exons:
1:957-988 'exon_0_0', rank: 1, frame: ., sequence: gttgcttgaatactgtatagccttgccattgt
1:1045-1057 'exon_0_1', rank: 2, frame: ., sequence: tgtgttgctaact
1:1148-1157 'exon_0_2', rank: 3, frame: ., sequence: agacatggac
CDS : gttgcttgaatactgtatagccttgccattgttgtgttgctaactagacatggac
Protein : VA*ILYSLAIVVLLTRHG?
Test case for structural variants: Duplications
Test cases for structural variants: Inversions
Test case for structural variants: Translocation (fusions)
Test cases: apply a variant (MIXED) to a transcript
Test cases for variant realignment
VCF parsing test cases
Test playground
Creates a simple "genome" for testing:
Invoke all integration test cases
Invoke all Unit test cases for SnpEff
An entry in a TFAM table.
Interval for a transcript, as well as some other information: exons, utrs, cds, etc.
A set of transcripts
Transcript level support
Reference: http://useast.ensembl.org/Help/Glossary?id=492;redirect=no
Pojo for translocation reports
Calculate Ts/Tv rations per sample (transitions vs transversions)
Tuple: A pair of objects
Interval for a gene, as well as some other information: exons, utrs, cds, etc.
Interval for a UTR (5 prime UTR and 3 prime UTR
Interval for a UTR (5 prime UTR and 3 prime UTR
Interval for a UTR (5 prime UTR and 3 prime UTR
A variant represents a change in a reference sequence
A 'BND' variant (i.e.
Effect of a variant.
This class is only getFused for SNPs
A Generic ChangeEffect filter
Effect of a structural variant (fusion) affecting two genes
A sorted collection of variant effects
Variants effect statistics
Effect of a structural variant affecting multiple genes
Opens a sequence change file and iterates over all sequence changes
A variant respect to non-reference (e.g.
Re-align a variant towards the leftmost (rightmost) position
Note: We perform a 'progressive' realignment, asking for more
reference sequence as we need it
Variants statistics
Opens a sequence change file and iterates over all sequence changes
TXT Format: Tab-separated format, containing five columns that correspond to:
chr \t position \t refSeq \t newSeq \t strand \t quality \t coverage \t id \n
Fields strand, quality, coverage and id are optional
E.g.
Count variant types (SNP, MNP, INS, DEL)
Variant + VcfEntry
This is used to 'outer-join' a VcfEntry into all its constituent variants.
A variant that has a numeric score.
Annotate a VCF file: E.g.
Maintains a list of VcfAnnotators and applies them one by one
in the specified order
An 'CSQ' entry in a vcf line ('Consequence' from ENSEMBL's VEP)
Format:
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as predicted by VEP.
An 'CSQ' entry in a vcf header line
An 'ANN' or 'EFF' entry in a VCF INFO field
Note: 'EFF' is the old version that has been replaced by the standardized 'ANN' field (2014-12)
*
A VCF entry (a line) in a VCF file
Opens a VCF file and iterates over all entries
Format: VCF 4.1
Reference: http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41
Old 4.0 format: http://www.1000genomes.org/wiki/doku.php?id=1000_genomes:analysis:vcf4.0
1.
A VCF genotype field
There is one genotype per sample in each VCF entry
Opens a Hapmap phased file and iterates over all entries, returning VcfEntries for each line
Note: Each HapMap file has one chromosome.
Represents the header of a vcf file.
Represents a info elements in a VCF file's header
References:
https://samtools.github.io/hts-specs/VCFv4.3.pdf
http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41
Represents a info elements in a VCF file
References: http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41
INFO fields should be described as follows (all keys are required):
##INFO=<ID=ID,Number=number,Type=type,Description=description>
Possible Types for INFO fields are: Integer, Float, Flag, Character, and String.
Number of values in an INFO field.
An 'LOF' entry in a vcf line
An 'NMD' entry in a vcf line
Formats output as VCF
Needleman-Wunsch (global sequence alignment) algorithm for sequence alignment
Only used for short strings (algorithm is not optimized)
VCF statistics: This are usually multi-sample statistics
Check is a new version is available