Introducing kyber: minimalistic quality assessment plot for long reads
Gigabase or gigabyte - Exploring bioinformatics
by wdecoster
1y ago
I am releasing kyber today, and you can find all about it on GitHub. In short, kyber produces a heatmap showing where the majority of your reads are. On the x-axis is the log transformed read length, on the y-axis the accuracy (optionally Phred-scaled). The axis ranges are fixed and therefore easier to compare various datasets. The intention is to have a minimalistic and fast impression of your dataset. Example below ..read more
Visit website
Announcing Phasius for visualization of phase blocks
Gigabase or gigabyte - Exploring bioinformatics
by wdecoster
1y ago
Long-read sequencing enables the phasing of variants and reads, i.e., separating those into two parental haplotypes. Without trio information, you will not be able to say which variant is inherited from which parent, but you will know which variants were inherited together. And this matters, e.g., for compound heterozygous variants and cis-regulatory variation. Phasing can work out across long distances but ends in the case of long nasty repeats (e.g., segmental duplications) or long regions without heterozygous variants. I have developed Phasius, a tool for visualizing phase blocks (continuou ..read more
Visit website
Comparing the pore status of flow cells
Gigabase or gigabyte - Exploring bioinformatics
by wdecoster
3y ago
A (for us) very useful diagnostic of a nanopore run is the status of the pores: if they’re sequencing (‘single_pore’), saturated, unavailable or multiple. It is also interesting to see how fast you are losing pores or killing your flow cell with e.g. a particularly blocky library. Fortunately these metrics are since not-too-long ago saved in the mux_scan_data files created by MinkNOW, so I wrote a script to visualize and compare the pore status. One caveat: these files don’t record the time of the mux_scan, just the order. By default the time between muxes is 1.5h. But since you can manually r ..read more
Visit website
Automate everything: following and reading literature
Gigabase or gigabyte - Exploring bioinformatics
by wdecoster
4y ago
I’m a big fan of saving time by automation. Any job you have to do repeatedly which doesn’t require your brain would be a job to be automated. Below is how I organize my workflow and the tools which I use to stay up-to-date with the scientific literature and to organize my reading. I use Inoreader to follow RSS feeds from certain journals and preprints, a lot of blogs, newsletters, some important Twitter accounts, and search terms in PubMed. Every day I go through about 150 new items to search for interesting things, and as a ‘reward’ also things like The Onion and quite some webcomics are fol ..read more
Visit website
Removing accidental 2D/1D^2 reads from an alignment
Gigabase or gigabyte - Exploring bioinformatics
by wdecoster
4y ago
In surpyvor I collect scripts which can be helpful for structural variant analysis. Today I added a “purge2d” subcommand to remove accidental 2D/1D^2 nanopore reads from a BAM file. Most typically we do 1D ligation preps, but in some cases the complement fragment is sequenced directly after the template and not recognized as separate molecules, as what used to be 2D sequencing (with a hairpin) and is nowadays 1D^2 sequencing (without covalent link). Typically this complement read also has a much lower quality, since the template read starts refolding and pulling the strand through at the trans ..read more
Visit website
Methplotlib examples
Gigabase or gigabyte - Exploring bioinformatics
by wdecoster
4y ago
We recently published methplotlib, a tool for the visualization and analysis of modified nucleotides from nanopore sequencing. It works downstream of tools like nanopolish, nanocompore and direct methylation calling by the guppy basecaller. More information can be found on GitHub. Feedback, suggestions, reporting problems and feature requests are very much appreciated. Below are some example outputs ..read more
Visit website
Comparing the end reason of reads in a nanopore experiment
Gigabase or gigabyte - Exploring bioinformatics
by wdecoster
4y ago
Since a recent version of MinKNOW, the software controlling a nanopore sequencer, a sequencing_summary file is created before basecalling, in which one column is of particular interest: the end_reason. Although I’m not yet sure what each value means, I believe it gives per read the reason why the software decided to stop sequencing here, mostly because the read was finished, but optionally because it was blocking the pore. Blocked pores is mainly an issue with very long reads, and it can drastically reduce the yield of your flow cell. So it’s something to an eye on. I wrote a quick script t ..read more
Visit website
Stacked bar chart of FILTER information from a multi-sample VCF
Gigabase or gigabyte - Exploring bioinformatics
by wdecoster
4y ago
I wanted to make a stacked bar chart to show the number of variants with a certain FILTER status per sample from a multi-sample VCF. Nowadays I make all plots with plotly, because it’s fast, convenient to write and dynamic HTML makes it easier afterwards to select the bits I’m interested in to show. In the example below I have to hide the sample labels, as these are confidential. The example is of course a static screenshot, and the HTML output is lots more fun to play with. In this case the VCF is from Mutect2 somatic variant calling and the number of PASS variants is rather low! Here is t ..read more
Visit website
Bcftools concat: Failed to open variants.vcf.gz: could not load index
Gigabase or gigabyte - Exploring bioinformatics
by wdecoster
5y ago
If you are like me and like to massively parallelize jobs then you may come across the following, initially cryptic error when using bcftools concat with thousands of vcf files: bcftools concat -a *.vcf.gz | bcftools sort -o all_variants.vcf Failed to open a_certain_variant_file.vcf.gz: could not load index I believe the problem is that too many files are opened. For each vcf also the index (.tbi) file is opened, which turns out to be more than the number of files your operating system allows you to open. You can check that using ulimit -u, which in my case is 1024. You could see if your ..read more
Visit website
A handy bash alias for compressing and indexing vcf files
Gigabase or gigabyte - Exploring bioinformatics
by wdecoster
5y ago
I often have a ton of vcf files, which I would like to compress using bgzip and index using tabix, which is necessary for many downstream steps such as bcftools concat. I grew tired of always typing the same command, so I wrote the following bash alias, which uses gnu parallel and is part of my .bash_aliases file. alias vcfzip="ls *.vcf | parallel --bar 'bgzip {} && tabix {}.gz'" When I’m in a directory with files that need to be compressed I can simply execute vcfzip and all files will get compressed and indexed, together with a friendly progress bar ..read more
Visit website

Follow Gigabase or gigabyte - Exploring bioinformatics on FeedSpot

Continue with Google
Continue with Apple
OR