The latest Drosophila melanogaster 5 genome has a complete copy of the mitochondrial chromosome embedded in chrU, which makes it a trap for mitochondrial RNAseq reads. If the aligner randomly distributes reads between all matching sites, half of the mitochondrial reads will go to chrU. All reads will be lost if ambiguous alignments are discarded.
ChrU is not a real chromosome, but a 10 Mb mixed bag of 34,630 small scaffolds, that didn’t seem to fit anywhere during shotgun genome assembly. My best guess is, that the 19.5 kb mitochondrial scaffold seemed far too small to be anything real during genome assembly, so the Celera shotgun assembler just lumped it into chrU with all the other loose fragments. According to this paper, the fragment is the true y1, cn1, bw1, sp1 strain mitochondrial genome, in contrast to the reference chrM, which is a composite of several genomes.
Leaving out all, or parts of chrU during alignment is an obvious solution. The coordinates of the mitochondrial bit are (roughly) chrU:5288508-5303826, and turning this stretch into N’s preserves the remainder of chrU, without trapping mitochondrial reads. Another option is to leave out chrU altogether. This is probably justifiable given that most of chrU are just duplicated fragments from other parts of the genome, only sequenced in much worse quality and thus not fitting into their original place.
[Update] I heard BDGP will remove this snag in the release 6 genome.
Recent release notes on Dmel v6 don’t indicate to me that the ‘mitochondrial genome in Unmapped’ snag has been resolved http://flybase.org/static_pages/feature/previous/articles/2014_07/FB2014_04.html
Also dmel-all-chromosome-r6.03.fasta contains about 1800 small scaffolds that I guess should be excluded from NGS read-mapping. See ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r6.03_FB2014_06/fasta/
What do you reckon ?
thanks for the note. A quick search with some mitochondrial sequences found them only in dmel_mitochondrion_genome in release 6, so this looks good now.
I would be excited to find someday some interesting strong expression in the hard-to-map regions of the genome so I’m always of two minds about including them. For myself it is something for separate exploration, even more with release 6’s deluge of files and separately listed scaffolds. This makes routine inclusion seem a bit inconvenient unless the scaffolds are all pasted together into a new chrU.