Hello,
I'm trying to use the FluxSimulator to generate some synthetic data on which I'm comparing a number of different RNA-Seq expression estimation tools. A number of existing methods test their quantification procedures on data generated using the generative model that the method itself assumes (thus begging the question and resulting in potentially misleading estimates). The FluxSimulator seems like a great tool — developed independently and with the goal of generating realistic synthetic data — on which to test some of these methods.
Right now, however, I'm getting some very strange results when using eXpress to estimate expression levels on a synthetically generated dataset (.par file – fluxSim_Hsapien_defaultExp_76bpWithError.par). Specifically, eXpress is producing a number of warnings of the form:
"WARNING: The observed alignments appear disporportionately in the forward-reverse order (270392482 vs. 855441). If your library is strand-specific, you should use the --fr-stranded option to avoid incorrect results."
and the resulting quantification estimates are very poor (spearman R ~ 0.64). However, RSEM performs well (spearman R ~ 0.94). I was wondering if the FluxSimulator, run with the parameters included above, will produce a strand-specific library. I read nothing about this in the documentation, and I assumed that the resulting library would not be strand-specific. Could you inform me what the default behavior of the FluxSimulator is regarding the strandedness of the library? If it is strand-specific by default, is there a way to produce a library that is not; what is the parameter that controls this? Thanks for your help!
4 Comments
Rob Patro
Rerunning express with it's '--fr-stranded' option doesn't seem to improve their quantification estimates. However, I'm still interested in the answer to this question regarding the Flux Simulator.
Micha Sammeth
Hi Rob,
sorry for the delay in responding. The Flux Simulator is currently not producing explicitely strand specific libraries, but you could create one by generating paired-end reads and taking always exclusively one of the mates that corresponds to the directionality you go for.
However, if I understand your question, your doubt is actually about the "forward-reverse" asymmetry of 270392482 vs. 855441. If "forward-reverse" is to be read as "sense-antisense", then this is not normal in the output. Usually you loose a couple of poly-A reads that are mainly anti-sense, but not as many as it would explain the discrepancy you report, and as I see in your parameter file, you got poly-adenlyation deactivated anyway.
Also under the assumption that "forward-reverse" refers to the genomic orientation your observation seems odd, because usually the distribution of genes along human chromosomes is not biased towards the one or the other DNA strand. I just ran as a test the parameter file you provide, replacing your annotation "hg19_annotations_sorted.gtf" by the current RefSeq annotation. In the resulting bed file, I find (as expected)
If you obtain different results, please provide a couple of lines from your sequenced bed file for further investigation.
ATB
Rob Patro
Hi Micha,
I figured out what's going on with the help of Adam Roberts, the developer of eXpress (as it turns out, it doesn't seem that the "strandedness" is what was causing the performance issues). However, what was occurring is that with 'UNIQUE_IDS YES' set, the reads were all being generated such that the first read of the mate pair (the read ending in \1) was overwhelmingly being mapped to the forward (sense) strand. Given the wiki section on Read Identifiers, this makes sense, as it states that:
If
UNIQUE_IDS
is enabled, all sense reads (S withoutUNIQUE_IDS
) will be/1
, and all anti-sense reads (A withoutUNIQUE_IDS
) will be/2
.Perhaps, the small fraction of reverse mappings were due to misalignments or some other issue; I'm not sure. So, presumably, the first read of every mate pair is coming from the sense strand; right? If so, I think this explains the asymmetry, since, by chance, we would expect the first read of a mate pair to be produced unbiasedly from either the sense or antisense strand. Either way, if I pass the --fr-stranded option to eXpress, it stops producing the warning. Also, for the purposes of recording this for posterity or for anyone else who runs into a similar "performance issue" using eXpress with reads generated via the flux simulator, the problem was due to the fact that eXpress expects the reads to be provided in a random order. However, the flux simulator outputs reads in a non-random order (they appear to be grouped by source or some such). To properly test eXpress with reads generated by the flux simulator, then, one must appropriately randomize the order of the reads in the FASTA / FASTQ file. If anything I said here sounds fishy, or you have any further thoughts on it, please let me know.
Micha Sammeth
Hi Rob,
yes, apparently I didn't completely understand your scenario last time. Your observations are correct: usually mate numbering and of a read pair, and thus also the strandedness of each read, are randomly assigned; the according code is in the method Sequencer$Processor.process(Gene):
However, when producing "unique IDs" (unique in the sense that, after cutting off the "/1" or "/2" suffix, the prefixes of both of the mated reads match), this random assignment is overridden later-on in Sequencer.appendReadName(...)
That is because there never has been proposed really a standard to put strand information in read IDs, and is in line with current standards to couple read and strand information (i.e., mate 1 is sense). The dilemma is that, in the simulation, we always know the directionality of a read at the time it is produced–and the UNIQUE_ID does not provide any possibility to put that information if not enforcing a strand for a certain mate (1 or 2); if not, this information is lost and not recoverable from the output.
I created a ticket on the current Simulator (version 1.2.1) always outputting stranded mate-info with UNIQUE_IDs on, to discuss if we should introduce another parameter for unstranded data in this scenario:
I updated the documentation to make the situation more clear, thanks for your thourough investigation of the phenomenon!
Micha