To perform co-analysis with the collected RNA-seq experiments in fungiExp, users must make gene and transcript expression files, and two alternative splicing profile files respectively based on JCEC and JC. The expression files include 5 columns corresponding to gene/transcript identifer, read count, coverage and normalized FPKM and TPM expression levels (Click
to see the format). The alternative splicing files must include 6 columns corresponding to event identifer, inclusion read count, skiping read count, inclusion frame length, skiping frame length and normalized psi value (Click
to see the format). To generate available expression and alternative splicing files, do the following steps:
2. A popular protocol based on
Hisat2 and
StringTie2 is employed to obtain expression data for genes and transcripts. Hisat2 is used to align reads against reference genome and produce alignment file in bam format, such as
align.bam. Refer to Hisat2
manual for how build genome index and map RNA-seq reads against genome. Next,
StringTie2 is just used to detect transcript abundance, not to discover novel transcripts. To reach this goal, users should run StringTies by setting "-G" to the improved gene model annotation as reference, "-e" to only estimate the abundance of transcripts in the given reference gene models, and "-A" to a gene abundance file as output.
stringtie align.bam \
-G 5518.improvedAnno.gtf \
-e \
-o transcriptome.gtf \
-A gene.abundance.tab
Note:
-G a reference gene model annotation in gtf format. It refers to the "5518.improvedAnno.gtf" file.
-e switch off discovery of novel transcripts.
-o output a transcriptome file in gtf format containing transcript expression information.
-A output a gene abundance file in tsv format.
3. Running a customized perl script (
gather.expression.for.co-analysis.pl ) to generate gene and transcript expression files whose format are available for online co-analysis with the collected experiments(Click
to see the available format).
perl gather.expression.for.co-analysis.pl \
--python /usr/bin/python \
--prepDE /opt/software/stringtie-2.1.4/prepDE.py \
--readLen 101 \
--transcriptomeFile transcriptome.gtf \
--geneAbundanceFile gene.abundance.tab \
--outputGeneExprFile gene.expr.tsv \
--outputTrsptExprFile transcript.expr.tsv
Note:
--python a python program with full path.
--prepDE a python script named "prepDE.py" to extract the read count information directly
from the "transcriptome.gtf" file generated by StringTie (run with the -e parameter).
prepDE.py can be downloaded from the stringTie2 offical website.
If your python is Python 3 version, only the "prepDE.py3" script is available.
--readLen a number corresponding to RNA-seq read length.
--transcriptomeFile a transcriptome file from stringTie2 in the previous step.
--geneAbundanceFile a gene abundance file from stringTie2 in the previous step.
--outputGeneExprFile output file containing gene expression data that is available for co-analysis.
--outputTrsptExprFile output file containing transcript expression data that is available for co-analysis.
4. Download five types of alternative splicing events collected in a tar package file(
) and a mapping file from alternative splicing event to numeric identifier(
).
5.
rMATS-4.1.1 is employed to count reads on alternative splicing events by setting "--gtf" to the improved gene model annotation and "--fixed-event-set" to directory that contains five alternative splicing event annotation files. Refer to rMATS
manual for the operation details.
python rmats.py \
--gtf 5518.improvedAnno.gtf \
--b1 b.txt \
--readLength 101 \
--fixed-event-set 5518.fromGTF \
--statoff \
--od outputDir \
--tmp /tmp /
Note:
--gtf reference gene model annotation in gtf format.
--b1 a text file containing a comma separated list of the BAM files for RNA-seq sample, for example "/tmp/align.bam"
--readLength a number corresponding to RNA-seq read length, for example 101.
--fixed-event-set a directory where the five alternative splicing event annotation files exist.
The five alternative splicing files are downloaded in the previous step and must be
named "fromGTF.A3SS.txt", "fromGTF.A5SS.txt", "fromGTF.RI.txt", "fromGTF.MXE.txt" and "fromGTF.SE.txt".
--statoff switch off statistic model.
--od a directory where rMATS outputs the result files including JCEC.raw.input.[asType].txt and JC.raw.input.[asType].txt.
--tmp a temporary directory for work.
6. Running a customized perl (
gather.splicing.for.co-analysis.pl )" to calculate PSI values of events, convert numeric identifers to alternative spicing identifiers, and merge 5 "JCEC.raw.input.[asType].txt" files into a single "JCEC" file and merge 5 "JC.raw.input.[asType].txt" into into two single files. The two output files are available for online co-analysis with the collected experiments.
perl gather.splicing.for.co-analysis.pl \
--asIdToNumericId 5518.asId.to.numId.tsv \
--jcecA3SS outDir/JCEC.raw.input.A3SS.txt \
--jcecA5SS outDir/JCEC.raw.input.A5SS.txt \
--jcecMXE outDir/JCEC.raw.input.MXE.txt \
--jcecRI outDir/JCEC.raw.input.RI.txt \
--jcecSE outDir/JCEC.raw.input.SE.txt \
--jcA3SS outDir/JC.raw.input.A3SS.txt \
--jcA5SS outDir/JC.raw.input.A5SS.txt \
--jcMXE outDir/JC.raw.input.MXE.txt \
--jcRI outDir/JC.raw.input.RI.txt \
--jcSE outDir/JC.raw.input.SE.txt \
--outputJCECtsv jcec.tsv \
--outputJCtsv jc.tsv
Note:
--asIdToNumericId a mapping file from alternative spicing event to numeric identifer.
It is downloaded in the previous 4th step.
--jcec[asType] the read count files of rMATS JCEC outputs in the previous 5th step.
--jc[asType] the read count files of rMATS JC outputs in the previous 5th step.
--outputJCECtsv the combined file for the 5 JCEC read count files.
--outputJCtsv the combined file for the 5 JC read count files.
7. In the "Comparison", "Specificity", "Coexpression" or "Cross-species" pages, upload gene.expr.tsv and transcript.expr.tsv(the outputs in the 3rd step) as well as jcec.tsv and jc.tsv (the outputs in the 6th step) with related RNA-seq sample information to perform co-analysis with the collected RNA-seq experiments.