This website contains material on the analysis of small RNA, in particular miRNA, sequencing experiments. It refers to
Supplementary Material for "Identifying Transcriptional miRNA Biomarkers by Integrating High-Throughput Sequencing and Real-Time PCR Data"
Accompanying Sotware
We provide a zip archive small-rna-pipeline.zip containing
- the Snakefile with the workflow description
- required Python scripts, as listed in Appendix A of the manuscript
- a README file explaining how to set up the directory in which to run the pipeline
- necessary files to initialize the pipeline directory
Other required software
Required Resources
-
Human genome reference assembly GRCh37,
- ENSEMBL genome annotation track (version 67),
-
FASTA file with mature miRNAs of all organisms from the current miRBase release,
-
simiar FASTA file with mature miRNAS from the older miRBase release 8.1 (on which the RT-q-PCR kit was based)
Datasets
If you are interested in analyzing our neuroblastoma data, your can
retrieve the ten datasets from the NCBI Sequence Read Archive using
accession number SRA009986.
To convert the downloaded .SRA files to FASTQ, download the NCBI SRA
SDK, and compile and install the fastq-dump binary to your $PATH. For a
single SRA file, run
fastq-dump -F SRR029966.sra
The resulting .fastq file will have, curiously, a file name
consisting of spaces, which needs to be fixed manually. In the
following, we assume the datasets have been labelled 552.fastq,
553.fastq, ..., 561.fastq (ten datasets in total). Our pipeline still expects color space fasta (.csfasta) and separate quality (.qual) files, as output by the SOLiD system.
Previous Workflow
The section below describes parts of our old workflow.
It is here for historical purposes only, it has been completely replaced by the Snakefile-based workflow.
Reference Genome
Get the 1000 Genomes project reference file. It is named human_g1k_v37.fasta.gz. You don't need to unpack it.
Create a FASTA index of the reference:
samtools faidx human_g1k_v37.fasta.gz
Create a BWA colorspace index (this may take a few hours):
bwa index -a bwtsw -p hg1kv37 -c human_g1k_v37.fasta.gz
Remove adapters with cutadapt:
mkdir reads/
for i in $(seq 552 561); do
cutadapt -c -e 0.12 -a 330201030313112312 -m 15 --bwa ${i}.fastq.gz > reads/${i}.fastq.gz
done
Cutadapt is here instructed to
- work with colorspace data (-c)
- allow an error rate of 12 % (-e 0.12)
- remove the adapter
330201030313112312
- discard reads shorter than 15 bp (-m 15)
- create BWA-compatible output (--bwa). Please see the cutadapt command-line help for what this does exactly.
If you don't use cutadapt, be aware that BWA especially has problems with empty sequences (length zero), negative quality values (which do occur in SOLiD data), and requires double-encoded FASTQ files in which the primer nucleotide and the first color have been removed.
Align reads
for i in $(seq 552 561); do
bwa aln -n 3 -t 8 -c hg1kv37 reads/${i}.fastq | bwa samse hg1kv37 | samtools view -bS > ${i}.unsorted.bam
samtools sort ${i}.unsorted.bam ${i}
done
If mapping was successful, you can delete the .unsorted.bam files.
Count reads mapping to miRNAs
Download the miRBase GFF track from ftp://mirbase.org/pub/mirbase/CURRENT/genomes/hsa.gff.
Count reads mapping to miRNAs with htseq-count:
htseq-count -i ID -t miRNA aligned.sam hsa.gff > counts.txt
for i in $(seq 552 561); do
samtools view -h $i.bam > ${i}.sam && \
htseq-count -i ID -t miRNA ${i}.sam hsa.gff > ${i}.counts.txt && \
rm $i.sam
done
Since htseq-count cannot read from BAM files and it had problems reading from standard input, we can neither use the BAM files directly nor pipe them into the tool. Instead, a temporary SAM file needs to be generated.
Convert to .expressions files
These files will be read in by the normalization script below.
for i in $(seq 552 561); do
echo "name count" > ${i}.expressions
grep -v no_feature ${i}.counts.txt | grep -v ambiguous | tr '\t' ' ' >> ${i}.expressions
done
Normalize miRNA counts
Get the attached normalize-mirna-expression.R script. Create a file named classes.txt that describes to which class each dataset belongs. In our case, it looks like this:
dataset class
552 0
553 0
554 0
555 0
556 0
557 1
558 1
559 1
560 1
561 1
Run the normalization script:
normalize-mirna-expression.R qqscale classes.txt
For each dataset xyz that is mentioned in the classes.txt, the script reads in the corresponding xyz.expressions file. It then normalizes the data according to the given normalization method (qq-scale in this case, described in the paper) and creates a file all.qqscale.expressions. Some PDF plots are also generated in the current directory.
The file all.qqscale.expressions is in CSV format (and can therefore be opened by a spread sheet program). In R, the file can be read in with the following code:
read.table('all.qqscale.expressions',sep=",",header=T,row.names=1)