Research‎ > ‎

Small RNA Sequencing

This website contains material on the analysis of small RNA, in particular miRNA, sequencing experiments. It refers to


Supplementary Material for "Identifying Transcriptional miRNA Biomarkers by Integrating High-Throughput Sequencing and Real-Time PCR Data"


Accompanying Sotware

We provide a zip archive small-rna-pipeline.zip containing

  • the Snakefile with the workflow description
  • required Python scripts, as listed in Appendix A of the manuscript
  • a README file explaining how to set up the directory in which to run the pipeline
  • necessary files to initialize the pipeline directory

Other required software

Required Resources

  • Human genome reference assembly GRCh37,

  • ENSEMBL genome annotation track (version 67),
  • FASTA file with mature miRNAs of all organisms from the current miRBase release,

  • simiar FASTA file with mature miRNAS from the older miRBase release 8.1 (on which the RT-q-PCR kit was based)

Datasets

If you are interested in analyzing our neuroblastoma data, your can retrieve the ten datasets from the NCBI Sequence Read Archive using accession number SRA009986.

To convert the downloaded .SRA files to FASTQ, download the NCBI SRA SDK, and compile and install the fastq-dump binary to your $PATH. For a single SRA file, run

fastq-dump -F SRR029966.sra

The resulting .fastq file will have, curiously, a file name consisting of spaces, which needs to be fixed manually. In the following, we assume the datasets have been labelled 552.fastq, 553.fastq, ..., 561.fastq (ten datasets in total). Our pipeline still expects color space fasta (.csfasta) and separate quality (.qual) files, as output by the SOLiD system.



Previous Workflow

The section below describes parts of our old workflow.
It is here for historical purposes only, it has been completely replaced by the Snakefile-based workflow.

Reference Genome

Get the 1000 Genomes project reference file. It is named human_g1k_v37.fasta.gz. You don't need to unpack it.

Create a FASTA index of the reference:

samtools faidx human_g1k_v37.fasta.gz

Create a BWA colorspace index (this may take a few hours):

bwa index -a bwtsw -p hg1kv37 -c human_g1k_v37.fasta.gz

Remove adapters with cutadapt:

mkdir reads/
for i in $(seq 552 561); do
   cutadapt -c -e 0.12 -a 330201030313112312 -m 15 --bwa ${i}.fastq.gz > reads/${i}.fastq.gz
done

Cutadapt is here instructed to

  • work with colorspace data (-c)
  • allow an error rate of 12 % (-e 0.12)
  • remove the adapter 330201030313112312
  • discard reads shorter than 15 bp (-m 15)
  • create BWA-compatible output (--bwa). Please see the cutadapt command-line help for what this does exactly.

If you don't use cutadapt, be aware that BWA especially has problems with empty sequences (length zero), negative quality values (which do occur in SOLiD data), and requires double-encoded FASTQ files in which the primer nucleotide and the first color have been removed.

Align reads


for i in $(seq 552 561); do
  bwa aln -n 3 -t 8 -c hg1kv37 reads/${i}.fastq | bwa samse hg1kv37 | samtools view -bS > ${i}.unsorted.bam
  samtools sort ${i}.unsorted.bam ${i}
done

If mapping was successful, you can delete the .unsorted.bam files.

Count reads mapping to miRNAs

Download the miRBase GFF track from ftp://mirbase.org/pub/mirbase/CURRENT/genomes/hsa.gff.

Count reads mapping to miRNAs with htseq-count:

htseq-count -i ID -t miRNA aligned.sam hsa.gff > counts.txt

for i in $(seq 552 561); do
  samtools view -h $i.bam > ${i}.sam && \

    htseq-count -i ID -t miRNA ${i}.sam hsa.gff > ${i}.counts.txt && \
    rm $i.sam
done

Since htseq-count cannot read from BAM files and it had problems reading from standard input, we can neither use the BAM files directly nor pipe them into the tool. Instead, a temporary SAM file needs to be generated.

Convert to .expressions files

These files will be read in by the normalization script below.

for i in $(seq 552 561); do
  echo "name count" > ${i}.expressions
 
grep -v no_feature ${i}.counts.txt | grep -v ambiguous | tr '\t' ' ' >> ${i}.expressions
done

Normalize miRNA counts

Get the attached normalize-mirna-expression.R script. Create a file named classes.txt that describes to which class each dataset belongs. In our case, it looks like this:

dataset class
552 0
553 0
554 0
555 0
556 0
557 1
558 1
559 1
560 1
561 1

Run the normalization script:

normalize-mirna-expression.R qqscale classes.txt

For each dataset xyz that is mentioned in the classes.txt, the script reads in the corresponding xyz.expressions file. It then normalizes the data according to the given normalization method (qq-scale in this case, described in the paper) and creates a file all.qqscale.expressions. Some PDF plots are also generated in the current directory.

The file all.qqscale.expressions is in CSV format (and can therefore be opened by a spread sheet program). In R, the file can be read in with the following code:

read.table('all.qqscale.expressions',sep=",",header=T,row.names=1)


Č
ċ
ď
normalize-mirna-expression.R
(10k)
Marcel Martin,
Apr 14, 2011, 6:58 AM
ċ
ď
Sven Rahmann,
Jul 11, 2012, 7:39 AM
Comments