David Quigley
Research Summary   -   CV   -   Publications   -   CARMEN   -   equalizer   -   talks


Affymetrix microarrays are designed using a reference genome from human, mouse, or another organism. Single Nucleotide Polymorphisms (SNPs) in the genomes of individual humans and model organisms that vary from the reference annotations affect the binding of cDNA to microarray probes, affecting the results of experiments measuring gene expression in organisms with distinct genomes. Genetic experiments performed using these arrays are subject to systemic bias unless the effect of these SNPs is accounted for.

The software package equalizer drastically reduces this problem for the commonly used Affymetrix IVT and Gene ST platforms by using whole-genome sequence data to remove probes which overlap SNPs. The customized annotation files generated by equalizer can be used for normalization in oligo or other packages.

equalizer is a free open-source project hosted on Github.


If you use equalizer, please cite:
Quigley D. equalizer reduces SNP bias in Affymetrix microarrays. BMC Bioinformatics 2015, 16:238 (30 July 2015)
Open-access journal link

How it works

equalizer diagram

equalizer rewrites a new copy of the Affymetrix platform description files, removing individual probes that intersect any feature in one or more VCF files you supply. equalizer can then be used to generate a new bioconductor Platform Design Information package.

There are two ways to run equalizer:
  1. Call a the equalizer.py python script directly from the command line
  2. or install the equalizer R package (recommmended). To install equalizer from within R, call:
    equalizer depends on Bioconductor and the packages pdInfoBuilder and findPython. To install all of these:
    NOTE: The R package requires that Python is installed on your system. OS X and unix operating systems ship with Python; if you need to install Python on a Windows box there is an installer at python.org.
  1. The BedTools package

Self-contained examples

To run a complete, self-contained minimal example on a machine running Ubutnu linux with at least 4 GB RAM, first install bedtools and R if they are not already available:
sudo apt-get update
sudo apt-get install r-base 
wget https://bedtools.googlecode.com/files/BEDTools.v2.17.0.tar.gz
tar -xzf BEDTools.v2.17.0.tar.gz
cd bedtools-2.17.0
sudo make 
cd ..

This example is sufficient to demonstrate that equalizer will modify MoGene annotation files using a VCF file describing SNPs in a single gene, Cdc26. The example normalizes the a single raw CEL file using both the standard and alternate probe databases and shows the change in expression levels and which probes were removed.
wget http://davidquigley.com/software/equalizer/equalizer_minimal_example.tar.gz
tar -xzf equalizer_minimal_example.tar.gz
cd equalizer_minimal
#  ****************************************************************
#  * IMPORTANT: Before continuing, set the value of BEDTOOLS_PATH *
#  ****************************************************************
sudo Rscript equalizer_minimal_example.R $BEDTOOLS_PATH 
The expected results from the last two lines of output for this code are
> [1] "Original number of probes for Cdc26: 12"
> [1] "Probes remaining after equalizer (should be 3): 3"

**NOTE: requires 500 Mb. data download**

This example demonstrates how equalizer and eqtl can be used together to correct for SNP effects. It re-creates a result in equalizer manuscript. The example includes 96 CEL files from mouse mammary tissue measured on the MoGene platform (also available online at GEO: GSE46077) and all of the annotation files required for this platform. The script *run_example.sh*
  1. runs equalizer on the MoGene data using a VCF file for a single gene, Cdc26.
  2. normalizes the raw CEL files using both the standard and alternate probe databases
  3. calculates eQTL for Cdc26 using the standard and alternate gene expression data
The result is a demonstration that the strongest uncorrected P value is a non-significant trans-eQTL on chromosome 7, while the strongest corrected P value is a significant cis-eQTL on chromosome 4.
wget http://davidquigley.com/software/equalizer/equalizer_example.tar.gz
tar -xzf equalizer_example.tar.gz
cd equalizer_full
#  ****************************************************************
#  * IMPORTANT: Before continuing, set the value of BEDTOOLS_PATH *
#  ****************************************************************
sudo Rscript equalizer_example.R $BEDTOOLS_PATH 
sh eqtl.sh
The expected output for this code (last lines) is:
> uncorrected result:
> 7.616683e-03	Cdc26	10513583	E07.002.028_10	0.601000	8.017787	8.160000	0.000000	2.727430
> corrected result:
> 4.302593e-14	Cdc26	10513583	E04.067.977_13	0.000000	4.393535	5.667245	0.000000	8.885516

I have confirmed these examples will run on a freshly instantiated Amazon Web Services T2.medium Ubuntu linux instance. They should run on any linux or OSX machine although you will need to install R by other means if you are not running Ubuntu linux; consult your local guru if needed.

Annotation files you'll need

I have prepared bundles of commonly used array platforms so you don't need to find them yourself:.

To assemble your own set of files for equalizer you will need Affymetrix's chip definition files for the chip you wish to correct, available from Affymetrix.com.

For platform ARRNAME, the required chip definition files are:

For IVT arrays (e.g. HG-U133A)
  1. ARRNAME.annot.csv
  2. ARRNAME.cdf
  3. ARRNAME.probe_tab
  4. ARRNAME.bed
  5. A CEL file in the same format as ARRNAME

For Gene ST arrays (e.g. HuEx 1.0)
  1. ARRNAME.probeset.csv
  2. ARRNAME.transcript.csv
  3. ARRNAME.pgf
  4. ARRNAME.mps
  5. ARRNAME.clf
  6. ARRNAME.bed

You must also supply a file specifying the genome locations of Affymetrix probes in the BED format.

equalizer command line usage

-f --format              Chip format, one of {gene,IVT}
-p --packagename         New package name
-v --vcf                 Path to VCF files, comma-delimited
-a --affy                Path to affymetrix probe BED file
-b --bedtools            Path to bedTools bin folder
-o --output              Path to write package files, default current directory

-s --annot_version       Affymetrix annotation version (e.g. na34)
-e --annot_revision      Affymetrix chip library revision (e.g. 1)
-g --genome_version_ucsc UCSC Genome version (e.g. mm10)
-d --genome_version_ncbi NCBI Genome version (e.g. 38)
-c --chip                Affymetrix chip name (e.g. MoGene-1_1-st-v1)
-i --chip_version        Affymetrix chip version (e.g. 1.1)
-u --author              Author's Name for package (e.g. Gregor Mendel)
-l --email               Author's email for package (e.g. greg@wrinkledpea.org)
-w --organism            organism common name (e.g. Mouse)
-r --species             species name (e.g. Mus musculus)

-n --file_CDF            Path to affymetrix CDF file 
-x --file_probe_tab      Path to affymetrix probe_tab file
-z --file_CEL            Path to sample Affymetrix CEL file

-y --file_probeset_csv   Path to affymetrix probeset.csv
-t --file_transcript_csv Path to affymetrix transcript.csv 
-q --file_pgf            Path to chip PGF file from Affymetrix library 
-m --file_mps            Path to chip MPS file from Affymetrix library 
-k --file_clf            Path to chip CLF file from Affymetrix library 
Preparing BED files

Affymetrix currently stores BED files for their arrays at this location on their website. These files contain multiple tracks.
Gene ST files

If you obtain a BED file from Affymetrix it will contain several tracks. An example, from the MoGene-2_0-st-v1:
  • Affymetrix MoGene-2_0-st-v1_exon probeset(transcriptID_probesetID)
  • Affymetrix MoGene-2_0-st-v1_gene level exon(transcriptID)
  • Affymetrix MoGene-2_0-st-v1_gene probeset(transcriptID)
  • Affymetrix MoGene-2_0-st-v1_probe(transcriptID_probesetID)
The purpose of the BED file in this context is to obtain the mapped positions of each individual probe. If you call:
grep 17210850_17210851 MoGene-2_0-st-v1.mm10.bed
you'll see:
chr1	3102029	3102110	17210850_17210851	0	+	3102029	3102110	204,102,51	8	24,24,24,24,24,24,24,24,	0,1,5,6,7,51,52,57,
chr1	3102029	3102053	17210850_17210851	0	+	3102029	3102053	0,0,0	1		
chr1	3102030	3102054	17210850_17210851	0	+	3102030	3102054	0,0,0	1		
chr1	3102034	3102058	17210850_17210851	0	+	3102034	3102058	0,0,0	1		
chr1	3102035	3102059	17210850_17210851	0	+	3102035	3102059	0,0,0	1		
chr1	3102036	3102060	17210850_17210851	0	+	3102036	3102060	0,0,0	1		
chr1	3102080	3102104	17210850_17210851	0	+	3102080	3102104	0,0,0	1		
chr1	3102081	3102105	17210850_17210851	0	+	3102081	3102105	0,0,0	1		
chr1	3102086	3102110	17210850_17210851	0	+	3102086	3102110	0,0,0	1
Note that the first entry, from the probeset track, spans chr1:3102029-3102110. The 8 on that line indicates 8 probes make up that probeset. The remaining lines are the locations of the eight individual probes (e.g. chr1:3102029-3102053). If there is a SNP between chr1:3102029-3102110 that does NOT intersect one of the probe locations, it should not affect the hybridization. Therefore, retain only the probe(transcriptID_probesetID) track.

IVT files

The HG-U133A_2 BED file contains two tracks
  • HG-U133A_2 Consensus/Exemplar
  • HG-U133A_2 Probe
In this case, use the HG-U133A_2 Probe track.

The HG-U133 Plus 2 array BED file contains a single track, called "HG-U133_Plus_2 Probes And Consensi". This is an intermingled track, which is not very helpful. equalizer notes this and strips out the probes, ignoring the consensus lines.

Updated June 2015