FINEMAP


Command-line arguments | Input | Output | Examples

FINEMAP-ing articles

- Refining fine-mapping: effect sizes and regional heritability. bioRxiv. (2018).
- Prospects of fine-mapping trait-associated genomic regions by using summary statistics from genome-wide association studies. Am. J. Hum. Genet. (2017).
- FINEMAP: Efficient variable selection using summary data from genome-wide association studies. Bioinformatics 32, 1493-1501 (2016).

FINEMAP is a program for

in genomic regions associated with complex traits and disease. FINEMAP is computationally efficient by using summary statistics from genome-wide association studies and robust by applying a shotgun stochastic search algorithm (Hans et al., 2007). It produces accurate results in a fraction of processing time of existing approaches. It is therefore the ideal tool for analyzing growing amounts of data produced in genome-wide association studies and emerging sequencing or biobank projects.

Download

(license)

Command-line arguments

--cond Fine-mapping with stepwise conditional search Subprogram
--config Evaluate a single causal configuration without performing shotgun stochastic search Subprogram
--corr-config Option to set the posterior probability of a causal configuration to zero if it includes a pair of SNPs with absolute correlation above this threshold Default is 0.95
--corr-group Option to set the threshold for grouping a pair of SNPs with absolute correlation above this threshold Default is 0.99
--dataset Option to specify a delimiter-separated list of datasets for fine-mapping as given in the master file (e.g. 1,2 or 1|2) All datasets are processed by default
--flip-beta Option to read a column 'flip' in the Z file with binary indicators specifying if the direction of the estimated SNP effect sizes needs to be flipped to match SNP correlations With --cond, --config and --sss
--group-snps Option to group SNPs on the basis of their correlations With --cond and --sss
--help Command-line help
--in-files Master file (see below) With --cond, --config and --sss
--log Option to write output to log files specified in column 'log' in the master file No log files are written by default
--n-causal-snps Option to set the maximum number of allowed causal SNPs Default is 5
--n-configs-top Option to set the number of top causal configurations to be saved Default is 50000
--n-convergence Option to set the number of iterations that the added probability mass is required to be below the specified threshold (--prob-tol) before the shotgun stochastic search is terminated Default is 1000
--n-iterations Option to set the maximum number of iterations before the shotgun stochastic search is terminated Default is 100000
--prior-k Option to use prior probabilities for the number of causal SNPs as specified in K files (see below) in the master file SNPs are by default assumed to be causal with probability 1 / (# of SNPs in the genomic region)
--prior-k0 Option to set the prior probability that there is no causal SNP in the genomic region. Only used when computing posterior probabilities for the number of causal SNPs but not during fine-mapping itself Default is 0.0
--prior-std Option to specify a comma-separated list of prior standard deviations of effect sizes. Default is 0.05
--prob-tol Option to set the tolerance at which the added probability mass (over --n-convergence iterations) is considered small enough to terminate the shotgun stochastic search Default is 0.001
--rsids Option to sepcify a comma-separated list of SNP identifiers corresponding with the rsid column in Z files (see below) With --config
--sss    Fine-mapping with shotgun stochastic search    Subprogram

Input

(1) Master file

The master file is a semicolon-separated text file and contains no space. It contains the following mandatory column names and one dataset per line.

(2) Z file

The dataset.z file is a space-delimited text file and contains the GWAS summary statistics one SNP per line. It contains the mandatory column names in the following order.

(3) LD file

The dataset.ld file is a space-delimited text file and contains the SNP correlation matrix (Pearson's correlation).

(4) BGEN, BGI, SAMPLE and INCL file

These are Oxford file formats and described here (BGEN), here (BGI) and here (SAMPLE). The dataset.incl file is a text file to restrict estimation of SNP correlations to genotype data from a subset of samples in dataset.sample. It constains one sample ID per line.

(5) Optional K file

By default, FINEMAP assumes that SNPs are causal with prior probability 1 / (# of SNPs in the genomic region). As an alternative, it is possible to specify prior probabilities for the number of causal SNPs in the genomic region by using a dataset.k file. This is a space-delimited text file and contains the prior probabilities pk = Pr(# of causal SNPs is k) for k = 1,...,K, where K is the number of entries in the dataset.k file. The prior probabilities must be non-negative and will be normalized to sum to one.

Output

(1) SNP file

The dataset.snp file is a space-delimited text file. It contains the GWAS summary statistics and model-averaged posterior summaries for each SNP one per line.

(2) CONFIG file

The dataset.config file is a space-delimited text file. It contains the posterior summaries for each causal configuration one per line.

(3) CRED file

The dataset.cred file is a space-delimited text file. It contains the 95% credible sets for each causal signal conditional on other causal signals in the genomic region together with conditional posterior inclusion probabilities for each variant. More detailed information TBA.

(4) LOG file

The dataset.log file outputs additional information. It contains the following output.

(5) DOSE file

The dataset.dose file is a binary file with allele dosage data. A DOSE file contains the following information.

Fine-mapping example

Using genotype data with 50 SNPs and 5363 individuals, a quantitative phenotype was simulated using a linear model with 2 causal SNPs. Single-SNP testing was performed to obtain z-scores. SNP correlations were computed from GWAS genotype data.

Single causal configuration example

The same data as in the fine-mapping example above are used. Without having to perform shotgun stochastic search, information about a single causal configuration can be obtain by specifying SNP identifiers as follows

./finemap_v1.3_MacOSX --config --in-files example/data --dataset 1 --rsids rs30,rs11
./finemap_v1.3_x86_64 --config --in-files example/data --dataset 1 --rsids rs30,rs11

References

Benner, C. et al. FINEMAP: Efficient variable selection using summary data from genome-wide association studies. Bioinformatics 32, 1493-1501 (2016).
Hans, D. et al. Shotgun stochastic search for "large p" regression. J Am Stat Assoc 102, 507-516 (2007).

Acknowledgements

Matti Pirinen contributed to the design and implementation of FINEMAP.

LDstore


Command-line arguments | Input | Output | Example

LDstore is a computationally efficient program for estimating and storing Linkage Disequilibrium (LD) between variants (i.e. Pearson correlations). It combines some of the best features from RAREMETALWORKER and PLINK by implementing 1) parallel processing using OPENMP, 2) sparse estimation to achieve smaller file size, and 3) storing of the LD information with additional variant information in the same file to enable fast lookups of LD information. For instance, LDstore can generate LD information for 5,000 variants in less than 30 seconds on an off-the-shelf laptop and store the LD information using less than 100 megabytes. LDstore is therefore the ideal tool for sharing LD information in large-scale meta-analyses of genome-wide association studies and for on-the fly computing within web portals.

Download

(license)

Command-line arguments

--bcor Name of the BCOR input/output file(s) Always required
--bgen    Name of the BGEN input file    Requires --bcor
--bplink    Basename of the PLINK BED, BIM and FAM input files    Requires --bcor
--merge Merge xx BCOR files (having file extensions .bcor_processNumber) into a common BCOR file where xx is the total number of parallel processes used during estimation of LD information Requires --bcor
--samples    Name of the SNPTEST2 sample file when using BGEN input    Requires --bgen and --incl-samples
--meta Extract variant information and store them in the specified text file Requires --bcor
--matrix Extract LD information in matrix format and store them in the specified text file Requires --bcor
--table Extract LD information in table format and store them in the specified text file Requires --bcor
--incl-range Specify a genomic range xx-yy to operate on where xx and yy are the start and end coordinate in base pairs Default is genomic range of the input file
--incl-samples Include only samples in the estimation of LD information whose sample ID (ID_1 with BGEN input and IID with PLINK input) lies in the specified text file Requires --samples with BGEN input
--incl-variants Extract LD information for variants given in the specified text file. The specified file has 5 columns with a header: RSID, position, chromosome, A_allele and B_allele Requires --matrix or --table
--ld-thold LD information for two variants is only stored or extracted if their absolute Pearson correlation is above this threshold Requires --bgen, --bplink or --table. Default is 0.001
--ld-n-samples-avail-prop LD information for two variants is only stored or extracted if the proportion of all samples with genotype data for the two variants is above this threshold Requires --bgen, --bplink or --table. Default is 0.1
--n-variants-chunk Number of variants processed together Requires --bgen or --bplink. Default is 1000
--variant-window-size LD information for two variants A and B is computed if B is xx base pairs downstream of A Requires --bgen or --bplink. Default is 5 megabase pairs
--accuracy LD information is stored using either low, medium or high accuracy Requires --bgen or --bplink. Default is medium
--n-threads Specify the number of parallel processes during estimation of LD information Requires --bgen or --bplink. Default is max number of CPU cores available
--help Command-line help

Input

LDstore supports BGEN files and PLINK BED, BIM and FAM files as input.

Output

LDstore writes LD information between variants (i.e. Pearson correlations) into a binary file format called BCOR. The BCOR format reduces data storage requirements and enables fast lookups of LD information. To speed up computation of LD information, LDstore uses OPENMP for parallel processing and sparse estimation via a window approach because LD between two variants decreases with their physical distance. This means that LDstore creates multiple BCOR files (having file extensions .bcor_processNumber) that can be merged into a common BCOR file.

Examples

Genotype data on 50 SNPs and 5,000 samples is provided in BGEN and PLINK files. The second variant is monomorphic and there are 500 samples with missing genotype data at the third variant.

Estimation of LD information

Merging of multiple BCOR files

Although only a single thread was used to generate LD information, the --merge option can still be used:

./ldstore \
--bcor example/data_bgen.bcor \
--merge 1
./ldstore \
--bcor example/data_plink.bcor \
--merge 1

Note that the value after --bcor is the same value specified after --bcor during estimation of LD information. LDstore searches for xx BCOR files (having file extensions .bcor_processNumber) where xx is the total number of parallel processes used during estimation of LD information and merges them into a common file called data_bgen.bcor or data_plink.bcor

Extraction of variant information

Variant information can be extracted and stored in a text file as follows:

./ldstore \
--bcor example/data_bgen.bcor \
--meta example/data_bgen.meta

The first 3 lines in the file data_bgen.meta are:

index RSID position chromosome A_allele B_allele A_allele_freq B_allele_freq
1 rs1 1 01 A G 0.2267 0.7733
2 rs2 2 01 A G 1.0000 0.0000
3 rs3 3 01 A G 0.4239 0.4761

Extraction of LD information

LDstore implements several ways to extract LD information from BCOR files. Below are a few examples.

References

Benner, C. et al. Prospects of fine-papping trait-associated genomic regions by using summary statistics from genome-wide association studies. Am. J. Hum. Genet. (2017).