Christian Benner

Command-line arguments | Input | Output | Examples

FINEMAP-ing articles

-		Refining fine-mapping: effect sizes and regional heritability. bioRxiv. (2018).
-		Prospects of fine-mapping trait-associated genomic regions by using summary statistics from genome-wide association studies. Am. J. Hum. Genet. (2017).
-		FINEMAP: Efficient variable selection using summary data from genome-wide association studies. Bioinformatics 32, 1493-1501 (2016).

FINEMAP is a program for

1identifying causal SNPs
2estimating effect sizes of causal SNPs
3estimating the heritability contribution of causal SNPs

in genomic regions associated with complex traits and disease. FINEMAP is computationally efficient by using summary statistics from genome-wide association studies and robust by applying a shotgun stochastic search algorithm (Hans et al., 2007). It produces accurate results in a fraction of processing time of existing approaches. It is therefore the ideal tool for analyzing growing amounts of data produced in genome-wide association studies and emerging sequencing or biobank projects.

Download

(license)

finemap_v1.3.1_MacOSX.tgz (Mac OS X)
finemap_v1.3.1_x86_64.tgz (Unix)
Updated 19-Oct-2018
(1) Credible sets
finemap_v1.3_MacOSX.tgz (Mac OS X)
finemap_v1.3_x86_64.tgz (Unix)
Updated 06-July-2018
(1) BGEN support (e.g. for UK biobank data)
(2) Stepwise conditional search
(3) Group-wise SNP probabilities
Mac OSX users: If you see dyld: Library not loaded: /usr/local/lib/libzstd.1.dylib, install Zstandard.
finemap_v1.2_MacOSX.tgz (Mac OS X)
finemap_v1.2_x86_64.tgz (Unix)
Documentation
Be aware that FINEMAP v1.1 cannot handle large effect size regions!
finemap_v1.1_MacOSX.tgz (Mac OS X)
finemap_v1.1_x86_64.tgz (Unix)
Documentation
To receive email reminders about updates of FINEMAP, send an email to finemap@christianbenner.com.

Command-line arguments

--cond	Fine-mapping with stepwise conditional search	Subprogram
--config	Evaluate a single causal configuration without performing shotgun stochastic search	Subprogram
--corr-config	Option to set the posterior probability of a causal configuration to zero if it includes a pair of SNPs with absolute correlation above this threshold	Default is 0.95
--corr-group	Option to set the threshold for grouping a pair of SNPs with absolute correlation above this threshold	Default is 0.99
--dataset	Option to specify a delimiter-separated list of datasets for fine-mapping as given in the master file (e.g. 1,2 or 1\|2)	All datasets are processed by default
--flip-beta	Option to read a column 'flip' in the Z file with binary indicators specifying if the direction of the estimated SNP effect sizes needs to be flipped to match SNP correlations	With --cond, --config and --sss
--group-snps	Option to group SNPs on the basis of their correlations	With --cond and --sss
--help	Command-line help
--in-files	Master file (see below)	With --cond, --config and --sss
--log	Option to write output to log files specified in column 'log' in the master file	No log files are written by default
--n-causal-snps	Option to set the maximum number of allowed causal SNPs	Default is 5
--n-configs-top	Option to set the number of top causal configurations to be saved	Default is 50000
--n-convergence	Option to set the number of iterations that the added probability mass is required to be below the specified threshold (--prob-tol) before the shotgun stochastic search is terminated	Default is 1000
--n-iterations	Option to set the maximum number of iterations before the shotgun stochastic search is terminated	Default is 100000
--prior-k	Option to use prior probabilities for the number of causal SNPs as specified in K files (see below) in the master file	SNPs are by default assumed to be causal with probability 1 / (# of SNPs in the genomic region)
--prior-k0	Option to set the prior probability that there is no causal SNP in the genomic region. Only used when computing posterior probabilities for the number of causal SNPs but not during fine-mapping itself	Default is 0.0
--prior-std	Option to specify a comma-separated list of prior standard deviations of effect sizes.	Default is 0.05
--prob-tol	Option to set the tolerance at which the added probability mass (over --n-convergence iterations) is considered small enough to terminate the shotgun stochastic search	Default is 0.001
--rsids	Option to sepcify a comma-separated list of SNP identifiers corresponding with the rsid column in Z files (see below)	With --config
--sss	Fine-mapping with shotgun stochastic search	Subprogram

Input

(1) Master file

The master file is a semicolon-separated text file and contains no space. It contains the following mandatory column names and one dataset per line.

z column contains the names of Z files (input)
ld column contains the names of LD files (input)
bgen column contains the names of BGEN files (input)
bgi column contains the names of BGI files (input)
dose column contains the names of DOSE files (output)
sample column contains the names of SAMPLE files (input)
incl column contains the names of INCL files (input)
snp column contains the names of SNP files (output)
config column contains the names of CONFIG files (output)
cred column contains the names of CRED files (output)
n_samples column contains the GWAS sample sizes
k column contains the optional K files (optional input)
log column contains the optional LOG files (optional output)

File extensions must correspond with the column names in the header line!
The master file can contain columns ld, bgen, bgi and dose simultaneously. For each dataset per line, entries need to be specified for precomputed SNP correlations in column ld or for BGEN support using all three columns bgen, bgi and dose. If a line contains entries in all four columns, then precomputed SNP correlations are used.
Entries in columns sample and incl need to be specified if the GWAS sample size in column n_samples is smaller than the number of samples in the BGEN file.
A master file with two datasets using precomputed SNP correlations could look as follows.

z;ld;snp;config;cred;log;n_samples

dataset1.z;dataset1.ld;dataset1.snp;dataset1.config;dataset1.cred;dataset1.log;5363

dataset2.z;dataset2.ld;dataset2.snp;dataset2.config;dataset2.cred;dataset2.log;5363

A master file with two datasets using precomputed SNP correlations in the first dataset and BGEN support in the second dataset could look as follows.

z;ld;bgen;bgi;dose;snp;config;cred;log;n_samples

dataset1.z;dataset1.ld;;;;dataset1.snp;dataset1.config;dataset1.cred;dataset1.log;5363

dataset2.z;;dataset2.bgen;dataset2.bgi;dataset2.dose;dataset2.snp;dataset2.config;dataset2.cred;dataset2.log;5363

A master file with one datasets using BGEN support and a subset of 5,000 samples could look as follows.

z;bgen;bgi;dose;sample;incl;snp;config;cred;log;n_samples

dataset2.z;dataset2.bgen;dataset2.bgi;dataset2.dose;dataset.sample;dataset.incl;dataset.snp;dataset.config;dataset.cred;dataset.log;5000

(2) Z file

The dataset.z file is a space-delimited text file and contains the GWAS summary statistics one SNP per line. It contains the mandatory column names in the following order.

rsid column contains the SNP identifiers. The identifier can be a rsID number or a combination of chromosome name and genomic position (e.g. XXX:yyy)
chromosome column contains the chromosome names. The chromosome names can be chosen freely with precomputed SNP correlations (e.g. 'X', '0X' or 'chrX')
position column contains the base pair positions
allele1 column contains the "first" allele of the SNPs. In SNPTEST this corresponds to 'allele_A', whereas BOLT-LMM uses 'ALLELE1'
allele2 column contains the "second" allele of the SNPs. In SNPTEST this corresponds to 'allele_B', whereas BOLT-LMM uses 'ALLELE0'
maf column contains the minor allele frequencies
beta column contains the estimated effect sizes as given by GWAS software
se column contains the standard errors of effect sizes as given by GWAS software
flip optional column - see below

Columns beta and se are required for fine-mapping. Column maf is needed to output posterior effect size estimates on the allelic scale. All other columns are not required for computations and can be specified arbitrarily.
When using BGEN support, entries for each SNP in columns rsid, chromosome, position, allele1 and allele2 need to correspond with the information in BGEN files. The chromosome column may have to contain '0X' for X = 1,...,9, where X is the chromosome number, to correspond to the information in the BGEN file. Listing SNPs in a BGEN file with the BGENIX software outputs columns first_allele and alternative_alleles, whereas QCTOOL uses allele_A and allele_B as column names. These columns correspond with FINEMAP's allele1 and allele2 columns.
It is recommended to compute all SNP correlations from allele counts of one of the alleles. In this case, estimated effect sizes and their standard errors from GWAS software can be used directly if the software always codes the same allele as the effect allele. This is the case in software SNPTEST (uses 'allele_B' as the effect allele) and BOLT-LMM (uses 'ALLELE1' as the effect allele). However, if the GWAS software codes the minor allele of the SNPs as the effect allele, then the direction of estimated effect sizes needs to be flipped to either the first or the second allele. This can be done by specifying the --flip-beta command-line argument and augmenting dataset.z by a flip column which contains 1 in a line if the direction of the estimated effect size of the SNP needs to be flipped and 0, otherwise.
SNPs do not have to be ordered by genomic positions and can reside on different chromsomes. However, the order of SNPs in dataset.z must correspond to the order of SNPs in dataset.ld!
A dataset.z file with three SNPs could look as follows.

rsid chromosome position allele1 allele2 maf beta se

rs1 10 1 T C 0.35 0.0050 0.0208

rs2 10 1 A G 0.04 0.0368 0.0761

rs3 10 1 G A 0.18 0.0228 0.0199

(3) LD file

The dataset.ld file is a space-delimited text file and contains the SNP correlation matrix (Pearson's correlation).

Ideally, the SNP correlation matrix is computed from the genotype data on the same samples from which the GWAS summary statistics orginate. Read here what could happen if SNP correlations from reference genotypes (e.g. 1000 Genomes Project) do not match well with the GWAS summary statistics.
With imputed biobank-scale genotype data, it is important to compute SNP correlations from the same genotype data used in GWAS software. Read here for an example highlighting the importance of computing SNP correlations from the same dosage data used in GWAS software. For example, if GWAS summary statistics are generated with BOLT-LMM using SNP dosages (e.g. when used with BGEN files), then SNP correlations need to be computed from the same SNP dosage data. The same applies to SNPTEST when using the -method expected option to deal with genotype uncertainty. If GWAS summary statistics are computed from SNP dosage data using BGEN files, we recommended to use the LDstore software to compute SNP correlations or FINEMAP's BGEN support and disadvise to convert genotype probabilities to best-guess genotypes in order to compute SNP correlations.
The order of the SNPs in the dataset.ld must correspond to the order of variants in dataset.z.
A dataset.ld file with three SNPs could look as follows.

1.00 0.95 0.98

0.95 1.00 0.96

0.97 0.96 1.00

(4) BGEN, BGI, SAMPLE and INCL file

These are Oxford file formats and described here (BGEN), here (BGI) and here (SAMPLE). The dataset.incl file is a text file to restrict estimation of SNP correlations to genotype data from a subset of samples in dataset.sample. It constains one sample ID per line.

FINEMAP supports the BGEN format up to v1.3.
Genotype data from the UK biobank is available in this format.

(5) Optional K file

By default, FINEMAP assumes that SNPs are causal with prior probability 1 / (# of SNPs in the genomic region). As an alternative, it is possible to specify prior probabilities for the number of causal SNPs in the genomic region by using a dataset.k file. This is a space-delimited text file and contains the prior probabilities p_k = Pr(# of causal SNPs is k) for k = 1,...,K, where K is the number of entries in the dataset.k file. The prior probabilities must be non-negative and will be normalized to sum to one.

We assume that the genomic region includes at least one causal SNP and thus p₀ = 0. A non-zero prior probability p₀ that there is no causal SNP in the genomic region can be specified with the command-line argument --prior-k0. This value is only used when computing posterior probabilities p_k|data = Pr(# of causal SNPs is k | data) but not during fine-mapping itself. We further assume that p_k = 0 for k = K +1,...,m, where m is the number of SNPs in the dataset.z file.
A dataset.k file allowing for three causal SNPs with p₁ = 0.6, p₂ = 0.3 and p₃ = 0.1 would look as follows.

0.6 0.3 0.1

Output

(1) SNP file

The dataset.snp file is a space-delimited text file. It contains the GWAS summary statistics and model-averaged posterior summaries for each SNP one per line.

index column contains the line numbers in which SNPs appear in the dataset.z file
rsid, chromosome, position, allele1 and allele2 columns are the SNP identifiers from the Z file
maf column contains the minor allele frequencies as given in the Z file
beta column contains the estimated effect sizes as given in the Z file
se column contains the standard errors of effect size estimates as given in the Z file
z column contains the z-scores
prob column contains the marginal Posterior Inclusion Probabilities (PIP). The PIP for the l th SNP is the posterior probability that this SNP is causal.
log10bf column contains the log₁₀ Bayes factors. The Bayes factor quantifies the evidence that the l th SNP is causal with log₁₀ Bayes factors greater than 2 reporting considerable evidence
group column contains the group number that the SNP belongs to
corr_group column contains the correlation with the marginally most significant SNP among SNPs in the same group with this SNP
prob_group column contains the posterior probability that there is at least one causal signal among SNPs in the same group with this SNP
log10bf_group column contains the log₁₀ Bayes factors for quantifying the evidence that there is at least one causal signal among SNPs in the same group with the SNP. Bayes factors greater than 2 report considerable evidence
mean column contains the marginalized shrinkage estimates of the posterior effect size mean for the same allele as in column beta. The marginalized shrinkage estimate for the l th SNP is computed by averaging the posterior effect size means of this SNP from all causal configurations in the dataset.config file assuming that the effect size of the l th SNP is zero if the SNP is absent from a causal configuration
sd column contains the marginalized shrinkage estimates of the posterior effect size standard deviation. The estimates are computed in the same way as the marginalized shrinkage estimates of the posterior effect size mean
mean_incl column contains the conditional estimates of the posterior effect size mean for the same allele as in column beta. The conditional estimate for the l th SNP is computed by averaging the posterior effect size means of this SNP from causal configurations in the dataset.config file in which it is included
sd_incl column contains the conditional estimates of the posterior effect size standard deviation. The estimates are computed in the same way as the conditional estimates of the posterior effect size mean

The PIPs in column prob are computed by summing up the posterior probabilities of all causal configurations in the dataset.config file in which l th SNP is included. The PIPs sum to 1.0 if the maximum number of allowed causal SNPs is set to 1 with the --n-causal-snps command-line argument.
In the presences of strong correlations, evidence for a causal signal is split among strongly correlated SNPs. In such cases, it is recommended to consider the posterior probability in column prob_group that there is at least one causal signal among a group of strongly correlated SNPs. SNPs with the same entry in column group constitute a group of SNPs with absolute correlations greater than the specified --corr-group command-line argument (default is 0.99). SNPs with the same entry in column index and group are marginally most significant among SNPs in the same group with the SNP.

(2) CONFIG file

The dataset.config file is a space-delimited text file. It contains the posterior summaries for each causal configuration one per line.

rank column contains the ranking
config column contains the SNP identifiers
prob column contains the posterior probabilities that configurations are the causal configuration
log10bf column contains the log₁₀ Bayes factors. The Bayes factor quantifies the evidence for a causal configuration over the null configuration (no SNPs are causal)
odds column contains the odds of the top causal configurations
h2 column contains the heritability contribution of SNPs
h2_0.95CI column contains the 95% credible interval of the heritability contribution of SNPs
mean column contains the joint posterior effect size means
sd column contains the joint posterior effect size standard deviations

(3) CRED file

The dataset.cred file is a space-delimited text file. It contains the 95% credible sets for each causal signal conditional on other causal signals in the genomic region together with conditional posterior inclusion probabilities for each variant. More detailed information TBA.

(4) LOG file

The dataset.log file outputs additional information. It contains the following output.

Posterior probabilities p_k|data = Pr(# of causal SNPs is k | data) for k = 1,...,K, where K is the maximum number of allowed causal SNPs
A log₁₀ Bayes factor to quantify the evidence of at least one causal SNP in the genomic region
Model-averaged heritability and 95% credible interval to quantify the contribution from causal SNPs

Credible sets do not account for correlations between SNPs. To quantify the number of causal signals in the genomic region, it is recommended to use the posterior probabilities p_k|data . The posterior probabilities account for correlations between SNPs and can be summarized to the expected number of causal SNPs as ∑_k p_k|data × k .

(5) DOSE file

The dataset.dose file is a binary file with allele dosage data. A DOSE file contains the following information.

Header

Bytes	Description
7	Magic number (dose1.0)
4	Unsigned integer indicating the length L_{BGEN_filename} of the BGEN filename in bytes
L_{BGEN_filename}	Name of the BGEN file
8	Unsigned integer indicating the size S_{BGEN_file} of the BGEN file in bytes
Min(1000, S_{BGEN_file})	First bytes of the BGEN file
8	Unsigned integer indicating the size S_{DOSE_file} of the DOSE file in bytes
4	Unsigned integer indicating the number of samples N_Samples included from the BGEN file
4	Unsigned integer indicating the number of SNPs N_SNPs included from the BGEN file

SNP identifiers from Z file (sequence of N_SNPs blocks)

Bytes	Description
4	Unsigned integer indicating the length L_Block of the SNP identifier block in bytes
4	Unsigned integer indicating the line in which the SNP appears in the Z file
2	Unsigned integer indicating the length L_rsid of the entry in column rsid of the Z file in bytes
L_rsid	Entry in column rsid of the Z file
4	Unsigned integer indicating the entry in column position of the Z file
2	Unsigned integer indicating the length L_chromosome of the entry in column chromosome of the Z file in bytes
L_chromosome	Entry in column chromosome of the Z file
4	Unsigned integer indicating the length L_allele1 of the entry in column allele1 of the Z file in bytes
L_allele1	Entry in column allele1 of the Z file
4	Unsigned integer indicating the length L_allele2 of the entry in column allele2 of the Z file in bytes
L_allele2	Entry in column allele2 of the Z file

L_Block = 20 + L_rsid + L_chromosome + L_allele1 + L_allele2 number of bytes for the SNP identifier block

Dosage data offsets

Bytes	Description
4	Unsigned integer indicating the length L_Block of the dosage data offset block in bytes
8 × N_SNPs	Unsigned integers indicating the start position of dosages data for each SNP

Dosage data (sequence of N_SNPs blocks)

Bytes	Description
2 × N_Samples	Half-precision floating-point numbers representing standardized allele dosages with respect to the allele in column allele2 of the Z file

Fine-mapping example

Using genotype data with 50 SNPs and 5363 individuals, a quantitative phenotype was simulated using a linear model with 2 causal SNPs. Single-SNP testing was performed to obtain z-scores. SNP correlations were computed from GWAS genotype data.

Fine-mapping the SNPs in genomic region 1 in the example folder using shotgun stochastic search is done follows.

./finemap_v1.3_MacOSX --sss --in-files example/data --dataset 1

./finemap_v1.3_x86_64 --sss --in-files example/data --dataset 1

Fine-mapping the SNPs in genomic region 2 in the example folder using stepwise conditional search is done follows.

./finemap_v1.3_MacOSX --cond --in-files example/data --dataset 2

./finemap_v1.3_x86_64 --cond --in-files example/data --dataset 2

The stepwise conditional search starts with a causal configuration containing the SNP with the lowest P-value alone and then iteratively adds to the causal configuration the SNP given the highest posterior model probability until no further SNP yields a higher posterior model probability.

Single causal configuration example

The same data as in the fine-mapping example above are used. Without having to perform shotgun stochastic search, information about a single causal configuration can be obtain by specifying SNP identifiers as follows

./finemap_v1.3_MacOSX --config --in-files example/data --dataset 1 --rsids rs30,rs11
./finemap_v1.3_x86_64 --config --in-files example/data --dataset 1 --rsids rs30,rs11

References

Benner, C. et al. FINEMAP: Efficient variable selection using summary data from genome-wide association studies. Bioinformatics 32, 1493-1501 (2016).

Hans, D. et al. Shotgun stochastic search for "large p" regression. J Am Stat Assoc 102, 507-516 (2007).

Acknowledgements

Matti Pirinen contributed to the design and implementation of FINEMAP.