Command-line arguments | Input | Output | Examples
FINEMAP-ing articles
- | Refining fine-mapping: effect sizes and regional heritability. bioRxiv. (2018). | |
- | Prospects of fine-mapping trait-associated genomic regions by using summary statistics from genome-wide association studies. Am. J. Hum. Genet. (2017). | |
- | FINEMAP: Efficient variable selection using summary data from genome-wide association studies. Bioinformatics 32, 1493-1501 (2016). |
FINEMAP is a program for
- 1identifying causal SNPs
- 2estimating effect sizes of causal SNPs
- 3estimating the heritability contribution of causal SNPs
in genomic regions associated with complex traits and disease. FINEMAP is computationally efficient by using summary statistics from genome-wide association studies and robust by applying a shotgun stochastic search algorithm (Hans et al., 2007). It produces accurate results in a fraction of processing time of existing approaches. It is therefore the ideal tool for analyzing growing amounts of data produced in genome-wide association studies and emerging sequencing or biobank projects.
Download
(license)
- finemap_v1.3.1_MacOSX.tgz (Mac OS X)
- finemap_v1.3.1_x86_64.tgz (Unix)
- Updated 19-Oct-2018
- (1) Credible sets
- finemap_v1.3_MacOSX.tgz (Mac OS X)
- finemap_v1.3_x86_64.tgz (Unix)
- Updated 06-July-2018
- (1) BGEN support (e.g. for UK biobank data)
- (2) Stepwise conditional search
- (3) Group-wise SNP probabilities
- Mac OSX users: If you see dyld: Library not loaded: /usr/local/lib/libzstd.1.dylib, install Zstandard.
- finemap_v1.2_MacOSX.tgz (Mac OS X)
- finemap_v1.2_x86_64.tgz (Unix)
- Documentation
- Be aware that FINEMAP v1.1 cannot handle large effect size regions!
- finemap_v1.1_MacOSX.tgz (Mac OS X)
- finemap_v1.1_x86_64.tgz (Unix)
- Documentation
- To receive email reminders about updates of FINEMAP, send an email to finemap@christianbenner.com.
Command-line arguments
--cond | Fine-mapping with stepwise conditional search | Subprogram | ||
--config | Evaluate a single causal configuration without performing shotgun stochastic search | Subprogram | ||
--corr-config | Option to set the posterior probability of a causal configuration to zero if it includes a pair of SNPs with absolute correlation above this threshold | Default is 0.95 | ||
--corr-group | Option to set the threshold for grouping a pair of SNPs with absolute correlation above this threshold | Default is 0.99 | ||
--dataset | Option to specify a delimiter-separated list of datasets for fine-mapping as given in the master file (e.g. 1,2 or 1|2) | All datasets are processed by default | ||
--flip-beta | Option to read a column 'flip' in the Z file with binary indicators specifying if the direction of the estimated SNP effect sizes needs to be flipped to match SNP correlations | With --cond, --config and --sss | ||
--group-snps | Option to group SNPs on the basis of their correlations | With --cond and --sss | ||
--help | Command-line help | |||
--in-files | Master file (see below) | With --cond, --config and --sss | ||
--log | Option to write output to log files specified in column 'log' in the master file | No log files are written by default | ||
--n-causal-snps | Option to set the maximum number of allowed causal SNPs | Default is 5 | ||
--n-configs-top | Option to set the number of top causal configurations to be saved | Default is 50000 | ||
--n-convergence | Option to set the number of iterations that the added probability mass is required to be below the specified threshold (--prob-tol) before the shotgun stochastic search is terminated | Default is 1000 | ||
--n-iterations | Option to set the maximum number of iterations before the shotgun stochastic search is terminated | Default is 100000 | ||
--prior-k | Option to use prior probabilities for the number of causal SNPs as specified in K files (see below) in the master file | SNPs are by default assumed to be causal with probability 1 / (# of SNPs in the genomic region) | ||
--prior-k0 | Option to set the prior probability that there is no causal SNP in the genomic region. Only used when computing posterior probabilities for the number of causal SNPs but not during fine-mapping itself | Default is 0.0 | ||
--prior-std | Option to specify a comma-separated list of prior standard deviations of effect sizes. | Default is 0.05 | ||
--prob-tol | Option to set the tolerance at which the added probability mass (over --n-convergence iterations) is considered small enough to terminate the shotgun stochastic search | Default is 0.001 | ||
--rsids | Option to sepcify a comma-separated list of SNP identifiers corresponding with the rsid column in Z files (see below) | With --config | ||
--sss | Fine-mapping with shotgun stochastic search | Subprogram |
Input
(1) Master file
The master file is a semicolon-separated text file and contains no space. It contains the following mandatory column names and one dataset per line.
z column contains the names of Z files (input)
ld column contains the names of LD files (input)
bgen column contains the names of BGEN files (input)
bgi column contains the names of BGI files (input)
dose column contains the names of DOSE files (output)
sample column contains the names of SAMPLE files (input)
incl column contains the names of INCL files (input)
snp column contains the names of SNP files (output)
config column contains the names of CONFIG files (output)
cred column contains the names of CRED files (output)
n_samples column contains the GWAS sample sizes
k column contains the optional K files (optional input)
log column contains the optional LOG files (optional output)
File extensions must correspond with the column names in the header line!
The master file can contain columns ld, bgen, bgi and dose simultaneously. For each dataset per line, entries need to be specified for precomputed SNP correlations in column ld or for BGEN support using all three columns bgen, bgi and dose. If a line contains entries in all four columns, then precomputed SNP correlations are used.
Entries in columns sample and incl need to be specified if the GWAS sample size in column n_samples is smaller than the number of samples in the BGEN file.
A master file with two datasets using precomputed SNP correlations could look as follows.
A master file with two datasets using precomputed SNP correlations in the first dataset and BGEN support in the second dataset could look as follows.
A master file with one datasets using BGEN support and a subset of 5,000 samples could look as follows.
z;ld;snp;config;cred;log;n_samples |
dataset1.z;dataset1.ld;dataset1.snp;dataset1.config;dataset1.cred;dataset1.log;5363 |
dataset2.z;dataset2.ld;dataset2.snp;dataset2.config;dataset2.cred;dataset2.log;5363 |
z;ld;bgen;bgi;dose;snp;config;cred;log;n_samples |
dataset1.z;dataset1.ld;;;;dataset1.snp;dataset1.config;dataset1.cred;dataset1.log;5363 |
dataset2.z;;dataset2.bgen;dataset2.bgi;dataset2.dose;dataset2.snp;dataset2.config;dataset2.cred;dataset2.log;5363 |
z;bgen;bgi;dose;sample;incl;snp;config;cred;log;n_samples |
dataset2.z;dataset2.bgen;dataset2.bgi;dataset2.dose;dataset.sample;dataset.incl;dataset.snp;dataset.config;dataset.cred;dataset.log;5000 |
(2) Z file
The dataset.z file is a space-delimited text file and contains the GWAS summary statistics one SNP per line. It contains the mandatory column names in the following order.
rsid column contains the SNP identifiers. The identifier can be a rsID number or a combination of chromosome name and genomic position (e.g. XXX:yyy)
chromosome column contains the chromosome names. The chromosome names can be chosen freely with precomputed SNP correlations (e.g. 'X', '0X' or 'chrX')
position column contains the base pair positions
allele1 column contains the "first" allele of the SNPs. In SNPTEST this corresponds to 'allele_A', whereas BOLT-LMM uses 'ALLELE1'
allele2 column contains the "second" allele of the SNPs. In SNPTEST this corresponds to 'allele_B', whereas BOLT-LMM uses 'ALLELE0'
maf column contains the minor allele frequencies
beta column contains the estimated effect sizes as given by GWAS software
se column contains the standard errors of effect sizes as given by GWAS software
flip optional column - see below
Columns beta and se are required for fine-mapping. Column maf is needed to output posterior effect size estimates on the allelic scale. All other columns are not required for computations and can be specified arbitrarily.
When using BGEN support, entries for each SNP in columns rsid, chromosome, position, allele1 and allele2 need to correspond with the information in BGEN files. The chromosome column may have to contain '0X' for X = 1,...,9, where X is the chromosome number, to correspond to the information in the BGEN file. Listing SNPs in a BGEN file with the BGENIX software outputs columns first_allele and alternative_alleles, whereas QCTOOL uses allele_A and allele_B as column names. These columns correspond with FINEMAP's allele1 and allele2 columns.
It is recommended to compute all SNP correlations from allele counts of one of the alleles. In this case, estimated effect sizes and their standard errors from GWAS software can be used directly if the software always codes the same allele as the effect allele. This is the case in software SNPTEST (uses 'allele_B' as the effect allele) and BOLT-LMM (uses 'ALLELE1' as the effect allele). However, if the GWAS software codes the minor allele of the SNPs as the effect allele, then the direction of estimated effect sizes needs to be flipped to either the first or the second allele. This can be done by specifying the --flip-beta command-line argument and augmenting dataset.z by a flip column which contains 1 in a line if the direction of the estimated effect size of the SNP needs to be flipped and 0, otherwise.
SNPs do not have to be ordered by genomic positions and can reside on different chromsomes. However, the order of SNPs in dataset.z must correspond to the order of SNPs in dataset.ld!
A dataset.z file with three SNPs could look as follows.
rsid chromosome position allele1 allele2 maf beta se |
rs1 10 1 T C 0.35 0.0050 0.0208 |
rs2 10 1 A G 0.04 0.0368 0.0761 |
rs3 10 1 G A 0.18 0.0228 0.0199 |
(3) LD file
The dataset.ld file is a space-delimited text file and contains the SNP correlation matrix (Pearson's correlation).
Ideally, the SNP correlation matrix is computed from the genotype data on the same samples from which the GWAS summary statistics orginate. Read here what could happen if SNP correlations from reference genotypes (e.g. 1000 Genomes Project) do not match well with the GWAS summary statistics.
With imputed biobank-scale genotype data, it is important to compute SNP correlations from the same genotype data used in GWAS software. Read here for an example highlighting the importance of computing SNP correlations from the same dosage data used in GWAS software. For example, if GWAS summary statistics are generated with BOLT-LMM using SNP dosages (e.g. when used with BGEN files), then SNP correlations need to be computed from the same SNP dosage data. The same applies to SNPTEST when using the -method expected option to deal with genotype uncertainty. If GWAS summary statistics are computed from SNP dosage data using BGEN files, we recommended to use the LDstore software to compute SNP correlations or FINEMAP's BGEN support and disadvise to convert genotype probabilities to best-guess genotypes in order to compute SNP correlations.
The order of the SNPs in the dataset.ld must correspond to the order of variants in dataset.z.
A dataset.ld file with three SNPs could look as follows.
1.00 0.95 0.98 |
0.95 1.00 0.96 |
0.97 0.96 1.00 |
(4) BGEN, BGI, SAMPLE and INCL file
These are Oxford file formats and described here (BGEN), here (BGI) and here (SAMPLE). The dataset.incl file is a text file to restrict estimation of SNP correlations to genotype data from a subset of samples in dataset.sample. It constains one sample ID per line.
FINEMAP supports the BGEN format up to v1.3.
Genotype data from the UK biobank is available in this format.
(5) Optional K file
By default, FINEMAP assumes that SNPs are causal with prior probability 1 / (# of SNPs in the genomic region). As an alternative, it is possible to specify prior probabilities for the number of causal SNPs in the genomic region by using a dataset.k file. This is a space-delimited text file and contains the prior probabilities pk = Pr(# of causal SNPs is k) for k = 1,...,K, where K is the number of entries in the dataset.k file. The prior probabilities must be non-negative and will be normalized to sum to one.
We assume that the genomic region includes at least one causal SNP and thus p0 = 0. A non-zero prior probability p0 that there is no causal SNP in the genomic region can be specified with the command-line argument --prior-k0. This value is only used when computing posterior probabilities pk|data = Pr(# of causal SNPs is k | data) but not during fine-mapping itself. We further assume that pk = 0 for k = K +1,...,m, where m is the number of SNPs in the dataset.z file.
A dataset.k file allowing for three causal SNPs with p1 = 0.6, p2 = 0.3 and p3 = 0.1 would look as follows.
0.6 0.3 0.1 |
Output
(1) SNP file
The dataset.snp file is a space-delimited text file. It contains the GWAS summary statistics and model-averaged posterior summaries for each SNP one per line.
index column contains the line numbers in which SNPs appear in the dataset.z file
rsid, chromosome, position, allele1 and allele2 columns are the SNP identifiers from the Z file
maf column contains the minor allele frequencies as given in the Z file
beta column contains the estimated effect sizes as given in the Z file
se column contains the standard errors of effect size estimates as given in the Z file
z column contains the z-scores
prob column contains the marginal Posterior Inclusion Probabilities (PIP). The PIP for the l th SNP is the posterior probability that this SNP is causal.
log10bf column contains the log10 Bayes factors. The Bayes factor quantifies the evidence that the l th SNP is causal with log10 Bayes factors greater than 2 reporting considerable evidence
group column contains the group number that the SNP belongs to
corr_group column contains the correlation with the marginally most significant SNP among SNPs in the same group with this SNP
prob_group column contains the posterior probability that there is at least one causal signal among SNPs in the same group with this SNP
log10bf_group column contains the log10 Bayes factors for quantifying the evidence that there is at least one causal signal among SNPs in the same group with the SNP. Bayes factors greater than 2 report considerable evidence
mean column contains the marginalized shrinkage estimates of the posterior effect size mean for the same allele as in column beta. The marginalized shrinkage estimate for the l th SNP is computed by averaging the posterior effect size means of this SNP from all causal configurations in the dataset.config file assuming that the effect size of the l th SNP is zero if the SNP is absent from a causal configuration
sd column contains the marginalized shrinkage estimates of the posterior effect size standard deviation. The estimates are computed in the same way as the marginalized shrinkage estimates of the posterior effect size mean
mean_incl column contains the conditional estimates of the posterior effect size mean for the same allele as in column beta. The conditional estimate for the l th SNP is computed by averaging the posterior effect size means of this SNP from causal configurations in the dataset.config file in which it is included
sd_incl column contains the conditional estimates of the posterior effect size standard deviation. The estimates are computed in the same way as the conditional estimates of the posterior effect size mean
The PIPs in column prob are computed by summing up the posterior probabilities of all causal configurations in the dataset.config file in which l th SNP is included. The PIPs sum to 1.0 if the maximum number of allowed causal SNPs is set to 1 with the --n-causal-snps command-line argument.
In the presences of strong correlations, evidence for a causal signal is split among strongly correlated SNPs. In such cases, it is recommended to consider the posterior probability in column prob_group that there is at least one causal signal among a group of strongly correlated SNPs. SNPs with the same entry in column group constitute a group of SNPs with absolute correlations greater than the specified --corr-group command-line argument (default is 0.99). SNPs with the same entry in column index and group are marginally most significant among SNPs in the same group with the SNP.
(2) CONFIG file
The dataset.config file is a space-delimited text file. It contains the posterior summaries for each causal configuration one per line.
rank column contains the ranking
config column contains the SNP identifiers
prob column contains the posterior probabilities that configurations are the causal configuration
log10bf column contains the log10 Bayes factors. The Bayes factor quantifies the evidence for a causal configuration over the null configuration (no SNPs are causal)
odds column contains the odds of the top causal configurations
h2 column contains the heritability contribution of SNPs
h2_0.95CI column contains the 95% credible interval of the heritability contribution of SNPs
mean column contains the joint posterior effect size means
sd column contains the joint posterior effect size standard deviations
(3) CRED file
The dataset.cred file is a space-delimited text file. It contains the 95% credible sets for each causal signal conditional on other causal signals in the genomic region together with conditional posterior inclusion probabilities for each variant. More detailed information TBA.
(4) LOG file
The dataset.log file outputs additional information. It contains the following output.
Posterior probabilities pk|data = Pr(# of causal SNPs is k | data) for k = 1,...,K, where K is the maximum number of allowed causal SNPs
A log10 Bayes factor to quantify the evidence of at least one causal SNP in the genomic region
Model-averaged heritability and 95% credible interval to quantify the contribution from causal SNPs
Credible sets do not account for correlations between SNPs. To quantify the number of causal signals in the genomic region, it is recommended to use the posterior probabilities pk|data . The posterior probabilities account for correlations between SNPs and can be summarized to the expected number of causal SNPs as ∑k pk|data × k .
(5) DOSE file
The dataset.dose file is a binary file with allele dosage data. A DOSE file contains the following information.
Header
SNP identifiers from Z file (sequence of NSNPs blocks)
Dosage data offsets
Dosage data (sequence of NSNPs blocks)
Bytes | Description |
7 | Magic number (dose1.0) |
4 | Unsigned integer indicating the length LBGEN_filename of the BGEN filename in bytes |
LBGEN_filename | Name of the BGEN file |
8 | Unsigned integer indicating the size SBGEN_file of the BGEN file in bytes |
Min(1000, SBGEN_file) | First bytes of the BGEN file |
8 | Unsigned integer indicating the size SDOSE_file of the DOSE file in bytes |
4 | Unsigned integer indicating the number of samples NSamples included from the BGEN file |
4 | Unsigned integer indicating the number of SNPs NSNPs included from the BGEN file |
Bytes | Description |
4 | Unsigned integer indicating the length LBlock of the SNP identifier block in bytes |
4 | Unsigned integer indicating the line in which the SNP appears in the Z file |
2 | Unsigned integer indicating the length Lrsid of the entry in column rsid of the Z file in bytes |
Lrsid | Entry in column rsid of the Z file |
4 | Unsigned integer indicating the entry in column position of the Z file |
2 | Unsigned integer indicating the length Lchromosome of the entry in column chromosome of the Z file in bytes |
Lchromosome | Entry in column chromosome of the Z file |
4 | Unsigned integer indicating the length Lallele1 of the entry in column allele1 of the Z file in bytes |
Lallele1 | Entry in column allele1 of the Z file |
4 | Unsigned integer indicating the length Lallele2 of the entry in column allele2 of the Z file in bytes |
Lallele2 | Entry in column allele2 of the Z file |
LBlock = 20 + Lrsid + Lchromosome + Lallele1 + Lallele2 number of bytes for the SNP identifier block |
Bytes | Description |
4 | Unsigned integer indicating the length LBlock of the dosage data offset block in bytes |
8 × NSNPs | Unsigned integers indicating the start position of dosages data for each SNP |
Bytes | Description |
2 × NSamples | Half-precision floating-point numbers representing standardized allele dosages with respect to the allele in column allele2 of the Z file |
Fine-mapping example
Using genotype data with 50 SNPs and 5363 individuals, a quantitative phenotype was simulated using a linear model with 2 causal SNPs. Single-SNP testing was performed to obtain
Fine-mapping the SNPs in genomic region 1 in the example folder using shotgun stochastic search is done follows.
./finemap_v1.3_MacOSX --sss --in-files example/data --dataset 1Fine-mapping the SNPs in genomic region 2 in the example folder using stepwise conditional search is done follows.
./finemap_v1.3_MacOSX --cond --in-files example/data --dataset 2The stepwise conditional search starts with a causal configuration containing the SNP with the lowest P-value alone and then iteratively adds to the causal configuration the SNP given the highest posterior model probability until no further SNP yields a higher posterior model probability.
./finemap_v1.3_x86_64 --sss --in-files example/data --dataset 1
./finemap_v1.3_x86_64 --cond --in-files example/data --dataset 2
Single causal configuration example
The same data as in the fine-mapping example above are used. Without having to perform shotgun stochastic search, information about a single causal configuration can be obtain by specifying SNP identifiers as follows
./finemap_v1.3_MacOSX --config --in-files example/data --dataset 1 --rsids rs30,rs11./finemap_v1.3_x86_64 --config --in-files example/data --dataset 1 --rsids rs30,rs11
References
Benner, C. et al. FINEMAP: Efficient variable selection using summary data from genome-wide association studies. Bioinformatics 32, 1493-1501 (2016). |
Hans, D. et al. Shotgun stochastic search for "large p" regression. J Am Stat Assoc 102, 507-516 (2007). |
Acknowledgements
Matti Pirinen contributed to the design and implementation of FINEMAP.