FINEMAP


Command-line arguments | Input | Output | Examples

FINEMAP-ing articles

- Refining fine-mapping: effect sizes and regional heritability. bioRxiv. (2018).
- Prospects of fine-mapping trait-associated genomic regions by using summary statistics from genome-wide association studies. Am. J. Hum. Genet. (2017).
- FINEMAP: Efficient variable selection using summary data from genome-wide association studies. Bioinformatics 32, 1493-1501 (2016).

FINEMAP is a program for

in genomic regions associated with complex traits and disease. FINEMAP is computationally efficient by using summary statistics from genome-wide association studies and robust by applying a shotgun stochastic search algorithm (Hans et al., 2007). It produces accurate results in a fraction of processing time of existing approaches. It is therefore the ideal tool for analyzing growing amounts of data produced in genome-wide association studies and emerging sequencing or biobank projects.

Download

(license)

Command-line arguments

--cond Fine-mapping with stepwise conditioning Subprogram
--cond-pvalue Option to set the p-value threshold for declaring genome-wide significance Default is 5 × 10-8
--config Evaluate a single causal configuration without performing shotgun stochastic search Subprogram
--corr-config Option to set the posterior probability of a causal configuration to zero if it includes a pair of SNPs with absolute correlation above this threshold Default is 0.95
--dataset Option to specify a delimiter-separated list of datasets for fine-mapping as given in the master file (e.g. 1,2 or 1|2) All datasets are processed by default
--flip-beta Option to read a column 'flip' in the Z file with binary indicators specifying if the direction of the estimated SNP effect sizes needs to be flipped to match SNP correlations With --cond, --config and --sss
--force-n--samples Option to allow correlations in a BCOR file to be computed on a set of samples with different size than GWAS sample size With --cond, --config and --sss
--help Command-line help
--in-files Master file (see below) With --cond, --config and --sss
--log Option to write output to log files specified in column 'log' in the master file No log files are written by default
--n-causal-snps Option to set the maximum number of allowed causal SNPs Default is 5
--n-configs-top Option to set the number of top causal configurations to be saved Default is 50000
--n-conv-sss Option to set the number of iterations that the added probability mass is required to be below the specified threshold (--prob-conv-sss-tol) before the shotgun stochastic search is terminated Default is 100
--n-iter Option to set the maximum number of iterations before the shotgun stochastic search is terminated Default is 100000
--n-threads Option to set the number of parallel threads Default is 1
--prior-k Option to use prior probabilities for the number of causal SNPs as specified in K files (see below) in the master file SNPs are by default assumed to be causal with probability 1 / (# of SNPs in the genomic region)
--prior-k0 Option to set the prior probability that there is no causal SNP in the genomic region. Only used when computing posterior probabilities for the number of causal SNPs but not during fine-mapping itself Default is 0.0
--prior-snps Option to read a column 'prob' in the Z file with prior probabilities that a SNP is causal in order to define the prior probability for each causal configuration With --sss
--prior-std Option to specify a comma-separated list of prior standard deviations of effect sizes. Default is 0.05
--prob-conv-sss-tol Option to set the tolerance at which the added probability mass (over --n-conv-sss iterations) is considered small enough to terminate the shotgun stochastic search Default is 0.001
--prob-cred-set Option to set the probability at which the credible interval includes a causal SNP Default is 0.95
--pvalue-snps Option to set a p-value threshold at which SNPs are included Default is 1.0
--rsids Option to sepcify a comma-separated list of SNP identifiers corresponding with the rsid column in Z files (see below) With --config
--sss    Fine-mapping with shotgun stochastic search    Subprogram
--std-effects    Option to print mean and standard deviation of the posterior effect size distribution for standardized dosages    Default is allele dosage

Input

(1) Master file

The master file is a semicolon-separated text file and contains no space. It contains the following mandatory column names and one dataset per line.

(2) Z file

The dataset.z file is a space-delimited text file and contains the GWAS summary statistics one SNP per line. It contains the mandatory column names in the following order.

(3) LD file

The dataset.ld file is a space-delimited text file and contains the SNP correlation matrix (Pearson's correlation).

(4) BCOR file

See here for BCOR file format desciption.

(5) Optional K file

By default, FINEMAP assumes that SNPs are causal with prior probability 1 / (# of SNPs in the genomic region). As an alternative, it is possible to specify prior probabilities for the number of causal SNPs in the genomic region by using a dataset.k file. This is a space-delimited text file and contains the prior probabilities pk = Pr(# of causal SNPs is k) for k = 1,...,K, where K is the number of entries in the dataset.k file. The prior probabilities must be non-negative and will be normalized to sum to one.

Output

(1) SNP file

The dataset.snp file is a space-delimited text file. It contains the GWAS summary statistics and model-averaged posterior summaries for each SNP one per line.

(2) CONFIG file

The dataset.config file is a space-delimited text file. It contains the posterior summaries for each causal configuration one per line.

(3) CRED file

The dataset.cred file is a space-delimited text file. It contains the 95% credible sets for each causal signal in the genomic region. For each credible set, the following posterior summaries are provided

CRED files are generated for those cases of k causal SNPs in the genomic region that have largest posterior probability. For specific k, FINEMAP takes the k-SNP causal configuration with highest posterior probability and then asks, for the l th SNP in that set, which are the other candidates that could possibly replace that SNP in this causal configuration. The l th credible set shows the best candidate SNPs and their posterior probability of being in a k-SNP causal configuration that additionally contains k - 1 SNPs. Note that the k - 1 SNPs are chosen to have highest posterior probability in their credible set.

(4) LOG file

The dataset.log file outputs additional information. It contains the following output.

Fine-mapping example

Using genotype data with 55 SNPs and 5363 individuals, a quantitative phenotype was simulated using a linear model with 2 causal SNPs. Single-SNP testing was performed to obtain z-scores. SNP correlations were computed from GWAS genotype data.

Single causal configuration example

The same data as in the fine-mapping example above are used. Without having to perform shotgun stochastic search, information about a single causal configuration can be obtain by specifying SNP identifiers as follows

./finemap_v1.4_MacOSX --config --in-files example/data --dataset 1 --rsids rs30,rs11
./finemap_v1.4_x86_64 --config --in-files example/data --dataset 1 --rsids rs30,rs11

References

Benner, C. et al. FINEMAP: Efficient variable selection using summary data from genome-wide association studies. Bioinformatics 32, 1493-1501 (2016).
Hans, D. et al. Shotgun stochastic search for "large p" regression. J Am Stat Assoc 102, 507-516 (2007).

Acknowledgements

Matti Pirinen contributed to the design and implementation of FINEMAP.

LDstore


Command-line arguments | Input | Output | Examples | Python library

LDstore is a computationally efficient program for estimating and storing Linkage Disequilibrium (SNP correlations). It combines some of the best features from RAREMETALWORKER and PLINK by implementing parallel processing using OPENMP and storing of the SNP correlations with information about the SNPs in the same binary file for fast lookups. LDstore is therefore the ideal tool for sharing SNP correlations in large-scale meta-analyses of genome-wide association studies and for on-the-fly computing/querying within web portals.

LDstore is a program for

  • 1compressing sequencing data
  • 2converting genotype probabilities to dosage data
  • 3computing SNP correlations

Download

(license)

Command-line arguments

--bcor-to-text Convert BCOR file to a text file Subprogram
--bdose-version    Option to set the BDOSE file version (see below)    With --write-bdose
--compression    Option to specify the compression level (see below) of a BDOSE/BCOR file as 'ultra-low' (1 byte), 'low' (2 bytes), 'medium' (4 bytes) or 'high' (8 bytes)    Default is medium. With --write-bcor and --write-bdose
--dataset Option to specify a delimiter-separated list of datasets as given in the master file (e.g. 1,2 or 1|2) All datasets are processed by default
--in-files    Master file (see below)    With --bcor-to-text, --write-bcor, --write-bdose and --write-text
--memory Option to limit the amount of memory in gigabyte that can be used during computation of SNP correlations Default is 1Gb. With --read-bdose or --write-bdose when using either --write-bcor or --write-text
--n-threads Option to set the number of parallel threads Default is 1. With --write-bcor, --write-bdose or --write-text
--read-bdose Read dosage data from a BDOSE file With --write-bcor or --write-text
--read-only-bgen Read genotype probabilities from a BGEN file and store dosage data in memory With --write-bcor or --write-text
--rsids Option to sepcify a comma-separated list of SNP identifiers corresponding with the rsid column in a Z file (see below) With --bcor-to-text or --write-text
--sample-miss Option to set a missing data threshold between 0 and 1. If the missing data rate for a SNP is above the specified threshold, then the correlation of any SNP pair that includes this SNP is set to NA. If the missing data rate for a SNP is below the specified threshold, then missing data is mean-imputed Default is 0.1. With --write-bcor or --write-text
--write-bcor Write SNP correlations to a BCOR file Subprogram
--write-bdose Write dosage data to a BDOSE file Subprogram and with --write-bcor or --write-text
--write-text Write SNP correlations to a text file Subprogram

Input

(1) Master file

The master file is a semicolon-separated text file and contains no space. It contains the following mandatory column names and one dataset per line.

(2) Z file

The dataset.z file is a space-delimited text file and contains meta information about the SNPs one SNP per line. It contains the mandatory column names in the following order.

(3) BGEN, BGI, SAMPLE and INCL file

These are Oxford file formats and described here (BGEN), here (BGI) and here (SAMPLE). The dataset.incl file is a text file to specify inclusion of samples in any processing. It contains one sample ID per line.

Output

(1) BCOR v1.1 file

BCOR v1.1 files are binary files that store SNP correlations together with information about the SNPs in the same file for fast lookups. BCOR v1.1 files can be used with FINEMAP v1.4 and also include correlations for more SNPs than will be fine-mapped.

The BCOR v1.1 file format is described here.

(2) BDOSE v1.0 file

BDOSE v1.0 files are binary files and meant for speeding up one-time computations of SNP correlations in a genomic region when memory is limited. LDstore converts genotype probabilities from a BGEN file to dosage data and writes that data in floating-point format to a BDOSE v1.0 file (possibly in parallel). I/O speedups are achievied by memory-mapping the BDOSE v1.0 file and memory limitations are satisfied by computing SNP correlations in a block-wise fashion.

The BDOSE v1.0 file format is described here.

(3) BDOSE v1.1 file

BDOSE v1.1 files are binary files and meant for compressing sequencing data and storing whole-chromosome dosage data. LDstore 1) converts genotype probabilities from a BGEN file to dosage data, 2) converts dosage data from floating-point format to integer format, 3) compresses dosage data in integer format according to the Zstandard compression algorithm, and 4) writes the compressed dosage data to a BDOSE v1.1 file (possibly in parallel). Memory limitations are satisfied by computing SNP correlations in a block-wise fashion.

The BDOSE v1.1 file format is described here.

(4) LD file

LD files are space-delimited text files that contain SNP correlation matrices. A LD file with three SNPs could look as follows.

1.00 0.95 0.98
0.95 1.00 0.96
0.97 0.96 1.00

Examples

BGEN to BDOSE v1.1 file conversion

Genotype data with 55 SNPs and 5363 individuals in BGEN format can be converted to dosage data in BDOSE v1.1 format as follows.

./ldstore_v2.0_x86_64 --in-files example/data --write-bdose --bdose-version 1.1

Dosage data in the BDOSE v1.1 file can be compressed by using

./ldstore_v2.0_x86_64 --in-files example/data --write-bdose --bdose-version 1.1 --compression low

Computation of SNP correlations

SNP correlations for the same data as in the example above can be computed and written to a BCOR v1.1 file. There are several options for 1) storing intermediate dosage data in memory, 2) writing dosage data first to a BDOSE file, or 3) reading from an existing BDOSE file.

./ldstore_v2.0_x86_64 --in-files example/data --write-bcor --read-only-bgen
./ldstore_v2.0_x86_64 --in-files example/data --write-bcor --write-bdose --bdose-version 1.0
./ldstore_v2.0_x86_64 --in-files example/data --write-bcor --write-bdose --bdose-version 1.1
./ldstore_v2.0_x86_64 --in-files example/data --write-bcor --read-bdose

SNP correlations for a subset of SNPs can be computed and written to a LD file by either specifying a comma-delimited list of SNP identifiers or a text file with SNP identifiers one per line after --rsids. The same options for handling dosage data apply as in the example above.

./ldstore_v2.0_x86_64 --in-files example/data --write-bcor --read-only-bgen --rsids rs30,rs11
./ldstore_v2.0_x86_64 --in-files example/data --write-bcor --read-only-bgen --rsids rsids.txt

BCOR v1.1 to LD file conversion

SNP correlations in a BCOR v1.1 file can be extracted and written to a LD file as follows.

./ldstore_v2.0_x86_64 --in-files example/data --bcor-to-text
./ldstore_v2.0_x86_64 --in-files example/data --bcor-to-text --rsids rs30,rs11
./ldstore_v2.0_x86_64 --in-files example/data --bcor-to-text --rsids rsids.txt

Python library

The LDstore python v2.7 library contains functions for reading files from LDstore as well as limited functionalities for computing SNP correlations.

Installation

pip install ldstore

BDOSE v1.0 example

>>> from ldstore.bdose import bdose

>>> myBdose = bdose( 'example/data_v1.0.bdose' )

>>> myBdose.getFname()
'example/data_v1.0.bdose'

>>> myBdose.getFsize()
2363184

>>> myBdose.getMeta().loc[ range( 5 ) ]
rsidpositionchromosomeallele1allele2
0rs11.001AG
1rs22.001AG
2rs33.001AG
3rs44.001AG
4rs55.001AG

>>> myBdose.getMissingness()[ range( 5 ) ]
array([0, 0, 0, 0, 0], dtype=uint32)

>>> myBdose.getNumOfSNPs()
55

>>> myBdose.getNumOfSamples()
5363

>>> myBdose.getOffsets()[ range( 5 ) ]
array([ 3464, 46368, 89272, 132176, 175080], dtype=uint64)

>>> myBdose.readDosages( [ 29, 10 ] )[ 0, : ]
array([-1.50769106, -1.57917029])

>>> myBdose.readDosages( [] )[ 0, [ 29, 10 ] ]
array([-1.50769106, -1.57917029])

>>> myBdose.computeCorr( [ 29, 10 ] )
01
01.000000-0.082955
1-0.0829551.000000

>>> myBdose.computeCorr( [] ).loc[ 29, 10 ]
-0.0829552808503373

BDOSE v1.1 example

>>> from ldstore.bdose import bdose

>>> myBdose = bdose( 'example/data_v1.1.bdose' )

>>> myBdose.getFname()
'example/data_v1.1.bdose'

>>> myBdose.getFsize()
125054

>>> myBdose.getMeta().loc[ range( 5 ) ]
rsidpositionchromosomeallele1allele2
0rs11.001AG
1rs22.001AG
2rs33.001AG
3rs44.001AG
4rs55.001AG

>>> myBdose.getNumOfSNPs()
55

>>> myBdose.getNumOfSamples()
5363

>>> myBdose.getOffsets()[ range( 5 ) ]
array([45066, 46430, 48122, 49810, 51508], dtype=uint64)

>>> myBdose.getSamples()[ 0 : 5 ]
['1', '2', '3', '4', '5']

>>> myBdose.computeMAF( [ 29, 10 ] )
array([0.13499907, 0.43921313])

>>> myBdose.computeMAF( [] )[ [ 29, 10 ] ]
array([0.13499907, 0.43921313])

>>> myBdose.computeFrqAllele1( [ 29, 10 ] )
array([0.13499907, 0.43921313])

>>> myBdose.computeFrqAllele1( [] )[ [ 29, 10 ] ]
array([0.13499907, 0.43921313])

>>> myBdose.computeFrqAllele2( [ 29, 10 ] )
array([0.86500093, 0.56078687])

>>> myBdose.computeFrqAllele2( [] )[ [ 29, 10 ] ]
array([0.86500093, 0.56078687])

>>> myBdose.readDosages( [ 29, 10 ] )[ 0, : ]
array([1., 0.])

>>> myBdose.readDosages( [] )[ 0, [ 29, 10 ] ]
array([1., 0.])

>>> myBdose.computeCorr( [ 29, 10 ] )
01
01.000000-0.082955
1-0.0829551.000000

>>> myBdose.computeCorr( [] ).loc[ 29, 10 ]
-0.0829552808503373

BCOR v1.1 example

>>> from ldstore.bcor import bcor

>>> myBcor = bcor( 'example/data.bcor' )

>>> myBcor.getFname()
'example/data.bcor'

>>> myBcor.getFsize()
7723

>>> myBcor.getMeta().loc[ range( 5 ) ]
rsidpositionchromosomeallele1allele2
0rs11.001AG
1rs22.001AG
2rs33.001AG
3rs44.001AG
4rs55.001AG

>>> myBcor.getNumOfSNPs()
55

>>> myBcor.getNumOfSamples()
5363

>>> myBcor.readCorr( [ 29, 10 ] )[ 29, : ]
array([ 1. , -0.0829553])

>>> myBcor.readCorr( [] )[ 29, 10 ]
-0.08295530080795288

References

Benner, C. et al. Prospects of fine-papping trait-associated genomic regions by using summary statistics from genome-wide association studies. Am. J. Hum. Genet. (2017).