In (Dash et al., 2021; Al Hajri et al., 2020), when iterating through all 2-hit combinations (of \(G\) number of genes), the outer two \(i\) and \(j\) loops are “flattened” into a single \(\lambda\) loop (\(\lambda \gets 1\ldots\binom{G}{2}\)). To reconstruct the \(i\) and \(j\) the following formulas are used:

\[\begin{align} j &= \lfloor \sqrt{1/4 + 2 \lambda} + 1/2 \rfloor \\ i &= \lambda - j (j - 1) / 2 \end{align}\]

Let’s try to derive those formulas.

It is easy to spot the \(S_{j-1} := \sum_{t=1}^{j-1} t = \frac{j (j - 1)}{2}\) formula for the sum of all positive integers going up to \(j-1\). Which leads us to:

\[\lambda = i + \frac{j(j - 1)}{2} = i + \sum_{t=1}^{j-1} t\]

The same \(S_{j-1}\) formula is also present in the expression for \(j\) (we begin by removing the \(\lfloor \cdot \rfloor\)):

\[\begin{align} j &= \sqrt{1/4 + 2 \lambda} + 1/2 \\ j - 1/2 &= \sqrt{1/4 + 2 \lambda} \\ (j - 1/2)^2 &= 1/4 + 2 \lambda \\ j^2 - j + 1/4 &= 1/4 + 2 \lambda \\ j^2 - j &= 2 \lambda \\ \lambda &= \frac{j (j-1)}{2} \end{align}\]

According to (Dash et al., 2021; Al Hajri et al., 2020) this flattened \(\lambda\) loop corresponds to the following \(i\) and \(j\) loop:

N = 5
count = 0
for i in range(N):
    for j in range(i+1, N):
        count += 1
        print(count, ":", i,j)

If we implement the flattened loop we see that this is only true in the sense that the set of visited combination is the same, however, the order is different.

import math

N = 5
Nc2 = 10
for L in range(Nc2):
    j = math.floor(math.sqrt(0.25 + 2*L) + 0.5)
    i = L - j*(j-1)//2
    print(L, ":", i, j)

So to generate the combinations in the same order as the initial \(i\), \(j\) loops, we need to modify the code as follows:

import math

N = 5
Nc2 = 10
for L in reversed(range(Nc2)):
    j = math.floor(math.sqrt(0.25 + 2*L) + 0.5)
    i = L - j*(j-1)//2
    print(L, ":", N - 1 - j, N - 1 - i)

Or alternatively, if we want to modify the original \(i\), \(j\) loop to match the \(\lambda\) loop and the mathematical derivation:

N = 5
count = 0
for j in range(N):
    for i in range(0, j):
        count += 1
        print(count, ":", i, j)

Graphical representation

Because \(\lambda \mapsto j(\lambda) = \lfloor \sqrt{1/4 + 2 \lambda} + 1/2 \rfloor\) is monotonically increasing (non-decreasing), returning the \(\lfloor \cdot \rfloor\) which we skipped in the calculations above, means that \(j\) is the largest possible integer such that \(\lambda = i + \frac{j(j - 1)}{2}\) for a non-negative integer \(i\). In the figure we can read \(j=4\) since the blue dots represent \(\sum_{t=1}^{j-1} t\) and the red dots show the calculation of \(i = \lambda - \sum_{t=1}^{j-1} t\).

References

Scaling Out a Combinatorial Algorithm for Discovering Carcinogenic Gene Combinations to Thousands of GPUs

Sajal Dash, Qais Al-Hajri, Wu-chun Feng, and 2 more authors

In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2021

Abs DOI

Cancer is a leading cause of death in the US, second only to heart disease. It is primarily a result of a combination of an estimated two-nine genetic mutations (multi-hit combinations). Although a body of research has identified hundreds of cancer-causing genetic mutations, we don’t know the specific combination of mutations responsible for specific instances of cancer for most cancer types. An approximate algorithm for solving the weighted set cover problem was previously adapted to identify combinations of genes with mutations that may be responsible for individual instances of cancer. However, the algorithm’s computational requirement scales exponentially with the number of genes, making it impractical for identifying more than three-hit combinations, even after the algorithm was parallelized and scaled up to a V100 GPU. Since most cancers have been estimated to require more than three hits, we scaled out the algorithm to identify combinations of four or more hits using 1000 nodes (6000 V100 GPUs with ≈ 48\texttimes106 processing cores) on the Summit supercomputer at Oak Ridge National Laboratory. Efficiently scaling out the algorithm required a series of algorithmic innovations and optimizations for balancing an exponentially divergent workload across processors and for minimizing memory latency and inter-node communication. We achieved an average strong scaling efficiency of 90.14% (80.96%-97.96% for 200 to 1000 nodes), compared to a 100 node run, with 84.18% scaling efficiency for 1000 nodes. With experimental validation, the multi-hit combinations identified here could provide further insight into the etiology of different cancer subtypes and provide a rational basis for targeted combination therapy.
Identifying Multi-Hit Carcinogenic Gene Combinations: Scaling up a Weighted Set Cover Algorithm Using Compressed Binary Matrix Representation on a GPU

Qais Al Hajri, Sajal Dash, Wu-chun Feng, and 2 more authors

Scientific Reports, Feb 2020

Abs DOI

Methods) that it will take 253 days to calculate a set of 3-hit (h = 3) combinations for BRCA, without any additional optimization or parallelization. The goal of this work is to optimize the multi-hit algorithm to identify combinations of more than two hits in a practical time frame (<1 month). Achieving this level of speedup requires parallel execution across a large number of processors. Graphical processing units (GPUs) with thousands of processors are a natural choice for massively parallel processing42. However, GPUs have three key limitations that must be addressed to achieve significant speedup. (1) Speed of memory access is significantly slower on GPUs compared to CPUs, e.g. on the Intel Xeon E5-2630 CPU L1 and L2 cache access require 4 and 11 cycles respectively43, compared to 28 and 193 cycles for the NVIDIA V100 GPU44. Therefore, speedup from parallelization will be offset by slower memory access for algorithms that require access to a large amount of data from memory. (2) GPUs have limited amount of accessible memory, e.g. 32GB for the NVIDIA V100, compared to 1.5TB for Intel Xeon E5-263045. (3) On NVIDIA GPUs, divergent branching during execution will result in unbalanced processor load, which also limits the achievable speedup from parallelization46–50. To address these GPU limitations, we employed two general strategies. (1) We used a compressed binary representation for the Gene-Sample Mutation matrix (described in Methods), which reduced memory requirement by 16-fold and resulted in an average 10 fold speedup (see Results). (2) We restructured and optimized the algorithm for parallel execution on a NVIDIA Tesla V100 PCIe graphical processing unit by minimizing divergent branching in addition to other optimizations described in the Methods section. The compressed binary representation alone resulted in a 0.4–18 fold speedup for the 2-hit algorithm, compared to the original integer matrix, depending on cancer type. This additional speedup, and the associated increase in software complexity, was not necessary for the identification of 2-hit combinations, and insufficient by itself for the identification of 3-hit combinations on the CPU. However, the optimized GPU implementation combined with the compressed binary representation was 0.7–224 times faster than the original CPU based integer matrix implementation, for the 2-hit algorithm, depending on cancer type. The 3-hit algorithm was an estimated 29–33,690 times faster for the optimized GPU implementation compared to the original CPU implementation. For the breast cancer samples mentioned above, we were able to compute a set of 3-hit combinations in 23 minutes with the optimized GPU implementation compared to the estimated 253 days for the original CPU implementation. The set of 3-hit combinations identified using a randomly partitioned training set was able to differentiate between tumor and normal samples in separate test data with overall sensitivity of 90% (95% confidence interval (CI) = 88–91%) and overall specificity of 93% (95% CI = 92–94%). Despite this relatively high accuracy, the multi-hit gene combinations identified by our algorithm may not represent cancer genes (see Discussion). Further experimental validation will be required to determine if mutations within these genes may play a role in cancer genesis or progression. The remainder of this manuscript is organized as follows. In the Results section, we describe the speedup achieved by the optimized parallel implementation, the breakdown of the contribution of different optimizations, and the accuracy of the multi-hit combinations identified. In the Discussion section, we illustrate how the distribution of somatic mutations in tumor and normal samples in the gene combinations can be used to identify potential driver mutations for further investigation. Our approach and results are summarized in the Conclusions. In the Methods section, we describe the multi-hit algorithm, the compressed binary representation of the input matrix, the mapping of the algorithm to the GPU, and its optimization for parallel execution.

Enjoy Reading This Article?

Here are some more articles you might like to read next: