Massive Algorithms: Rafal's Blog - Codility Genomic Range Query

Rafal's Blog - Codility Genomic Range Query

A DNA sequence can be represented as a string consisting of the letters A, C, G and T, which correspond to the types of successive nucleotides in the sequence. Each nucleotide has an impact factor, which is an integer. Nucleotides of types A, C, G and Thave impact factors of 1, 2, 3 and 4, respectively. You are going to answer several queries of the form: What is the minimal impact factor of nucleotides contained in a particular part of the given DNA sequence?

The DNA sequence is given as a non-empty string S = S[0]S[1]...S[N-1] consisting of N characters. There are M queries, which are given in non-empty arrays P and Q, each consisting of M integers. The K-th query (0 ≤ K < M) requires you to find the minimal impact factor of nucleotides contained in the DNA sequence between positions P[K] and Q[K] (inclusive).

For example, consider string S = CAGCCTA and arrays P, Q such that:

    P[0] = 2    Q[0] = 4
    P[1] = 5    Q[1] = 5
    P[2] = 0    Q[2] = 6

The answers to these M = 3 queries are as follows:

The part of the DNA between positions 2 and 4 contains nucleotides G and C (twice), whose impact factors are 3 and 2 respectively, so the answer is 2.

The part between positions 5 and 5 contains a single nucleotide T, whose impact factor is 4, so the answer is 4.

The part between positions 0 and 6 (the whole string) contains all nucleotides, in particular nucleotide A whose impact factor is 1, so the answer is 1.

The optimal idea is to keep a prefix sum of the number of occurences of each letter from the set [G,C,T,A] for every position in the target string. Then, to evalute the minimal nucleotide between indices (a,b), we can easily compute the total number of occurences of each of the nucleotides in O(1) time, and pick the smallest one. This leads to a total running time of O(N+M).

public static int[] genome(String S, int[] P, int[] Q) {
   int len = S.length();
   int[][] arr = new int[len][4];
   int[] result = new int[P.length];
   
   for(int i = 0; i < len; i++){
     char c = S.charAt(i);
     if(c == 'A') arr[i][0] = 1;
     if(c == 'C') arr[i][1] = 1;
     if(c == 'G') arr[i][2] = 1;
     if(c == 'T') arr[i][3] = 1;
   }
   // compute prefixes
   for(int i = 1; i < len; i++){
     for(int j = 0; j < 4; j++){
       arr[i][j] += arr[i-1][j];
     }
   } 
   
   for(int i = 0; i < P.length; i++){
     int x = P[i];
     int y = Q[i];
     
     for(int a = 0; a < 4; a++){
       int sub = 0;
       if(x-1 >= 0) sub = arr[x-1][a];
       if(arr[y][a] - sub > 0){
         result[i] = a+1;
         break;
       }
     }
     
   }
   return result;
 }

http://luisramalho.com/genomic-range-query/

I did that by checking if the difference between the highest range prefix sum and the lowest returns a value above 0, if yes, then it means that a nucleotide of that impact factor must have occured, hence it’s our answer and we can break out of the loop.

https://github.com/luisramalho/codility/blob/master/codility/L03E2GenomicRangeQuery.java

public static int[] solution(final String s, final int[] p, final int[] q) {

// Mark the position of each element

int[][] a = new int[NUMBER_OF_NUCLEOTIDES][s.length()];

for (int i = 0; i < s.length(); i++) {

char ch = s.charAt(i);

switch (ch) {

case 'A':

a[0][i]++;

break;

case 'C':

a[1][i]++;

break;

case 'G':

a[2][i]++;

break;

case 'T':

a[3][i]++;

break;

default:

break;

}

// Compute prefix sum

int[][] prefixSum = new int[NUMBER_OF_NUCLEOTIDES][s.length() + 1];

for (int k = 1; k < s.length() + 1; k++) {

for (int j = 0; j < NUMBER_OF_NUCLEOTIDES; j++) {

prefixSum[j][k] = prefixSum[j][k - 1] + a[j][k - 1];

}

// Count total

int[] m = new int[p.length];

for (int i = 0; i < p.length; i++) {

int x = p[i];

int y = q[i];

for (int j = 0; j < NUMBER_OF_NUCLEOTIDES; j++) {

if (prefixSum[j][y + 1] - prefixSum[j][x] > 0) {

m[i] = j + 1;

break;

}

return m;

}

http://codesays.com/2014/solution-to-genomic-range-query-by-codility/

http://codility-lessons.blogspot.com/2014/07/lesson-3-genomicrangequery.htmlRead full article from Rafal's Blog - Codility Genomic Range Query

Rafal's Blog - Codility Genomic Range Query

Labels

Popular Posts