Massive Algorithms: Problem C. Welcome to Code Jam - Qualification Round 2009

Dashboard - Qualification Round 2009 - Google Code Jam

Problem

So you've registered. We sent you a welcoming email, to welcome you to code jam. But it's possible that you still don't feel welcomed to code jam. That's why we decided to name a problem "welcome to code jam." After solving this problem, we hope that you'll feel very welcome. Very welcome, that is, to code jam.
If you read the previous paragraph, you're probably wondering why it's there. But if you read it very carefully, you might notice that we have written the words "welcome to code jam" several times: 400263727 times in total. After all, it's easy to look through the paragraph and find a 'w'; then find an 'e' later in the paragraph; then find an 'l' after that, and so on. Your task is to write a program that can take any text and print out how many times that text contains the phrase "welcome to code jam".
To be more precise, given a text string, you are to determine how many times the string "welcome to code jam" appears as a sub-sequence of that string. In other words, find a sequence s of increasing indices into the input string such that the concatenation of input[s[0]], input[s[1]], ..., input[s[18]] is the string "welcome to code jam".
The result of your calculation might be huge, so for convenience we would only like you to find the last 4 digits.

Input

The first line of input gives the number of test cases, N. The next N lines of input contain one test case each. Each test case is a single line of text, containing only lower-case letters and spaces. No line will start with a space, and no line will end with a space.

Official Analysis: https://code.google.com/codejam/contest/90101/dashboard#s=a&a=2

The word we want to find is S = "welcome to code jam", in a long string T. In fact the solution is not very different when we want to find any S. It is actually illustrative to picture the cases for short words.

In case S is just a single character, you just need to count how many times this character appears in T. If S = "xy" is a string of length 2, instead of brute force all the possible positions, one can do it in linear time, start from left to the right. For each occurrence of 'y', one needs to know how many 'x's appeared before that 'y'.

The general solution follows this pattern. Let us again use S = "welcome to code jam" as an example. The formal solution will be clear from the example; and you can always download the good solutions (with nice programming techniques) from the scoreboard.

So, let us define, for each position i in T, T⁽ⁱ⁾ to be the string consists of the first i characters of T. And write

Dp[i,1]: How many times we can find "w" in T⁽ⁱ⁾?
Dp[i,2]: How many times we can find "we" in T⁽ⁱ⁾?
Dp[i,3]: How many times we can find "wel" in T⁽ⁱ⁾?
Dp[i,4]: How many times we can find "welc" in T⁽ⁱ⁾?
...
Dp[i,18]: How many times we can find "welcome to code ja" in T⁽ⁱ⁾?
Dp[i,19]: How many times we can find "welcome to code jam" in T⁽ⁱ⁾?

Assume Dp[i,j] is computed for each j, let us see how easy we can compute, say, Dp[i+1,4]:

If the (i+1)-th character of T is not 'c', then Dp[i+1,4] = Dp[i,4].
If the (i+1)-th character of T is 'c', then we can include all the "welc"s found in T⁽ⁱ⁾, as well as those "welc"s ends exactly on the (i+1)-th character, so Dp[i+1,4] = Dp[i,4] + Dp[i,3].

Finally, let n be the length of the text T, Dp[n,19] will be our answer.

private void solve() {
char[] pattern = "welcome to code jam".toCharArray();
int patternN = pattern.length;
int MODULO = 10000;

char[] line = nextLine().toCharArray();

int[] d = new int[patternN + 1];
d[0] = 1;

for (char c : line) {
for (int i = patternN - 1; i >= 0; i--) {
if (c == pattern[i]) {
d[i + 1] += d[i];
d[i + 1] %= MODULO;
}
}
}

String answer = Integer.toString(d[patternN]);
while (answer.length() < 4) {
answer = "0" + answer;
}
out.print(' ');
out.print(answer);
}

}
http://wilanw.blogspot.com/2009/09/google-code-jam-2009-qualification.html
Hence your allowed to skip characters in T so long as you use the all the characters in P in order. This easily lends itself to a recursive solution, we just need to keep track of 2 indexes: the index of P and the index of T. This alone let's us to determine how far we are within a particular instance of the problem and lets us determine what next character in P we need to match within T. There are two conditions in which we end the recursion: we have reached the end of P (i.e. this is a valid occurrence of P inside T), we have reached the end of T but there are still characters to process in P (i.e. invalid occurrence of P in T as we haven't finished laying out the characters in P on T). We return 1 and 0 on the respective cases.

This solution will timeout for the large input as the number of possible occurrences skyrockets (this should be evident enough from the problem statement which states 400263727 occurrences of "welcome to code jam" in the small paragraph). The simple optimisation is to use memoisation and cache the indexes so we don't re-compute sub-problems we have already calculated before. Another approach is to convert the memoisation/recurrence relation into a bottom-up dynamic programming algorithm. A minor implementation note is keeping track of the last 4 digits, the safest and easiest is to use modulo arithmetic: we simply mod all our calculations with 10000. Note that it is not sufficient to do this at the end (before outputting) as the calculations may have already overflowed the integer type and modding it by 10000 will just yield the incorrect solution.

int dp[511][21];   // memo cache
string lineR;      // used to keep track of the line we are given
string textR = "welcome to code jam";   // the phrase we are looking for

int calc(int lineIdx, int textIdx) {
   // base cases
   if (textIdx >= textR.size()) return 1;
   if (lineIdx >= lineR.size()) return 0;
   // seen it before - return the previously computed value
   if (dp[lineIdx][textIdx] != -1) return dp[lineIdx][textIdx];
   dp[lineIdx][textIdx] = 0;
   for (int i = lineIdx; i < lineR.size(); i++) {
      // valid transition - add all the combinations
      if (lineR[i] == textR[textIdx]) {
         dp[lineIdx][textIdx] = (dp[lineIdx][textIdx] + calc(i+1, textIdx+1)) % 10000;
      }
   }
   return dp[lineIdx][textIdx] % 10000;
}

http://stevekrenzel.com/articles/welcome_to_code_jam
http://codejamdaemon.blogspot.com/2012/03/qualification-round-2009-alien-language.html

S = 'welcome to code jam'
lS = len(S)
T = int(stdin.next().strip())
for t in xrange(1, T+1):
    L = stdin.next().strip()
    lL = len(L)
    cache = -1 * np.ones((501,20), 'int')
    def count(l,s):
        if cache[l,s] == -1:
            if s==lS:
                cache[l,s] = 1
            else:
                try:
                    f = L[l:].index(S[s])
                    cache[l,s] = (count(l+f+1,s) + count(l+f+1,s+1)) % 10000
                except ValueError:
                    cache[l,s] = 0
        return cache[l,s]
    print 'Case #%d: %0.4d' % (t, count(0,0))

Read full article from Dashboard - Qualification Round 2009 - Google Code Jam

Problem C. Welcome to Code Jam - Qualification Round 2009 - Google Code Jam

Labels

Popular Posts