Massive Algorithms: Retrieve String From Sampled Slices

Retrieve String From Sampled Slices | Bo's Blog
Given a string, such as 01001010101001101011, we can randomly sliced multiple substrings. Assume that during the slicing, due to some unexpected noises, some characters may flip(0->1 or 1->0). For example:

Position: 0123456789.........
String:   01001010101001101011

slice1:    1001110101000
slice2:       1010111001111
slice3:       10101
slice4:                1101011

In above sample, slice1 starts from position 1(assume that the index of a string begins with 0), slice2 starts from position 4, slice3 starts from 4, and slice4 starts from 13. In slice1, 0 flips to 1 at position 5 and 1 flips to 0 at position 13.
For one specific position in the original string, if it is 1, then the probability of flipping to 0 in a slicing is 0.1; and vice versa(i.e.Prob(0->1)=0.1).
The problem is: if we only have multiple slices(the length of each slice may vary) and their starting positions in the string, and we don't know the original string, given an arbitrary position in the original string, how can we calculate the probability that position is a 1?
Assume most positions will be covered at least once in slices, and we have following parameters:

p01=0.1; // Probability a '0' in string but flipped to a '1' in a slice  p10=0.1; // Probability a '1' in string but flipped to a '0' in a slice  p1=0.5;  // Prior probability that any given position in string is a '1'

We can also assume that the string is a random string of 0s and 1s, and during slicing, each position is sampled independently.
For the above example string and four slices, we already have following probabilities for each position:

2. Solution

Obviously, the numbers of 0s and 1s in all slices for each position can be counted using a hashmap(unordered_map in C++). However, the difficult part is to find a model to calculate the probabilities. Simple multiplicaiton or subtraction won't work, because we cannot get 0.999 with the given probabilities and arithmetic operations. Besides, the Bayes' theorem also cannot be used directly, because we don't know the number of slices and the computation would be too complicated. And we also may be tempted to try Markov Model, but Markov Model cannot fit all positions in the above example, such as positions 4(1,1,1), 5(1,0,0), 9(0,1) and 13(0,1,1).
The best model to fit this problem is the Binomial Distribution:

The binomial distribution models the total number of successes in repeated trials from an infinite population under the following conditions:

Only two outcomes are possible on each of n trials.

The probability of success for each trial is constant.

All trials are independent of each other.

The probability density function(pdf) of Binomial Distribution is:
$f(x|n,p)={n \choose x} p^x (1-p)^{n-x}; \quad \quad x=0,1,2,\cdots,n.$
As for this problem, given a position in the original string, we first need to find the probability of a 0 occurs in this position(in this case 0=>1, say bp01), and then calculate the probability of a 1 occurs in this position(in this case 1=>1, say bp11). Finally, we can get the probability of 1 in this position in the original string by: p=bp11/(bp01+bp11).
Take position 5 as an example. We get one 1 and two 0s in position 5 in three observations. We can calculate b11 using Matlab binopdf function. Since p11=1-p10=0.9, therefore we have bp11=binopdf(1,3,0.9)=0.027. As for bp01, because p01=0.1, we have bp01=binopdf(1,3,0.1)=0.243. Therefore, we can calc the probability of 1 in position 5 by p=0.027/(0.243 + 0.027)=0.1.

3. Source Code

The implementation of this problem can be found in my GitHub Channel.

References:

Read full article from Retrieve String From Sampled Slices | Bo's Blog

Retrieve String From Sampled Slices | Bo's Blog

2. Solution

3. Source Code

References:

Labels

Popular Posts