Massive Algorithms: Dropbox: Get Total Hits in Last 15 mins

Dropbox: Get Total Hits in Last 15 mins - log hitter

Software Engineer Interview Questions | nuttynana
Write a counter for hits received in the past 15 minutes.
Suppose we only need 1 second granularity, then the thing become simpler. We can have a rolling window array of 900 elements to store the hit in each second. We also need to maintain some metadata.

struct Counter {
 int count[N];
 int lastPosition;
 int lastTime;
 int sum;
 Counter() {
  memset(count, 0, sizeof(count));
  lastPosition = 0;
  sum = 0;
  lastTime = time(NULL);
 }
 void move() {
  int t = time(NULL);
  int gap = min(t - lastTime, N);
  for (int k = 0; k < gap; ++k) {
   lastPosition = (lastPosition + 1) % N;
   sum -= count[lastPosition];
   count[lastPosition] = 0;
  }
  lastTime = t;
 }
 void hit() {
  move();
  count[lastPosition]++;
  sum++;
 }
 int getcount() {
  move();
  return sum;
 }
};

If there are multiple threads accessing the structs, then there are two solutions. One is to have a lock on top of it. The second is to have thread-local version of the struct, so when updating, update the thread-local one, and when querying, get all data and sum them together. Both version requires locks, however, the first solution, every hit may wait on the lock while the second solution, only when calling getcount() may block the lock. In the real life, usually hit() is called much higher frequency than getcount(), so the second solution could perform better. But the second solution also harder to implemented, so there is another trade-off need to be made, depends on the real requirement.

What if we want full accuracy? Actually this is simpler, as we don’t have much room for optimization. What we need is to record all the raw data. We can use a queue for the purpose.
class Counter {
queue hits; // the float is the timestamp (in seconds)
float now;

void move() {
now = getCurrentTime(); // in seconds in float number to support full accuracy
while (!hits.empty() && hits.front() < now - 900) {
hits.pop();
}
}

void hit() {
move();
hits.push(now);
}

void getcounter() {
move();
return hits.size();
}
};

http://www.mitbbs.com/article_t/JobHunting/32549839.html

我是以精确到秒的思想解题的，用size 300的array存次数和上次的hit 时间，然后用
时间time%300来找到对应的index,如果发现上次时间超过了一秒就重新计数0，然后
getHits()就把array 扫一遍加起来。面试官蛮满意的，但是followup 没答很好，先
让我写unit test测试,然后再讨论如何concurrent.

test cases我只想到这三个:
1. last hit和cur hit发生在同一秒
2. last hit和cur hit发生在不同秒, 检查结果是否正确的把last hit到cur hit之间
的element reset 0
3. 同2, 且last hit跟cur hit发生的间隔很大(ex. 30000s), 检查run time, 看是否
在reset完300个element后就early return

concurrency是不是只需要用一个mutex把hit()头尾用mutex.lock(), nutex.unlock()
包住就行了? 不是很确定...

这题相当tricky，用circular buffer并不好做，我最早想到的就是circular buffer，
但是写了一遍，面试官说有bug，fix了面试官说还有bug，这样来来回回了好几趟。主
要得考虑相邻两次hit是不是在一个bucket中，并且相差多远（大于5分钟还是小于5分
钟），不仅得考虑两次hit，还需要考虑getHit与上次hit相差多远（大于5分钟还是小
于5分钟），中途问面试官可不可以用一个background job来清空bucket，被告知不可
以。相比起来应该还是用链表更好写一些。

把所有bug都fix过后，面试官又问在多线程的情况下这段code怎么改，本来应该给所有
变量都加个读写锁的，但是已经没有时间，只有把整个函数都给锁掉。鉴于fix了很多
次bug以及效率很低下的锁，现在只有坐等据信了:(

之前twitter电面我时候遇到的题也是问了concurrency。题目本身是不断插入hashtag
，让随时统计频率最高的。跟这个题一样，有一个插入函数（类似hit)。我的回答是有
一个线程专门负责插入，其它线程给这个线程发送插入的请求。这样需要加锁的部分就
是发送插入请求这块。如果可以用Go的话，直接上channel，无锁搞定（不知道channel
的实现是不是用到锁）。面试官表示满意。

http://blog.csdn.net/whuwangyi/article/details/41010325

这题是他家高频题，我用deque实现的，hit的均摊时间是O(1)，觉得应该差不多最优了吧。

后来要求写个并行程序，忘记问是写共享内存的还是分布式的了，写的有点卡，有时候纠结到底用lock还是用多个local copy，感觉设计起来其实就是CAP理论的实践。如果要求consistency，就不能partition,或者说使用lock的话availability会很低。如果我做多个local copy，感觉无法保证consistency。

http://www.careercup.com/question?id=14908664

Also check the book the-art-of-readable-code
http://massivetechinterview.blogspot.com/2015/09/the-art-of-readable-code.html
Designing and Implementing a “Minute/Hour Counter”
// Track the cumulative counts over the past minute and over the past hour.
// Useful, for example, to track recent bandwidth usage.
class MinuteHourCounter {
// Add a new data point (count >= 0).
// For the next minute, MinuteCount() will be larger by +count.
// For the next hour, HourCount() will be larger by +count.
void Add(int count);

// Return the accumulated count over the past 60 seconds.
int MinuteCount();

// Return the accumulated count over the past 3600 seconds.
int HourCount();
};

An Easier-to-Read Version
class MinuteHourCounter {
list<Event> events;

int CountSince(time_t cutoff) {
int count = 0;
for (list<Event>::reverse_iterator rit = events.rbegin();
rit != events.rend(); ++rit) {
if (rit->time <= cutoff) {
break;
}
count += rit->count;
}
return count;
}

public:
void Add(int count) {
events.push_back(Event(count, time()));
}

int MinuteCount() {
return CountSince(time() - 60);
}

int HourCount() {
return CountSince(time() - 3600);
}
};

Better to have all the difficult code confined to one place.

Because “traditional” for loops of the form for(begin; end; advance) are easiest to read. The reader can immediately understand it as “go through all the elements” and doesn’t have to think about it further.

Attempt 2: Conveyor Belt Design
class MinuteHourCounter {
list<Event> minute_events;
list<Event> hour_events; // only contains elements NOT in minute_events

int minute_count;
int hour_count; // counts ALL events over past hour, including past minute
};

void Add(int count) {
const time_t now_secs = time();
ShiftOldEvents(now_secs);

// Feed into the minute list (not into the hour list--that will happen later)
minute_events.push_back(Event(count, now_secs));

minute_count += count;
hour_count += count;
}

int MinuteCount() {
ShiftOldEvents(time());
return minute_count;
}

int HourCount() {
ShiftOldEvents(time());
return hour_count;
}
// Find and delete old events, and decrease hour_count and minute_count accordingly.
void ShiftOldEvents(time_t now_secs) {
const int minute_ago = now_secs - 60;
const int hour_ago = now_secs - 3600;

// Move events more than one minute old from 'minute_events' into 'hour_events'
// (Events older than one hour will be removed in the second loop.)
while (!minute_events.empty() && minute_events.front().time <= minute_ago) {
hour_events.push_back(minute_events.front());

minute_count -= minute_events.front().count;
minute_events.pop_front();
}

// Remove events more than one hour old from 'hour_events'
while (!hour_events.empty() && hour_events.front().time <= hour_ago) {
hour_count -= hour_events.front().count;
hour_events.pop_front();
}
}

Attempt 3: A Time-Bucketed Design
The key idea is to bucket all the events within a small time window together, and summarize those events with a single total. For instance, the events over the past minute could be inserted into 60 discrete buckets, each 1 second wide. The events over the past hour could also be inserted into 60 discrete buckets, each 1 minute wide.
this design has a fixed, predictable memory usage.

class MinuteHourCounter {
TrailingBucketCounter minute_counts;
TrailingBucketCounter hour_counts;

public:
MinuteHourCounter() :
minute_counts(/* num_buckets = */ 60, /* secs_per_bucket = */ 1),
hour_counts( /* num_buckets = */ 60, /* secs_per_bucket = */ 60) {
}

void Add(int count) {
time_t now = time();
minute_counts.Add(count, now);
hour_counts.Add(count, now);
}

int MinuteCount() {
time_t now = time();
return minute_counts.TrailingCount(now);
}

int HourCount() {
time_t now = time();
return hour_counts.TrailingCount(now);
}
};

class TrailingBucketCounter {
ConveyorQueue buckets;
const int secs_per_bucket;
time_t last_update_time; // the last time Update() was called

// Calculate how many buckets of time have passed and Shift() accordingly.
void Update(time_t now) {
int current_bucket = now / secs_per_bucket;
int last_update_bucket = last_update_time / secs_per_bucket;

buckets.Shift(current_bucket - last_update_bucket);
last_update_time = now;
}

public:
TrailingBucketCounter(int num_buckets, int secs_per_bucket) :
buckets(num_buckets),
secs_per_bucket(secs_per_bucket) {
}

void Add(int count, time_t now) {
Update(now);
buckets.AddToBack(count);
}

int TrailingCount(time_t now) {
Update(now);
return buckets.TotalSum();
}
};

// A queue with a maximum number of slots, where old data gets shifted off the end.
class ConveyorQueue {
queue<int> q;
int max_items;
int total_sum; // sum of all items in q

public:
ConveyorQueue(int max_items) : max_items(max_items), total_sum(0) {
}

int TotalSum() {
return total_sum;
}

void Shift(int num_shifted) {
// In case too many items shifted, just clear the queue.
if (num_shifted >= max_items) {
q = queue<int>(); // clear the queue
total_sum = 0;
return;
}

// Push all the needed zeros.
while (num_shifted > 0) {
q.push(0);
num_shifted--;
}

// Let all the excess items fall off.
while (q.size() > max_items) {
total_sum -= q.front();
q.pop();
}
}

void AddToBack(int count) {
if (q.empty()) Shift(1); // Make sure q has at least 1 item.
q.back() += count;
total_sum += count;
}
};
Read full article from Software Engineer Interview Questions | nuttynana

Dropbox: Get Total Hits in Last 15 mins - log hitter

Labels

Popular Posts