Massive Algorithms: Google Interview Questions Deconstructed

Google Interview Questions Deconstructed

https://medium.com/@alexgolec

https://medium.com/@alexgolec/introducing-google-interview-questions-deconstructed-a012e41ea631

https://hackernoon.com/google-interview-questions-deconstructed-the-knights-dialer-f780d516f029
More at LeetCode 935 - Knight Dialer

I like it because it hits number of sweet spots:

It’s easy to state and understand.
It has a number of solutions, each requiring varying degrees of algorithms and data structures knowledge. Also, a little bit of insight goes a long way.
Each solution can be implemented in relatively few lines, making it perfect for a time-constrained environment.

Google Interview Questions Deconstructed: The Knight’s Dialer (Logarithmic Time Edition)

Google Interview Questions Deconstructed: Synonymous Queries

Imagine you operate a popular search engine and in your logs you observe two queries, let’s say “obama approval ratings” and “obama popularity rate.” Those two queries are different strings, but I think we can all agree that they’re basically searching for the same thing, and should be considered equivalent when counting queries, showing results, etc. How can we detect that two queries are synonymous?

To make this concrete, here is a sample input to illustrate:

SYNONYMS = [
('rate', 'ratings'),
('approval', 'popularity'),
]

QUERIES = [
('obama approval rate', 'obama popularity ratings'),
('obama approval rates', 'obama popularity ratings'),
('obama approval rate', 'popularity ratings obama')
]

Part 1: The (Not So) Simple Case

However candidates arrive at these questions, they inevitably end up asking me for an answer, and I always start off with the simplest case possible: words can have multiple synonyms, order matters, synonyms are not transitive, and synonyms can only map from one word to another. This makes for a pretty limited feature in a search engine, but there’s more than enough subtlety in it to make for an interested interview question.

You’d think, but there’s more subtlety here than you’d see at first glance. By far the trickiest component of this simple algorithm is the synonym comparison. While simple to understand and describe, there are a lot of ways the synonym comparison component can go wrong. I’ll go into some of the more common ones I’ve seen here.

To be clear, none of these mistakes are disqualifying in my mind; if the candidate produced an implementation with an error, I would simply point it out, they would adjust their solution, and we would move on. However, an interview is, first and foremost, a battle against time. Making, noticing, and correcting mistakes is expected, but it saps time that could be spent elsewhere, such as producing a more optimal solution. Very few candidates make no mistakes, but candidates who make fewer make more progress simply because they spend less time cleaning up after themselves.

This is why I like this problem: unlike the knight’s dialer, which requires a flash of algorithmic insight followed by a (hopefully) simple implementation, this question requires a multitude of small, incremental steps in the right direction. Each step represents a tiny hurdle over which the candidate can either leap gracefully or trip and have to recover. Good candidates avoid these little pitfalls using their experience and intuition and are rewarded with a more fleshed-out and correct solution, while weaker ones waste time and energy on mistakes and are usually left with buggy code.

While every interview saw a different mix of leaps and faceplants, here is a small sampling of the more common errors I saw.

Accidental Runtime Killers

In terms of practical advice, this is actually an easy mistake to avoid. First off, never forget the types of your objects, even if you’re using an untyped language like python! Second, remember that when you use the in keyword on a list, that’s a linear search. Unless that list is guaranteed to always be very small, it’s going to be a performance killer.

Reminding candidates that the input structure is a list is usually enough to rouse them. What happens after I give a hint is very informative. The better candidates immediately think to preprocess the synonyms somehow, which is a good start. However, that approach is not without its pitfalls…

Use the Right Data Structure

Which of these two the candidate chooses is less interesting to me than what they put in it. (By the way, never use a dict/hashmap that goes to True or False. That’s called a set.) Most candidates settle on some sort of dict/hashmap. The most common error I see is a subconscious assumption that each word can have at most one synonym:

A slightly more serious problem is not realizing that the synonym relationship goes both ways. You’ll notice the above code does this. Correcting this, however, can be error prone. Consider the following approach to implementing this property:

Why perform two insertions and use double the memory when you can use no additional memory and perform two checks?

The takeaway: always ask yourself if you can do less work! In hindsight, permuting the lookup is an obvious way of saving time if you look for it, but using a suboptimal implementation suggests the candidate didn’t think to look for ways to optimize. Again, I’m happy to give a hint, but it’d be better if I didn’t have to.

Transitivity: Naive Approaches

The first constraint I like to relax is the one around transitivity, meaning that if words A and B are synonymous, and if words B and C are synonymous, then words A and C are synonymous. Sharp candidates quickly realize they can adapt their earlier solution to solve this because they’re still determining whether simple pairs of words represent synonym pairs, whereas the other relaxations invalidate the core logic of the earlier algorithm.

So far, so good, but there’s clearly no blinding performance just yet. The genius of this structure is in a procedure called compaction.

Yes you can, it turns out. In a way, every element in this tree is destined to arrive at “fast.” Instead of traversing the tree multiple times, why not simply change the parent of each element along the way to “fast” and save ourselves the work? This process is called compaction

This leaves us with a few more follow-ups: a version of this question where word order doesn’t matter, and one where synonyms can span multiple words. The solutions to each of these are challenging and delightful.

Google Interview Problems: Ratio Finder

More at LeetCode 399 - Evaluate Division

Given a list of conversion rates (formatted in the language of your choice) as a collection of origin unit, destination unit, and multiplier, for example:

foot inch 12

foot yard 0.3333333

etc…

Such that ORIGIN * MULTIPLIER = DESTINATION, design an algorithm that takes two arbitrary unit values and returns the conversion rate between them.

For context, framing is the act of translating a problem where the solution is not obvious into an equivalent one where the solution yields naturally. If that sounds completely abstract and unapproachable, forgive me because it is

For instance, what if there is no conversion? The obvious approach doesn’t tell you anything about whether there actually is a conversion, and if I were given a thousand conversion rates, I would have a very difficult time determining whether such a conversion exists. Perhaps I’m being asked to convert between unfamiliar (or made up) units called wobbles and a thingles, and I have no idea where to even start. How would the intuitive approach handle that?

I have to admit, that’s kind of a contrived scenario, but there’s another, more realistic one to consider. You’ll notice that my problem statement only includes units of distance. This is very intentional. What if I ask my system to translate from inches to kilogram? You and I both know this can’t be done because those units measure different things, but our input tells us nothing about the “kind” of thing each unit measures.

This is where the careful statement of the question allows strong candidates to shine. Strong candidates think through the edge cases of a system before they design an algorithm, and this problem statement purposefully gives them an opportunity to ask me whether we’ll be translating different units. It’s not a huge deal if they don’t catch this issue early on, but it’s always a good sign when someone asks me “what should I return if there is no conversion?” Stating the question this way gives me an indication.

The Graph Framing

Framing the problem as a graph unlocks all the classic graph search problems. In particular, two algorithms are useful here: breadth first search (BFS) and depth first search (DFS).

We don’t want to find whether a path exists, we want to find the conversion rate! This is where the candidate must make a leap: it turns out you can modify any search algorithm to find the conversion rate, simply by keeping additional state as you traverse.

This is a fine implementation, but it suffers from two major weaknesses. First off, it is recursive. If it turns out the path we need is more than a thousand or so hops long, we’ll crash. Sure, it’s not likely, but if there’s one thing you don’t want happening in a long-running service, it’s crashing. Second off, even if we were to stop successfully, our answer has some undesirable properties.

I actually already gave you a hint way up at the top of the post. Did you notice how Google says the conversion rate is 1.0739e-17 but the conversion I computed manually came out to 1.0737e-17? It turns out given all the floating point multiplications we’re performing, we have to start worrying about error propagation. The subtleties are a little more than I want to go into for this post, but the gist of it is we want to perform as few floating point multiplications as possible to avoid errors accumulating and causing trouble.

DFS is a fine search algorithm, and if a solution exists it will find it, but it lacks a crucial property: it does not necessarily find the shortest path. This is relevant to us because a shorter path means fewer hops, which means fewer error-propagating floating point multiplications. To get around this, we’ll want to use BFS.

the recursive DFS solution’s major weaknesses are that it’s recursive and it doesn’t minimize the number of multiplications. BFS,as we’ll soon see, does minimize the number of multiplications, and it also happens to be very tricky to implement recursively.

Understand the question
Frame the conversion network as a graph
Realize conversion rates can be mapped to paths through the graph
Recognize they can use search algorithms to accomplish this
Choose their favorite algorithm and modify it to track the conversion rate
If they implemented DFS as a naive solution, realize its weaknesses
Implement BFS
Step back and examine the edge cases:
What if we’re asked for a nonexistent node?
What if the conversion rate does not exist?
Realize the reversing the conversions is possible and likely necessary

Part 4: Can You Do Any Better?

Part 4: Constant Time

So it turns out that the “cache everything” solution is actually not far off from the mark. In that approach, we (eventually) end up with an edge between each node and every other node, meaning our conversion happen in a single edge lookup. But do we really need to store conversion from every node to every node? What if we just stored the conversion rates from one node to every other node?

But Wait, There’s More!

First, a warm-up: in the constant time solution I laid out, I chose the root node of each connected component arbitrarily. In particular, I use the first node of that component we encounter. This is not optimal, because for all we know we’ve chosen some node way off on the fringes of the graph, while some other node might be more central and so have shorter paths to all other nodes. Your assignment is to replace this arbitrary choice with one which minimizes the number of multiplications required and this the floating point error propagation.

Second, this whole discussion assumes all equal-length paths through the graph are created equal, which isn’t always the case. One interesting variant of this problem is currency conversions: nodes are currencies, and edges from A to B and vice-versa are the bid/ask prices of each currency pair. We can rephrase the unit conversion question as a forex arbitrage question: implement an algorithm that, given a currency conversion graph, computes a cycle through the graph that can leave a trader with more money than when they started. Assume no transaction fees.

Finally, a real doozie: some units are expressed as a combination of various basic units. For instance, the watt is defined, in SI units, as “kilogram meters squared by seconds cubed”. The final challenge is to extend this system to support converting between these units given only the definitions of the basic SI units.

Google Interview Questions Deconstructed

Labels

Popular Posts