Spam Classifier ​
Problem Statement ​
Given a list of messages and a list of spam words, label each message as "spam" if it contains two or more spam words; otherwise label it "not spam". Matching is case-sensitive — "Win" and "win" are different.
Example 1:
Input:
messages = ["Win money now", "Hello there", "Free win prize"]
spamWords = ["win", "prize", "money"]
Output: ["not spam", "not spam", "spam"]
Explanation:
"Win money now" — contains "money" (spam) but NOT "win" (different case). 1 spam word → not spam.
"Hello there" — 0 spam words → not spam.
"Free win prize" — contains "win" and "prize" → 2 spam words → spam.Example 2:
Input:
messages = ["buy now", "buy buy buy"]
spamWords = ["buy"]
Output (if counting duplicate tokens): ["not spam", "spam"]
Output (if counting distinct spam words): ["not spam", "not spam"]Constraints:
1 <= messages.length <= 10^4.1 <= spamWords.length <= 10^3.- Clarify: do duplicate occurrences of the same spam word count separately?
Difficulty: Easy Round: OA
Pattern Recognition ​
- What is the input structure? A list of strings (messages) and a list of strings (spam vocabulary).
- What are we optimizing? O(1) membership lookup per token.
- What pattern does this fit? Hash set + tokenization.
- Why does this pattern fit? You need to answer "is this token a spam word?" many times. A hash set gives O(1) lookup. Tokenizing on whitespace (or word boundaries) turns each message into a stream of candidate words.
Approach ​
Brute Force — Nested Loop ​
For each message, tokenize, and for each token, linearly scan the spam list.
Time: O(M * W * S) where M = messages, W = avg tokens per message, S = spam-list size. Space: O(1). Why insufficient: If S is large, this gets slow.
Optimal — Hash Set + Linear Tokenize ​
Key Insight: Put the spam words in a hash set once. Then scan each message's tokens, counting matches. Short-circuit at 2.
Step-by-step:
- Build
spamSet = Set(spamWords). - For each message:
- Tokenize on whitespace.
- Walk tokens, incrementing
countwhen the token is inspamSet. - If
count >= 2, label "spam" and break. - Otherwise label "not spam".
- Return the labels.
Time: O(T) total where T is the total number of tokens across all messages. Space: O(S) for the set.
Walkthrough ​
Trace on messages = ["Win money now", "Hello there", "Free win prize"], spamWords = ["win", "prize", "money"]:
spamSet = {"win", "prize", "money"}.
Message 1: "Win money now"
| token | in set? | count |
|---|---|---|
| Win | no (case-sensitive) | 0 |
| money | yes | 1 |
| now | no | 1 |
count = 1 → "not spam".
Message 2: "Hello there"
Both tokens are misses. count = 0 → "not spam".
Message 3: "Free win prize"
| token | in set? | count |
|---|---|---|
| Free | no | 0 |
| win | yes | 1 |
| prize | yes | 2 → break |
count = 2 → "spam".
Answer: ["not spam", "not spam", "spam"].
Implementation ​
function classifyMessages(messages: string[], spamWords: string[]): string[] {
const spamSet = new Set(spamWords);
return messages.map(msg => {
const tokens = msg.split(/\s+/);
let count = 0;
for (const t of tokens) {
if (spamSet.has(t)) count++;
if (count >= 2) return "spam";
}
return "not spam";
});
}#include <string>
#include <vector>
#include <sstream>
#include <unordered_set>
std::vector<std::string> classifyMessages(const std::vector<std::string>& messages,
const std::vector<std::string>& spamWords) {
std::unordered_set<std::string> spamSet(spamWords.begin(), spamWords.end());
std::vector<std::string> out;
out.reserve(messages.size());
for (const auto& msg : messages) {
std::stringstream ss(msg);
std::string token;
int count = 0;
while (ss >> token) {
if (spamSet.count(token)) count++;
if (count >= 2) break;
}
out.push_back(count >= 2 ? "spam" : "not spam");
}
return out;
}Watch out:
- Case sensitivity. Do not lowercase anything. "Win" != "win".
- Duplicates count. The reported version counts each occurrence. If unclear, ask: "Does 'buy buy' count as 2 matches or 1?"
- Punctuation. If tokens can have trailing punctuation (
"win!"), either strip it or use a looser split. Clarify.
Complexity Analysis ​
| Time | Space | |
|---|---|---|
| Brute Force (Linear Scan) | O(M * W * S) | O(1) |
| Optimal (Hash Set) | O(T) | O(S) |
Bottleneck: Total tokens across all messages. Set construction is O(S) once upfront.
Edge Cases ​
- Empty message — 0 tokens, label "not spam".
- Message is only whitespace —
split(/\s+/)may yield[""]; filter empty tokens or trim. - Spam words list empty — every message labeled "not spam".
- Message with exactly 2 spam words — boundary; the
>=check includes it. - Case variants — deliberately excluded by case-sensitive matching; do not normalize.
- Non-ASCII characters — Unicode works if both sides are normalized consistently. Otherwise, inputs like
"café"vs"cafe\u0301"can silently miss.
Related Problems ​
- LC 771 — Jewels and Stones — Hash set membership test, simpler variant.
- LC 49 — Group Anagrams — Hash key construction from strings.
- LC 290 — Word Pattern — Bijection between pattern chars and tokens.
OA Strategy Notes ​
This is the easiest OA problem you will see. 5-10 minutes of implementation. The only risk is reading the problem sloppily and missing "case-sensitive" or misunderstanding the duplicate rule. Re-read the problem once before coding.
What is Expected at Each Level ​
Junior: Solves quickly. Likely normalizes case by habit; needs to unlearn that for this problem.
Mid-level: Hash set, tokenize, count. Asks the case-sensitivity and duplicate questions upfront.
Senior: Discusses tokenization robustness (punctuation, Unicode, whitespace variants). Mentions that in production, stemming and n-gram matching would apply but are out of scope for the stated spec.