Spam Classifier

Problem Statement

Given a list of messages and a list of spam words, label each message as "spam" if it contains two or more spam words; otherwise label it "not spam". Matching is case-sensitive — "Win" and "win" are different.

Example 1:

Input:
  messages = ["Win money now", "Hello there", "Free win prize"]
  spamWords = ["win", "prize", "money"]
Output: ["not spam", "not spam", "spam"]
Explanation:
  "Win money now" — contains "money" (spam) but NOT "win" (different case). 1 spam word → not spam.
  "Hello there" — 0 spam words → not spam.
  "Free win prize" — contains "win" and "prize" → 2 spam words → spam.

Example 2:

Input:
  messages = ["buy now", "buy buy buy"]
  spamWords = ["buy"]
Output (if counting duplicate tokens): ["not spam", "spam"]
Output (if counting distinct spam words): ["not spam", "not spam"]

Constraints:

1 <= messages.length <= 10^4.
1 <= spamWords.length <= 10^3.
Clarify: do duplicate occurrences of the same spam word count separately?

Difficulty: Easy Round: OA

Pattern Recognition

What is the input structure? A list of strings (messages) and a list of strings (spam vocabulary).
What are we optimizing? O(1) membership lookup per token.
What pattern does this fit? Hash set + tokenization.
Why does this pattern fit? You need to answer "is this token a spam word?" many times. A hash set gives O(1) lookup. Tokenizing on whitespace (or word boundaries) turns each message into a stream of candidate words.

Approach

Brute Force — Nested Loop

For each message, tokenize, and for each token, linearly scan the spam list.

Time: O(M * W * S) where M = messages, W = avg tokens per message, S = spam-list size. Space: O(1). Why insufficient: If S is large, this gets slow.

Optimal — Hash Set + Linear Tokenize

Key Insight: Put the spam words in a hash set once. Then scan each message's tokens, counting matches. Short-circuit at 2.

Step-by-step:

Build spamSet = Set(spamWords).
For each message:
- Tokenize on whitespace.
- Walk tokens, incrementing count when the token is in spamSet.
- If count >= 2, label "spam" and break.
- Otherwise label "not spam".
Return the labels.

Time: O(T) total where T is the total number of tokens across all messages. Space: O(S) for the set.

Walkthrough

Trace on messages = ["Win money now", "Hello there", "Free win prize"], spamWords = ["win", "prize", "money"]:

spamSet = {"win", "prize", "money"}.

Message 1: "Win money now"

token	in set?	count
Win	no (case-sensitive)	0
money	yes	1
now	no	1

count = 1 → "not spam".

Message 2: "Hello there"

Both tokens are misses. count = 0 → "not spam".

Message 3: "Free win prize"

token	in set?	count
Free	no	0
win	yes	1
prize	yes	2 → break

count = 2 → "spam".

Answer: ["not spam", "not spam", "spam"].

Implementation

TypeScriptC++

typescript

function classifyMessages(messages: string[], spamWords: string[]): string[] {
    const spamSet = new Set(spamWords);
    return messages.map(msg => {
        const tokens = msg.split(/\s+/);
        let count = 0;
        for (const t of tokens) {
            if (spamSet.has(t)) count++;
            if (count >= 2) return "spam";
        }
        return "not spam";
    });
}

cpp

#include <string>
#include <vector>
#include <sstream>
#include <unordered_set>

std::vector<std::string> classifyMessages(const std::vector<std::string>& messages,
                                          const std::vector<std::string>& spamWords) {
    std::unordered_set<std::string> spamSet(spamWords.begin(), spamWords.end());
    std::vector<std::string> out;
    out.reserve(messages.size());
    for (const auto& msg : messages) {
        std::stringstream ss(msg);
        std::string token;
        int count = 0;
        while (ss >> token) {
            if (spamSet.count(token)) count++;
            if (count >= 2) break;
        }
        out.push_back(count >= 2 ? "spam" : "not spam");
    }
    return out;
}

Watch out:

Case sensitivity. Do not lowercase anything. "Win" != "win".
Duplicates count. The reported version counts each occurrence. If unclear, ask: "Does 'buy buy' count as 2 matches or 1?"
Punctuation. If tokens can have trailing punctuation ("win!"), either strip it or use a looser split. Clarify.

Complexity Analysis

	Time	Space
Brute Force (Linear Scan)	O(M * W * S)	O(1)
Optimal (Hash Set)	O(T)	O(S)

Bottleneck: Total tokens across all messages. Set construction is O(S) once upfront.

Edge Cases

Empty message — 0 tokens, label "not spam".
Message is only whitespace — split(/\s+/) may yield [""]; filter empty tokens or trim.
Spam words list empty — every message labeled "not spam".
Message with exactly 2 spam words — boundary; the >= check includes it.
Case variants — deliberately excluded by case-sensitive matching; do not normalize.
Non-ASCII characters — Unicode works if both sides are normalized consistently. Otherwise, inputs like "café" vs "cafe\u0301" can silently miss.

LC 771 — Jewels and Stones — Hash set membership test, simpler variant.
LC 49 — Group Anagrams — Hash key construction from strings.
LC 290 — Word Pattern — Bijection between pattern chars and tokens.

OA Strategy Notes

This is the easiest OA problem you will see. 5-10 minutes of implementation. The only risk is reading the problem sloppily and missing "case-sensitive" or misunderstanding the duplicate rule. Re-read the problem once before coding.

What is Expected at Each Level

Junior: Solves quickly. Likely normalizes case by habit; needs to unlearn that for this problem.

Mid-level: Hash set, tokenize, count. Asks the case-sensitivity and duplicate questions upfront.

Senior: Discusses tokenization robustness (punctuation, Unicode, whitespace variants). Mentions that in production, stemming and n-gram matching would apply but are out of scope for the stated spec.

Spam Classifier ​

Problem Statement ​

Pattern Recognition ​

Approach ​

Brute Force — Nested Loop ​

Optimal — Hash Set + Linear Tokenize ​

Walkthrough ​

Implementation ​

Complexity Analysis ​

Edge Cases ​

Related Problems ​

OA Strategy Notes ​

What is Expected at Each Level ​