Skip to content

Spam Classifier ​

Problem Statement ​

Given a list of messages and a list of spam words, label each message as "spam" if it contains two or more spam words; otherwise label it "not spam". Matching is case-sensitive — "Win" and "win" are different.

Example 1:

Input:
  messages = ["Win money now", "Hello there", "Free win prize"]
  spamWords = ["win", "prize", "money"]
Output: ["not spam", "not spam", "spam"]
Explanation:
  "Win money now" — contains "money" (spam) but NOT "win" (different case). 1 spam word → not spam.
  "Hello there" — 0 spam words → not spam.
  "Free win prize" — contains "win" and "prize" → 2 spam words → spam.

Example 2:

Input:
  messages = ["buy now", "buy buy buy"]
  spamWords = ["buy"]
Output (if counting duplicate tokens): ["not spam", "spam"]
Output (if counting distinct spam words): ["not spam", "not spam"]

Constraints:

  • 1 <= messages.length <= 10^4.
  • 1 <= spamWords.length <= 10^3.
  • Clarify: do duplicate occurrences of the same spam word count separately?

Difficulty: Easy Round: OA


Pattern Recognition ​

  • What is the input structure? A list of strings (messages) and a list of strings (spam vocabulary).
  • What are we optimizing? O(1) membership lookup per token.
  • What pattern does this fit? Hash set + tokenization.
  • Why does this pattern fit? You need to answer "is this token a spam word?" many times. A hash set gives O(1) lookup. Tokenizing on whitespace (or word boundaries) turns each message into a stream of candidate words.

Approach ​

Brute Force — Nested Loop ​

For each message, tokenize, and for each token, linearly scan the spam list.

Time: O(M * W * S) where M = messages, W = avg tokens per message, S = spam-list size. Space: O(1). Why insufficient: If S is large, this gets slow.

Optimal — Hash Set + Linear Tokenize ​

Key Insight: Put the spam words in a hash set once. Then scan each message's tokens, counting matches. Short-circuit at 2.

Step-by-step:

  1. Build spamSet = Set(spamWords).
  2. For each message:
    • Tokenize on whitespace.
    • Walk tokens, incrementing count when the token is in spamSet.
    • If count >= 2, label "spam" and break.
    • Otherwise label "not spam".
  3. Return the labels.

Time: O(T) total where T is the total number of tokens across all messages. Space: O(S) for the set.


Walkthrough ​

Trace on messages = ["Win money now", "Hello there", "Free win prize"], spamWords = ["win", "prize", "money"]:

spamSet = {"win", "prize", "money"}.

Message 1: "Win money now"

tokenin set?count
Winno (case-sensitive)0
moneyyes1
nowno1

count = 1 → "not spam".

Message 2: "Hello there"

Both tokens are misses. count = 0 → "not spam".

Message 3: "Free win prize"

tokenin set?count
Freeno0
winyes1
prizeyes2 → break

count = 2 → "spam".

Answer: ["not spam", "not spam", "spam"].


Implementation ​

typescript
function classifyMessages(messages: string[], spamWords: string[]): string[] {
    const spamSet = new Set(spamWords);
    return messages.map(msg => {
        const tokens = msg.split(/\s+/);
        let count = 0;
        for (const t of tokens) {
            if (spamSet.has(t)) count++;
            if (count >= 2) return "spam";
        }
        return "not spam";
    });
}
cpp
#include <string>
#include <vector>
#include <sstream>
#include <unordered_set>

std::vector<std::string> classifyMessages(const std::vector<std::string>& messages,
                                          const std::vector<std::string>& spamWords) {
    std::unordered_set<std::string> spamSet(spamWords.begin(), spamWords.end());
    std::vector<std::string> out;
    out.reserve(messages.size());
    for (const auto& msg : messages) {
        std::stringstream ss(msg);
        std::string token;
        int count = 0;
        while (ss >> token) {
            if (spamSet.count(token)) count++;
            if (count >= 2) break;
        }
        out.push_back(count >= 2 ? "spam" : "not spam");
    }
    return out;
}

Watch out:

  1. Case sensitivity. Do not lowercase anything. "Win" != "win".
  2. Duplicates count. The reported version counts each occurrence. If unclear, ask: "Does 'buy buy' count as 2 matches or 1?"
  3. Punctuation. If tokens can have trailing punctuation ("win!"), either strip it or use a looser split. Clarify.

Complexity Analysis ​

TimeSpace
Brute Force (Linear Scan)O(M * W * S)O(1)
Optimal (Hash Set)O(T)O(S)

Bottleneck: Total tokens across all messages. Set construction is O(S) once upfront.


Edge Cases ​

  1. Empty message — 0 tokens, label "not spam".
  2. Message is only whitespace — split(/\s+/) may yield [""]; filter empty tokens or trim.
  3. Spam words list empty — every message labeled "not spam".
  4. Message with exactly 2 spam words — boundary; the >= check includes it.
  5. Case variants — deliberately excluded by case-sensitive matching; do not normalize.
  6. Non-ASCII characters — Unicode works if both sides are normalized consistently. Otherwise, inputs like "café" vs "cafe\u0301" can silently miss.

  • LC 771 — Jewels and Stones — Hash set membership test, simpler variant.
  • LC 49 — Group Anagrams — Hash key construction from strings.
  • LC 290 — Word Pattern — Bijection between pattern chars and tokens.

OA Strategy Notes ​

This is the easiest OA problem you will see. 5-10 minutes of implementation. The only risk is reading the problem sloppily and missing "case-sensitive" or misunderstanding the duplicate rule. Re-read the problem once before coding.


What is Expected at Each Level ​

Junior: Solves quickly. Likely normalizes case by habit; needs to unlearn that for this problem.

Mid-level: Hash set, tokenize, count. Asks the case-sensitivity and duplicate questions upfront.

Senior: Discusses tokenization robustness (punctuation, Unicode, whitespace variants). Mentions that in production, stemming and n-gram matching would apply but are out of scope for the stated spec.

Frontend interview preparation reference.