Download Work - 840 -2024- Bengla -www.mazabd.click... <ULTIMATE>
suspicious_word_list = "download","click","open","update","verify","invoice","account", "password","login","security","confirm"
def entropy(s): """Shannon entropy of a string.""" probs = np.bincount(list(s.encode())) / len(s) probs = probs[probs > 0] return -np.sum(probs * np.log2(probs))
# ---- Entropy ------------------------------------------------------------ char_entropy = entropy(subject) Download WORK - 840 -2024- Bengla -www.mazabd.click...
def extract_features(subject: str) -> dict: # ---- Basic tokenisation ------------------------------------------------- tokens = re.split(r'\s+', subject.strip()) n_tokens = len(tokens) n_chars = len(subject)
| # | Feature | Why it matters | Extraction | |---|---------|----------------|------------| |13| | www.mazabd.click is a new TLD ( .click ) frequently used by malicious actors. | tldextract.extract(subject).registered_domain | |14| Domain Age (in days) | Newly registered domains are riskier. | Use WHOIS API → creation_date → today - creation_date . | |15| Domain Reputation Score | Public blacklists (VirusTotal, Google Safe Browsing) give a numeric trust rating. | Query API → reputation_score . | |16| Top‑Level‑Domain (TLD) Popularity | .click , .xyz , .top are over‑represented in phishing. | Encode TLD as categorical (one‑hot) or assign risk weight (e.g., .com =0, others=1). | |17| Number of Sub‑domains | More sub‑domains → higher chance of URL‑shortening or obfuscation. | subject_url.count('.') - 1 . | |18| Presence of Hyphens in Domain | Hyphens are often used to mimic legitimate names ( mazabd ). | '−' in domain (boolean). | |19| URL Length | Very long URLs are suspicious. | len(url) | |20| URL Entropy | Randomly generated strings boost entropy. | Same entropy formula as above, applied to url . | |21| IDN / Punycode | Internationalised domain names can hide malicious domains. | url.startswith('xn--') . | |22| SSL Certificate Validity | Self‑signed or expired certs are a warning sign (if you later fetch the URL). | Use ssl / requests to check cert.notAfter . | |23| IP‑Address in URL | Direct IP links are uncommon in legitimate business mail. | re.search(r'\b\d1,3(?:\.\d1,3)3\b', url) . | 3. Structural / Formatting Features | # | Feature | Why it matters | Extraction | |---|---------|----------------|------------| |24| Number of Hyphens ( - ) | Overuse of hyphens often separates “spammy” tokens. | subject.count('-') | |25| Pattern of Numeric Tokens | The sequence 840 -2024 is a “number‑dash‑year” pattern typical of fake invoice titles. | Regex: r'\b\d3,\s*-\s*\d4\b' → boolean | |26| Presence of Ellipsis ( ... ) | Indicates truncation; spammers often hide the rest of a malicious URL. | subject.endswith('...') | |27| Bracket/Parentheses Ratio | Unbalanced punctuation is a heuristic for malformed messages. | subject.count('(') != subject.count(')') | |28| Whitespace Anomalies (multiple spaces, tabs) | Spam generators sometimes add extra spaces to bypass simple filters. | re.search(r'\s2,', subject) | |29| Encoding Flags (e.g., =?UTF-8?B?…?= ) | MIME‑encoded subjects can hide malicious strings. | Detect with email.header.decode_header . | |30| Subject Prefix / Tag Count | Tags like [Urgent] , [Notice] can be abused. | re.findall(r'\[.*?\]', subject) → count. | 4. Aggregated / Meta‑Features You can combine the raw values into risk scores : | |15| Domain Reputation Score | Public blacklists
"Download WORK - 840 -2024- Bengla -www.mazabd.click..." into an that can be fed to a spam‑/phishing‑detection model (e.g., a classic‑ML classifier, a gradient‑boosted tree, or a shallow neural net). The ideas are grouped by what the feature describes , why it matters, and how to compute it in a reproducible way (Python‑friendly pseudo‑code is included). 1. Text‑Based Features (what the subject says ) | # | Feature | Why it matters | Simple extraction (Python) | |---|---------|----------------|-----------------------------| | 1 | Token Count | Very short or very long subjects are atypical for legitimate business mail. | len(subject.split()) | | 2 | Character Count | Spam often packs many characters to hide keywords. | len(subject) | | 3 | Average Token Length | Long words → possible obfuscation. | np.mean([len(t) for t in subject.split()]) | | 4 | Upper‑case Ratio | Excessive caps = “shouting”, common in phishing. | sum(1 for c in subject if c.isupper()) / len(subject) | | 5 | Digit Ratio | High proportion of numbers (e.g., “840‑2024”) is a red flag. | sum(c.isdigit() for c in subject) / len(subject) | | 6 | Presence of Action Verbs (download, click, open, update…) | Direct calls‑to‑action are hallmark of malicious prompts. | any(v in subject.lower() for v in "download","click","open","update","verify") | | 7 | Suspicious Keywords (work, urgent, invoice, account, password…) | Common lure words. | any(k in subject.lower() for k in suspicious_word_list) | | 8 | Stop‑Word Ratio | Spam often reduces natural language flow → low stop‑word density. | stop_words = set(nltk.corpus.stopwords.words('english')) stop_ratio = sum(1 for t in tokens if t.lower() in stop_words) / len(tokens) | | 9 | N‑gram TF‑IDF Scores (bi‑grams, tri‑grams) | Captures patterns like “download work”, “840‑2024”. | Use sklearn.feature_extraction.text.TfidfVectorizer(ngram_range=(2,3)) on a corpus of subjects. | |10| Language Detection | “Bengla” hints at a language mismatch (English subject + foreign term). | langdetect.detect(subject) – flag if not the primary language of the organization. | |11| Spell‑Check Ratio | Misspellings (“Bengla” vs “Bangla”) are common in malicious mail. | spellchecker.unknown(tokens) → proportion. | |12| Entropy of Characters | High entropy can indicate random strings or encoded data. | entropy = -sum(p*np.log2(p) for p in np.bincount(list(subject.encode()))/len(subject)) | 2. URL‑Centric Features (what the subject exposes ) Even though the URL lives after the dash, the presence and shape of a domain in the subject is a strong signal.
# Example simple risk score (0‑10) risk = 0 risk += int(upper_ratio > 0.4) * 1 risk += int(digit_ratio > 0.2) * 1 risk += int(has_action_verb) * 1 risk += int(has_suspicious_keyword) * 1 risk += int(domain_age_days < 30) * 2 risk += int(tld not in 'com','org','net','gov','edu') * 1 risk += int(num_hyphens > 2) * 1 risk += int(url_entropy > 4.0) * 1 risk = min(risk, 10) A more sophisticated approach is to feed all raw features into a (XGBoost, LightGBM) which automatically learns interaction effects (e.g., “high digit ratio and unknown TLD”). 5. Practical Implementation Checklist | Step | Action | Tool / Library | |------|--------|----------------| |1| Collect a labeled corpus (spam vs. legitimate subjects).| CSV / Parquet | |2| Parse each subject for the features above.| re , tldextract , email , nltk , sklearn | |3| Enrich URLs via external APIs (whois, VirusTotal, Google Safe Browsing).| python-whois , requests | |4| Vectorise text (TF‑IDF, word‑embeddings) for deeper semantic signals.| sklearn , gensim , sentence‑transformers | |5| Scale numeric columns (StandardScaler or MinMax) if using linear models.| sklearn.preprocessing | |6| Train & evaluate (cross‑validation, ROC‑AUC, PR‑AUC).| sklearn.model_selection | |7| Deploy as a micro‑service (FastAPI/Flask) that receives a subject line, returns a risk score + optional explanations (e.g., “high digit ratio, unknown TLD”).| FastAPI, Docker | |8| Monitor drift – keep an eye on feature distributions (e.g., sudden rise in new TLDs).| Prometheus + Grafana | 6. Example Code Snippet (End‑to‑End) import re, tldextract, datetime, numpy as np from collections import Counter from sklearn.feature_extraction.text import TfidfVectorizer | Encode TLD as categorical (one‑hot) or assign
# Dummy placeholders for reputation / age (replace with real API calls) domain_age_days = 9999 # e.g., today - creation_date domain_risk = 0 # 0 = clean, 1 = flagged