Files
nlnieuws/python/Claude.md
Peter Kleiweg 8289072a52 weekadd
2026-06-19 12:59:45 +02:00

11 KiB
Raw Blame History

title, updated, created, source
title updated created source
Claude 2026-06-18 13:20:25Z 2026-06-18 13:15:32Z https://claude.ai/share/55700eff-d389-48d5-b1f5-cff59059ce61

I have collected a dataset of news articles from the past few weeks. I would like to know which words occur significantly more or less frequently in the last week compared to the four weeks before. What are good methods to do this?

There are several good methods for detecting statistically significant shifts in word frequency between two time periods. Here's an overview:

Corpus Comparison Methods

1. Log-Likelihood Ratio (G²)
The most widely used method in corpus linguistics (Dunning, 1993). It compares observed vs. expected frequencies under the null hypothesis that both corpora are drawn from the same distribution. It handles low-frequency words better than chi-squared and works well even when the corpora are very different sizes — which matters here since you're comparing 1 week vs. 4 weeks.

2. Chi-Squared (χ²) Test
A simpler alternative, but it's unreliable for low-frequency words (expected count < 5) and can produce many false positives in sparse data. Log-likelihood is generally preferred over it for text.

3. Mann-Whitney U / Permutation Tests
Non-parametric tests that make no distributional assumptions. Useful if you want to compare word rates per article rather than pooled counts, since they respect article-level variance.

Effect Size Measures (use alongside significance)

Significance alone doesn't tell you how much a word has changed. Pair it with:

  • %DIFF — simple percentage difference in relative frequency, easy to interpret
  • Bayes Factor (BIC approximation) — penalises complexity, good for prioritising the most meaningful differences
  • Log Ratiolog2(freq_week5 / freq_weeks1-4), gives a symmetric, interpretable effect size: +1 means doubled, 1 means halved

Practical Considerations

Issue Recommendation
Unequal corpus sizes Normalize to relative frequencies (per million words) before comparing
Multiple comparisons Apply Benjamini-Hochberg FDR correction — you're testing thousands of words simultaneously
Low-frequency words Set a minimum frequency threshold (e.g., appears ≥ 5 times total) to reduce noise
Stop words Decide whether to include them — they shift too and can be informative (e.g., rising use of "we" vs. "they")
Multiword expressions Consider bigrams/trigrams if single words feel too decontextualised
  1. Tokenize and normalize (lowercase, optional lemmatization)
  2. Count frequencies in each time window
  3. Filter by minimum frequency
  4. Run Log-Likelihood for significance; compute Log Ratio for effect size
  5. Apply FDR correction across all words
  6. Rank by effect size among significant words

Tools & Libraries

  • Python: scipy.stats for chi-squared/G², statsmodels for FDR correction, nltk/spacy for tokenization
  • R: the quanteda package has textstat_keyness() built-in, which implements log-likelihood and several other keyness measures out of the box — very convenient for exactly this use case
  • Ready-made: AntConc (free GUI tool) supports keyness analysis directly

Would you like help implementing any of these in code?

For the %DIFF, do you calculate it as "freq_week5 - freq_weeks1-4" ?

Can you show scipy.stats code for chi-squared and G^2 ?

%DIFF Formula

It's the relative difference in normalised frequencies (not raw counts, since your corpora are different sizes):

%DIFF = ((freq_B / size_B) - (freq_A / size_A)) / (freq_A / size_A) * 100

Where A is the reference corpus (weeks 14) and B is the target (week 5). So it answers: "by what percentage did the relative frequency go up or down?"


Chi-Squared and G² in scipy

import numpy as np
from scipy.stats import chi2_contingency, chi2

def corpus_stats(word, counts_recent, counts_reference, total_recent, total_reference):
    """
    word             : the word being tested
    counts_recent    : raw count in week 5
    counts_reference : raw count in weeks 1-4
    total_recent     : total tokens in week 5
    total_reference  : total tokens in weeks 1-4
    """
    a = counts_recent      # word in recent
    b = counts_reference   # word in reference
    c = total_recent - a   # non-word in recent
        d = total_reference - b  # non-word in reference


    contingency = np.array([[a, b],
                                                          [c, d]])


    # --- Chi-Squared ---
        chi2_stat, p_chi2, _, _ = chi2_contingency(contingency, correction=False)


    # --- Log-Likelihood (G²) ---
    # G² = 2 * sum(observed * log(observed / expected))
    # scipy's chi2_contingency with lambda_="log-likelihood" computes this
        g2_stat, p_g2, _, _ = chi2_contingency(contingency, lambda_="log-likelihood")


    # --- Effect sizes ---
    freq_recent    = a / total_recent
    freq_reference = b / total_reference

        pct_diff = (freq_recent - freq_reference) / freq_reference * 100


    # Avoid log(0) with a small epsilon
    eps = 1e-9
        log_ratio = np.log2((freq_recent + eps) / (freq_reference + eps))


    return {
        "word":          word,
        "freq_recent":   freq_recent,
        "freq_reference": freq_reference,
        "pct_diff":      pct_diff,
        "log_ratio":     log_ratio,
        "chi2":          chi2_stat,
        "p_chi2":        p_chi2,
        "g2":            g2_stat,
        "p_g2":          p_g2,
    }

A few notes:

  • correction=False disables Yates' continuity correction for chi-squared, which is standard for large corpora
  • lambda_="log-likelihood" is scipy's built-in way to get G²; it uses the same contingency table logic
  • Both tests have 1 degree of freedom for a 2×2 table, so the p-values are directly comparable

Running it over a full vocabulary

from statsmodels.stats.multitest import multipletests

results = [
    corpus_stats(word, counts_recent[word], counts_reference.get(word, 0),
                 total_recent, total_reference)
    for word in counts_recent
]


# FDR correction across all words

p_values = [r["p_g2"] for r in results]

_, p_adjusted, _, _ = multipletests(p_values, method="fdr_bh")


for r, p_adj in zip(results, p_adjusted):
    r["p_g2_adjusted"] = p_adj

Would you like help loading your articles and building the frequency counts to feed into this?

And what about tf-idf?

TF-IDF is a slightly different tool — worth understanding the distinction before deciding whether to use it.

What TF-IDF actually measures

TF-IDF scores a word relative to how distinctive it is within a document compared to a collection. The formula is:

TF-IDF(word, doc) = TF(word, doc) × IDF(word, corpus)
  • TF — how often the word appears in a specific document
  • IDF — penalises words that appear in many documents (so "the" scores low)

It was designed for ranking documents by relevance to a query, not for comparing two time periods.

The core problem for your use case

If you treat each time period as one big document, TF-IDF collapses to just IDF — because TF becomes the same for both "documents" and cancels out. You'd just be finding words that are rare across the whole corpus, which isn't what you want.

You could treat each article as a document and then aggregate scores by time period, but then you're measuring within-article distinctiveness, not frequency shifts over time.

When TF-IDF is useful in your pipeline

Use case Good fit?
Detecting words that increased/decreased in frequency week-over-week ✗ Use G² + log ratio instead
Finding the most topically distinctive words within a single article
Building features for a classifier or clustering articles by topic
Summarising what a time period was "about" relative to a background corpus Partially — see below

The partial exception: keyness as pseudo-TF-IDF

Some researchers compute a keyness score by treating the recent period as the "document" and the reference period as the "corpus" — which is conceptually similar to TF-IDF. But the statistically rigorous version of this is exactly the log-likelihood + log ratio approach already described. G² is essentially the principled, significance-tested version of what TF-IDF approximates intuitively.

Bottom line

For your goal — words that shifted significantly between week 5 and weeks 14 — G² with log ratio is the right tool. TF-IDF would give you a rough proxy at best and a misleading ranking at worst.

The corpus_stats() function contains the following comment:

# G² = 2 * sum(observed * log(observed / expected))

Is this the formula to calculate the log-likelihood ratio? What do the observed and expected variables contain?

Yes, that's the correct formula for G². Let me break it down.

The full formula

G² = 2 × Σ Oᵢ × ln(Oᵢ / Eᵢ)

The sum runs over all four cells of the contingency table:

Recent (week 5) Reference (weeks 14)
Target word a b
All other words c d

Observed values

Simply the raw counts from your data:

O₁ = a  (word in recent)
O₂ = b  (word in reference)
O₃ = c  (all other words in recent)
O₄ = d  (all other words in reference)

Expected values

The counts you would expect if the word's relative frequency were identical in both corpora — i.e. under the null hypothesis that nothing changed. They're computed from the row and column marginals:

E₁ = (a + b) × (a + c) / N
E₂ = (a + b) × (b + d) / N
E₃ = (c + d) × (a + c) / N
E₄ = (c + d) × (b + d) / N

Where N = a + b + c + d is the total number of tokens across both corpora.

Concrete example

Say your recent corpus has 10,000 tokens and your reference has 40,000 tokens, and the word "crisis" appears 30 times recently and 60 times in the reference:

a = 30,    b = 60
c = 9970,  d = 39940
N = 50000

E₁ = (30+60)  × (30+9970)  / 50000 = 90 × 10000 / 50000 = 18
E₂ = (30+60)  × (60+39940) / 50000 = 90 × 40000 / 50000 = 72
E₃ = (9970+39940) × 10000  / 50000 = 49910 × 10000 / 50000 = 9982
E₄ = (9970+39940) × 40000  / 50000 = 49910 × 40000 / 50000 = 39928

G² = 2 × (30×ln(30/18) + 60×ln(60/72) + 9970×ln(9970/9982) + 39940×ln(39940/39928))
   ≈ 2 × (16.2  9.9  1.2 + 1.2)
   ≈ 12.7

"crisis" appears at 3× its expected relative frequency (30/18), and G² = 12.7 exceeds the chi-squared critical value of 3.84 at p < 0.05 with 1 degree of freedom, so the increase is significant.