Using Screaming Frog embeddings + Excel to decide what to merge, refresh, prune (2025)

Here’s a story about how I used Screaming Frog vector embeddings + ChatGPT + Excel formulas to create a merge, refresh, and prune database for a blog of roughly 500 posts. The goal was to create an objective roadmap for what to do with these posts pre-website migration that didn’t rely solely on SEO performance metrics or subjective interpretation of blog post content.

——————————————————

I didn’t set out to build a spreadsheet monster. I set out to answer one deceptively simple question: Which posts should we merge, which deserve a refresh, and which can we safely let go? The site was a classic mature-resource-center situation—hundreds of posts, years of overlapping coverage, a new go‑to‑market story, and more “best practices” than any reader could possibly want.

I knew I’d need something more truthful than “gut feel,” but also something my stakeholders could actually use. The answer turned into a hybrid: AI embeddings to understand what posts mean, and Excel to turn that meaning into decisions. This is the story of how I did it—what worked, what didn’t, and exactly how I’d do it again.

The hyothesis

My hypothesis was straightforward: If I combine semantic similarity (how close two posts are in meaning) with business labels (pain points, jobs‑to‑be‑done) and a few simple performance signals, I can create a repeatable, defensible system for action: Merge, Refresh, Prune, or Keep.

I wanted it to be auditable (show your work), tunable (change thresholds; watch the plan update), and boringly practical (live inside Excel so no one has to learn a new tool). That constraint shaped everything that followed.

The process

Crawling and embeddings

I started with a Screaming Frog crawl and used the Gemini API to produce a vector embedding for each post. Think of an embedding as a GPS coordinate for meaning; if two posts land close together in that space, they’re probably talking about the same thing.

From points to clusters

In Python, I built a similarity graph from those vectors (edges only where cosine ≥ a threshold) and labeled connected components. That gave me a cluster ID for each URL—useful, but still a bit broad. Clusters tell you “these are cousins,” not always “these are duplicates.”

Adding the business lens

So I overlaid two label sets: the current MappedPain taxonomy and the legacy OriginalJTBD (jobs‑to‑be‑done) tags. That’s when things snapped into focus. I combined them with the cluster to create a micro‑group key I call a pod:

PodKey = cluster | MappedPain | OriginalJTBD

A pod is small on purpose. If two URLs share the same cluster and the same pain and the same JTBD, they aren’t cousins—they’re siblings.

Electing a hero post

Inside each pod I needed one page to be the destination when consolidation made sense. My first attempt was “highest six‑month clicks wins.” It worked… until it didn’t. Sometimes a page with slightly fewer clicks had a much stronger CTR, which told me searchers preferred it.

My fix: a two‑step rule—highest clicks and highest CTR wins; if there’s a tie, highest clicks alone wins. That tiny change cut most of the weird cases. Every pod now had a single hero, which later made redirect mapping trivial.

The dials I gave myself (and why they matter)

I didn’t want hard‑coded opinions. I wanted dials. I put four numbers in a tiny Parameters table so I can change strategy in seconds:

  • RelCut = 0.10 → a page is “weak” if it gets ≤10% of its pod hero’s six‑month clicks.

  • CTRCut = 0.04 → CTR under 4% reads as low engagement.

  • TrendCut = −0.05 → a 3‑month decline of 5% or more is “slipping.”

  • ClickFloor = 60 → if a page gets <60 clicks in six months, it’s probably not pulling its weight.

Stakeholder asks, “Could we merge fewer pages?” I nudge RelCut from 0.10 to 0.05 and the workbook recalculates live. Magic, but also not magic—just XLOOKUPs.

Turning signals into one clear action per URL

Early on I made a mistake a lot of audits make: I let multiple flags fire at once. A page could be a merge candidate and a refresh candidate, which is true but useless. So I made the flags mutually exclusive and set a priority. I chose:

PRUNE → REFRESH → KEEP → MERGE

Why that order? If a page is a lonely, low‑value outlier, get rid of it before you spend time refreshing anything. If it’s valuable but slipping, fix it. If it’s fine, don’t touch it. And only then go hunting for duplicate consolidation.

Here’s the logic in words, not formulas:

  • PRUNE if it’s the only page for that pain, sits in the bottom 25% for six‑month clicks, has low CTR, and is declining.

  • REFRESH if it’s on an important pain (there are multiple pages in that pain bucket), has more than bottom‑quartile traffic, and is declining.

  • MERGE if it shares a pod with at least one other URL and it’s weak on all three fronts: low share of the hero’s clicks (≤RelCut), low CTR, declining, and below the absolute click floor.

  • KEEP if none of the above applies.

The outcome is refreshingly boring: one—and only one—action per URL, driven by numbers you can defend.

What the sheet actually produces

Two deliverables make this whole exercise feel concrete:

  1. Redirects: every row with MERGE comes with a HeroURL (found via the pod’s hero). That’s a ready‑to‑ship 301 map. No manual pairing required.

  2. Refresh backlog: every REFRESH row carries its pain/JTBD context and recent trend, which is all a writer needs to scope a focused update.

Behind the scenes I keep a small Prune Review list as well, mainly to spot‑check for backlinks or product relevance before anything is removed.

Where I hit friction—and how I resolved it

1) Heroes that didn’t feel like heroes

Picking by clicks alone occasionally chose the wrong destination. Adding CTR as a tie‑breaker fixed most of it. When in doubt, I skimmed H1s and kept the page that read closer to the pain/JTBD, not just the keyword.

Caveat here: Relying solely on performance and clusters to pick a hero didn’t always work. This is a spot where manual intervention and more logical content mapping was needed.

2) Clusters were too big

Embeddings are great, but a raw cluster often groups multiple intents. Adding MappedPain and OriginalJTBD to form PodKey was the unlock. It turned “topical family” into “same niche problem” and removed most false merges.

3) Too many merges, then too few

At first almost everything looked like a merge. I tightened the gates (added ClickFloor and required PodSize > 1) and the count dropped to something realistic. Later, when stakeholders wanted more consolidation, I loosened RelCut. The point: the dials give you control without rewriting formulas.

A quick walkthrough you can replicate

If you want to run this yourself, here’s the narrative version of the workflow I wish I’d had the first time:

  1. Crawl and embed. Get a vector per URL. Any modern LLM embedding will do; consistency matters more than squeezing the last 2%.

  2. Cluster. Build a similarity graph, pick a threshold, label components as cluster.

  3. Layer business labels. Bring in MappedPain (current taxonomy) and OriginalJTBD (legacy). Don’t skip this—it’s what keeps merges honest.

  4. Build pods. Concatenate cluster | MappedPain | OriginalJTBD into PodKey. Count PodSize.

  5. Elect the hero. Highest clicks + CTR tie‑breaker; fallback to clicks if needed. Mark a single HERO per pod.

  6. Create dials. Put RelCut, CTRCut, TrendCut, ClickFloor in a tiny table.

  7. Tag actions. Apply the four rules above with the priority PRUNE → REFRESH → KEEP → MERGE. Make the flags mutually exclusive so only one sticks.

  8. Ship the work. Export a Redirects CSV (loser → hero) and a Refresh backlog list (URL + pain/JTBD + trend + brief notes). Keep a Prune Review list for a quick human check.

  9. Measure. After redirects, watch hero URLs in GSC; after updates, track CTR and impressions. Expect consolidation to clean up cannibalisation and refreshes to lift decayed winners.

What I learned about stakeholder comfort

People don’t love being told “we’re deleting 30% of the blog.” They do appreciate:

  • A chart that shows how many posts landed in each action,

  • A redirect map that requires zero guessing,

  • A refresh backlog organized by pain point (so it ladders into strategy), and

  • The ability to say, “what if we keep more?” and watch you tweak RelCut from 0.10 to 0.05 on the call.

It’s not just a content audit. it’s a planning tool.

Things I’d change next time

  • Recency guardrails: I sometimes exclude posts updated this year from any action by converting messy UTC stamps into real Excel dates and filtering them out. It avoids re‑working fresh edits.

  • Query overlap (optional): If I need extra confidence for edge cases, I’ll compare top queries in GSC between a loser and its hero. If overlap ≥70%, I feel great about merging. Not required, just nice when politics are spicy.

  • Automated briefs: Small GPT prompts can pre‑draft refresh outlines using the pain/JTBD context. The spreadsheet gets you 80% there; a little assist gets you over the line faster.

The payoff

After tuning, my split looked like this: a solid Keep core, a meaningful Refresh backlog, and a Merge set that actually felt like consolidation rather than scorched earth. Redirects shipped quickly because hero mapping was automatic; refreshes landed in two sprints because each one came with context.

And the best part: when the business changes again (it will), I don’t have to rebuild anything. I just swap in a new pain taxonomy or nudge two numbers in the Parameters table and rerun the sheet. The decisions fall out the bottom like clockwork.

Steal this

Take the idea, rename the columns, and make it yours. The exact model or threshold doesn’t matter as much as the pattern:

Use embeddings to understand meaning. Add business labels to capture intent. Compare siblings with simple, tunable metrics. Assign one clear action. Ship.

That’s how I turned “we should clean up the blog” into a concrete roadmap the team could execute—without arguing about which post is “better” for the next three weeks.

Brendan McConnell

Hi, I’m Brendan. Freelance writer, SEO, and fractional content marketer based in Ottawa.

I create content, strategies, and marketing frameworks that help companies (and people) like you grow traffic and leads from content.

Previous
Previous

How to plan and screen SEO content for journalistic integrity

Next
Next

Beyond keywords: Why journalistic integrity should be the next SEO frontier