SearchLoop · Field Guide

Build a proprietary acquisition screen from raw data.

How to find the founder-owned companies Grata and Inven can't — without a $15–40k/yr seat. A worked playbook for partners, searchers, and independent sponsors, built with Claude Code.

Walkthrough — coming soon ~8 min · Russell builds the screen live

The whole build, start to finish — sources, the test loop, and the funnel from 24,000 records to ~80 names.


01

The problem with renting a database

The expensive tools are opaque, they lock you in — and in a fragmented market they can't even give you a clean list.

Start with the part nobody at the database companies will say out loud: in a fragmented market like dental, the expensive tools can't even hand you a clean list.

The same practice shows up three times — under the dentist's own name, a DBA, and a stale duplicate. Ownership hides behind holdcos and management-services organizations. And half the single-location practices you actually want never made it into the firmographic feed at all. You're paying $15–40k a year, per seat, for a deduplication problem you then have to solve yourself.

There are three problems here, in order of how much they should bother you.

01 · Price & theatre

You book a call to see a price

Pricing is quote-based, demo-gated, and per-seat. Public estimates put entry around $15k, climbing past $40k with seats, data export, and API add-ons. The tell isn't the number — it's that you have to take a sales call to learn it. The price is whatever they think you'll pay.

02 · Lock-in

You own nothing at renewal

Annual contract, per seat. The list isn't yours, the enrichment isn't yours, and the day you stop paying your "pipeline" disappears. You were renting a view — not building an asset.

03 · The blind spot

Worst at the deals you want

These platforms aggregate digital exhaust — websites, funding, news, LinkedIn. The two-truck HVAC firm and the solo practice with a Wix site and no press are exactly the proprietary targets — and exactly what the feed under-indexes.

A database you share with every other fund is, by definition, not proprietary origination. You're all querying the same index and emailing the same top results.

The good news: Grata and Inven aren't magic. They're a clean UI and a scoring layer on top of public and semi-public data — most of it free, and some of it more authoritative than anything they resell. You can assemble the precise slice you need, own it outright, and tune it to your thesis instead of theirs.

 Rented seat (Grata / Inven)Owned screen (this guide)
Cost$15–40k+/yr, per seat, quote-based~$50–300 in API calls per pull
Price transparencyBook a call to see itEvery line item visible
Who else has itEvery other fundOnly you
Founder-owned coverageWeak — under-indexes off-grid SMBsBuilt from the index they live in
At renewalAccess ends; you keep nothingA database & scripts you own
Fit to your thesisTheir filtersYour gates, your weights
ExplainabilityBlack-box relevanceEvery score carries its reasons
02

Raw sources of truth

Every company is registered somewhere before it ever appears in a database product. Find that index and you're upstream of the aggregators.

There's a system of record — a place a business has to exist to operate. Work from it directly and you're using the same raw material Grata buys, normalizes, and rents back to you. Three sources cover most of the lower middle market.

Spine for trades

Google Maps

The census of Main Street. Every local, physical business has a listing because that's how customers find them — name, address, phone, website, category, rating, review count. More complete than any firmographic feed, because being listed is existential, not marketing.

Spine for healthcare

The NPI registry (NPPES)

The federal enumeration of every US healthcare provider. Free, bulk-downloadable as CSV, a public no-auth API, and taxonomy-coded — so you isolate "general-practice dentists in Texas" with a code, not a guess. 7M+ active records, kept current by law.

Spine for the UK

Companies House

The unfair advantage for UK targets. Free API with company profiles, officers, and the PSC register (owners >25%). Accounts are free to download, ~60% as structured XBRL you can parse for revenue and headcount. Ownership and financials, one source.

The pattern across all three: one source is your spine — the index — and the others become enrichment and corroboration. In the worked example below, NPPES is the spine; Google Maps, the practice website, and reviews are the enrichment.

NPPES pullfree · current · yours
# Bulk: the full FOIA-disclosable file, updated monthly (+ weekly deltas)
https://download.cms.gov/nppes/NPI_Files.html

# Or the public API — no auth, 200 records/page:
https://npiregistry.cms.hhs.gov/api/?version=2.1&taxonomy_description=Dentist&state=TX&limit=200&skip=0

# Dental taxonomy codes you'll filter on (NUCC):
122300000X  Dentist (grouping)      1223X0400X  Orthodontics
1223G0001X  General Practice        1223E0200X  Endodontics
1223P0221X  Pediatric Dentistry     1223P0300X  Periodontics

Tell Claude "go to the NPI registry" — and here's the one thing that trips everyone up: the API caps at 1,200 results per query and filters by taxonomy description, so for a whole state you pull the bulk file and match your codes across all 15 taxonomy columns, not just the first. The downloadable kit's NPI playbook covers the rest — skipping deactivated records, using the practice (not mailing) address, and the decision-maker that org records hand you for free.

03

The toolkit

Seven tools. You don't operate most of them — Claude Code does. Your job is to read what it writes and tell it when it's wrong.

ToolWhat it doesCost
Claude CodeThe operator. Reads your thesis, writes the Python, runs the pulls, iterates in plain English. You direct and review — you don't code.
ApifyGoogle Maps scrape (compass/crawler-google-places) — turns a search into structured rows with rating and review count.~$0.004/listing
FirecrawlThe web reader. Returns a company site as clean markdown so a model can read the About page without choking on markup.free tier
TavilyResearch-grade search. Have a name but no website? It finds the site, the LinkedIn, the local-news mention.pay-as-you-go
OpenRouterOne key, every model. Route cheap classification to Gemini Flash; escalate hard judgment to a frontier model.per token
SupabaseWhere the screen lives once it outgrows a JSON file — Postgres + API + auth. Also what a client-facing dashboard runs on.free tier
A design pluginhuashu-design / frontend-design in Claude Code — so the screen looks like a product, not a generated table. Anti-AI-slop.free

All-in cost to stand this up: the keys are pay-as-you-go, and a single vertical pull lands in the low hundreds — $50–300 depending on how deep you enrich. Two orders of magnitude under a seat.

04

Architecture — so it doesn't hallucinate

Accuracy is the entire game. A screen that's 95% right means opening a call by congratulating someone on a practice they sold three years ago.

One bad row doesn't cost you one deal — it costs you your credibility with that buyer for every deal after it. The pipeline is six stages, each one inspectable and idempotent:

01 pull
NPPEStaxonomy + state
02 normalize
Dedupeto practices
03 enrich
Web + reviewsread the exhaust
04 corroborate
2-sourceclaim vs fact
05 score
2-stagecheap → smart
06 store
Supabase+ dashboard

Pull 20 → eyeball against ground truth → refine → only then scale.

Pull the first 20 records. For each, establish the truth from outside your own system — open Google, load the website, look at the listed phone. If 1 of 20 is wrong, that's a 5% error rate, and at 24,000 records it's 1,200 wrong rows. Fix the prompt now, while it's 20 rows. You re-run this loop every time you change a step, and you never trust a pull you haven't ground-truthed.

What you actually typeplain English
"Pull every dentist NPI in Texas from the NPPES API — taxonomies
1223G0001X and 1223P0221X — and write them to a JSON file. Then
take the first 20, look each one up on Google, and show me every
row where our data disagrees with reality before we scale."

Four hard rules enforce the discipline. Put them in your project's CLAUDE.md so Claude Code follows them every session:

1

Ground-truth before you trust

Establish truth from outside the system on a 20-row sample before scaling any step.

2

Two sources, or it's a claim

Owner, ownership, location count, revenue — require two independent sources, or store it as a claim, not a fact. Zero sources: the field stays blank.

3

Thin scrape = no classification

Under ~200 characters back from Firecrawl means the site blocked you. Don't ask the model what they do — it'll invent. Mark needs_review.

4

Temperature 0, log the evidence

Every classification runs deterministic, and stores the raw text it was based on — so you can audit any row back to its source.

05

Enrichment — reading signal in the noise

The signal that matters to an acquirer is almost never in the structured fields. It's in the unstructured exhaust around the business.

The website

Services and specialty mix (implants + ortho + sedation = bigger). Provider count — your cheapest size proxy. The About page. And a "Locations" dropdown or "now part of —" banner is a consolidation tell.

The reviews

Count is a volume proxy; rating is quality; velocity tells you growing vs. winding down. And the text is gold: "Dr. Alvarez has been my dentist for 22 years" gives tenure, owner identity, and succession risk in one line.

Ownership

Shared branding, a corporate footer, one phone routing several "locations," "part of the [X] family of practices" — these mark DSO/PE assets. A disqualifier, or a comp. Either way you tag it, with evidence.

Stack the signals and a specific shape emerges — the thing you're actually hunting:

Single location · established ~20–30 years ago · owner's surname behind the practice name · no DSO branding · one phone · a founder who keeps appearing in decade-old reviews.

That is the founder-owned, no-obvious-successor practice a DSO or a search fund most wants to buy — and the exact profile Grata is least likely to surface, because none of those signals live in a firmographic feed. You're not finding companies; you're finding situations. Every enrichment is a model call over corroborated raw text — never a guess.

One enriched recordillustrative
{
  "practice": "Lakeside Family Dentistry",
  "location": "Round Rock, TX",
  "taxonomy": "1223G0001X · General Practice",
  "locations": 1,   "providers_listed": 2,   "established": 2002,
  "google_reviews": 418,  "rating": 4.8,
  "ownership": "Independent",
  "ownership_evidence": "single phone; no DSO footer; surname branding",
  "owner_signal": "Dr. Karen Vogel named in 31 reviews; 'my dentist 19 yrs'",
  "succession_flag": true,
  "sources": ["nppes", "google_maps", "practice_website"]
}
// Note what's NOT here: no invented revenue, no guessed email, no owner age.
// If two sources didn't support it, it isn't in the record.
06

Scoring & the two-stage funnel

You don't run a frontier model over 24,000 records. You filter cheap, then spend real money only on the survivors — and let the funnel do the work.

Stage 1 — the cheap filter. A fast model (Gemini Flash) runs over the structured fields and kills the obvious no's: wrong taxonomy, wrong geography, clearly multi-location, obvious DSO, below a volume floor. Nearly free at this volume. Stage 2 — enrich & qualify the survivors. You spend real money only on the few thousand that matter — the deeper enrichment, with a frontier model reserved for the hard judgment calls (is this PE-owned? who owns it?). Qualification itself stays deterministic: an operator-tunable gate model, not a black-box AI score — every result carries the gates it passed and the evidence behind them.

GatePasses whenSource
Specialty fitTaxonomy matches thesis (e.g. GP + pediatric)NPPES
GeographyInside target metro / stateNPPES + Maps
SizeProvider count / review volume in bandWeb + reviews
OwnershipIndependent, not DSO/PE-heldCorroborated

The one rule that matters here: "unknown" is not "fail." An unknown gate is a research task, not a rejection — it routes back for another enrichment pass. Treating unknowns as fails is how you silently delete your best, hardest-to-read targets — precisely the off-grid ones you came for.

Figures below are illustrative — representative magnitudes for a state-wide pull, not a delivered count.

Raw NPPES pull24,000
Dental taxonomies, TX — Type 1 + Type 2 NPIs
Collapse to distinct practices9,200
Dedupe individuals → practices; drop inactive
Stage 1 — cheap filter2,600
Single-location, independent, minimum volume
Stage 2 — enrich + qualify survivors310
Acquirability above cutoff
Human review of the top tier~80
Worth a personal, partner-led approach
24,000 → 80

Illustrative magnitudes for a state-wide pull — the shape is the point. Every name's score is a transparent roll-up of the gates it passed, not a model's guess, so you can defend each one in an IC meeting instead of a relevance number you can't explain.

Acquirability is a transparent weighted roll-up of the gates — not a model score. Illustrative.

PracticeCityLoc.Est.ReviewsAcq.Why
Lakeside Family DentistryRound Rock1200241891Independent, single-site, strong succession signal
Hill Country Dental CareNew Braunfels1199826388Long-tenured owner, no successor named
Brushy Creek SmilesCedar Park2200954064Two sites, younger owner, lower urgency
Capital Dental GroupAustin620141,91022Multi-site, DSO branding — comp, not target

And because the thresholds are knobs, the screen is yours to re-tune. Your size band, your geographies, your weighting of succession versus scale — move the gates and the list re-ranks. A rented database gives you their filters. This gives you your thesis, expressed as a screen.

07

Where I stop

I just handed you the screen. I'll be just as direct about what I didn't hand you, and why.

The screen is the commoditizable part. Known sources, scripted pulls, a scoring funnel. It takes discipline — mostly the testing discipline above — but anyone serious can build it from this guide. So I give it away. It also makes the case for what I do better than any pitch deck could.

What I don't give away is turning that screen into booked calls — because that's where the actual difficulty, and the judgment, lives:

Copy

The reply you have to earn

Getting a skeptical 62-year-old founder who's been cold-pitched by a dozen DSOs to reply to you is not a template. It's voice, segment-specific framing, and a few-shot library built from messages that have actually worked.

Deliverability

Where a perfect list dies

Private IPs, domain warmup, inbox placement. Get this wrong and your 80-name list lands in spam — and you never find out it happened.

Orchestration

A system, not a script

Email, LinkedIn, and — where it fits — WhatsApp, sequenced, each prospect locked to one sender, replies routed and handled. Then the feedback loop sharpens both the copy and the screen.

The data is the cost of entry. The conversion is the moat.

I'm comfortable giving you the data layer because I've built the conversion layer enough times to know that's where the real work — and the real edge — is.

KIT

Take the starter kit

The keepable toolkit to build the screen yourself — plain text you own. Free, no email.

A ready-to-run folder: a setup guide for the whole toolbox (the CLIs — Supabase, Vercel, Playwright — a .env key template for OpenRouter/Apify/Firecrawl/Tavily, and a ready Apify MCP config), the CLAUDE.md with the four hard rules that keep it accurate, a thesis + scoring template (your tunable gates), a copy-paste Day-1 bootstrap prompt that scaffolds the pipeline in Claude Code, the DESIGN.md behind this page, and a fictional sample dataset.

The screen is the free part

Want the whole origination engine?

The screen above, plus the outreach that turns it into a live, replying pipeline — done-for-you, tuned to your thesis, run as a system. I'm an ex-investor, so I build origination the way a deal team actually uses it, not the way a software vendor imagines you might.

Russell Taylor — SearchLoop rt@searchloop.ai searchloop.ai