Build a proprietary acquisition screen from raw data.
How to find the founder-owned companies Grata and Inven can't — without a $15–40k/yr seat. A worked playbook for partners, searchers, and independent sponsors, built with Claude Code.
The whole build, start to finish — sources, the test loop, and the funnel from 24,000 records to ~80 names.
The problem with renting a database
The expensive tools are opaque, they lock you in — and in a fragmented market they can't even give you a clean list.
Start with the part nobody at the database companies will say out loud: in a fragmented market like dental, the expensive tools can't even hand you a clean list.
The same practice shows up three times — under the dentist's own name, a DBA, and a stale duplicate. Ownership hides behind holdcos and management-services organizations. And half the single-location practices you actually want never made it into the firmographic feed at all. You're paying $15–40k a year, per seat, for a deduplication problem you then have to solve yourself.
There are three problems here, in order of how much they should bother you.
You book a call to see a price
Pricing is quote-based, demo-gated, and per-seat. Public estimates put entry around $15k, climbing past $40k with seats, data export, and API add-ons. The tell isn't the number — it's that you have to take a sales call to learn it. The price is whatever they think you'll pay.
You own nothing at renewal
Annual contract, per seat. The list isn't yours, the enrichment isn't yours, and the day you stop paying your "pipeline" disappears. You were renting a view — not building an asset.
Worst at the deals you want
These platforms aggregate digital exhaust — websites, funding, news, LinkedIn. The two-truck HVAC firm and the solo practice with a Wix site and no press are exactly the proprietary targets — and exactly what the feed under-indexes.
A database you share with every other fund is, by definition, not proprietary origination. You're all querying the same index and emailing the same top results.
The good news: Grata and Inven aren't magic. They're a clean UI and a scoring layer on top of public and semi-public data — most of it free, and some of it more authoritative than anything they resell. You can assemble the precise slice you need, own it outright, and tune it to your thesis instead of theirs.
| Rented seat (Grata / Inven) | Owned screen (this guide) | |
|---|---|---|
| Cost | $15–40k+/yr, per seat, quote-based | ~$50–300 in API calls per pull |
| Price transparency | Book a call to see it | Every line item visible |
| Who else has it | Every other fund | Only you |
| Founder-owned coverage | Weak — under-indexes off-grid SMBs | Built from the index they live in |
| At renewal | Access ends; you keep nothing | A database & scripts you own |
| Fit to your thesis | Their filters | Your gates, your weights |
| Explainability | Black-box relevance | Every score carries its reasons |
Raw sources of truth
Every company is registered somewhere before it ever appears in a database product. Find that index and you're upstream of the aggregators.
There's a system of record — a place a business has to exist to operate. Work from it directly and you're using the same raw material Grata buys, normalizes, and rents back to you. Three sources cover most of the lower middle market.
Google Maps
The census of Main Street. Every local, physical business has a listing because that's how customers find them — name, address, phone, website, category, rating, review count. More complete than any firmographic feed, because being listed is existential, not marketing.
The NPI registry (NPPES)
The federal enumeration of every US healthcare provider. Free, bulk-downloadable as CSV, a public no-auth API, and taxonomy-coded — so you isolate "general-practice dentists in Texas" with a code, not a guess. 7M+ active records, kept current by law.
Companies House
The unfair advantage for UK targets. Free API with company profiles, officers, and the PSC register (owners >25%). Accounts are free to download, ~60% as structured XBRL you can parse for revenue and headcount. Ownership and financials, one source.
The pattern across all three: one source is your spine — the index — and the others become enrichment and corroboration. In the worked example below, NPPES is the spine; Google Maps, the practice website, and reviews are the enrichment.
# Bulk: the full FOIA-disclosable file, updated monthly (+ weekly deltas) https://download.cms.gov/nppes/NPI_Files.html # Or the public API — no auth, 200 records/page: https://npiregistry.cms.hhs.gov/api/?version=2.1&taxonomy_description=Dentist&state=TX&limit=200&skip=0 # Dental taxonomy codes you'll filter on (NUCC): 122300000X Dentist (grouping) 1223X0400X Orthodontics 1223G0001X General Practice 1223E0200X Endodontics 1223P0221X Pediatric Dentistry 1223P0300X Periodontics
Tell Claude "go to the NPI registry" — and here's the one thing that trips everyone up: the API caps at 1,200 results per query and filters by taxonomy description, so for a whole state you pull the bulk file and match your codes across all 15 taxonomy columns, not just the first. The downloadable kit's NPI playbook covers the rest — skipping deactivated records, using the practice (not mailing) address, and the decision-maker that org records hand you for free.
The toolkit
Seven tools. You don't operate most of them — Claude Code does. Your job is to read what it writes and tell it when it's wrong.
| Tool | What it does | Cost |
|---|---|---|
| Claude Code | The operator. Reads your thesis, writes the Python, runs the pulls, iterates in plain English. You direct and review — you don't code. | — |
| Apify | Google Maps scrape (compass/crawler-google-places) — turns a search into structured rows with rating and review count. | ~$0.004/listing |
| Firecrawl | The web reader. Returns a company site as clean markdown so a model can read the About page without choking on markup. | free tier |
| Tavily | Research-grade search. Have a name but no website? It finds the site, the LinkedIn, the local-news mention. | pay-as-you-go |
| OpenRouter | One key, every model. Route cheap classification to Gemini Flash; escalate hard judgment to a frontier model. | per token |
| Supabase | Where the screen lives once it outgrows a JSON file — Postgres + API + auth. Also what a client-facing dashboard runs on. | free tier |
| A design plugin | huashu-design / frontend-design in Claude Code — so the screen looks like a product, not a generated table. Anti-AI-slop. | free |
All-in cost to stand this up: the keys are pay-as-you-go, and a single vertical pull lands in the low hundreds — $50–300 depending on how deep you enrich. Two orders of magnitude under a seat.
Architecture — so it doesn't hallucinate
Accuracy is the entire game. A screen that's 95% right means opening a call by congratulating someone on a practice they sold three years ago.
One bad row doesn't cost you one deal — it costs you your credibility with that buyer for every deal after it. The pipeline is six stages, each one inspectable and idempotent:
Pull 20 → eyeball against ground truth → refine → only then scale.
Pull the first 20 records. For each, establish the truth from outside your own system — open Google, load the website, look at the listed phone. If 1 of 20 is wrong, that's a 5% error rate, and at 24,000 records it's 1,200 wrong rows. Fix the prompt now, while it's 20 rows. You re-run this loop every time you change a step, and you never trust a pull you haven't ground-truthed.
"Pull every dentist NPI in Texas from the NPPES API — taxonomies 1223G0001X and 1223P0221X — and write them to a JSON file. Then take the first 20, look each one up on Google, and show me every row where our data disagrees with reality before we scale."
Four hard rules enforce the discipline. Put them in your project's CLAUDE.md so Claude Code follows them every session:
Ground-truth before you trust
Establish truth from outside the system on a 20-row sample before scaling any step.
Two sources, or it's a claim
Owner, ownership, location count, revenue — require two independent sources, or store it as a claim, not a fact. Zero sources: the field stays blank.
Thin scrape = no classification
Under ~200 characters back from Firecrawl means the site blocked you. Don't ask the model what they do — it'll invent. Mark needs_review.
Temperature 0, log the evidence
Every classification runs deterministic, and stores the raw text it was based on — so you can audit any row back to its source.
Enrichment — reading signal in the noise
The signal that matters to an acquirer is almost never in the structured fields. It's in the unstructured exhaust around the business.
The website
Services and specialty mix (implants + ortho + sedation = bigger). Provider count — your cheapest size proxy. The About page. And a "Locations" dropdown or "now part of —" banner is a consolidation tell.
The reviews
Count is a volume proxy; rating is quality; velocity tells you growing vs. winding down. And the text is gold: "Dr. Alvarez has been my dentist for 22 years" gives tenure, owner identity, and succession risk in one line.
Ownership
Shared branding, a corporate footer, one phone routing several "locations," "part of the [X] family of practices" — these mark DSO/PE assets. A disqualifier, or a comp. Either way you tag it, with evidence.
Stack the signals and a specific shape emerges — the thing you're actually hunting:
Single location · established ~20–30 years ago · owner's surname behind the practice name · no DSO branding · one phone · a founder who keeps appearing in decade-old reviews.
That is the founder-owned, no-obvious-successor practice a DSO or a search fund most wants to buy — and the exact profile Grata is least likely to surface, because none of those signals live in a firmographic feed. You're not finding companies; you're finding situations. Every enrichment is a model call over corroborated raw text — never a guess.
{
"practice": "Lakeside Family Dentistry",
"location": "Round Rock, TX",
"taxonomy": "1223G0001X · General Practice",
"locations": 1, "providers_listed": 2, "established": 2002,
"google_reviews": 418, "rating": 4.8,
"ownership": "Independent",
"ownership_evidence": "single phone; no DSO footer; surname branding",
"owner_signal": "Dr. Karen Vogel named in 31 reviews; 'my dentist 19 yrs'",
"succession_flag": true,
"sources": ["nppes", "google_maps", "practice_website"]
}
// Note what's NOT here: no invented revenue, no guessed email, no owner age.
// If two sources didn't support it, it isn't in the record.
Scoring & the two-stage funnel
You don't run a frontier model over 24,000 records. You filter cheap, then spend real money only on the survivors — and let the funnel do the work.
Stage 1 — the cheap filter. A fast model (Gemini Flash) runs over the structured fields and kills the obvious no's: wrong taxonomy, wrong geography, clearly multi-location, obvious DSO, below a volume floor. Nearly free at this volume. Stage 2 — enrich & qualify the survivors. You spend real money only on the few thousand that matter — the deeper enrichment, with a frontier model reserved for the hard judgment calls (is this PE-owned? who owns it?). Qualification itself stays deterministic: an operator-tunable gate model, not a black-box AI score — every result carries the gates it passed and the evidence behind them.
| Gate | Passes when | Source |
|---|---|---|
| Specialty fit | Taxonomy matches thesis (e.g. GP + pediatric) | NPPES |
| Geography | Inside target metro / state | NPPES + Maps |
| Size | Provider count / review volume in band | Web + reviews |
| Ownership | Independent, not DSO/PE-held | Corroborated |
The one rule that matters here: "unknown" is not "fail." An unknown gate is a research task, not a rejection — it routes back for another enrichment pass. Treating unknowns as fails is how you silently delete your best, hardest-to-read targets — precisely the off-grid ones you came for.
Figures below are illustrative — representative magnitudes for a state-wide pull, not a delivered count.
Illustrative magnitudes for a state-wide pull — the shape is the point. Every name's score is a transparent roll-up of the gates it passed, not a model's guess, so you can defend each one in an IC meeting instead of a relevance number you can't explain.
Acquirability is a transparent weighted roll-up of the gates — not a model score. Illustrative.
| Practice | City | Loc. | Est. | Reviews | Acq. | Why |
|---|---|---|---|---|---|---|
| Lakeside Family Dentistry | Round Rock | 1 | 2002 | 418 | 91 | Independent, single-site, strong succession signal |
| Hill Country Dental Care | New Braunfels | 1 | 1998 | 263 | 88 | Long-tenured owner, no successor named |
| Brushy Creek Smiles | Cedar Park | 2 | 2009 | 540 | 64 | Two sites, younger owner, lower urgency |
| Capital Dental Group | Austin | 6 | 2014 | 1,910 | 22 | Multi-site, DSO branding — comp, not target |
And because the thresholds are knobs, the screen is yours to re-tune. Your size band, your geographies, your weighting of succession versus scale — move the gates and the list re-ranks. A rented database gives you their filters. This gives you your thesis, expressed as a screen.
Where I stop
I just handed you the screen. I'll be just as direct about what I didn't hand you, and why.
The screen is the commoditizable part. Known sources, scripted pulls, a scoring funnel. It takes discipline — mostly the testing discipline above — but anyone serious can build it from this guide. So I give it away. It also makes the case for what I do better than any pitch deck could.
What I don't give away is turning that screen into booked calls — because that's where the actual difficulty, and the judgment, lives:
The reply you have to earn
Getting a skeptical 62-year-old founder who's been cold-pitched by a dozen DSOs to reply to you is not a template. It's voice, segment-specific framing, and a few-shot library built from messages that have actually worked.
Where a perfect list dies
Private IPs, domain warmup, inbox placement. Get this wrong and your 80-name list lands in spam — and you never find out it happened.
A system, not a script
Email, LinkedIn, and — where it fits — WhatsApp, sequenced, each prospect locked to one sender, replies routed and handled. Then the feedback loop sharpens both the copy and the screen.
The data is the cost of entry. The conversion is the moat.
I'm comfortable giving you the data layer because I've built the conversion layer enough times to know that's where the real work — and the real edge — is.
Take the starter kit
The keepable toolkit to build the screen yourself — plain text you own. Free, no email.
A ready-to-run folder: a setup guide for the whole toolbox (the CLIs — Supabase, Vercel, Playwright — a .env key template for OpenRouter/Apify/Firecrawl/Tavily, and a ready Apify MCP config), the CLAUDE.md with the four hard rules that keep it accurate, a thesis + scoring template (your tunable gates), a copy-paste Day-1 bootstrap prompt that scaffolds the pipeline in Claude Code, the DESIGN.md behind this page, and a fictional sample dataset.
Want the whole origination engine?
The screen above, plus the outreach that turns it into a live, replying pipeline — done-for-you, tuned to your thesis, run as a system. I'm an ex-investor, so I build origination the way a deal team actually uses it, not the way a software vendor imagines you might.