Methodology — DataDawn

Why DataDawn exists, and how it’s built

U.S. federal records are public by law and fragmented by design. A member of Congress’s votes, stock trades, campaign donors, sponsored bills, the lobbyists who work them, and the regulations they shape all sit in separate databases, in separate formats, behind separate search tools. The information is technically open; the connections — the part that actually tells you something — are invisible unless you have the time and technical skill to assemble them yourself. DataDawn does that assembly and publishes the result: free, no account, no paywall, no advertising.

DataDawn is built by three collaborators: a human who sets direction and standards, and two AI systems — Claude (Anthropic) and DJ Crabdaddy (Crabdaddy for short, Claude Code) — that write the extraction pipelines, the parsers, and the entity-matching logic that do the real work of the platform. We say this plainly, up front, because it bears on how you should read everything below. You should not have to take an AI-built data pipeline on faith — and here you don’t, for a reason that doesn’t rest on our code being open: every substantive input is a primary government source you can re-pull yourself, and every record can be checked against the filing behind it — directly for what we publish as filed, by auditing the connection for the links we infer. Where our code is published it’s released CC0 (Section 7 says which parts are public). That is the point of the design: it makes the work checkable on its own terms, not contingent on trusting whoever — or whatever — produced it.

That reproducibility is also why DataDawn ingests only primary government data — IRS, SEC, FEC, the Federal Register, Congress.gov, and the like — plus a short list of open public registries. No commercial aggregators; no curated datasets from other public-interest organizations, even good ones. This costs us real things: lower match rates on the hard problems, slower coverage, more pipelines to maintain. We pay it on purpose. Building on someone else’s curated data would make our conclusions only as good — and only as auditable — as their methodology, which is usually a trade secret. Primary-sources-only means every claim on DataDawn ends at a government filing you can open — a record we publish as filed is that filing; a link we infer audits against the filings it connects.

Most of what we publish is data exactly as filed: we parsed what the government released, and we don’t correct, score, or characterize it. But some of the platform’s value comes from connecting records across datasets — and some of those connections are matched, not given. Where a link rests on a stable government identifier (an EIN, a member’s bioguide ID), it’s reliable and reproducible. Where it rests on name-matching or inference, it carries error we can measure — and would rather show you than bury. Throughout this page we mark which is which, and we report coverage and error rates as they are, including when they’re unflattering.

Finally, two lines we hold, on two different axes. We don’t editorialize about meaning: no ratings, no scores, no rankings, no judgments about who looks good or bad. The data is the data; the conclusions are yours. Separately, we do make principled decisions about individual privacy: we name people acting in a public capacity, because surfacing them is the whole point — but we won’t turn a private citizen who merely appears in a federal dataset (a tourist in a visitor log, a $50 donor in an FEC file) into a one-click search target. That’s a publication standard, not an editorial one. We never shade what the data says; we do decline to amplify incidental exposure of private people.

How to read this page

Everything below is organized around one question: how do we know a given fact, and how sure can you be of it? Three tiers of confidence run through the whole platform, and we mark which one applies wherever it matters.

As filed. Parsed directly from what the government published — a 990 return, a Federal Register notice, a roll-call vote. Most of the platform is this. We don’t correct, score, or second-guess it; if the source contains an error it propagates, and the correction lies with the filer, not with us.

Joined on a stable government identifier. Two records linked because they share a key the government itself assigns — an EIN, a member’s bioguide ID, a docket number, a bill ID. These joins are reproducible and low-error, but they rarely cover 100% of either side; where coverage is partial we report the coverage rather than impute the gap.

Matched or inferred. A link we decided — most often by matching a name — because no shared identifier exists. This is where judgment lives, and where error is measurable. We never state one of these in the confident voice of an as-filed fact; Section 3c sets out, field by field, exactly how each match is made, how much it covers, and where it breaks.

A note on freshness, because it bears on every figure here: DataDawn carries no single “last updated” date, because each dataset refreshes on its own schedule. Each dataset’s last successful pull is recorded on its Datasette explore page — that per-dataset stamp is the authoritative recency signal, not anything printed on this page.

The data, and how it connects

This is the body of the page, organized by the three tiers above. It opens with an inventory of every source dataset and the tier it falls under, then walks the tiers in turn. The hard part — and what a skeptical reader should scrutinize hardest — is the third, where links are matched rather than given; it gets the most space, on purpose.

Every source dataset on the platform, with its source agency and the confidence tier it falls under. Coverage windows are written open-ended (“to present”) wherever the dataset is still updated; no row counts appear here, because those change every build — for a live count, query the table itself. This lists every source dataset; the derived summaries, rollups, and crosswalks built on them — a per-year comment count, an employer rollup — add no source data beyond these rows, and aren’t padded into this table.

Dataset	Source	Coverage	Tier
Federal Register documents	Office of the Federal Register (NARA)	1994 to present	As filed
Presidential documents	Office of the Federal Register	1993 to present	As filed
Regulations.gov dockets, documents & public comments	GSA / federal rulemaking agencies	rulemaking record, 2000s to present	As filed
Open public-comment snapshots	Regulations.gov	rolling open-comment periods	As filed
Code of Federal Regulations	GPO / Office of the Federal Register (eCFR)	current text; a subset of titles, now 7, 9, 21, 40, 50	As filed
OIRA regulatory reviews & meetings	OMB Office of Information & Regulatory Affairs	1981 to present	As filed
Legislation (bills, actions, cosponsors, subjects)	Congress.gov (Library of Congress)	Congresses 93 to present	As filed
Roll-call votes & member votes	U.S. Senate + House Clerk	1998 to present	As filed
Congressional Record & floor speeches	U.S. Government Publishing Office	1994 to present	As filed
Members of Congress	Congress.gov + community federal re-export	current & historical roster	As filed
Committees & memberships	Congress.gov / House & Senate	current & historical	As filed
Committee hearings & witnesses	GPO / Congress.gov	to present (sparse before ~1990)	As filed
Nominations	Congress.gov	1987 to present	As filed
Treaties	Congress.gov	1965 to present	As filed
CRS reports	Congressional Research Service	1993 to present	As filed
GAO reports	Government Accountability Office	to present (sparse before ~1990)	As filed
Inspector General reports	Federal Inspectors General (oversight.gov)	1986 to present	As filed
CBO cost estimates	Congressional Budget Office	2003 to present	As filed
Congressional stock trades	House Clerk + Secretary of the Senate (STOCK Act)	2012 to present	As filed (member & issuer links: matched, §3c)
FEC contributions (donor name/address dropped — see §5)	Federal Election Commission	by cycle, to present	As filed
FEC committees, candidates & expenditures (operating, independent, electioneering, communication-cost, PAC)	Federal Election Commission	by cycle, to present	As filed
Federal spending awards	USAspending.gov (Treasury)	FY2008 to present	As filed
Earmarks / directed appropriations	Congressional disclosure	FY2022 to present	As filed
Congressional disbursements — House (non-personnel itemized; personnel aggregated to office, see §5)	House Clerk — Statement of Disbursements	to present	As filed
Congressional disbursements — Senate (office totals)	Secretary of the Senate	to present	As filed
IRS Form 990 filings (returns, officers, grants, Schedule I, capital gains, investments, related orgs, contributors, program activities, contractors, top employees)	IRS e-file bulk XML	tax years 2014 to present	As filed
IRS Business Master File (+ group exemptions)	Internal Revenue Service	current monthly extract	As filed
Lobbying disclosures (LD-1/LD-2 filings, activities, lobbyists, bills)	U.S. Senate (Lobbying Disclosure Act)	1999 to present	As filed
FARA registrations (registrants, foreign principals, short forms, documents)	U.S. Department of Justice (FARA)	to present	As filed
White House visitor logs (privacy-redacted, see §5)	The White House (WAVES)	2009 to 2024, as each administration releases	As filed
OGE Form 278 appointee filings & transactions	U.S. Office of Government Ethics	2020 to present	As filed
Plum Book appointees	GPO / U.S. House & Senate (Plum Book)	editions 2004, 2008, 2012, 2016, 2020, 2024	As filed
White House personnel reports	The White House (5 U.S.C. §5318)	2013 to present (with gaps)	As filed
USDA APHIS animal-welfare inspections (facilities, inspections, enforcement, annual reports)	USDA Animal & Plant Health Inspection Service	to present	As filed
SAM.gov UEI registry	GSA / SAM.gov	current	As filed
Census government units	U.S. Census Bureau (Census of Governments)	current	As filed
Federal Register ↔ Regulations.gov crosswalk	derived from the two government IDs	—	Deterministic join
Congressional Record speaker → member	GPO speaker markup	—	Deterministic join
FEC candidate ↔ member crosswalk	in-house, hand-verified	—	Matched (§3c)
Entity resolution — the organization graph	in-house, identifier-first (EIN/CIK/LEI/UEI/FEC)	—	Matched (§3c)
Industry classification	in-house, rule-based (~40 categories)	—	Matched (§3c)

3a · As-filed primary data

Most of the inventory above is this tier: we parsed what the government published and publish it back as filed. We don’t correct, score, or characterize the data — if a filer made an error, it’s in our copy too, and the fix is theirs to file. The one thing we adjust is individual privacy, never the data’s substance (see Publication standards below); everything else is exactly as the government released it. What that buys you is a clean chain of custody: every record traces to a government filing you can open yourself. What it costs you is that the data carries the source’s limits, and an honest reader should know the ones that actually change how you’d use it.

Electronic filings only. The 990 corpus is built from the IRS e-file archive; paper-filed returns aren’t in it — and that rule is the mechanism behind the coverage thinning the further back you go. Before the Taxpayer First Act (2019), only the largest organizations (broadly, $10M+ in assets filing 250+ returns a year) and the smallest (the 990-N e-Postcard) were required to file electronically; the mid-size organizations in between could still file on paper. The Act extended mandatory e-filing to those mid-size filers, effective for tax years beginning after July 2019 and reaching calendar-year filers around 2020. So machine-readable coverage is strong for recent years and progressively thinner in earlier ones, where more of the mid-size organizations were still on paper.

Filing lag. Records appear when the government posts them, not when the activity happened. The most recent periods are still accumulating — a low count for last year usually means “not filed yet,” not “didn’t happen.” It’s why the coverage windows above are written open-ended rather than to a fixed end.

Sparse early years. Some sources reach back decades but thin out the further back you look — committee hearings and GAO reports carry occasional records from the early 20th century, but the dense, reliable coverage is recent. Where the inventory marks a window “sparse before ~1990,” read the early tail as illustrative, not complete.

Self-reported fields. Much of this is what a filer said about themselves — a lobbyist’s described issues, a former official’s covered position, an officer’s title. We publish the self-report as the record of what was disclosed, not as independently verified fact.

Range-based amounts. Some disclosures report brackets, not exact figures — a congressional stock trade discloses a dollar range per transaction, not a precise amount. Any total built from them is therefore itself a range; we don’t collapse a bracket to its midpoint and present that as a number.

Two dataset-specific shapes a query can get wrong, surfaced here because a careful reader will hit them: the 990 set excludes Form 990-T (a different return on a different schema), and campaign-finance records mix transaction types that can mean opposite things — a contribution and an expenditure spent against a candidate can sit in one table — so a sum across types conflates them. Both are documented on the relevant explore pages and in the schema, not left as traps.

3b · Deterministic joins on stable government identifiers

The platform’s value isn’t only the datasets — it’s that they connect. Most of those connections are made the reliable way: on an identifier the government itself assigns and keeps stable. A member of Congress carries one bioguide_id across their votes, sponsored bills, committee seats, floor speeches, and trades; a nonprofit carries one EIN across its return, its officers, and the grants it makes; a rulemaking carries one docket number across its Federal Register notice and its public comments. Join on those and the link is reproducible — you would get the same result from the same public IDs.

The honest part is coverage. A shared identifier links the records that carry it, and that is rarely 100% of either side — a speech without a clean speaker tag, a filing without a parsable ID. Where a join is partial, we report the coverage as it is and leave the gap a gap. We don’t impute the missing links, and we don’t quietly drop the unmatched records to make the rate look complete: the unmatched rows stay in the database, queryable, simply without that join key. A coverage number here is a statement about how much connects, never a claim that what doesn’t connect isn’t there.

Two linkages sit in this tier but earn a visible caveat, because “provided by the government” is not the same as “perfect.” Floor-speech speakers are matched to members from the GPO’s own markup of the Congressional Record — provided, not guessed — and nearly all speech entries carry a member ID; but the GPO occasionally misattributes a speaker in back-and-forth colloquy, so treat a single speech attribution as reproducible, not infallible. And the Federal Register–to–Regulations.gov crosswalk links a rule’s notice to its comment docket on the two documents’ own identifiers; its limit is completeness, not correctness — only a minority of Federal Register documents have a Regulations.gov counterpart, so it under-connects rather than mis-connects.

3c · Matched and inferred fields

The fields here are different in kind from everything above. In 3a we published what the government filed; in 3b we joined records on a shared government identifier that either matches or doesn’t. Here there is no shared identifier — we are deciding that this name in one dataset refers to the same real-world person or organization as that name in another. That decision is a judgment, made in-house, and wrong some measurable fraction of the time. This section is where we tell you how often, and in which direction.

Two kinds of error matter throughout, and they are not symmetric. A false merge says two different things are the same — it fabricates a connection, putting a trade, grant, or donation on the wrong person or organization. A missed match fails to connect two things that are the same — it understates, leaving a real connection invisible, but it doesn’t invent anything. Where we have a choice we bias toward missed matches over false merges, and toward disclosing an ambiguity rather than resolving it by guessing. Where we’ve built that discipline in — the revolving-door link — an ambiguous name is disclosed, not resolved to an arbitrary pick. Elsewhere, matching a name can still mis-attach; we don’t claim it never does — we measure how often and show it (the over-merge and confusable-name rates below).

Stock trade → member of Congress

What we link, and why. Congressional Periodic Transaction Reports disclose individual securities trades. To make them useful next to a member’s votes, committee seats, and sponsored bills, each trade has to carry that member’s identifier.

The rule. A report is filed by an identified filer; we map that filer to the congressional roster by name and chamber. This is not a fuzzy search over the whole population — it is a name match against a known, finite list of members serving in a known chamber and period.

Coverage. Effectively every trade we publish carries a member identifier. Read that correctly: it is a statement about coverage, not correctness. We retain trades we can attribute to a member, so the join rate is high by construction — it is not evidence that every attribution is right.

Known error (false merge). The failure mode is same-name collision: across the history of Congress, some members share a name. Two structural facts bound it — the filer pool is small (a few hundred members), and matching resolves a filer to the sitting member, so a same-named member from an earlier era is not a likely false target. In the current data the exposure is tiny: about one filer name in several hundred is ambiguous against the full historical roster, touching only a handful of trades. Where two members of the same era are genuinely confusable an attribution can still land on the wrong one — so treat a confusable name as a lead to verify against the underlying filing, which we link, not as settled fact.

Stock trade → issuer (SEC CIK)

What we link, and why. Tagging each traded security with the issuer’s SEC Central Index Key lets a trade join to that company’s regulatory filings.

The rule. A ticker-to-issuer crosswalk maps the disclosed ticker to a CIK. Where the ticker maps cleanly the trade gets a CIK; where it doesn’t, the trade stays in the database with no CIK rather than a guessed one.

Coverage. Roughly two-thirds of trades carry a CIK.

Known error (missed match). The third without one is not random: it concentrates in securities that don’t map to a domestic SEC filer — foreign issuers, municipal bonds, options and other derivatives, private holdings, funds, and delisted or renamed tickers. This is overwhelmingly a missed-match problem: an absent CIK means “we couldn’t confidently map this,” not “we mapped it wrong.” The CIK column is a convenience where present, not a complete index of what members traded.

Former members of Congress who lobby (“revolving door”)

What we link, and why. Lobbying registrations name lobbyists and, where applicable, the government positions they previously held. Isolating former members of Congress who now appear on lobbying filings is a small, high-interest slice of the broader revolving door.

The rule. Two inferences stacked. First we read the self-reported “covered position” text on lobbying records and pattern-match it for former-congressional-service language. Then we name-match those people back to the congressional roster.

Coverage. A deliberately narrow, conservative view — a small set of former members, not the full universe of ex-government lobbyists. A starting point, not a census.

Known error (both directions). Two inferences means two ways to be wrong. The source text is self-reported free text, so a position can be mis-stated or phrased in a way our patterns miss (missed match). And the name join can be ambiguous: a former member’s name sometimes matches more than one person on the historical roster (false merge). Where a name resolves to a single former member we assert it; where it collides, we disclose the ambiguity rather than guess. Either way, treat an individual revolving-door attribution as a lead to confirm against the underlying lobbying filing, which we link.

FEC candidate ID ↔ member identifier

What we link, and why. The FEC identifies candidates with its own ID, unrelated to the congressional member identifier. A crosswalk between them is what lets a member’s campaign finance join to their legislative record.

The rule. This is a finite, hand-verified mapping we built — not an automated fuzzy match. Each pairing was checked.

Coverage. It maps members to their FEC candidate IDs and is small by design (see the live table). It is deliberately not a complete FEC-to-Congress index: the FEC registers tens of thousands of candidates, the vast majority of whom never served in Congress and so have no member identifier to map to.

Known error (missed match). Because it’s hand-verified, false merges are rare — this is the highest-confidence linkage in this section. Its limitation is completeness, not correctness: a member missing from the crosswalk is a gap to fill, never a wrong pairing asserted.

Officer → employer organization

What we link, and why. Most named people on the platform are nonprofit officers, directors, and trustees drawn from 990 filings. Connecting each to the organization they serve is what lets an organization page show its leadership, and a person be traced across the filings they appear in.

The rule. For an officer on a 990, the employer is the filing organization, and the filing carries that organization’s EIN. So this link is anchored to a government identifier, not inferred from a name — which makes it far more reliable than the name-only matches elsewhere in this section.

Coverage. The large majority of role records — roughly nine in ten — carry an employer organization. The unlinked remainder is mostly actor types that don’t come with an EIN-bearing filing.

Known error — two very different things, kept separate. The officer-to-organization link is EIN-anchored and high-precision; our sampling put it in the high-nineties percent. The residual is structural — e.g., an officer of a fiscally-sponsored project attributed to the sponsoring organization rather than the project (a narrow false merge). Separately, person-to-person identity — deciding that “John Smith” on one filing is the same human as “John Smith” on another — is a different problem, and the one place on the platform we do not claim to have solved. There is no public person-level identifier, so we do not assert that same-named individuals across organizations are one person; treat cross-organization identity as unconfirmed unless a stable ID ties them together.

Industry classification

What we link, and why. A coarse industry tag on policy-relevant organizations (roughly forty categories) lets you ask sector-level questions — which industries lobby on what, who funds whom — that no single government dataset answers directly.

The rule. A deterministic, rule-based classifier applied in fixed precedence: manual pins for a handful of known entities, then SIC industrial codes, then a selective read of nonprofit NTEE codes, then foreign-filer patterns, then inheritance from a PAC’s connected organization — plus a curated trade-association map. Every tag traces to a stated rule, not a model’s judgment.

Coverage. Only a small, deliberately chosen slice of organizations is tagged — the federally policy-relevant ones (tens of thousands), not the millions in the platform. An untagged organization is the default, not a failure.

Known error (both directions). The rules are deliberately blunt. A broad code can over-include (an NTEE bucket that lumps a zoo with an advocacy group), and a curated federation name — a national union or trade association — tags all of that federation’s local affiliates. That last case is usually correct, but it marks federation membership broadly rather than identifying one headquarters organization — so read it as a sector hint, not a precise per-entity coding. In the other direction, anything outside the rules’ reach is simply left untagged. We publish it as a navigational aid, explicitly not as an authoritative industry classification of any single organization.

Entity resolution — the organization graph

This is the largest and most consequential matched layer on the platform, and the one a skeptical reader should scrutinize hardest. It is live — it powers the organization pages and the entity finders. It is also, by its nature, the layer with the most measurable error, and the rest of this page would be dishonest if it described everyone else’s error rates and hid its own.

What it is. A single resolved organization that gathers a company, nonprofit, union, trade association, or political committee together across the many datasets that name it differently — so you can ask “show me every federal touchpoint of this organization” without hand-assembling it from fragile name matches.

The rule. The backbone is identifier-first, not name-first. Nearly every resolved organization — the org node itself — is anchored to a stable identifier: an IRS EIN for nonprofits, plus SEC CIKs, FEC committee IDs, GLEIF LEIs, and SAM.gov UEIs where applicable. Organizations are merged on the identifier where one exists. Name-matching enters at a different step — attaching incoming records that arrive without an identifier (a grant recipient, a lobbying client) to those nodes — and that attachment, not the node anchoring, is where the matched-layer error measured below lives. There the discipline is resolve or disclose: we link a name to an organization only when the name is globally unique, or can be uniquely pinned by adding state and then city; when it stays ambiguous, we disclose the ambiguity rather than pick one.

Coverage. Millions of organizations, the large majority identifier-anchored. The relationship graph that connects parents to subsidiaries and predecessors to successors is dominated by IRS-certified group-exemption memberships (an authoritative government grouping), with smaller hand-curated sets of successor and affiliate links.

Known error — stated toward the unflattering side, because this is where it counts. The hard case is linking a free-text organization name that arrives with no identifier — most importantly grant recipients, since foundation grant records carry a recipient name but no recipient EIN.

Over-merge (false merge) — collapsing distinct organizations that share a name. On grant-to-organization links this affects about one linked grant in fourteen. By dollars the figure looks larger, near one in eight — but that gap is the opposite of what it appears. A single recipient name accounts for roughly seventy percent of all the over-merge dollars: one large foundation that shares a bare name with a small unrelated namesake, a case a state-level check separates cleanly. Set that one name aside and the dollar rate falls to the low single digits, below the per-link rate. And measured directly on the most-funded recipient names — the ones a reader is most likely to look up — the over-merge rate stays in that same low range — no higher on the big, recognizable names than across the data as a whole, within the noise of these small samples. Worth stating plainly, because it is the kind of thing easy to get backwards: the raw dollar figure made over-merge look concentrated in exactly the prominent names a reader would search — and measuring it directly showed the reverse. The method corrected our own first reading. So over-merge is real but uncommon; the dollar headline is one resolvable outlier, not broad contamination; and a name total for a well-known recipient is usually right — though because a colliding large name can still carry a large sum, confirm a name-based total against the EIN wherever one is reported.

Fragmentation (missed match) — the opposite error, and in practice the larger of the two; this, not over-merge, is the caveat to carry away from this section. One real organization appears under several name spellings, so a name-based view of it is incomplete. To size it honestly we checked grant recipient names that do carry an EIN against the canonical IRS registry, which splits the apparent mismatch three ways: roughly a quarter to a third are genuinely different spellings of the same organization — that is the real fragmentation; about one in seven differ only in formatting (punctuation, a leading “The,” an “Inc”), which is not fragmentation at all; and roughly one in six point to an organization not in the registry snapshot, which is a coverage gap rather than a matching failure. So the honest headline is a quarter to a third — the largest residual in this section, but well short of “most.”

What this means for a name lookup. Most recipient names are unambiguous and link cleanly. The hard subset is names that collide with several organizations — a common or generic name shared across many filers; there, the state and city on the filing often still aren’t enough to single one out. We don’t paper that over or dress it up: where a colliding name resolves uniquely we link it, and where it doesn’t we disclose the ambiguity rather than guess. We can measure how often these collisions resolve, and it is poor — but that is a resolvability rate on the hardest names, not a matcher accuracy rate for the whole surface, and we won’t present one as the other. Treat a name-based total for a common or generically-named recipient as a lead to verify against the underlying filings, not a settled attribution.

We mitigate the over-merge with the resolve-or-disclose rule above (and by preferring an EIN whenever a record actually carries one): a shared name is linked only when state, then city, pins it to a single organization, and is otherwise disclosed rather than guessed. That removes the arbitrary-pick defect without discarding correct unique-name links. Fragmentation is the harder, more open problem; we are honest that a single name-based lookup of a large organization may not show you all of it.

Sourcing: primary government data only

The intro states the principle; this section is how we apply it at the edges, because a sourcing rule is only as good as its hardest cases. The rule itself is narrow on purpose.

The rule. DataDawn ingests substantive data only from primary government sources — the body that collected or compelled the record publishes it, and we parse that. We do not ingest facts from commercial data aggregators, and we do not build on another organization’s curated dataset, however reputable. Alongside the government sources we use a short, named list of open public registries, and we use those only to anchor identity — never as a source of the trades, grants, votes, or filings themselves.

What counts as a primary source. A record published or maintained by the government body that collected it: the IRS for 990 returns, the SEC for securities filings, the FEC for campaign finance, GPO and the Federal Register for rules, the Census Bureau for the registry of government units, GSA’s SAM.gov for the federal entity registry, Congress.gov and the Clerk and Secretary for legislative records. We take these as filed. The test is provenance, not subject matter: if a fact on DataDawn cannot be traced to a record some government body published, it does not belong here.

The carve-outs, named. Three things we rely on are not, strictly, a government body publishing its own record — so we name them and say exactly why each is consistent with the rule rather than an exception we hope you won’t notice. The common thread: every carve-out supplies an identifier or roster the government itself adopted, never the substantive data.

Carve-out	What it is	Why it doesn’t break the rule
SAM.gov Unique Entity ID	The 12-character entity identifier the federal government issues through GSA’s SAM.gov.	It is a government-issued identifier — the one federal award systems require under 2 CFR Part 25. Using it is using a federal source.
Legacy DUNS number	The nine-digit identifier issued by Dun & Bradstreet that the UEI replaced on April 4, 2022.	It appears in older federal award records because the government adopted it as the authoritative entity ID for two decades. We carry it as a join key into that historical data, not as a present-day data source.
Public-domain member roster (`congress-legislators`)	An open, public-domain re-export of congressional identifiers and service dates.	Every value in it traces to an official source (the Biographical Directory of Congress and the chambers’ own records). We use it for the member roster and ID crosswalk — identifiers, not facts — and a reader can re-derive any of it from the official record.

One more registry earns a word: where an organization carries a Global Legal Entity Identifier (an open, free identifier maintained by the GLEIF standards body), we use it as one more anchor for entity resolution. Like the carve-outs above, an LEI tells us which organization a record belongs to; it is never the source of what that record says.

What the rule costs. This discipline is not free, and the costs fall exactly where a commercial shortcut would help most. We get lower match rates on the hardest entity problems, because we won’t buy a vendor’s pre-resolved crosswalk; we add coverage more slowly, because each new source is a pipeline we build and maintain ourselves; and we leave gaps open rather than fill them from a dataset we can’t show you. We accept all of it deliberately. Building on a curated dataset would make every DataDawn conclusion only as sound — and only as auditable — as a methodology that is usually a trade secret. Primary-sources-only is what lets every claim here end at a government filing you can open — as-filed records are those filings; matched links audit against the filings they connect.

Publication standards

This is the second of our two axes of honesty, and the one most likely to be misread as the first. We do not editorialize about what the data means. Separately, we make principled decisions about individual privacy — and those are publication decisions, never editorial ones. Here is the rule we actually apply, and every place we’ve applied it.

The principle. DataDawn’s value comes from naming real people who act in public roles — a Cabinet member’s trades, a lobbyist’s clients, a nonprofit’s officers. That requires naming them. But the platform should not amplify the incidental exposure of a private citizen whose name merely happens to appear in a federal dataset. The rest of this section is how we draw that line, consistently and in the open.

We publish a named individual when all three hold: (A) they act in an official public capacity (an elected, appointed, or Senate-confirmed official; a senior executive hire; a registered LDA lobbyist; a listed 990 officer; a public advisory-committee member; or a filer of a legally mandated public disclosure such as the STOCK Act, Form 278, or an FEC candidate filing); (B) the data element concerns that public role — a salary, vote, trade, meeting, filing, affiliation, or registration, not incidental private life; and (C) the underlying disclosure is either voluntary or legally mandated to be public.

We aggregate, redact, or exclude when any one holds: (a) the person is a private citizen appearing incidentally (a tour visitor, a pro-se commenter, a constituent, junior staff whose role doesn’t require public identification); (b) the element reveals personal financial, medical, location, or relationship information beyond the public role; or (c) an aggregated form keeps the analytical value without identifying anyone.

The paradigm case is FEC individual contributions. We ingest the full itemized individual-contribution file for analysis, but the version we publish drops the donor’s name and address, keeping employer, occupation, state, amount, and recipient. We want to be precise about what that does and doesn’t do, because a privacy claim is easy to over-state: we are not anonymizing anyone. The named itemized record is public at the FEC source, and employer plus amount plus date can still fingerprint an individual. Dropping the name and address fields does something narrower and real — it declines to make DataDawn itself a name-searchable index of small donors. You can still ask “which employers’ people gave to which campaigns” without our turning a $200 donor into a one-click search target here. That is rule (a) and (c), and the template for the decisions below.

Every applied decision, in one place:

Source / element	Decision	Why
OGE Form 278 appointee filings	Publish	All filers Senate-confirmed (A), filings concern the role (B), legally mandated (C).
White House visitor logs — visitees	Publish	The official being visited is acting in their public capacity.
WH visitor logs — tour / group / social context	Exclude name	Large groups, blank names, and tour/reception/ceremony context are incidental (a).
WH visitor logs — other visitors	Enrich, then redact	Keep the name only if it matches a known public actor — a member of Congress, a federal appointee (Plum Book or Senate-confirmed), an FEC-registered candidate, a FARA-registered foreign agent, or a registered lobbyist — otherwise redact the name while keeping date, visitee, and location for aggregate queries. (A) vs (a) by match.
Congressional disbursements — member level	Publish	Members are elected officials (A)(B)(C).
Congressional disbursements — named staff + salary	Aggregate to office level	Junior staff don’t require public identification; office-level totals keep the research value (a)(c).
FEC individual contributions	Drop donor name + address	Keep employer, occupation, state, amount, recipient. Not an anonymity claim — the named record is public at the FEC and remains re-identifiable; dropping the fields declines to make DataDawn a name-searchable donor index. (a)(c).
990 officers, directors, key employees	Publish	The 990 legally requires public listing of compensated individuals (C).
990 Schedule B contributors	Use what the IRS publishes	The IRS already masks individual donor identities; we mirror the public form.
Rulemaking comment headers (submitter name)	Publish	Commenting is voluntary public participation (C).
Rulemaking comment full text	Mirror the source; no pre-screen	Regulations.gov posts comments publicly; we surface what is already public there.
Lobbying registrants and lobbyists	Publish	The LDA requires public registration (A)(B)(C).
FARA registrants and foreign principals	Publish	FARA requires public registration (A)(B)(C).
Plum Book political appointees	Publish	Appointees named in their public capacity in a mandated public roster (A)(B)(C).
White House personnel reports	Publish	WH staff named in the legally mandated annual personnel report (A)(C).

One honest limit: this redaction is rule-based, and the rule is conservative, so it over-redacts — some genuinely public actors (career senior executives, lower-level political staff) get redacted because they aren’t yet in a public-actor list we can match against. The redaction re-runs every build, so a name un-redacts automatically as our coverage of public roles grows. Its keep-condition and the redaction pass are in the public build script (05_build_database.py); the matcher that flags a visitor as a known public actor is part of the matching code that isn’t yet published.

AI accessibility

The same reproducibility we offer a human reader, we offer a machine one. An AI agent auditing a source shouldn’t have to scrape rendered pages or rely on whatever it happened to absorb in training; it should be able to query the live data directly and get back the same rows a person would. So we publish the platform through several read surfaces:

A live MCP server (datadawn-mcp), so an agent can query the platform as a tool.
An llms.txt guide served at the database hosts (data.datadawn.org, regs.datadawn.org), pointing a model to the structured surfaces below.
A JSON API: every table and every SQL query on the public databases returns JSON.
A machine-readable OpenAPI 3.1 description of the JSON interface.
Question catalogs — worked example queries an agent can adapt rather than guess at the schema.

Concretely, on each database host: the guide at /llms.txt, the OpenAPI description at /api/openapi.json, and the question catalog at /api/questions.json.

Two honest points about what this is and isn’t. First, every one of these surfaces reads the live databases — the same 990 and OpenRegs tables a human can open — so an answer reflects the current records, not a stale snapshot, and an agent’s claim traces back to a row you can re-query yourself. Second, accessibility is not endorsement of an answer: these surfaces hand a model the data and its confidence tiers, but the tier-3 caveats in Section 3c bind a machine reader exactly as they bind a human one. A matched link is still a matched link when an agent retrieves it.

Independence, reproducibility, and corrections

Independence. DataDawn has no institutional affiliations and takes no funding from any organization that appears in its datasets. That matters here for one concrete reason: it removes the incentive to shade. A platform whose revenue depended on an entity in the data would have a reason to soften how that entity is shown; we have none, and the design — no advertising, no client relationships, no positions — is what makes “we don’t editorialize” a structural fact rather than a promise.

Reproducibility. Two things hold for everything on the platform, and neither rests on our code being public. First, every substantive input is a primary government source you can re-pull yourself, from the same public APIs and bulk files we used. Second, every record can be checked against the specific filing behind it: a record we publish as filed is that filing; a link we infer can be audited against the two filings it connects — the connection is our inference, not a fact either filing states. That is the floor, and it doesn’t depend on us.

Our code is open where we’ve published it, released into the public domain under a CC0 dedication. What’s published lives in openregulations-public (the 990 build in 990database-public; the agent interface in datadawn-mcp) — the repositories are the authoritative list of what’s open. Not in them: the entity-resolution and classification layers, whose method and measured error are set out in Section 3c; and the ingestion for a number of later sources — congressional disbursements, White House visitor logs, OGE filings, FACA, and government-unit data among them — which you audit against their filings, not our code. Open code is a growing part of the picture, not the thing the platform’s checkability depends on.

Freshness. As noted in Section 2, the platform carries no single “last updated” date, because each dataset refreshes on its own schedule. The authoritative recency signal is the last successful pull date recorded on each dataset’s Datasette explore page — not anything printed here, which would start rotting the moment a pipeline ran.

Limitations worth stating plainly. Three hold across everything above. The data is as reported: we publish what was filed, source errors and all, and a correction belongs to the filer who made the record. The platform is a set of point-in-time snapshots, not a real-time feed; what you see reflects each dataset as of its last pull. And the connections we surface are associations in the public record — a shared donor, a co-sponsored bill, a meeting on a calendar — which document that two things are linked, not that one caused or improperly influenced the other. Correlation in this data is not causation, and we don’t present it as such.

Corrections. If you find an error — a mis-parse, a wrong link, a privacy call we got wrong — tell us, and point to the underlying filing so we can check it against the source. The most direct route is an issue on the relevant public repository, or by email to [email protected]. Where the source itself is right and our copy is wrong, we fix it on the next build and the change is visible in the public commit history.

A note on the name. Claude Code’s mascot is a crab. During the early builds, it spun out pipelines and parsers at a pace that earned it a DJ’s billing — hence DJ Crabdaddy, Crabdaddy for short. No crabs were harmed in the making of this database.