Platform overview
DataDawn operates six production databases covering 25+ distinct federal data sources. All data is sourced exclusively from primary U.S. federal government APIs and public registries per DataDawn's Data Sourcing Policy (see the Standards & Governance section below). Records are published as filed, with no editorial filtering, scoring, or ranking applied. Total: approximately 180 million records across ~71 GB of SQLite, served via two Datasette instances behind Caddy on a dedicated Hetzner server.
990 Database (data.datadawn.org)
IRS nonprofit filings: 990, 990-EZ, 990-PF, and 990-T returns from tax years 2014β2025, plus foundation grants, DAF disbursements, and the IRS Business Master File. Approximately 5.2 million filings, 13.6 million foundation grants, and 1.27 million Schedule I DAF/charity grants. Total: over 115 million records across 14 public tables, 21.1 GB. All 5.3 million filings are additionally available as rendered Form 990 images at forms.datadawn.org.
OpenRegs instance (regs.datadawn.org) β five databases
The regs.datadawn.org Datasette instance serves five production databases from one subdomain:
- OpenRegs (35.2 GB, 189 tables) β Congress, floor proceedings, roll call votes, stock trades, legislation (Congresses 93β119), federal rulemaking (Federal Register, Regulations.gov, CFR), FEC campaign finance, federal spending, OIRA regulatory reviews and meetings, Inspector General reports, GAO reports, CRS reports, committee hearings, nominations, treaties, and earmarks.
- Lobbying (14.5 GB, 9 tables) β Senate Lobbying Disclosure Act (LDA) bulk data, served as its own database with 10 dedicated canned queries (richer than the mirror tables in OpenRegs).
- FARA (42 MB, 6 tables) β DOJ Foreign Agents Registration Act data: registrants, foreign principals, short forms, and document filings.
- APHIS (0.1 GB) β USDA Animal Welfare Act licensed facilities and inspections.
- Open Comments Snapshot β daily snapshot of Regulations.gov dockets currently accepting public comment, updated every morning at 06:30 UTC.
A separate entity-resolution staging database holds 2.3 million resolved organizational entities, 13.4 million role records, and 405,000 relationships, built entirely from primary government data. This layer is not yet deployed to the public databases β it is merged in once source-table FK backfill rates stabilize (currently 49β88% across source tables). See the Entity Resolution section below.
Both Datasette instances provide interactive browsing, arbitrary SQL queries, JSON API access, and full data downloads in SQLite and CSV formats. No account, API key, or registration is required. As of April 2026, the platform also exposes a Model Context Protocol server at mcp.datadawn.org for AI-agent access β see the AI Accessibility section.
Cross-referencing methodology
The distinctive feature of DataDawn's OpenRegs database is that datasets are linked
through shared identifiers, enabling queries that span multiple data sources.
The primary linkage key is bioguide_id, the Biographical Directory of
the United States Congress identifier assigned to every member of Congress.
FEC-to-bioguide crosswalk
The Federal Election Commission uses its own candidate ID system, separate from
the Congressional bioguide_id. DataDawn maintains a fec_candidate_crosswalk
table with 1,712 verified mappings between FEC candidate IDs
and bioguide IDs, enabling queries that link campaign contributions to legislators'
voting records, stock trades, and floor statements.
Congressional Record speaker linkage
Floor speeches in the Congressional Record are linked to members via the
crec_speakers table. 99.6% of speaker entries
have a verified bioguide_id, enabling reliable attribution of floor statements
to individual legislators.
Stock trade linkage
Congressional stock trade disclosures are linked to members' bioguide IDs through name and chamber matching. 100% of trades (61,895 transactions) are linked to a bioguide_id. This was achieved by filtering to only Periodic Transaction Report (PTR) filings and matching member names against the congressional member roster.
CIK enrichment on stock trades
As of April 2026, individual stock trades also carry SEC Central Index Key (CIK) identifiers for the traded issuer, populated via a ticker-SIC crosswalk. Coverage is 67.7% of trades, with the remaining trades mostly in tickers not mapped cleanly to an SEC filer (foreign issuers, municipal bonds, private securities appearing on certain member disclosures).
Linkage transparency: All cross-reference rates are reported as-is. Key linkage rates: stock trades 100%, CREC speakers 99.6%, FEC crosswalk 1,712 verified mappings, stock trades β CIK 67.7%, Federal Register β Regulations.gov 395,621 crossrefs. Where linkage is incomplete, unlinked records remain in the database and are queryable β they simply lack a bioguide_id join key. We do not impute or guess linkages.
Full-text search
DataDawn builds SQLite FTS5 (Full-Text Search) indexes on key text fields across both databases, enabling instant search across millions of records.
| Dataset | Fields indexed |
|---|---|
| 990 Returns | Organization name |
| Foundation Grants | Recipient name, purpose |
| DAF Grants | Recipient name, funder name |
| Federal Register | Title, abstract |
| Regulations.gov Dockets | Title |
| Regulations.gov Documents | Title |
| Regulations.gov Comments | Title, submitter name |
| Congressional Record | Full speech text |
| CFR Sections | Full regulatory text |
| Lobbying Filings | Filing descriptions |
| FARA Registrants | Registrant name, business |
| FARA Foreign Principals | Registrant, principal, country |
| Legislation | Bill title, summary |
| Spending Awards | Recipient name, description |
| FEC Employers | Employer name |
| GAO Reports | Title, abstract, subjects |
| Hearings | Title, witness names |
| Earmarks | Recipient, project description |
| CRS Reports | Title, summary |
| Nominations | Description |
IRS 990 nonprofit filings
Data source
All data is extracted from IRS electronic filings (e-files) published at
apps.irs.gov/pub/epostcard/990/xml/. This is the same public dataset
used by ProPublica, GuideStar, and academic researchers. The IRS releases new
batches periodically throughout the year, typically monthly.
DataDawn currently covers filings from tax years 2014 through 2025, with earlier years having sparser coverage due to lower e-filing adoption rates before the IRS mandate took effect. Coverage is most complete for tax years 2016 onward.
Form types parsed
| Form | Filed by | What we extract |
|---|---|---|
| 990 | Public charities (revenue > $200K or assets > $500K) | Revenue, expenses, assets, officers, contractors, program activities |
| 990-EZ | Small public charities (revenue < $200K) | Revenue, expenses, assets, officers |
| 990-PF | Private foundations | Revenue, assets, grants paid, officers, investments, contributors |
| 990-T | Orgs with unrelated business income | Basic filing data |
Database tables
Raw filings are parsed into 12 structured tables. No editorial filtering is applied β if the IRS published it, we parsed it.
| Table | Records | Description |
|---|---|---|
| returns | ~5.2M | Core filing data: org name, EIN, state, revenue, expenses, assets, return type, tax year |
| grants | ~13.6M | Foundation grants from 990-PF filings: recipient, amount, purpose, date, location |
| schedule_i_grants | ~1.27M | Schedule I disbursements (DAF sponsors and public charities) |
| bmf | ~1.9M | IRS Business Master File: NTEE codes, ruling dates, asset codes |
| officers | ~44.8M | Officers, directors, trustees, and key employees with compensation |
| capital_gains | ~24.1M | Capital gains and losses from 990-PF |
| related_orgs | ~9.0M | Schedule R related organizations |
| investments | ~5.8M | Foundation investments (from 990-PF Part II) |
| contributors | ~662K | Contributors to private foundations (from 990-PF Schedule B) |
| program_activities | ~576K | Program service accomplishments and expenses |
| program_investments | ~314K | Program-related investments (PRIs) |
| contractors | ~77K | Independent contractors receiving > $100K |
| top_employees | ~74K | Highest-compensated employees |
Extraction pipeline
IRS e-files are XML documents following IRS-defined schemas that have evolved across filing years. DataDawn's extraction scripts handle schema variations across years, mapping different XML element paths to consistent database columns.
- Download β New XML batches are synced from the IRS S3 bucket. Batch completion is tracked with marker files to prevent reprocessing.
- Parse β Three extraction scripts process 990/990-EZ returns, 990-PF detail filings (grants, investments, contributors), and Schedule I grants respectively.
- Deduplicate β Filings are keyed on a combination of EIN and object ID to prevent duplicate insertion from overlapping IRS releases.
- Index β Full-text search indexes (SQLite FTS5) are built on organization names and grant recipient names for instant search.
- Publish β The public database is built from an allowlist of raw data tables. No analysis or curated tables are included in the public release.
Known limitations β 990 data
E-file only
DataDawn only includes electronically filed returns. Paper filings β roughly one-third of all 990s β are not included. E-filing rates have increased over time, so recent years have better coverage than earlier years.
Filing lag
Organizations file 990s after their fiscal year ends, and the IRS publishes e-files on a rolling basis. The most recent tax year will always have incomplete data.
Sparse early years
Tax years 2014β2015 have limited coverage because the IRS e-filing mandate was not yet in full effect. Coverage is most reliable from 2016 onward.
Grant dates
Foundation grant dates come from the filer's reported grant date field. Some foundations report the approval date, others the payment date, and some leave it blank. Year-level analysis is more reliable than month-level.
Name matching
Organization names are as reported on the filing. The same organization may appear under slightly different names across years. DataDawn does not perform entity resolution β search results should be verified by checking the EIN.
Amount discrepancies
Financial figures reflect what was reported on the filing. Amended returns may not overwrite original filings. In rare cases, both an original and amended filing for the same tax year may appear.
DAF grant identification
Donor-advised fund (DAF) disbursements are extracted from Schedule I of 990 filings submitted by DAF sponsor organizations. DataDawn identifies and parses grants from major DAF sponsors including Vanguard Charitable, Fidelity Charitable, Schwab Charitable, National Philanthropic Trust, Silicon Valley Community Foundation, and others.
These are grants made by DAF sponsors to recipient nonprofits. They do not identify the individual donors who recommended the grants β that information is not available in any public filing.
Why this matters: DAF grants represent a large and growing share of philanthropic funding, but because they flow through intermediary sponsors, they are difficult to trace using traditional 990-PF data alone. Combining 990-PF grants with Schedule I DAF data provides a more complete picture of institutional funding flows.
Congress members
The congress_members table is the universal identity table for the OpenRegs
database. It contains 12,765 members of Congress,
both historical and current, sourced from the Congress.gov API and the Biographical
Directory of the United States Congress.
Each member is identified by their bioguide_id, which serves as the
primary join key across all member-related datasets. The table includes name,
party, state, chamber, number of terms served, and service dates.
A precomputed member_stats table provides aggregate counts per member
(total trades, speeches, bills sponsored, votes cast) for quick summary views.
Committee assignments (3,908 current assignments
across 233 committees and subcommittees) are maintained in separate
committees and committee_memberships tables with
leadership title indicators.
Congressional Record
Data source
Floor proceedings from the Congressional Record, sourced from the Government Publishing Office (GPO) via govinfo.gov bulk data. Coverage spans 1994 to present, encompassing speeches, debates, remarks, and other floor proceedings from both chambers.
Tables
| Table | Records | Description |
|---|---|---|
| congressional_record | 878,583 | Floor proceedings with full text, date, chamber, section |
| crec_speakers | 944,216 | Speaker-to-speech linkage, 99.6% with bioguide_id |
| crec_bills | 1,561,719 | Bill references extracted from floor proceedings |
Known limitations
The Congressional Record is not a verbatim transcript. Members may revise and extend their remarks after delivery. The "Extensions of Remarks" section includes statements that were not delivered orally on the floor. Speaker attribution relies on GPO markup, which occasionally misattributes statements in colloquy or debate.
Roll call votes
Data source
Roll call voting data from both chambers, sourced from the Congress.gov API and official House/Senate clerk records.
Tables
| Table | Records | Description |
|---|---|---|
| roll_call_votes | 26,439 | Vote metadata: question, result, date, congress, chamber |
| member_votes | 8,336,815 | Individual vote records: Yea, Nay, Present, Not Voting per member per vote |
Known limitations
Roll call votes capture only recorded votes, not voice votes or unanimous consent agreements. Many legislative actions proceed without a recorded vote. "Not Voting" may indicate absence, abstention, or recusal β the data does not distinguish between these.
Congressional stock trades
Data source
Financial disclosure data from both chambers: House Periodic Transaction Reports (PTRs, parsed from PDF filings) and Senate electronic Financial Disclosures (eFD, scraped from Senate disclosure website).
Coverage
61,895 transactions are currently in the database, with 100% linked to a bioguide_id and 67.7% carrying SEC CIK for the traded issuer (via ticker-SIC crosswalk, added April 2026). Trades include ticker symbol, transaction date, transaction type (purchase/sale/exchange), amount range, owner (self/spouse/child), and source (House PTR or Senate eFD).
House trades are parsed from PDF Periodic Transaction Reports. Senate trades are extracted from the Senate electronic Financial Disclosure system. Only PTR filings are included β annual disclosures, candidate reports, and amendments are excluded as they do not represent individual transactions.
Known limitations
Disclosure amounts are reported in ranges (e.g., $1,001β$15,000), not exact figures.
Filing deadlines allow up to 45 days after a transaction, and extensions are common.
Trades by spouses and dependent children are included in disclosures and identified
via the owner field where available.
Legislation
Data source
Bill data from the Congress.gov API covering Congresses 93β119 (1973βpresent).
Tables
| Table | Records | Description |
|---|---|---|
| legislation | 375,620 | Bills with title, sponsor, policy area, latest action, status |
| legislation_cosponsors | 4,067,601 | Cosponsor records with bioguide_id linkage |
| legislation_actions | 2,310,777 | Action steps: introduced, referred, passed, signed |
| legislation_subjects | 3,036,186 | Subject tags assigned by the Congressional Research Service |
| cbo_cost_estimates | 17,201 | CBO bill scoring and cost analysis |
Bills are identified by a composite bill_id in the format
{congress}-{type}-{number} (e.g., 118-hr-1234),
which links to floor speech references in crec_bills and lobbying
activity records in lobbying_activities.
Campaign finance (FEC)
Data source
Federal Election Commission bulk data files covering candidates, committees, and contributions.
Tables β deployed
| Table | Records | Description |
|---|---|---|
| fec_candidates | 64,679 | FEC-registered candidates |
| fec_committees | 154,967 | PACs, party committees, campaign committees |
| fec_contributions | 4,395,926 | PAC/committee-to-candidate contributions |
| fec_candidate_crosswalk | 1,712 | Verified FEC candidate ID to bioguide_id mappings |
| fec_operating_expenditures | 15,358,447 | PAC/party/candidate disbursements (where the money goes) |
| fec_independent_expenditures | 666,910 | Independent expenditures supporting or opposing candidates |
| fec_pac_summary | 98,614 | Per-committee aggregate financials by cycle |
| fec_electioneering | 1,679 | Electioneering communications |
| fec_communication_costs | 25,641 | Internal communications costs reported by corps/unions |
Employer-aggregated donations (the fec_employers rollup)
In addition to the PAC/committee-level tables above, DataDawn publishes a family of employer-aggregated rollups built from the individual-donor contribution file. These rollups preserve the research utility of the raw data (which employers give to which candidates, which occupations concentrate in which party, etc.) without publishing any individual donor's name, address, or transaction.
| Table | Records | Description |
|---|---|---|
| fec_employer_totals | 352,103 | Per-employer totals: donation count, total amount, unique states |
| fec_employer_to_candidate | 476,534 | Employer-to-candidate aggregates with party and office |
| fec_employer_to_party | 286,908 | Employer-to-party totals by cycle |
| fec_top_occupations | 666,254 | Top occupations per employer (aggregate, not per-person) |
Why individual contributions are not deployed (PII)
The Federal Election Commission publishes individual contribution records that include donor
names, addresses, employers, and occupations. The full file runs to approximately 104
million records (49 GB). DataDawn ingests this entire corpus into a local analysis
database (fec.db: 44 million committee transactions plus 4.4 million contributions
to candidates plus the candidate and committee metadata) but does not publish it
on the deployed databases.
This is a deliberate editorial decision, not a cost-saving one. The rollup tables above answer essentially every research question about influence, employer giving patterns, PAC activity, and occupation-concentration without exposing individual donor records to a one-click public query. The raw file remains accessible at the FEC's own site for anyone who needs it; DataDawn is choosing not to replicate a second public surface for individual-named records.
As of April 19, 2026, this approach is codified as DataDawn's formal PII Standard and extended beyond FEC to govern White House visitor log handling, OGE Public Available Filer disclosures, FACA advisory committee membership, and federal disbursement records. See the "DataDawn Standards & Governance" section below for the underlying philosophy.
Other known limitations
FEC data has its own filing lag β contributions may not appear for weeks or months after they
are made. FEC amount figures are reported exactly (unlike the range-based financial disclosures
from Congress). Committees that register but never file activity still appear in
fec_committees; filter by activity or cycle when doing meaningful counts.
Federal Register
Data source
The Federal Register API (federalregister.gov/api), which provides
structured data for every document published in the Federal Register.
Tables
| Table | Records | Description |
|---|---|---|
| federal_register | 994,487 | Rules, proposed rules, notices, presidential documents with title, abstract, dates, PDF/HTML URLs. As of April 2026, includes a regulation_id_numbers column populated on 103,532 documents (joinable to OIRA reviews by RIN). |
| federal_register_agencies | 1,532,539 | Agency tags (many documents have multiple agencies) |
| presidential_documents | 5,925 | Executive orders, proclamations, memoranda |
| fr_regs_crossref | 395,621 | Links Federal Register document numbers to Regulations.gov dockets (3.4Γ the March 2026 count after the comprehensive crossref rebuild) |
Regulations.gov
Data source
The Regulations.gov API (api.regulations.gov), the federal government's
public comment and rulemaking system.
Tables
| Table | Records | Description |
|---|---|---|
| dockets | 254,910 | Regulatory dockets from 126 federal agencies |
| documents | 1,703,711 | Regulatory documents: rules, proposed rules, notices, supporting materials |
| comments | 9,764,809 | Public comment headers: submitter, date, agency, docket |
| comment_details | 428,838 | Full-text comment bodies β approximately 4.4% of all comment headers, up from 1.6% in March 2026 |
Known limitations
The Regulations.gov API has strict rate limits (approximately 1,000 requests per hour per key).
Full-text comment bodies (comment_details) are being downloaded incrementally using
a dual-key forward/reverse approach, prioritizing organizational submissions. Comment header data
(submitter name, date, docket) is complete for all 9.76M comments across 126 federal agencies.
The majority of Regulations.gov comments are identical form letters submitted through advocacy
campaigns; the substantive organizational comments are a much smaller fraction of the total.
Some agencies do not publish all comments through Regulations.gov.
Code of Federal Regulations
Data source
Bulk XML downloads from the Electronic Code of Federal Regulations (eCFR)
at ecfr.gov.
Coverage
123,480 regulatory sections from 19 CFR titles covering major regulatory domains: Agriculture (7), Animals (9), Energy (10), Aeronautics (14), Commerce (15), Commodities (17), Employee Benefits (20), Food and Drugs (21), Housing (24), Judicial Administration (28), Labor (29), Navigation (33), Education (34), Pensions (38), Environment (40), Emergency Management (44), Shipping (46), Transportation (49), and Wildlife (50). Full regulatory text is indexed for full-text search.
Known limitations
19 of 50 CFR titles are currently included. The CFR is updated continuously as agencies publish final rules; the DataDawn snapshot reflects the eCFR as of the most recent bulk download. Regulations that have been proposed but not finalized are not included in the CFR data (they appear in the Federal Register).
Lobbying disclosures
Data source
Senate Lobbying Disclosure Act (LDA) filings, downloaded from the Senate Office of Public Records bulk data system. Data covers 1999 through present.
Served as a standalone database
Lobbying is served at regs.datadawn.org as its own Datasette database
(lobbying.db, 14.5 GB, 9 tables, 10 dedicated canned queries). This gives
the lobbying corpus its own query surface with richer schema than the subset mirrored
into OpenRegs for cross-database joins.
Tables (lobbying.db)
| Table | Records | Description |
|---|---|---|
| lobbying_filings_raw | 1,915,098 | Disclosure filings: client, registrant, income/expenses, year, period. As of April 2026 also carries client_state (91% coverage), registrant_house_id, client_government_entity, and affiliated_organizations columns recovered by the re-parse migration. |
| lobbying_registrations | 138,328 | Registration records β not mirrored into OpenRegs |
| lobbying_activities | 3,811,121 | Activity records: issue codes, descriptions, specific bills lobbied |
| lobbying_lobbyists | 4,730,966 | Lobbyist entries with covered_position (revolving door indicator) |
| lobbying_contributions | 3,670,570 | LDA-reported contributions β not mirrored into OpenRegs |
| lobbying_affiliated_orgs | 29,447 | Subsidiary graph extracted from LDA affiliated_organizations field (April 2026) |
| lobbying_issue_codes | 79 | Standard issue category codes (reference) |
| lobbying_gov_entities | 257 | Covered government-entity list (reference) |
| lobbying_filing_types | 300 | Filing type codes (reference) |
April 2026 re-parse migration
An April 2026 migration reprocessed the 1.9 million raw LDA JSON filings already on disk
to recover fields that had been dropped by the original pipeline: client_state
(now 91% coverage), registrant_house_id, client_government_entity,
and affiliated_organizations. This was an in-place enrichment, not a fresh
download. The lobbying_affiliated_orgs subsidiary graph (29,447 rows) was
extracted as a companion table during the same migration.
Revolving door
The covered_position field in lobbyist records identifies individuals
who previously held government positions β the "revolving door" between government
service and lobbying. This field is self-reported by the registrant.
DataDawn builds a materialized revolving_door table that cross-references
lobbyists whose covered_position indicates former congressional service (matching 12
position patterns such as "U.S. Senator," "Member of Congress," "Former Representative")
against the congress_members table using name matching. This enables queries
showing which former members lobby on which issues, for which clients.
Known limitations
LDA filings are self-reported by registrants and are not independently audited.
Income and expense figures are reported in ranges on some filing types. The
lobbying_activities table links to specific bill numbers when reported,
but lobbyists are not required to list every bill they lobby on.
Foreign agents (FARA)
Data source
Foreign Agents Registration Act data from the Department of Justice FARA
database at fara.gov. Served at regs.datadawn.org as its own
Datasette database (fara.db, 42 MB, 6 tables) and also mirrored as tables
in OpenRegs for cross-dataset joins.
Tables
| Table | Records | Description |
|---|---|---|
| fara_registrants | 7,043 | Registered foreign agents (firms and individuals) |
| fara_foreign_principals | 17,652 | Foreign government and entity clients |
| fara_short_forms | 44,416 | Individual agents working under registrations |
| fara_registrant_docs | 151,987 | Filed documents with PDF links |
Known limitations
FARA registration is self-reported and enforcement has historically been limited. The DOJ has acknowledged that compliance rates are uncertain. Some entities that may be required to register under FARA instead register under the LDA, which has less stringent disclosure requirements. Cross-referencing FARA registrants with lobbying filings (by firm name) can reveal some of these overlaps but is not definitive.
Federal spending
Data source
USAspending.gov bulk award data covering grants, contracts, and other federal awards across 20 agencies.
Coverage
863,632 awards including recipient name, award amount, funding agency, award type, and date ranges. Linkable to agencies referenced in Federal Register documents and lobbying filings.
Known limitations
USAspending.gov data has known reporting quality issues acknowledged by the government itself. Not all agencies report at the same level of detail or timeliness. Sub-award data is not currently included. The 20-agency scope covers the most active federal funders but is not comprehensive across all federal agencies.
OIRA regulatory reviews & meetings
Data source
OIRA (Office of Information and Regulatory Affairs) data is sourced from
reginfo.gov. EO 12866 regulatory review records are downloaded
as XML bulk files covering 1981 to present. Meeting data (2014-present) is
collected from the reginfo.gov search interface, with individual meeting
detail pages scraped for full attendee lists.
Coverage
48,434 regulatory reviews (1981βpresent), 8,663 meetings with outside parties (2014βpresent), and 90,711 individual meeting attendees with organization affiliations and participation type.
Cross-reference potential
OIRA meetings connect to lobbying data through requestor organization names, to Federal Register documents through Regulation Identifier Numbers (RIN), and to the rulemaking timeline through meeting dates relative to rule publication dates.
Inspector General reports
Data source
Inspector General reports are scraped from oversight.gov,
the federal IG community's centralized reporting portal. Both listing
metadata and individual report details (including recommendations) are
collected.
Coverage
34,880 IG reports with 11,999 individual recommendations. Reports include questioned costs, funds for better use, agency reviewed, report type, and links to original PDF documents.
Committee hearings
Data source
Congressional committee hearing transcripts and metadata from GovInfo
(govinfo.gov/bulkdata/CHRG), covering hearings published since 1995.
Tables
| Table | Records | Description |
|---|---|---|
| hearings | 46,177 | Hearing metadata: title, committee, date, congress, chamber, package_id |
| hearing_members | 1,244,920 | Member attendance at hearings, linked to congress_members by bioguide_id |
89.6% of hearings link to at least one member record. Witnesses are captured in the hearing transcript text but are not yet broken out into a structured table; that is planned for a future build.
Congressional Research Service (CRS) reports
Data source
CRS reports via the Congress.gov API v3. CRS is Congress's in-house research arm; its reports are published to Congress.gov and are subject to a long-running public-release debate that finally landed in 2018 with statutory directions to make reports public.
Coverage
13,727 CRS reports with title, publication date,
URL, and full text where available. A companion crs_report_bills table
(135,890 rows) links reports to bill_ids referenced in the report body, enabling
queries like "all CRS analysis related to a specific bill or topic area."
Executive nominations & treaties
Data source
Nominations and treaties data from the Congress.gov API v3, covering the Senate's advice-and-consent calendar.
Tables
| Table | Records | Description |
|---|---|---|
| nominations | 40,167 | Executive nominations: nominee, position, agency, date received, Senate action |
| treaties | 777 | Treaties submitted to the Senate for ratification: title, topic, date received |
The full historical Senate treaty record is small (fewer than 800 treaties since the founding); the nominations stream is substantially larger and covers both confirmed and withdrawn/returned nominations.
Earmarks & directed appropriations
Data source
Congressional earmark data (also called Congressionally Directed Spending / Community Project Funding) from the House Appropriations Committee and Senate Appropriations Committee public disclosures.
Coverage
70,826 earmark records with requesting member (bioguide_id), recipient organization, dollar amount requested, fiscal year, and project description. Linkable to members' voting, sponsored-bills, and campaign-donor records through bioguide_id.
Earmark disclosures have varying detail depending on chamber and fiscal year; some historical windows had earmarks suspended entirely and contain zero records.
Government Accountability Office (GAO) reports
Data source
GAO reports sourced from two feeds: GovInfo bulk data for the historical
GAO archive up through mid-2008, and direct scraping from gao.gov for
the post-2008 period when GAO moved off GovInfo. The two feeds are reconciled into a
single gao_reports table.
Coverage
73,725 GAO reports with 99.95% detail-file coverage (57K direct stubs from gao.gov plus GovInfo historical). Reports include title, subjects, agencies reviewed, abstract, and PDF URL. Full-text search indexes title, abstract, and subject tags.
Entity resolution layer (in progress, staged)
What it is
A staging database that builds a unified organization graph from DataDawn's primary
government data. When complete, it will let users ask a single question ("show me every
federal touchpoint of Company X") and get an answer across lobbying, grants, federal
spending, regulatory participation, congressional hearings, and personnel filings β
joined by a single entity_id instead of by fragile name-matching.
Current state (April 19, 2026)
| Table | Records | Description |
|---|---|---|
| entities | 2,335,378 | Resolved organizations: nonprofits, private companies, trade associations, unions, political committees, PACs, public companies, and manually-pinned mutual insurance companies |
| public_actors | 13,431,058 | Role records: nonprofit officers (13.26M), lobbyists-under-LDA (86,587), FARA agents (43,243), FEC candidates (30,615), congress members (12,765), OGE Senate-confirmed filers (147) |
| entity_relationships | 405,304 | BMF group-exemption memberships (401,894), predecessor-successor links (2,427 including WyethβPfizer, ConsecoβCNO, SunocoβEnergy Transfer), LDA affiliated-org links (983) |
Industry classification (companion)
32,706 federal-policy-relevant entities have been tagged into a 40-industry taxonomy by a 5-pass deterministic classifier (mutual pins β SIC 4-digit β NTEE selective β foreign-filer regex β PAC connected-organization inheritance) plus a 215-entry curated trade-association map. Top classified industries: hospitals_health_systems (10,492), education (8,700), agriculture_agribusiness (5,663), insurance_non_health (1,434), pharmaceuticals (893), banking (739), big_tech (698).
Documented structural gaps
- Mutual insurance companies (TIAA, State Farm, Nationwide, MassMutual,
Liberty Mutual, USAA, NY Life, Northwestern Mutual, others) β manually pinned with
entity_type='mutual_company'. They file no 990 and no SEC registration, which breaks both sides of our normal name-to-identifier crosswalk. Future closure via state Secretary of State data. - Federated labor-union nationals β 8 of 10 (IBEW, Teamsters, SEIU,
AFSCME, NEA, AFT, UFCW, IFPTE) are pinned as
manual_chapter_nationalaliases. AFL-CIO national and UAW International are not yet in the entity pool (subsection and name-normalization issues) β open investigation. - Federated nonprofit HQs (Sierra Club, NAACP, ACLU, United Way Worldwide, Boys & Girls Clubs of America, Habitat for Humanity International, Salvation Army, Planned Parenthood Federation of America, YMCA) also pinned as manual chapter-nationals.
Source-table FK backfill rates
These are the rates at which source tables have been backfilled with entity_id
foreign keys from the scratch DB:
- schedule_i_grants (enriched): 88%
- schedule_i_990: 79%
- grants (Schedule F foundation grants): 57%
- lobbying: 49%
- earmarks: 59%
- hearings: 26%
- spending (USAspending): 32% β pending re-download after an April 2026 audit found the pipeline was dropping the UEI field from API responses; expected to reach ~92% after completion
AI accessibility layer
DataDawn is built to be queryable by AI agents, not only by humans browsing a website. The platform exposes four public accessibility surfaces for machine consumption:
- llms.txt at each subdomain
(
regs.datadawn.org/llms.txt,data.datadawn.org/llms.txt) β plain-text orientation guide for LLM agents about what each database contains and how to query it. - REST APIs β Datasette natively exposes every table, every view, and
every canned query as a JSON endpoint. Appending
.jsonto any URL ondata.datadawn.orgorregs.datadawn.orgreturns structured data. - OpenAPI specification β auto-generated by Datasette, available at each subdomain for agent builders pointing standard tooling at the platform.
- Live MCP server at
mcp.datadawn.org
β Model Context Protocol server with named tools including
search_nonprofit,search_legislation,search_lobbying,search_comments,search_federal_register,search_grants,search_daf_grants,lookup_ein,lookup_member,member_trades,org_grants_made,org_officers, plus raw-SQL passthroughs (run_990_sql,run_openregs_sql).
Journalists, researchers, and policy analysts increasingly use AI tools in their workflow. DataDawn is designed to be a first-class destination for those tools: queries hit live data, not training snapshots or scraped summaries. No other comparable transparency platform currently offers a live MCP server.
DataDawn standards & governance
DataDawn operates under two load-bearing standards adopted on April 19, 2026: a Data Sourcing Policy that governs what the platform ingests, and a PII Standard that governs what the platform publishes about named individuals. The philosophy of each is captured below. The full internal policies, decision tables, and implementation patterns are maintained as engineering documents; the essential commitments are public here.
Data Sourcing Policy
DataDawn ingests only primary U.S. federal government data and open public registries
(GLEIF, SAM.gov, and similar). No commercial aggregators β Candid/GuideStar, Bloomberg
Government, LegiStorm, Factiva β and no NGO-curated derivative tables from OpenSecrets,
OpenCorporates, ProPublica, or GovTrack. This is not a judgment about the quality of
those sources; many of them do excellent work. The point is that building DataDawn on
top of them would make DataDawn's own work unauditable in the way we want to be
auditable. Every join, every crosswalk, every aggregate on the platform can be
reproduced from the same federal APIs anyone else can access. When a claim on DataDawn
depends on an entity-resolution decision, that decision is made in-house from primary
data β not licensed from a third party whose methodology is a trade secret. Two narrow
carve-outs: federally-adopted identifiers with commercial lineage that are now part of
the federal stack (UEI, legacy DUNS in SAM.gov) and community-maintained federal
re-exports where the source is itself a federal database (e.g.,
unitedstates/congress-legislators).
PII Standard
DataDawn names people who act in public capacity β Cabinet members, elected officials, Senate-confirmed appointees, registered lobbyists, nonprofit officers with published 990s, foreign-agent principals, public advisory-committee members. Those are the named actors the platform exists to surface, and the value of cross-referencing them across votes, trades, filings, meetings, and money depends on being able to name them. For incidental appearances of private citizens in federal datasets β a tourist in a visitor log, a $50 donor in an FEC file, a pro-se commenter in a rulemaking docket, a junior staffer on a disbursement line β the platform aggregates, redacts, or excludes. The record stays, the individual identifier does not.
The FEC deployment is the paradigm case. The FEC publishes approximately 104 million
individual contribution records with donor names, addresses, employers, and occupations
(about 49 GB on disk). DataDawn ingests the entire file locally for analysis but
does not publish it on the public databases. What gets published is
the family of aggregated rollups described in the FEC section above:
fec_employer_totals (352,103 employers with their giving histories),
fec_employer_to_candidate (476,534 employer-to-candidate aggregates),
fec_employer_to_party (286,908 employer-to-party totals), and
fec_top_occupations (666,254 occupation rollups). These answer essentially
every research question about which employers' staff give to which candidates β without
DataDawn becoming a second public search surface for individual donor names. The raw
file remains accessible at fec.gov for anyone who specifically needs it; DataDawn chooses
not to replicate that surface. The same principle extends to how the platform handles
White House visitor logs, OGE Public Available Filer disclosures, FACA advisory-committee
membership, and federal disbursement record-handling.
General limitations
Data as reported
DataDawn publishes data as reported in source filings and government databases. We do not correct, impute, or editorialize. Errors in source filings propagate to our database. Where we are aware of systematic data quality issues, they are documented in the dataset-specific sections above.
Entity resolution β in progress
The same real-world entity may appear under different names across datasets (e.g., "ASPCA" vs "American Society for the Prevention of Cruelty to Animals" in 990 data, or variant name spellings across FEC and Congressional records). As of April 2026, DataDawn maintains a staging database with 2.3 million resolved organizational entities, 13.4 million role records, and 405,000 relationships, built entirely from primary government data (see the Entity Resolution section above). The layer is not yet deployed to the public databases β it is merged in once source-table FK backfill rates stabilize (currently 49β88% across sources). Until then, cross-dataset organization queries still benefit from stable identifiers (EIN, bioguide_id, FEC candidate ID, UEI) where possible, and name-based matching on unresolved entities has its usual pitfalls.
Point-in-time snapshots
Each dataset reflects the state of its source at the time of DataDawn's most recent extraction. Government agencies update their data on different schedules. The database is not a real-time feed.
Correlation is not causation
Cross-referencing datasets enables powerful queries (e.g., stock trades within 30 days of floor speeches on related topics), but temporal or thematic proximity does not establish a causal or improper relationship. DataDawn provides the data; interpretation is the user's responsibility.
Update schedule
As of April 2026, DataDawn runs on an automated update pipeline monitored end-to-end by Healthchecks.io. Before April 2026 the platform was rebuilt manually; the move to automated updates with health monitoring is a significant operational maturity upgrade.
| Job | Schedule | Coverage |
|---|---|---|
| OpenRegs weekly | Sat 04:00 CT | Federal Register, eCFR, Congressional Record, stock trades, USAspending, legislation, votes, nominations, CRS β rebuild β deploy |
| OpenRegs monthly | 15th 20:00 CT | Heavier sources: Regulations.gov dockets/comments, lobbying, FEC, OIRA, FARA, GAO, IG β calls weekly pipeline |
| 990 monthly | 1st 04:30 CT | IRS e-file download β parse β build β deploy |
| Daily open comments | 06:30 UTC | Open-for-comment snapshot, no rebuild |
All 10 automated cron jobs (four pipelines above plus six backup/compliance crons) are
monitored by Healthchecks.io with per-job grace periods; maintainer is alerted on failure.
Overlap between weekly and monthly runs is prevented by flock. Post-build
full-text-search verification aborts the pipeline if any expected FTS table is missing,
empty, or undersized. The deploy step uses an atomic-rename pattern (upload to
${REMOTE_DB}.new, then mv) to prevent the rsync --partial
interrupt corruption that destroyed a live database on April 11, 2026. A separate
validate_dates() QC check catches silent schema-drift bugs on every build.
Individual government source APIs update at their own cadences β the Federal Register updates daily, FEC bulk data refreshes quarterly, IRS 990 e-files arrive in batches on the IRS's own schedule. DataDawn documents the last successful pull date for each dataset on its Datasette explore pages.
Independence statement
DataDawn is an independent project with no institutional affiliations. It receives no funding from any nonprofit, foundation, government agency, or organization represented in its datasets. All data is sourced exclusively from public records filed with federal government agencies.
DataDawn does not endorse, evaluate, or rank any organization, legislator, or entity. The platform provides raw data and search tools. Interpretation and analysis are the responsibility of the user.
All source code, extraction pipelines, and database schemas are published on GitHub under a CC0 1.0 Universal (public domain) license.
Corrections and feedback
If you find a data quality issue, parsing error, or have questions about the methodology, you can reach DataDawn at [email protected] or via the GitHub repository.