Contents
Platform overview Cross-referencing methodology Full-text search IRS 990 nonprofit filings DAF grant identification Congress members Congressional Record Roll call votes Congressional stock trades Legislation Committee hearings CRS reports Executive nominations & treaties Campaign finance (FEC) Federal Register Regulations.gov Code of Federal Regulations Lobbying disclosures Foreign agents (FARA) Federal spending Earmarks & directed appropriations OIRA regulatory reviews Inspector General reports GAO reports Entity resolution layer AI accessibility layer DataDawn standards & governance Update schedule General limitations Independence statement Corrections and feedback

Platform overview

DataDawn operates six production databases covering 25+ distinct federal data sources. All data is sourced exclusively from primary U.S. federal government APIs and public registries per DataDawn's Data Sourcing Policy (see the Standards & Governance section below). Records are published as filed, with no editorial filtering, scoring, or ranking applied. Total: approximately 180 million records across ~71 GB of SQLite, served via two Datasette instances behind Caddy on a dedicated Hetzner server.

990 Database (data.datadawn.org)

IRS nonprofit filings: 990, 990-EZ, 990-PF, and 990-T returns from tax years 2014–2025, plus foundation grants, DAF disbursements, and the IRS Business Master File. Approximately 5.2 million filings, 13.6 million foundation grants, and 1.27 million Schedule I DAF/charity grants. Total: over 115 million records across 14 public tables, 21.1 GB. All 5.3 million filings are additionally available as rendered Form 990 images at forms.datadawn.org.

OpenRegs instance (regs.datadawn.org) β€” five databases

The regs.datadawn.org Datasette instance serves five production databases from one subdomain:

A separate entity-resolution staging database holds 2.3 million resolved organizational entities, 13.4 million role records, and 405,000 relationships, built entirely from primary government data. This layer is not yet deployed to the public databases β€” it is merged in once source-table FK backfill rates stabilize (currently 49–88% across source tables). See the Entity Resolution section below.

Both Datasette instances provide interactive browsing, arbitrary SQL queries, JSON API access, and full data downloads in SQLite and CSV formats. No account, API key, or registration is required. As of April 2026, the platform also exposes a Model Context Protocol server at mcp.datadawn.org for AI-agent access β€” see the AI Accessibility section.

Cross-referencing methodology

The distinctive feature of DataDawn's OpenRegs database is that datasets are linked through shared identifiers, enabling queries that span multiple data sources. The primary linkage key is bioguide_id, the Biographical Directory of the United States Congress identifier assigned to every member of Congress.

bioguide_id (universal member key) β”‚ β”œβ”€β”€ congress_members β€” identity, party, state, terms β”œβ”€β”€ congressional_record β€” floor speeches (via crec_speakers) β”œβ”€β”€ member_votes β€” roll call voting record β”œβ”€β”€ stock_trades β€” financial disclosures β”œβ”€β”€ legislation β€” sponsored/cosponsored bills β”œβ”€β”€ fec_candidate_crosswalk β†’ FEC contributions β”œβ”€β”€ hearing_members β€” committee hearing attendance β”œβ”€β”€ earmarks β€” directed appropriations └── revolving_door β†’ lobbying after Congress docket_id (regulatory chain) β”‚ β”œβ”€β”€ dockets β€” regulatory docket metadata β”œβ”€β”€ documents β€” rules, proposed rules, notices └── comments β€” public comments bill_id ({congress}-{type}-{number}) β”‚ β”œβ”€β”€ legislation β€” bill text and status β”œβ”€β”€ crec_bills β€” floor speech references β”œβ”€β”€ cbo_cost_estimates β€” CBO bill scoring β”œβ”€β”€ crs_report_bills β€” CRS report references └── lobbying_activities β€” lobbying on specific bills document_number (Federal Register) β”‚ └── fr_regs_crossref β†’ dockets/documents rin (Regulation Identifier Number) β”‚ β”œβ”€β”€ oira_reviews β€” regulatory review decisions β”œβ”€β”€ oira_meetings β†’ meeting attendees, requestor orgs └── federal_register β†’ RIN column (April 2026, 103,532 docs tagged) cmte_id (FEC Committee ID) β”‚ β”œβ”€β”€ fec_contributions β€” PAC-to-candidate money β”œβ”€β”€ fec_operating_expenditures β€” where PACs spend β”œβ”€β”€ fec_independent_expenditures β€” IEs for/against candidates └── fec_pac_summary β€” committee financials entity_id (entity-resolution scratch DB β€” in progress) β”‚ β”œβ”€β”€ entities β€” 2.3M resolved organizations β”œβ”€β”€ public_actors β€” 13.4M role records across lobbyists, β”‚ FARA agents, FEC candidates, OGE filers, nonprofit officers └── entity_relationships β€” 405K parent-subord, successor, group-exemption links (Wyethβ†’Pfizer, Catholic parish GEN 0928, etc.)

FEC-to-bioguide crosswalk

The Federal Election Commission uses its own candidate ID system, separate from the Congressional bioguide_id. DataDawn maintains a fec_candidate_crosswalk table with 1,712 verified mappings between FEC candidate IDs and bioguide IDs, enabling queries that link campaign contributions to legislators' voting records, stock trades, and floor statements.

Congressional Record speaker linkage

Floor speeches in the Congressional Record are linked to members via the crec_speakers table. 99.6% of speaker entries have a verified bioguide_id, enabling reliable attribution of floor statements to individual legislators.

Stock trade linkage

Congressional stock trade disclosures are linked to members' bioguide IDs through name and chamber matching. 100% of trades (61,895 transactions) are linked to a bioguide_id. This was achieved by filtering to only Periodic Transaction Report (PTR) filings and matching member names against the congressional member roster.

CIK enrichment on stock trades

As of April 2026, individual stock trades also carry SEC Central Index Key (CIK) identifiers for the traded issuer, populated via a ticker-SIC crosswalk. Coverage is 67.7% of trades, with the remaining trades mostly in tickers not mapped cleanly to an SEC filer (foreign issuers, municipal bonds, private securities appearing on certain member disclosures).

Linkage transparency: All cross-reference rates are reported as-is. Key linkage rates: stock trades 100%, CREC speakers 99.6%, FEC crosswalk 1,712 verified mappings, stock trades β†’ CIK 67.7%, Federal Register ↔ Regulations.gov 395,621 crossrefs. Where linkage is incomplete, unlinked records remain in the database and are queryable β€” they simply lack a bioguide_id join key. We do not impute or guess linkages.

Full-text search

DataDawn builds SQLite FTS5 (Full-Text Search) indexes on key text fields across both databases, enabling instant search across millions of records.

DatasetFields indexed
990 ReturnsOrganization name
Foundation GrantsRecipient name, purpose
DAF GrantsRecipient name, funder name
Federal RegisterTitle, abstract
Regulations.gov DocketsTitle
Regulations.gov DocumentsTitle
Regulations.gov CommentsTitle, submitter name
Congressional RecordFull speech text
CFR SectionsFull regulatory text
Lobbying FilingsFiling descriptions
FARA RegistrantsRegistrant name, business
FARA Foreign PrincipalsRegistrant, principal, country
LegislationBill title, summary
Spending AwardsRecipient name, description
FEC EmployersEmployer name
GAO ReportsTitle, abstract, subjects
HearingsTitle, witness names
EarmarksRecipient, project description
CRS ReportsTitle, summary
NominationsDescription

IRS 990 nonprofit filings

Data source

All data is extracted from IRS electronic filings (e-files) published at apps.irs.gov/pub/epostcard/990/xml/. This is the same public dataset used by ProPublica, GuideStar, and academic researchers. The IRS releases new batches periodically throughout the year, typically monthly.

DataDawn currently covers filings from tax years 2014 through 2025, with earlier years having sparser coverage due to lower e-filing adoption rates before the IRS mandate took effect. Coverage is most complete for tax years 2016 onward.

Form types parsed

FormFiled byWhat we extract
990Public charities (revenue > $200K or assets > $500K)Revenue, expenses, assets, officers, contractors, program activities
990-EZSmall public charities (revenue < $200K)Revenue, expenses, assets, officers
990-PFPrivate foundationsRevenue, assets, grants paid, officers, investments, contributors
990-TOrgs with unrelated business incomeBasic filing data

Database tables

Raw filings are parsed into 12 structured tables. No editorial filtering is applied β€” if the IRS published it, we parsed it.

TableRecordsDescription
returns~5.2MCore filing data: org name, EIN, state, revenue, expenses, assets, return type, tax year
grants~13.6MFoundation grants from 990-PF filings: recipient, amount, purpose, date, location
schedule_i_grants~1.27MSchedule I disbursements (DAF sponsors and public charities)
bmf~1.9MIRS Business Master File: NTEE codes, ruling dates, asset codes
officers~44.8MOfficers, directors, trustees, and key employees with compensation
capital_gains~24.1MCapital gains and losses from 990-PF
related_orgs~9.0MSchedule R related organizations
investments~5.8MFoundation investments (from 990-PF Part II)
contributors~662KContributors to private foundations (from 990-PF Schedule B)
program_activities~576KProgram service accomplishments and expenses
program_investments~314KProgram-related investments (PRIs)
contractors~77KIndependent contractors receiving > $100K
top_employees~74KHighest-compensated employees

Extraction pipeline

IRS e-files are XML documents following IRS-defined schemas that have evolved across filing years. DataDawn's extraction scripts handle schema variations across years, mapping different XML element paths to consistent database columns.

  1. Download β€” New XML batches are synced from the IRS S3 bucket. Batch completion is tracked with marker files to prevent reprocessing.
  2. Parse β€” Three extraction scripts process 990/990-EZ returns, 990-PF detail filings (grants, investments, contributors), and Schedule I grants respectively.
  3. Deduplicate β€” Filings are keyed on a combination of EIN and object ID to prevent duplicate insertion from overlapping IRS releases.
  4. Index β€” Full-text search indexes (SQLite FTS5) are built on organization names and grant recipient names for instant search.
  5. Publish β€” The public database is built from an allowlist of raw data tables. No analysis or curated tables are included in the public release.

Known limitations β€” 990 data

E-file only

DataDawn only includes electronically filed returns. Paper filings β€” roughly one-third of all 990s β€” are not included. E-filing rates have increased over time, so recent years have better coverage than earlier years.

Filing lag

Organizations file 990s after their fiscal year ends, and the IRS publishes e-files on a rolling basis. The most recent tax year will always have incomplete data.

Sparse early years

Tax years 2014–2015 have limited coverage because the IRS e-filing mandate was not yet in full effect. Coverage is most reliable from 2016 onward.

Grant dates

Foundation grant dates come from the filer's reported grant date field. Some foundations report the approval date, others the payment date, and some leave it blank. Year-level analysis is more reliable than month-level.

Name matching

Organization names are as reported on the filing. The same organization may appear under slightly different names across years. DataDawn does not perform entity resolution β€” search results should be verified by checking the EIN.

Amount discrepancies

Financial figures reflect what was reported on the filing. Amended returns may not overwrite original filings. In rare cases, both an original and amended filing for the same tax year may appear.

DAF grant identification

Donor-advised fund (DAF) disbursements are extracted from Schedule I of 990 filings submitted by DAF sponsor organizations. DataDawn identifies and parses grants from major DAF sponsors including Vanguard Charitable, Fidelity Charitable, Schwab Charitable, National Philanthropic Trust, Silicon Valley Community Foundation, and others.

These are grants made by DAF sponsors to recipient nonprofits. They do not identify the individual donors who recommended the grants β€” that information is not available in any public filing.

Why this matters: DAF grants represent a large and growing share of philanthropic funding, but because they flow through intermediary sponsors, they are difficult to trace using traditional 990-PF data alone. Combining 990-PF grants with Schedule I DAF data provides a more complete picture of institutional funding flows.

Congress members

The congress_members table is the universal identity table for the OpenRegs database. It contains 12,765 members of Congress, both historical and current, sourced from the Congress.gov API and the Biographical Directory of the United States Congress.

Each member is identified by their bioguide_id, which serves as the primary join key across all member-related datasets. The table includes name, party, state, chamber, number of terms served, and service dates.

A precomputed member_stats table provides aggregate counts per member (total trades, speeches, bills sponsored, votes cast) for quick summary views. Committee assignments (3,908 current assignments across 233 committees and subcommittees) are maintained in separate committees and committee_memberships tables with leadership title indicators.

Congressional Record

Data source

Floor proceedings from the Congressional Record, sourced from the Government Publishing Office (GPO) via govinfo.gov bulk data. Coverage spans 1994 to present, encompassing speeches, debates, remarks, and other floor proceedings from both chambers.

Tables

TableRecordsDescription
congressional_record878,583Floor proceedings with full text, date, chamber, section
crec_speakers944,216Speaker-to-speech linkage, 99.6% with bioguide_id
crec_bills1,561,719Bill references extracted from floor proceedings

Known limitations

The Congressional Record is not a verbatim transcript. Members may revise and extend their remarks after delivery. The "Extensions of Remarks" section includes statements that were not delivered orally on the floor. Speaker attribution relies on GPO markup, which occasionally misattributes statements in colloquy or debate.

Roll call votes

Data source

Roll call voting data from both chambers, sourced from the Congress.gov API and official House/Senate clerk records.

Tables

TableRecordsDescription
roll_call_votes26,439Vote metadata: question, result, date, congress, chamber
member_votes8,336,815Individual vote records: Yea, Nay, Present, Not Voting per member per vote

Known limitations

Roll call votes capture only recorded votes, not voice votes or unanimous consent agreements. Many legislative actions proceed without a recorded vote. "Not Voting" may indicate absence, abstention, or recusal β€” the data does not distinguish between these.

Congressional stock trades

Data source

Financial disclosure data from both chambers: House Periodic Transaction Reports (PTRs, parsed from PDF filings) and Senate electronic Financial Disclosures (eFD, scraped from Senate disclosure website).

Coverage

61,895 transactions are currently in the database, with 100% linked to a bioguide_id and 67.7% carrying SEC CIK for the traded issuer (via ticker-SIC crosswalk, added April 2026). Trades include ticker symbol, transaction date, transaction type (purchase/sale/exchange), amount range, owner (self/spouse/child), and source (House PTR or Senate eFD).

House trades are parsed from PDF Periodic Transaction Reports. Senate trades are extracted from the Senate electronic Financial Disclosure system. Only PTR filings are included β€” annual disclosures, candidate reports, and amendments are excluded as they do not represent individual transactions.

Known limitations

Disclosure amounts are reported in ranges (e.g., $1,001–$15,000), not exact figures. Filing deadlines allow up to 45 days after a transaction, and extensions are common. Trades by spouses and dependent children are included in disclosures and identified via the owner field where available.

Legislation

Data source

Bill data from the Congress.gov API covering Congresses 93–119 (1973–present).

Tables

TableRecordsDescription
legislation375,620Bills with title, sponsor, policy area, latest action, status
legislation_cosponsors4,067,601Cosponsor records with bioguide_id linkage
legislation_actions2,310,777Action steps: introduced, referred, passed, signed
legislation_subjects3,036,186Subject tags assigned by the Congressional Research Service
cbo_cost_estimates17,201CBO bill scoring and cost analysis

Bills are identified by a composite bill_id in the format {congress}-{type}-{number} (e.g., 118-hr-1234), which links to floor speech references in crec_bills and lobbying activity records in lobbying_activities.

Campaign finance (FEC)

Data source

Federal Election Commission bulk data files covering candidates, committees, and contributions.

Tables β€” deployed

TableRecordsDescription
fec_candidates64,679FEC-registered candidates
fec_committees154,967PACs, party committees, campaign committees
fec_contributions4,395,926PAC/committee-to-candidate contributions
fec_candidate_crosswalk1,712Verified FEC candidate ID to bioguide_id mappings
fec_operating_expenditures15,358,447PAC/party/candidate disbursements (where the money goes)
fec_independent_expenditures666,910Independent expenditures supporting or opposing candidates
fec_pac_summary98,614Per-committee aggregate financials by cycle
fec_electioneering1,679Electioneering communications
fec_communication_costs25,641Internal communications costs reported by corps/unions

Employer-aggregated donations (the fec_employers rollup)

In addition to the PAC/committee-level tables above, DataDawn publishes a family of employer-aggregated rollups built from the individual-donor contribution file. These rollups preserve the research utility of the raw data (which employers give to which candidates, which occupations concentrate in which party, etc.) without publishing any individual donor's name, address, or transaction.

TableRecordsDescription
fec_employer_totals352,103Per-employer totals: donation count, total amount, unique states
fec_employer_to_candidate476,534Employer-to-candidate aggregates with party and office
fec_employer_to_party286,908Employer-to-party totals by cycle
fec_top_occupations666,254Top occupations per employer (aggregate, not per-person)

Why individual contributions are not deployed (PII)

The Federal Election Commission publishes individual contribution records that include donor names, addresses, employers, and occupations. The full file runs to approximately 104 million records (49 GB). DataDawn ingests this entire corpus into a local analysis database (fec.db: 44 million committee transactions plus 4.4 million contributions to candidates plus the candidate and committee metadata) but does not publish it on the deployed databases.

This is a deliberate editorial decision, not a cost-saving one. The rollup tables above answer essentially every research question about influence, employer giving patterns, PAC activity, and occupation-concentration without exposing individual donor records to a one-click public query. The raw file remains accessible at the FEC's own site for anyone who needs it; DataDawn is choosing not to replicate a second public surface for individual-named records.

As of April 19, 2026, this approach is codified as DataDawn's formal PII Standard and extended beyond FEC to govern White House visitor log handling, OGE Public Available Filer disclosures, FACA advisory committee membership, and federal disbursement records. See the "DataDawn Standards & Governance" section below for the underlying philosophy.

Other known limitations

FEC data has its own filing lag β€” contributions may not appear for weeks or months after they are made. FEC amount figures are reported exactly (unlike the range-based financial disclosures from Congress). Committees that register but never file activity still appear in fec_committees; filter by activity or cycle when doing meaningful counts.

Federal Register

Data source

The Federal Register API (federalregister.gov/api), which provides structured data for every document published in the Federal Register.

Tables

TableRecordsDescription
federal_register994,487Rules, proposed rules, notices, presidential documents with title, abstract, dates, PDF/HTML URLs. As of April 2026, includes a regulation_id_numbers column populated on 103,532 documents (joinable to OIRA reviews by RIN).
federal_register_agencies1,532,539Agency tags (many documents have multiple agencies)
presidential_documents5,925Executive orders, proclamations, memoranda
fr_regs_crossref395,621Links Federal Register document numbers to Regulations.gov dockets (3.4Γ— the March 2026 count after the comprehensive crossref rebuild)

Regulations.gov

Data source

The Regulations.gov API (api.regulations.gov), the federal government's public comment and rulemaking system.

Tables

TableRecordsDescription
dockets254,910Regulatory dockets from 126 federal agencies
documents1,703,711Regulatory documents: rules, proposed rules, notices, supporting materials
comments9,764,809Public comment headers: submitter, date, agency, docket
comment_details428,838Full-text comment bodies β€” approximately 4.4% of all comment headers, up from 1.6% in March 2026

Known limitations

The Regulations.gov API has strict rate limits (approximately 1,000 requests per hour per key). Full-text comment bodies (comment_details) are being downloaded incrementally using a dual-key forward/reverse approach, prioritizing organizational submissions. Comment header data (submitter name, date, docket) is complete for all 9.76M comments across 126 federal agencies. The majority of Regulations.gov comments are identical form letters submitted through advocacy campaigns; the substantive organizational comments are a much smaller fraction of the total. Some agencies do not publish all comments through Regulations.gov.

Code of Federal Regulations

Data source

Bulk XML downloads from the Electronic Code of Federal Regulations (eCFR) at ecfr.gov.

Coverage

123,480 regulatory sections from 19 CFR titles covering major regulatory domains: Agriculture (7), Animals (9), Energy (10), Aeronautics (14), Commerce (15), Commodities (17), Employee Benefits (20), Food and Drugs (21), Housing (24), Judicial Administration (28), Labor (29), Navigation (33), Education (34), Pensions (38), Environment (40), Emergency Management (44), Shipping (46), Transportation (49), and Wildlife (50). Full regulatory text is indexed for full-text search.

Known limitations

19 of 50 CFR titles are currently included. The CFR is updated continuously as agencies publish final rules; the DataDawn snapshot reflects the eCFR as of the most recent bulk download. Regulations that have been proposed but not finalized are not included in the CFR data (they appear in the Federal Register).

Lobbying disclosures

Data source

Senate Lobbying Disclosure Act (LDA) filings, downloaded from the Senate Office of Public Records bulk data system. Data covers 1999 through present.

Served as a standalone database

Lobbying is served at regs.datadawn.org as its own Datasette database (lobbying.db, 14.5 GB, 9 tables, 10 dedicated canned queries). This gives the lobbying corpus its own query surface with richer schema than the subset mirrored into OpenRegs for cross-database joins.

Tables (lobbying.db)

TableRecordsDescription
lobbying_filings_raw1,915,098Disclosure filings: client, registrant, income/expenses, year, period. As of April 2026 also carries client_state (91% coverage), registrant_house_id, client_government_entity, and affiliated_organizations columns recovered by the re-parse migration.
lobbying_registrations138,328Registration records β€” not mirrored into OpenRegs
lobbying_activities3,811,121Activity records: issue codes, descriptions, specific bills lobbied
lobbying_lobbyists4,730,966Lobbyist entries with covered_position (revolving door indicator)
lobbying_contributions3,670,570LDA-reported contributions β€” not mirrored into OpenRegs
lobbying_affiliated_orgs29,447Subsidiary graph extracted from LDA affiliated_organizations field (April 2026)
lobbying_issue_codes79Standard issue category codes (reference)
lobbying_gov_entities257Covered government-entity list (reference)
lobbying_filing_types300Filing type codes (reference)

April 2026 re-parse migration

An April 2026 migration reprocessed the 1.9 million raw LDA JSON filings already on disk to recover fields that had been dropped by the original pipeline: client_state (now 91% coverage), registrant_house_id, client_government_entity, and affiliated_organizations. This was an in-place enrichment, not a fresh download. The lobbying_affiliated_orgs subsidiary graph (29,447 rows) was extracted as a companion table during the same migration.

Revolving door

The covered_position field in lobbyist records identifies individuals who previously held government positions β€” the "revolving door" between government service and lobbying. This field is self-reported by the registrant.

DataDawn builds a materialized revolving_door table that cross-references lobbyists whose covered_position indicates former congressional service (matching 12 position patterns such as "U.S. Senator," "Member of Congress," "Former Representative") against the congress_members table using name matching. This enables queries showing which former members lobby on which issues, for which clients.

Known limitations

LDA filings are self-reported by registrants and are not independently audited. Income and expense figures are reported in ranges on some filing types. The lobbying_activities table links to specific bill numbers when reported, but lobbyists are not required to list every bill they lobby on.

Foreign agents (FARA)

Data source

Foreign Agents Registration Act data from the Department of Justice FARA database at fara.gov. Served at regs.datadawn.org as its own Datasette database (fara.db, 42 MB, 6 tables) and also mirrored as tables in OpenRegs for cross-dataset joins.

Tables

TableRecordsDescription
fara_registrants7,043Registered foreign agents (firms and individuals)
fara_foreign_principals17,652Foreign government and entity clients
fara_short_forms44,416Individual agents working under registrations
fara_registrant_docs151,987Filed documents with PDF links

Known limitations

FARA registration is self-reported and enforcement has historically been limited. The DOJ has acknowledged that compliance rates are uncertain. Some entities that may be required to register under FARA instead register under the LDA, which has less stringent disclosure requirements. Cross-referencing FARA registrants with lobbying filings (by firm name) can reveal some of these overlaps but is not definitive.

Federal spending

Data source

USAspending.gov bulk award data covering grants, contracts, and other federal awards across 20 agencies.

Coverage

863,632 awards including recipient name, award amount, funding agency, award type, and date ranges. Linkable to agencies referenced in Federal Register documents and lobbying filings.

Known limitations

USAspending.gov data has known reporting quality issues acknowledged by the government itself. Not all agencies report at the same level of detail or timeliness. Sub-award data is not currently included. The 20-agency scope covers the most active federal funders but is not comprehensive across all federal agencies.

OIRA regulatory reviews & meetings

Data source

OIRA (Office of Information and Regulatory Affairs) data is sourced from reginfo.gov. EO 12866 regulatory review records are downloaded as XML bulk files covering 1981 to present. Meeting data (2014-present) is collected from the reginfo.gov search interface, with individual meeting detail pages scraped for full attendee lists.

Coverage

48,434 regulatory reviews (1981–present), 8,663 meetings with outside parties (2014–present), and 90,711 individual meeting attendees with organization affiliations and participation type.

Cross-reference potential

OIRA meetings connect to lobbying data through requestor organization names, to Federal Register documents through Regulation Identifier Numbers (RIN), and to the rulemaking timeline through meeting dates relative to rule publication dates.

Inspector General reports

Data source

Inspector General reports are scraped from oversight.gov, the federal IG community's centralized reporting portal. Both listing metadata and individual report details (including recommendations) are collected.

Coverage

34,880 IG reports with 11,999 individual recommendations. Reports include questioned costs, funds for better use, agency reviewed, report type, and links to original PDF documents.

Committee hearings

Data source

Congressional committee hearing transcripts and metadata from GovInfo (govinfo.gov/bulkdata/CHRG), covering hearings published since 1995.

Tables

TableRecordsDescription
hearings46,177Hearing metadata: title, committee, date, congress, chamber, package_id
hearing_members1,244,920Member attendance at hearings, linked to congress_members by bioguide_id

89.6% of hearings link to at least one member record. Witnesses are captured in the hearing transcript text but are not yet broken out into a structured table; that is planned for a future build.

Congressional Research Service (CRS) reports

Data source

CRS reports via the Congress.gov API v3. CRS is Congress's in-house research arm; its reports are published to Congress.gov and are subject to a long-running public-release debate that finally landed in 2018 with statutory directions to make reports public.

Coverage

13,727 CRS reports with title, publication date, URL, and full text where available. A companion crs_report_bills table (135,890 rows) links reports to bill_ids referenced in the report body, enabling queries like "all CRS analysis related to a specific bill or topic area."

Executive nominations & treaties

Data source

Nominations and treaties data from the Congress.gov API v3, covering the Senate's advice-and-consent calendar.

Tables

TableRecordsDescription
nominations40,167Executive nominations: nominee, position, agency, date received, Senate action
treaties777Treaties submitted to the Senate for ratification: title, topic, date received

The full historical Senate treaty record is small (fewer than 800 treaties since the founding); the nominations stream is substantially larger and covers both confirmed and withdrawn/returned nominations.

Earmarks & directed appropriations

Data source

Congressional earmark data (also called Congressionally Directed Spending / Community Project Funding) from the House Appropriations Committee and Senate Appropriations Committee public disclosures.

Coverage

70,826 earmark records with requesting member (bioguide_id), recipient organization, dollar amount requested, fiscal year, and project description. Linkable to members' voting, sponsored-bills, and campaign-donor records through bioguide_id.

Earmark disclosures have varying detail depending on chamber and fiscal year; some historical windows had earmarks suspended entirely and contain zero records.

Government Accountability Office (GAO) reports

Data source

GAO reports sourced from two feeds: GovInfo bulk data for the historical GAO archive up through mid-2008, and direct scraping from gao.gov for the post-2008 period when GAO moved off GovInfo. The two feeds are reconciled into a single gao_reports table.

Coverage

73,725 GAO reports with 99.95% detail-file coverage (57K direct stubs from gao.gov plus GovInfo historical). Reports include title, subjects, agencies reviewed, abstract, and PDF URL. Full-text search indexes title, abstract, and subject tags.

Entity resolution layer (in progress, staged)

What it is

A staging database that builds a unified organization graph from DataDawn's primary government data. When complete, it will let users ask a single question ("show me every federal touchpoint of Company X") and get an answer across lobbying, grants, federal spending, regulatory participation, congressional hearings, and personnel filings β€” joined by a single entity_id instead of by fragile name-matching.

Current state (April 19, 2026)

TableRecordsDescription
entities2,335,378Resolved organizations: nonprofits, private companies, trade associations, unions, political committees, PACs, public companies, and manually-pinned mutual insurance companies
public_actors13,431,058Role records: nonprofit officers (13.26M), lobbyists-under-LDA (86,587), FARA agents (43,243), FEC candidates (30,615), congress members (12,765), OGE Senate-confirmed filers (147)
entity_relationships405,304BMF group-exemption memberships (401,894), predecessor-successor links (2,427 including Wyeth→Pfizer, Conseco→CNO, Sunoco→Energy Transfer), LDA affiliated-org links (983)

Industry classification (companion)

32,706 federal-policy-relevant entities have been tagged into a 40-industry taxonomy by a 5-pass deterministic classifier (mutual pins β†’ SIC 4-digit β†’ NTEE selective β†’ foreign-filer regex β†’ PAC connected-organization inheritance) plus a 215-entry curated trade-association map. Top classified industries: hospitals_health_systems (10,492), education (8,700), agriculture_agribusiness (5,663), insurance_non_health (1,434), pharmaceuticals (893), banking (739), big_tech (698).

Documented structural gaps

Source-table FK backfill rates

These are the rates at which source tables have been backfilled with entity_id foreign keys from the scratch DB:

AI accessibility layer

DataDawn is built to be queryable by AI agents, not only by humans browsing a website. The platform exposes four public accessibility surfaces for machine consumption:

  1. llms.txt at each subdomain (regs.datadawn.org/llms.txt, data.datadawn.org/llms.txt) β€” plain-text orientation guide for LLM agents about what each database contains and how to query it.
  2. REST APIs β€” Datasette natively exposes every table, every view, and every canned query as a JSON endpoint. Appending .json to any URL on data.datadawn.org or regs.datadawn.org returns structured data.
  3. OpenAPI specification β€” auto-generated by Datasette, available at each subdomain for agent builders pointing standard tooling at the platform.
  4. Live MCP server at mcp.datadawn.org β€” Model Context Protocol server with named tools including search_nonprofit, search_legislation, search_lobbying, search_comments, search_federal_register, search_grants, search_daf_grants, lookup_ein, lookup_member, member_trades, org_grants_made, org_officers, plus raw-SQL passthroughs (run_990_sql, run_openregs_sql).

Journalists, researchers, and policy analysts increasingly use AI tools in their workflow. DataDawn is designed to be a first-class destination for those tools: queries hit live data, not training snapshots or scraped summaries. No other comparable transparency platform currently offers a live MCP server.

DataDawn standards & governance

DataDawn operates under two load-bearing standards adopted on April 19, 2026: a Data Sourcing Policy that governs what the platform ingests, and a PII Standard that governs what the platform publishes about named individuals. The philosophy of each is captured below. The full internal policies, decision tables, and implementation patterns are maintained as engineering documents; the essential commitments are public here.

Data Sourcing Policy

DataDawn ingests only primary U.S. federal government data and open public registries (GLEIF, SAM.gov, and similar). No commercial aggregators β€” Candid/GuideStar, Bloomberg Government, LegiStorm, Factiva β€” and no NGO-curated derivative tables from OpenSecrets, OpenCorporates, ProPublica, or GovTrack. This is not a judgment about the quality of those sources; many of them do excellent work. The point is that building DataDawn on top of them would make DataDawn's own work unauditable in the way we want to be auditable. Every join, every crosswalk, every aggregate on the platform can be reproduced from the same federal APIs anyone else can access. When a claim on DataDawn depends on an entity-resolution decision, that decision is made in-house from primary data β€” not licensed from a third party whose methodology is a trade secret. Two narrow carve-outs: federally-adopted identifiers with commercial lineage that are now part of the federal stack (UEI, legacy DUNS in SAM.gov) and community-maintained federal re-exports where the source is itself a federal database (e.g., unitedstates/congress-legislators).

PII Standard

DataDawn names people who act in public capacity β€” Cabinet members, elected officials, Senate-confirmed appointees, registered lobbyists, nonprofit officers with published 990s, foreign-agent principals, public advisory-committee members. Those are the named actors the platform exists to surface, and the value of cross-referencing them across votes, trades, filings, meetings, and money depends on being able to name them. For incidental appearances of private citizens in federal datasets β€” a tourist in a visitor log, a $50 donor in an FEC file, a pro-se commenter in a rulemaking docket, a junior staffer on a disbursement line β€” the platform aggregates, redacts, or excludes. The record stays, the individual identifier does not.

The FEC deployment is the paradigm case. The FEC publishes approximately 104 million individual contribution records with donor names, addresses, employers, and occupations (about 49 GB on disk). DataDawn ingests the entire file locally for analysis but does not publish it on the public databases. What gets published is the family of aggregated rollups described in the FEC section above: fec_employer_totals (352,103 employers with their giving histories), fec_employer_to_candidate (476,534 employer-to-candidate aggregates), fec_employer_to_party (286,908 employer-to-party totals), and fec_top_occupations (666,254 occupation rollups). These answer essentially every research question about which employers' staff give to which candidates β€” without DataDawn becoming a second public search surface for individual donor names. The raw file remains accessible at fec.gov for anyone who specifically needs it; DataDawn chooses not to replicate that surface. The same principle extends to how the platform handles White House visitor logs, OGE Public Available Filer disclosures, FACA advisory-committee membership, and federal disbursement record-handling.

General limitations

Data as reported

DataDawn publishes data as reported in source filings and government databases. We do not correct, impute, or editorialize. Errors in source filings propagate to our database. Where we are aware of systematic data quality issues, they are documented in the dataset-specific sections above.

Entity resolution β€” in progress

The same real-world entity may appear under different names across datasets (e.g., "ASPCA" vs "American Society for the Prevention of Cruelty to Animals" in 990 data, or variant name spellings across FEC and Congressional records). As of April 2026, DataDawn maintains a staging database with 2.3 million resolved organizational entities, 13.4 million role records, and 405,000 relationships, built entirely from primary government data (see the Entity Resolution section above). The layer is not yet deployed to the public databases β€” it is merged in once source-table FK backfill rates stabilize (currently 49–88% across sources). Until then, cross-dataset organization queries still benefit from stable identifiers (EIN, bioguide_id, FEC candidate ID, UEI) where possible, and name-based matching on unresolved entities has its usual pitfalls.

Point-in-time snapshots

Each dataset reflects the state of its source at the time of DataDawn's most recent extraction. Government agencies update their data on different schedules. The database is not a real-time feed.

Correlation is not causation

Cross-referencing datasets enables powerful queries (e.g., stock trades within 30 days of floor speeches on related topics), but temporal or thematic proximity does not establish a causal or improper relationship. DataDawn provides the data; interpretation is the user's responsibility.

Update schedule

As of April 2026, DataDawn runs on an automated update pipeline monitored end-to-end by Healthchecks.io. Before April 2026 the platform was rebuilt manually; the move to automated updates with health monitoring is a significant operational maturity upgrade.

JobScheduleCoverage
OpenRegs weeklySat 04:00 CTFederal Register, eCFR, Congressional Record, stock trades, USAspending, legislation, votes, nominations, CRS β†’ rebuild β†’ deploy
OpenRegs monthly15th 20:00 CTHeavier sources: Regulations.gov dockets/comments, lobbying, FEC, OIRA, FARA, GAO, IG β†’ calls weekly pipeline
990 monthly1st 04:30 CTIRS e-file download β†’ parse β†’ build β†’ deploy
Daily open comments06:30 UTCOpen-for-comment snapshot, no rebuild

All 10 automated cron jobs (four pipelines above plus six backup/compliance crons) are monitored by Healthchecks.io with per-job grace periods; maintainer is alerted on failure. Overlap between weekly and monthly runs is prevented by flock. Post-build full-text-search verification aborts the pipeline if any expected FTS table is missing, empty, or undersized. The deploy step uses an atomic-rename pattern (upload to ${REMOTE_DB}.new, then mv) to prevent the rsync --partial interrupt corruption that destroyed a live database on April 11, 2026. A separate validate_dates() QC check catches silent schema-drift bugs on every build.

Individual government source APIs update at their own cadences β€” the Federal Register updates daily, FEC bulk data refreshes quarterly, IRS 990 e-files arrive in batches on the IRS's own schedule. DataDawn documents the last successful pull date for each dataset on its Datasette explore pages.

Independence statement

DataDawn is an independent project with no institutional affiliations. It receives no funding from any nonprofit, foundation, government agency, or organization represented in its datasets. All data is sourced exclusively from public records filed with federal government agencies.

DataDawn does not endorse, evaluate, or rank any organization, legislator, or entity. The platform provides raw data and search tools. Interpretation and analysis are the responsibility of the user.

All source code, extraction pipelines, and database schemas are published on GitHub under a CC0 1.0 Universal (public domain) license.

Corrections and feedback

If you find a data quality issue, parsing error, or have questions about the methodology, you can reach DataDawn at [email protected] or via the GitHub repository.