Methodology

How we score, verify, and trust

c0mm0.com is independent of any national portal: we harvest, verify, and score what every portal publishes. This page explains exactly how that works, so the signals on every entry page are auditable.

1. Officialness score

Every candidate API or dataset is scored on a 0–100 "officialness" scale before it can enter the public catalog. The score is the weighted sum of five proof signals:

Signal	Weight	What it checks
Domain verification	40	The candidate URL resolves under an institution's official_domains (e.g. .insee.fr, .data.gov.uk).
Portal backlink	25	A known portal links to this URL (e.g. data.europa.eu lists it as a dataset of this publisher).
GitHub org match	20	Source code lives under the institution's verified GitHub org.
Reciprocal link	10	The candidate page links back to the institution's official site.
Policy presence	5	A privacy policy / terms / contact under an official domain.

Entries scoring ≥ 80 with a known institution are auto-approved. 40–79 are quarantined for human review. Below 40, or under a blocklisted domain, the candidate is rejected.

2. Continuous verification

Every active entry is re-checked on a rolling 6-hour cadence — APIs by hitting their declared base or health URL with the appropriate method, datasets by re-checking each declared distribution. Three consecutive failures move an entry from VERIFIED to FAILING; a clean check restores VERIFIED. We treat 401/403 as "up, requires auth" rather than "down" — many official APIs gate behind API keys but stay alive.

Verification evidence is recorded as rows in the checks table: each row carries the timestamp, HTTP status, latency, and a short error snippet on failure. The result is a per-entry uptime history, exposed at /api/v1/entries/{slug}/checks.

3. Field-level provenance

Every visible value (name, description, license, base URL, …) carries a provenance row stating which extractor produced it, from which source URL, with what confidence, and a 1 KB evidence snippet. The cascade runs in priority order — OpenAPI specs > repo metadata > HTML docs > heuristics — and we record the winning extractor, not just the value. See any entry's Provenance tab.

4. License normalization

Open-data licenses are a thicket — CC variants, OGL-UK, Licence Ouverte, IODL, DL-DE, ODbL, EUPL, national variants. We normalize raw license strings into a canonical id by pattern-matching against a reference table of 19 licenses, preserving the original string for audit. Each id carries boolean flags for is open, is redistributable, and requires attribution. Unrecognised strings stay raw, never silently mis-classified.

5. Standards we speak

DCAT-AP 3.0.1 + the W3C DCAT 3 vocabulary: the catalog is served as Turtle (/api/v1/catalog/dcat.ttl) and JSON-LD (/api/v1/catalog/dcat.jsonld), harvestable by data.europa.eu and national portals.
schema.org/Dataset: every entry page emits a JSON-LD block readable by Google Dataset Search.
FAIR Signposting: each entry page carries a rel="describedby" link to its machine-readable record.
OpenAPI 3.1: we publish our own API surface at /openapi.json.
llms.txt + .well-known/ai-plugin.json: agent-discoverability for AI clients that prefer the convention over web search.

6. Independence

We are not a national portal. We do not represent any institution we list. Every signal you see is reproducible from primary sources — institutions, portals, distributions — and the methodology is the same regardless of who publishes. If you spot a wrong call, the submission flow takes corrections.

Last reviewed: 2026-06. Methodology revisions are tracked in the repository changelog.