index-creation.md
February 25, 2026 ยท View on GitHub
Below is a description of the process used to create the Federal Website Index, which is then used as the target URL list for the Site Scanning engine to scan. The actual code that does this is here.
- The Federal Website Index is created by combining and processing a number of individual source datasets. The list of datasets is managed here (in the
fetchAllSourceListDatafunction), and the urls for these datasets are managed here. - The specified source datasets are copied and imported into memory. In the process, each URL is cleaned by removing any protocols,
www., and paths. Snapshots of each individual dataset are stored here. - The various source datasets are combined. A snapshot of this combined list is stored here.
- The combined list of websites is then deduplicated. A snapshot of the dedupped list is stored here.
- The base domain and top level domain fields are extrapolated and added as new fields, a snapshot of which is here.
- Agency, bureau, and branch information is added to each website by pulling in the relevant information for its base domain from the these three files (snapshots of which are here, here, and here).
- Agency and bureau information in the list of websites from OMB's 21st Century IDEA engagement are then used to override the information from the previous step since it should be more accurate. A snapshot of the end result is here.
- The list of websites is then compared to two ignore files (the
begins withlist andcontainslist) [note that thecontainslist actually requires that the specified string have non-alphanumeric characters both before and after it]. The purpose of this is to try and identify non-public websites which are then labeledFiltered = TRUE. - The same three domain lists above are then used to limit the list of websites to those with currently-registered, federal domains. The result is snapshotted here, and the list of sites that are removed can be found here.
- Analytics data is then added in by pulling in pageviews and visits from this copy of this dataset. Note that analytics results for associated
www.-and non-www.-domains are combined in the process to ensure a more accurate count. A snapshot is then stored here. - Websites that have consistently failed DNS resolution and are very likely no longer alive are added to this file, which is then used to provide a list of websites that should be removed as suspected dead sites. A snapshot is here, and a list of the sites removed can be found here.
- The dataset's columns are then reordered to align with what the Site Scanning end data looks like. A snapshot is here.
- As a final step, the dataset is alphabetized by
base_domaintheninitial_url. The result is stored here as the completed website index.