candidate-scans.md

March 19, 2026 · View on GitHub

This is a list of individual scans that could be run with the Site Scanning program. This is a brainstorming page and all ideas are welcome.

The list of scans that have already been built and are active can be found here. Feel free to suggest further ideas though by editing this page, filing an issue or emailing us at site-scanning@gsa.gov.

Scan status/diagnostics for Site Scanning engine

  • time page takes to load (link for ideas). E.g. scans time out after 60 seconds. Is it possible to capture what amount of time shorter than that it took?

Functional Scans for TTS Programs

  • For data.gov:
  • For the Digital Analytics Program:
    • Whether migrated to GA4.
    • Detect CORS policy.
    • Detect subagency filter.
    • Detect misconfigurations of the DAP snippet?
    • Detect and analyze GTM implementations.
    • Note if it is registered in the analytics.usa.gov export.
    • Add in GA traffic data.
  • For the Feedback Analytics program:
    • Priority - page load time, other performance data.
    • Monitor the adoption of Touchpoints.
    • Presence of forms.
      • Analyze the source code of each page that loads in order to detect forms and/or PRA numbers. (a lot more technical detail on ways to do this here).
  • For the US Digital Registry:
    • Maybe some way to validate or authenticate official government social media accounts?
    • Find new social media accounts that need to be added to the registry by looking at all links in the page.
    • List of all hyperlinks on the page.
    • List of all domains that have hyperlinks on the page.
    • List of all email addresses (both text and in mailto: links) in the page
    • A more radical idea - take over what it does by consuming social_media.json files. Relevant mainly if there's ever a future need to deprecate it's primary engine.
  • For search.gov:
    • Implement the search scan that Carter built for them internally.
    • Scan for CMS (site-inspector, builtwith API).
    • Have our scan data serve a domain owner to prepare and know that they are ready to move websites or make another major change.
    • Or, have a series of datapoints that get to what they look for each time. E.g. Is domain of entries the same as the domain.
    • Scan for last mod date (of the sitemap/robot files?).
    • Scan for and analyze the search.gov snippet?
    • Search for other major site scanning implementations.
  • For USWDS:
    • Try to highlight bad, resolving urls to the same end, e.g. www1 and www.
    • Ingest a YML file of their known users to indicate human-verified.
    • Scan just for the banner itself
  • For login.gov:
    • Presence of a sign in form.
    • Which signup form solution is used.
    • Whether the form employs 2FA.
  • For TTS:
    • Which websites use which TTS services.
      • E.g. for federalist, "curl --head https://" and examining the result for the line that says "X-Server: Federalist".
    • High level numbers - e.g. what percent of each agency's sites have implemented DAP; avg. USWDS total; etc.
    • Incorporate DAP analytics.
  • For Cloud.gov Pages
    • Whether a site is hosted on cloud.gov pages or not.
    • Continuous monitoring for substantial homepage changes, e.g. defacement or remote javascript loading.

Business Intelligence for TTS Programs

2025 new ideas:

  • How to associate subdomains together as veritably unified with the primary website
  • buckets of traffic by domain, by website

21st century idea - MVP scan possibilities

  • content length header, as a proxy for hashing the page

  • a11y

  • Same page title?

  • Has some search capacity

  • More ssl

  • library of customer service, analytics third party services

    • "capturing data layer values in scan data; might be possible to sniff out targeting/sampling parameters for sites that e.g. only fire a feedback survey script every nth pageLoad."
  • mobile header?

  • count of pdfs on a site - e.g. this query

  • content similarity (via comparison of page content [prototype via Bard], page titles, page desciptions, keywords, and/or page headers)

  • Other meta tags like these

  • Required URLs (list here) - aka slash somethings

  • Other language support aspects, as perhaps revealed by content on https://www.lep.gov/digital-services-and-websites

  • length of sentences

  • character strings of self-hosted feedback tools (e.g. census.gov homepage)

  • [Also, if they required implementation reports being in a machine readable, fixed location - we could analyze that report]

  • Maybe do the same scans on one page linked from the homepage

Best Practices

  • Plain language, vis-a-vis ISO Standard - https://dhs-gov.libguides.com/ld.php?content_id=72424017.
  • AMP compliant?
  • offers structured data - https://search.google.com/test/rich-results/
  • stuff in a .well-known folder
  • just the presence of the /.well-known/ folder? (I don't think that'll work but it's interesting to ponder).
  • Lots of ideas from https://www.screamingfrog.co.uk/seo-spider/
  • More with Lighthouse (list here).
  • Core Web Vitals (interest by leadership and relevance to Google search results [note]).
  • Performance (perhaps addressed in ^^^).
  • Mobile-friendliness (perhaps addressed in ^^^) - also, note possible IDEA Act dashboard needs.
  • A11y: Note that the legislative branch has a specific need, spelled out here. Note that there's a similar effort at https://www.feda11y.com/, run by former Pulse engineer Scott McAllister.
    • scan image alt text for filenames or urls, a common bad practice.
  • The entire redirect path.
  • Scan for problematic or un-inclusive language.
  • Scan for plainlanguage (there's a variety of potentially scannable items in https://www.plainlanguage.gov/guidelines).
  • Scan for a11y, for tts programs as part of their before-you-ship (there's interest).

Technical Analysis

  • Use of CDNs.
  • Other things we could look up from DNS lookup that we're already doing.
  • wappalyzer

Security

Privacy

  • Referrer policy (affects what is sent to 3rd parties).

Policy Compliance

  • /foia pages:
    • Presence and characteristics of the pages.
    • Are there possibly standards around XML files at certain locations?
  • Privacy officers:
    • more of the material from these reports.
    • Presence of PII.
    • Presence of forms.
    • Form fields (perhaps fed from a manually generated list of URLS).
    • For any of these scans that detect a machine readable file, copy and store in perpetuity the files themselves, so that powers that be can track trends.
  • /socialmediaTOS - Per this and this.
  • More from fisma reports, as seen in these questions.
  • Required links - About page = “About us OR About”; Contact page = “Contact us”; Accessibility = “Accessibility”; budget/perf = “Budget and Performance”; Equal employment = “No FEAR Act Data”; FOIA = “FOIA or Freedom of Information Act”; Usa.gov link; Privacy = “Privacy Policy”; IG = “Office of Inspector General”; ?presence of a search bar by searching for the word 'search' (or in a button, or text field?)?
    • Other ideas - "Contact" “Email us,” “Get in touch,” “Get Support,” etc.

Other

  • Tools like these: 1, 2, 3, 4, 5
  • Look at popular alternatives like SiteImprove, Screaming Frog
  • Federal Website Standards
  • More tests that Google tests for - https://support.google.com/webmasters/topic/9456557?hl=en&ref_topic=4558844&sjid=12716533698097656242-NA
  • Average age of content, via lastmods of sitemap.xml files
  • Number of words on a page.
  • Number of words on a page when viewed as via mobile user agent? Compared ^^^?
  • Blacklight scans.
  • Links to translated sites or other language options
  • Whether a subdomain is a website or a system.
  • 3rd party services/performance/? for pages like this.
  • 404 pages - Try to use the detection of 404 pages to discover broken links or websites. For instance, one idea could be to take ever link found in a page and then check links for their status.
  • Well known services hosted by webservers: more info here and here.
  • /security.txt ?
  • Bad SEO/website management in the form of duplicative URLs to the same webpage, e.g. with and without and www., or 2 domains that are CNAMED to the same thing.
  • Data for GSA's privacy dashboard - https://github.com/18f/privacy-dashboard.
  • Social media pages: Uncover agency's social media pages for inclusion in the U.S. Digital Registry.
  • HTML Metadata: Uncover HTML metadata, which could help to populate sub-domain scanning capabilities.
  • Others from https://policy.cio.gov/.
  • Contact information for a website (beyond perhaps what can be found on the dotgov whois page).
  • Find phone numbers and email addresses on a page.
  • Similar to services like siteimprove (used by gsa.gov team).
  • More from thinking about involved javascript.
  • Others possibly could be found in here.
  • Any way to derive what the top tasks for sites might be?
  • A list of all links on the page, to help with sunsetting websites.
  • List of links (or text) in the menu? list of menu headers? list of links (or hyperlinked text) in the footer?
  • count of h1s, h2s, h3s
  • Links that have words 'about', 'contact', etc. in them.
  • Readability. e.g. https://www.webfx.com/tools/read-able/?url=https://gsa.gov
  • More from schema.org ?
  • More from this market research.
  • A replacement for statuspage.
  • Misc. ideas from Jay - CDNs; js Libraries - e.g. what % of the gov us using a (outdated) jQuery; 3rd Party Ads; % of sites that are PHP & Drupal and vulnerable to the drupalgeddon; % of sites that are a CMS (and it would be interesting to know which ones are not); % of sites w/ PDF forms (and tie to Rokus data + DAP usage).
  • What % of federal .gov websites are hosted on Code.gov?" and/or What % of websites hosted on code.gov that utilize the USWDS (per code.gov) are correctly scanned by the web scanner?
  • Finding new .org, .com, .net federal websites to add to https://github.com/GSA/govt-urls.
  • Add a layer to the data to reflect if the result has been human verified (perhaps with a date).
  • A fetcher scan: to consume some foundational data that can be integrated into the other scans.
  • Finding geospatial tools like https://resources.hud.gov/.
  • Add in a layer to offer a real, simple and usable nested list of federal agencies, a la https://github.com/unitedstates/orgchart/tree/master/wikipedia.
  • Add a layer of metadata if an agency has a machine-readable websites.yml file at a fixed location that would enable one to filter for that agencies list of websites.
  • Is www. required? Does including www. break the link? (details, examples, and a methodology here)

Other Scopes to Consider Scanning Against

  • Run scans against second level domains, subdomains, and top 500 (or) 1000 pages in DAP. Also, consider CFO act delineations, agency delineations, what else?
  • Create individual add-on lists for individual scans that would allow people to append further urls to scan to the target list for that scan.
  • Reach out to .mil again to see if they are interested. Note this list.
  • Integrating the non-.gov and non-.mil websites.