socid_extractor

May 9, 2026 · View on GitHub

PyPI version Downloads/month Total downloads License

CI GitHub stars GitHub forks

Turn any public profile page into a structured account record — usernames, display names, bios, avatars, locations, joined-at dates, follower counts, external links, and the stable internal identifiers that uniquely pin an account across renames, redesigns, and deletions.

socid_extractor parses HTML pages and API responses from 130+ platforms and returns a flat, machine-readable dictionary of account fields. No API keys required, no headless browser — just a single function call on response text.

Why it's useful

  • Stable cross-service IDs. Get GAIA ID (Google), Facebook UID, Yandex Public ID, Instagram pk, and dozens more — values that survive username changes and let you correlate accounts across leaks, archives, and search-engine indices.
  • One uniform interface. Same extract() call for Instagram, GitHub, VK, Reddit, Substack, Bluesky, TikTok — no per-platform glue code on your side.
  • Field ontology. Normalized field names across platforms (username, fullname, created_at, is_verified, …) so downstream pipelines don't need 130 mappings.
  • Battle-tested. Powers Maigret and a number of other OSINT tools.

Installation

Python: 3.10+.

pip install socid-extractor

For a clean CLI install on a workstation:

pipx install socid-extractor

The latest development version:

pip install -U git+https://github.com/soxoj/socid-extractor.git

Quick start

As a CLI:

$ socid_extractor --url https://www.deviantart.com/muse1908
country: France
created_at: 2005-06-16 18:17:41
gender: female
username: Muse1908
website: www.patreon.com/musemercier
links: ['https://www.facebook.com/musemercier', 'https://www.instagram.com/muse.mercier/', 'https://www.patreon.com/musemercier']
tagline: Nothing worth having is easy...

As a Python library:

import requests
import socid_extractor

r = requests.get('https://www.patreon.com/annetlovart')
print(socid_extractor.extract(r.text))
# {'patreon_id': '33913189', 'patreon_username': 'annetlovart',
#  'fullname': 'Annet Lovart',
#  'links': "['https://www.facebook.com/322598031832479', ...]"}

Tip — batch runs: pass --skip-fetch-if-no-url-hint to skip the HTTP request when the URL doesn't match any known site hint (faster, but may skip generic engines such as forum templates):

$ socid_extractor --url https://example.com/foo --skip-fetch-if-no-url-hint

Supported sites

130+ schemes — see METHODS.md for the full list.

A non-exhaustive sample:

  • Major networks: Facebook (user & group pages), Instagram, VK.com, OK.ru, Reddit, TikTok, Bluesky, Tumblr, Flickr
  • Google ecosystem: Google docs/maps contributions (cookies required), Google Play, YouTube
  • Mail.ru: my.mail.ru user mainpage, photo, video
  • Dev / writing platforms: GitHub, Stack Overflow (HTML + API), LeetCode, Hashnode, Medium, Substack, Paragraph, WordPress.org, Virgool
  • Forums (universal detectors): Discourse, MediaWiki / Fandom wikis, Mastodon
  • Niche / vertical: Chess.com, Roblox, MyAnimeList, Scratch, Wikipedia, DailyMotion, SlideShare, Weebly, Calendly, Amazon Author, Boosty, Warpcast (Farcaster), Fragment (TON/Telegram), Rarible, CSSBattle, lnk.bio, Spatial, TwitchTracker, Max (max.ru)

…and many others.

For data examples, see tests/test_e2e.py; for the parsing logic, see socid_extractor/schemes.py; for the field ontology, see FIELDS.md.

Use cases

  • Pivot from a profile to everything you can see. One call returns the visible info plus the hidden internal IDs the platform uses behind the scenes. Background reading: Week in OSINT — Getting a grasp on Google IDs.
  • Track accounts across renames, redesigns, and deletions. Stable IDs (GAIA, FB UID, Yandex Public ID, Instagram pk, …) let you re-identify the same person even when every visible field has changed. Background: Aware Online — User IDs in social-media investigations.
  • Search by cross-service UID. Once you have a stable identifier you can pivot into:
    • SQL / leaked databases (forum dumps, breach data) where the UID is the join key,
    • Google / Yandex / archive.org indices that captured URLs containing the UID.
  • Feed downstream OSINT tooling. A normalized record is much easier to ingest than per-site scrapers — used by Maigret and similar tools for enrichment.

SOWEL classification

Maps to the following SOWEL techniques:

Tools using socid_extractor

  • Maigret — powerful namechecker that generates a report with all available info from accounts found across 3000+ sites.
  • TheScrapper — scrape emails, phone numbers, and social-media accounts from a website.
  • InfoHunter — open-source OSINT tool to search, collect, and analyze information online.
  • YaSeeker — gather all available information about a Yandex account by login/email.
  • Marple — scrape search-engine results for a given username.

Testing

python3 -m pytest tests/test_e2e.py -n 10 -k 'not cookies' -m 'not github_failed and not rate_limited'

Every new scheme must have an e2e test in tests/test_e2e.py hitting a real URL/API. Unit tests with inline fixtures (tests/test_socid_improvements.py) are also required but do not replace e2e coverage. See docs/testing-and-ci.md for details.

Developer documentation (architecture, modules, CI) lives in docs/.

Contributing

See the contributing guide if you want to add a new scheme or fix anything.