dît

May 18, 2026 · View on GitHub

Banner

dît (means found in Kurdish) tells you the type of an HTML page, form, and fields using machine learning.

It classifies pages (login, error, landing, blog, etc.), detects whether a form is a login, search, registration, password recovery, contact, mailing list, order form, or something else, and classifies each field (username, password, email, search query, etc.). Zero external ML dependencies.

Install

go install github.com/happyhackingspace/dit/cmd/dit@latest

Usage

As a Library

import "github.com/happyhackingspace/dit"

// Load classifier. On first call, if no model.json is found in the current
// directory (walked up to the nearest go.mod) or in ~/.dit/, the pretrained
// model is downloaded from Hugging Face to ~/.dit/model.json (~93MB, one-time)
// and reused on subsequent calls.
c, _ := dit.New()

// Or load an explicit file (no network, no search).
c, _ := dit.Load("path/to/model.json")

// Classify page type
page, _ := c.ExtractPageType(htmlString)
fmt.Println(page.Type)  // "login"
fmt.Println(page.Forms) // form classifications included

// Classify forms in HTML
results, _ := c.ExtractForms(htmlString)
for _, r := range results {
    fmt.Println(r.Type)   // "login"
    fmt.Println(r.Fields) // {"username": "username or email", "password": "password"}
}

// With probabilities
pageProba, _ := c.ExtractPageTypeProba(htmlString, 0.05)
formProba, _ := c.ExtractFormsProba(htmlString, 0.05)

// Train a new model
c, _ := dit.Train("data/", &dit.TrainConfig{Verbose: true})
c.Save("model.json")

// Evaluate via cross-validation
result, _ := dit.Evaluate("data/", &dit.EvalConfig{Folds: 10})
fmt.Printf("Form accuracy: %.1f%%\n", result.FormAccuracy*100)
fmt.Printf("Page accuracy: %.1f%%\n", result.PageAccuracy*100)

As a CLI

# Classify page type and forms on a URL
dit run https://github.com/login

# Classify forms in a local file
dit run login.html

# With probabilities
dit run https://github.com/login --proba

# Download training data and model from Hugging Face
dit data download

# Train a model
dit train model.json --data-folder data

# Evaluate model accuracy
dit evaluate --data-folder data

# Upload training data and model to Hugging Face
dit data upload

Page Types

TypeDescription
loginLogin page
registrationRegistration / signup page
searchSearch results page
checkoutCheckout / payment page
contactContact page
password_resetPassword reset page
landingLanding / home page
productProduct page
blogBlog / article page
settingsSettings / account page
soft_404Soft 404 (HTTP 200 but "not found" content)
errorError page (404, 403, 500, etc.)
captchaCAPTCHA / bot detection page
parkedDomain parking page
coming_soonUnder construction / maintenance page
adminAdmin panel / dashboard
directory_listingOpen directory index
default_pageUnconfigured server default
waf_blockWAF block page
otherOther page type

Form Types

TypeDescription
loginLogin form
searchSearch form
registrationRegistration / signup form
password/login recoveryPassword reset / recovery form
contact/commentContact or comment form
join mailing listNewsletter / mailing list signup
order/add to cartOrder or add-to-cart form
otherOther form type

Field Types

CategoryTypes
Authenticationusername, password, password confirmation, email, email confirmation, username or email
Namesfirst name, last name, middle name, full name, organization name, gender
Addresscountry, city, state, address, postal code
Contactphone, fax, url
Searchsearch query, search category
Contentcomment text, comment title, about me text
Buttonssubmit button, cancel button, reset button
Verificationcaptcha, honeypot, TOS confirmation, remember me checkbox, receive emails confirmation
Securitysecurity question, security answer
Timefull date, day, month, year, timezone
Productproduct quantity, sorting option, style select
Otherother number, other read-only, other

Full list of 79 field type codes in data/config.json (run dit data download to get the data).

Accuracy

Cross-validation results (10-fold, grouped by domain):

MetricScore
Form type accuracy82.9% (1135/1369)
Field type accuracy86.6% (4518/5218)
Sequence accuracy78.7% (1025/1302)
Page type accuracy53.4% (403/754)
Page macro F140.2%
Page weighted F153.6%

Trained on 1000+ annotated web forms and 754 annotated web pages.

Used By

  • katana - A next-generation crawling and spidering framework
  • httpx - A fast and multi-purpose HTTP toolkit

Contributing

See CONTRIBUTING.md.

Credits

Go port of Formasaurus.

License

MIT