docfreq

July 19, 2018 · View on GitHub

Bags of features, extracted in july 2018 from 7.8 million distinct files from PGA (taking only the HEAD commit), using all implemented extractors in sourc ed.ml at the time (identifiers, literals, graphlets, children, node2vec and uast2seq) and all languages parsable by Babelfish (Go, Java, Python, Bash, JavaScript and Ruby). The document frequency here refers to the frequency of each feature across all documents (we only kept features that appeared at least 5 times).

Example:

from sourced.ml.models import OrderedDocumentFrequencies
df = OrderedDocumentFrequencies().load("55215392-36fc-43e5-b277-500f5b68d0c6")
print("Number of documents:", len(df))

References


ID	55215392-36fc-43e5-b277-500f5b68d0c6
Uploaded	2018-06-20 14:51:45.469503
Version	1.0.0
File	https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fdocfreq%2F55215392-36fc-43e5-b277-500f5b68d0c6.asdf
Size	69.9 MB
Data collection date	July 2018
Number of distinct documents (files)	7,873,334
Number of distinct features	6,194,874
License