docfreq

July 19, 2018 ยท View on GitHub

Bags of features, extracted in july 2018 from 7.8 million distinct files from PGA (taking only the HEAD commit), using all implemented extractors in sourc ed.ml at the time (identifiers, literals, graphlets, children, node2vec and uast2seq) and all languages parsable by Babelfish (Go, Java, Python, Bash, JavaScript and Ruby). The document frequency here refers to the frequency of each feature across all documents (we only kept features that appeared at least 5 times).

Example:

from sourced.ml.models import OrderedDocumentFrequencies
df = OrderedDocumentFrequencies().load("55215392-36fc-43e5-b277-500f5b68d0c6")
print("Number of documents:", len(df))

References

ID55215392-36fc-43e5-b277-500f5b68d0c6
Uploaded2018-06-20 14:51:45.469503
Version1.0.0
Filehttps://storage.googleapis.com/models.cdn.sourced.tech/models%2Fdocfreq%2F55215392-36fc-43e5-b277-500f5b68d0c6.asdf
Size69.9 MB
Data collection dateJuly 2018
Number of distinct documents (files)7,873,334
Number of distinct features6,194,874
License