BOW

August 21, 2018 ยท View on GitHub

Bags of features, extracted in july 2018 from 7.8 million distinct files from PGA (taking only the HEAD commit), using all implemented extractors in sourced.ml at the time (identifiers, literals, graphlets, children, node2vec and uast2seq) and all languages parsable by Babelfish (Go, Java, Python, Bash, JavaScript and Ruby). This was done to try to use apollo at scale. We hit scipy.sparse limits while trying to merge sparse matrices for all bags, so this is only one of three BOW model holding bags.

Example:

from sourced.ml.models import BOW
bow = BOW().load("1e0deee4-7dc1-400f-acb6-74c0f4aec471")
print("Number of documents:", len(bow))
print("Number of tokens:", len(bow.tokens))

References

ID1e0deee4-7dc1-400f-acb6-74c0f4aec471
Uploaded2018-07-17 10:16:51.105969
Version1.0.0
Filehttps://storage.googleapis.com/models.cdn.sourced.tech/models%2Fbow%2F1e0deee4-7dc1-400f-acb6-74c0f4aec471.asdf
Size5.9 GB
Data collection dateJuly 2018
Number of distinct documents (files)864,458
Number of distinct features6,194,874
Other parts694c20a0-9b96-4444-80ae-f2fa5bd1395b and da8c5dee-b285-4d55-8913-a5209f716564
License

Dependencies