Standard Template Construct

October 31, 2023 ยท View on GitHub

Welcome, developer! You've arrived at the repository for STC, the library, search engine and AI tooling offering free access to academic knowledge and works of fictional literature.

STC | Help Center

Getting Started

  • Explore our search features at Web STC, or through one of the Telegram bots listed in the bio of our channel (not an ad, just a safety)
  • Discover how to set up your own STC instance, enabling you to enjoy the same search capabilities in your local environment
  • Learn about how to access large corpus of high-quality scholarly texts using Python and use them in AI apps

Details

In essence, STC is a search engine Summa coupled with databanks. These databanks reside on IPFS in a format that allows for searching without necessitating the download of the entire dataset. The search engine library can function as a standalone server, an embeddable Python library (requiring no additional software!), and a WASM-compiled module that can be used in a browser. Last way allows to embed search engine in a static site that further can be deployed over IPFS too. This is how Web STC is live.

Putting everything to IPFS allows you to open STC in your browser or on your server and avoid the use of centralized servers that may lose or censor data.

Components

  • Web STC is a browser-based interface with embedded search engine that can be entirely deployed on IPFS and used in browsers
  • GECK is a Python library and Bash tool for setting up and interacting with STC programmatically
  • Cybrex AI library pairs STC with AI tools such as OpenAI or free LLM for processing stored data
  • STC Hub API is plain API for accessing scholarly publications by their DOIs through kubo command line tools or even through HTTP.
  • Telegram Nexus Bot allows users to access STC via Telegram, one of the most popular messaging platforms.

Roadmap

PartTaskDescription
Library Stewardship
โœ… Assimilation of LibGen corpusTransition of all items to nexus_science
๐Ÿšง Assimilation of SciMag corpusSignificant task of transferring scimag corpus to IPFS
โœ… Structured contentEnhance GROBID extraction (headers + content) and store content in structured_content JSON column. Extract entities for cross-linking in Web STC
๐Ÿšง Implementing classification (articles, books)
Web STC
UX improvementSTC often requires loading of large data chunks, currently reflected only by a spinner. The UX needs improvement. Following structured content implementation, we can highlight headers and generate cross-links in abstracts/content
Enhancing availabilityFurther testing needed on diverse devices and networks
BookshelfSTC has all tools for generating bookshelves that may offer users high-quality suggestions on read.
Cybrex AI
First-class support of local LLMExtensive testing of prompts with documents is required to identify the smallest model capable of efficiently executing QA and summarization tasks. Most 13-15B models are currently failing (quantized, on CPU)
Building an embeddings datasetThe goal is to build a comprehensive dataset with DOIs and document embeddings. Currently, the Instructor XL model appears most promising, but further testing is necessary
Refining and fixing metadata (cleaning content)Areas for improvement include: detected language, tags, keywords, automated abstracts, Dewey classification
Build QA on local LLMSuch a system should be independently operable and also accessible via Telegram.
Fine-tuning LLMs on STC
Distribution
Building STC BoxDevelop and maintain a definitive guide and scripts for replicating and launching STC on compact devices like PI computers or TV Boxes
Global replicationThe goal is to replicate STC (including the search database and papers) a minimum of 100 times across at least 30 countries
Establishing Frontier OutpostsInvestigate strategies to replicate STC on an orbiting satellite or another planet in the solar system (Mars or Europa preferred)
Communities
โœ… Forming Science Communities on TelegramInitiate the first version of Telegram-based forums focusing on specific scientific topics
Addressing Copyright IssuesOrganize more activities aimed at challenging the copyright laws for scholarly and educational writings