Standard Template Construct

October 31, 2023 · View on GitHub

Welcome, developer! You've arrived at the repository for STC, the library, search engine and AI tooling offering free access to academic knowledge and works of fictional literature.

STC | Help Center

Getting Started

Explore our search features at Web STC, or through one of the Telegram bots listed in the bio of our channel (not an ad, just a safety)
Discover how to set up your own STC instance, enabling you to enjoy the same search capabilities in your local environment
Learn about how to access large corpus of high-quality scholarly texts using Python and use them in AI apps

In essence, STC is a search engine Summa coupled with databanks. These databanks reside on IPFS in a format that allows for searching without necessitating the download of the entire dataset. The search engine library can function as a standalone server, an embeddable Python library (requiring no additional software!), and a WASM-compiled module that can be used in a browser. Last way allows to embed search engine in a static site that further can be deployed over IPFS too. This is how Web STC is live.

Putting everything to IPFS allows you to open STC in your browser or on your server and avoid the use of centralized servers that may lose or censor data.

Components

Web STC is a browser-based interface with embedded search engine that can be entirely deployed on IPFS and used in browsers
GECK is a Python library and Bash tool for setting up and interacting with STC programmatically
Cybrex AI library pairs STC with AI tools such as OpenAI or free LLM for processing stored data
STC Hub API is plain API for accessing scholarly publications by their DOIs through kubo command line tools or even through HTTP.
Telegram Nexus Bot allows users to access STC via Telegram, one of the most popular messaging platforms.

Roadmap

Part	Task	Description
Library Stewardship
	✅ Assimilation of LibGen corpus	Transition of all items to `nexus_science`
	🚧 Assimilation of SciMag corpus	Significant task of transferring scimag corpus to IPFS
	✅ Structured content	Enhance GROBID extraction (headers + content) and store content in structured_content JSON column. Extract entities for cross-linking in Web STC
	🚧 Implementing classification (articles, books)
Web STC
	UX improvement	STC often requires loading of large data chunks, currently reflected only by a spinner. The UX needs improvement. Following structured content implementation, we can highlight headers and generate cross-links in abstracts/content
	Enhancing availability	Further testing needed on diverse devices and networks
	Bookshelf	STC has all tools for generating bookshelves that may offer users high-quality suggestions on read.
Cybrex AI
	First-class support of local LLM	Extensive testing of prompts with documents is required to identify the smallest model capable of efficiently executing QA and summarization tasks. Most 13-15B models are currently failing (quantized, on CPU)
	Building an embeddings dataset	The goal is to build a comprehensive dataset with DOIs and document embeddings. Currently, the Instructor XL model appears most promising, but further testing is necessary
	Refining and fixing metadata (cleaning `content`)	Areas for improvement include: detected language, tags, keywords, automated abstracts, Dewey classification
	Build QA on local LLM	Such a system should be independently operable and also accessible via Telegram.
	Fine-tuning LLMs on STC
Distribution
	Building STC Box	Develop and maintain a definitive guide and scripts for replicating and launching STC on compact devices like PI computers or TV Boxes
	Global replication	The goal is to replicate STC (including the search database and papers) a minimum of 100 times across at least 30 countries
	Establishing Frontier Outposts	Investigate strategies to replicate STC on an orbiting satellite or another planet in the solar system (Mars or Europa preferred)
Communities
	✅ Forming Science Communities on Telegram	Initiate the first version of Telegram-based forums focusing on specific scientific topics
	Addressing Copyright Issues	Organize more activities aimed at challenging the copyright laws for scholarly and educational writings

Getting Started

Details

Components

Roadmap