ols.wtf / blog / search-engine-ideas
Search engine ideas
Oliver Leaver-Smith <oliver@leaversmith.com> on 2022-11-07 08:44:37
There is yet more big-web resistance brewing, and for this reason I think it's time that I dusted off an old project of a curated search engine that anyone can run their own instance of (not federated).
I had relative personal success with a project called veri which met my immediate goals, however I now want to improve this project to make it a viable alternative for people. It will still be called veri which is both the Latin word for truth, reality, or fact, and also the Turkish word for data; all of these definitions seem pretty apt.
I have previously stated that the goals of veri are as follows:
- To be deployable by anyone to create their own specific-interest search engine
- To have an understanding of www, gemini, and gopher schemes
- To be modular, so that any of the individual components can be deployed without the others
- To be a good citizen of the Internet, respecting robots.txt and configurable User-Agent to provide contact details for the instance
All of these points still stand.
The workflow of veri will be as follows
- An instance operator maintains a list of "tier 1" URLs which are trusted sources of good information. A good example would be the sitemap of a useful site.
- veri-crawl periodically spiders this list of URLs to generate a list of "tier 2" URLs
- veri-scrape periodically scrapes the full list of URLs for their text content and metadata to be stored in a database
- veri-search is an API that processes queries, ad hoc reindexing, and deletion
- veri-www is a web front end to interact with the search API
Following this pattern, an example search engine might be one created by the Go community, which uses the trending Stack Overflow pages, Go documentation, popular Go bloggers and blog aggregators, etc. in order to provide a trusted and curated way to search for information on a particular subject.
Some additional goals that are not set in stone are:
- veri-www shows an archive of the text content of the page from when it was scraped, in case the content is unavailable at the time of the search
- veri-proxy allows a user to view the page through veri so that their identity is not revealed to the requested website
As and when I have time to work on this, I will be updating this blog with relevant changes. And for those that know me, no this won't use YAML for storage, I actually want people to use this particular project!