This wiki is now locked - both user registration and edits (except by admins) are disabled. We're currently migrating all the content to our new wiki. If you have time, please register and help us out!

You can still view the source code of every page. Once a page has been copied over to the new wiki, please add a link to it to MigratedPages (the only page which is still editable), to notify the admins to go and blank it.


New spider ideas


Freenet already has a primitive searching system:

- A built-in spider (and many other spiders) which produce index files which are inserted to a USK.
- The Librarian plugin downloads this index and executes a search from the user.
- Thaw allows for indexes which are collections of files with metadata, described in an XML-based format. This format was originally intended to be used by Librarian also, though it may need some extensions to be useful for page searching.
- Obviously there are major scale problems with the above. Also they do not address spam.

Useful projects:

- Make Librarian use the XML-based file format Thaw uses, with any necessary extensions.
- Make Librarian support keeping a list of indexes, adding and removing indexes, and indicating which site was found in which indexes (possibly by attaching icons for each source to the title?)
- Integrate Librarian better with fproxy. There should be a search box on the home page, above the fetch a key box and below the bookmarks.
- Support huge indexes. The spider inserts an index site, which is a main index file plus a bunch of subsidiary indexes. Each represents a chunk of the search space: The user types a query into Librarian, it checks each search term and determines which sub-indexes will be necessary. It fetches each sub-index, cross-references them and feeds the results back to the user. Sub-indexes may be divided up by the first few characters of the search term (not by a fixed number of characters: you might have X, Y and Z on a single index, but there might be several indexes within the letter E; each sub-index should be roughly the same size). Alternatively we might hash the terms and divide it up using the hash. Which is better?
- The user interface for Librarian would need to be much closer to a traditional p2p search. At the moment Librarian looks like Google: The user expects instant results. This is not realistic, especially with big indexes.
- The spider should save its progress to disk, and pick up from where it left off when restarted (use the database Bdb already integrated in the node could be an interesting solution ?). It should understand updatable keys. It might want to try to determine reachability in some cases.
- The spider should index files (by filename) as well as pages.
- The spider should be able to read other indexes (e.g. Thaw .frdx files) and combine them, without necessarily trusting the index sources 100%.
- The spider should be able to index files referred to on Frost or posted on its filesharing system. It might also be useful to be able to search Frost messages at the same time as pages, however they are ephemeral... In either case you might want to talk to the Worst authors about interface.
- Currently the spider fetches all files, or ignores some based solely on their filenames. It should be possible for it to ask Freenet to fetch only so much of the file as is needed to determine its content type, length etc.
- Make both librarian and the spider, if installed, accessible to FCP clients.
- Build a Swing GUI on top of this.
- Use some sort of page ranking algorithm / trust metric to sort results. For some non-patented algorithms, see this thread (ignore the stuff about premix routing, it won't be implemented before 0.8.
- Pick up category markers and other page metadata.
- If the spider is done as a node plugin, making it with a lot of comments would be nice too : It could help other developpers to understand how to make plugins.

How much integration do we want with Thaw (indexing files as well as pages etc)? What extensions don't make sense for indexing pages?

Known problems with the current spiders :

- Some people reported that the current spiders (Spider.java & NinjaSpider.java) seem to be blocked on some files after a while (maybe not totally blocked, but it loose a lot of time with just some files), so maybe a new download strategy should be used (something like : "if no download progressed at all since X minutes, start another download" ?)
- The UI must be redesigned (allow to change the parameters, etc)
- For the NinjaSpider, I (Jflesch) did a stupid mistake: I used the DOM generator instead of the SAX generator ... far to be optimized :(
- Everything is in one source file : It would be more readable if this plugin would have its own package (inside a package specific to the plugins ?).

Related plugin project:

- Produce a human readable HTML index (of pages). Should be highly customizable and easy to use so anyone can run an index and make it look good with no coding required. Should automatically insert updates, should publish an index of the source data etc. Could aggregate data from other indexes, but also allow input from the user e.g. categorisation of uncategorised (or wrongly categorised) sites, which could then be fed back into the published index and used by other index sites and even by searches ("Another says this site is porn, but The Freedom Engine says it is biology" etc).
Valid XHTML 1.0 Transitional :: Valid CSS :: Powered by WikkaWiki