OK you read the headline.
Imagine if Caliber had an AI tool that trained itself on all its books such that a user could ask a question regarding those specific books. And then extrapolate. Imagine if anyone anywhere could then ask questions to your local AI and get answers without actually sharing your books.
Right now I host my own Caliber server but I don’t even know if I can search a term and get a particular book that contains thar term. I think it can search the title and metadata. I’m probably wrong. But the point is that it could be so much more. And it could circumvent the copyright laws that have always held back knowledge.
Like maybe my car broke down and I could ask IA why and how to fix it. Then it would start asking for the make and model and what sort of sounds it made. It would then search forums, our books and formulate an answer composed in the form of a book specifically written for me about my car’s particular problem and how to solve it. Maybe better just a speech that you could listen to while fixing the car step by step…“now look a little to the left and you’ll find a large box with 3 screws…”
It would be awesome to have that locally for my books and have access to everyone else’s knowledge in books too.
You’re thinking of retrieval augmented generation (RAG) with a vector database. Its something that’s being actively developed. I haven’t had time to dig much into it, especially currently. But the terms themselves should give you a starting point.
Edit: a quick google gives what could be a promising starting point https://huggingface.co/learn/cookbook/rag_llamaindex_librarian
Rag is fucking awesome, But in its current state it can’t handle unlimited amounts of data. On consumer machines I think you can throw around 100 megs at it before it starts losing it That’s quite a lot of text, but not really a decent collection of books. They might be able to get away with separating the books into categories and adding them as different knowledge bases. They’d have to select which knowledge base they wanted to ask but if they could keep the size down it might work relatively well
I fed mine about a year’s worth of slack traffic from work. I would ask it how many times people had trouble with a certain system. It would say three, meanwhile there were 500 tickets in the system of people having trouble with it.
No if I asked it about those three things it would have great detail. I can even ask it for sentiment of people that were talking about it It would recognize reasonably well if they were upset, understanding or angry.
What are you talking about? RAG is a method you use. It only has limitations you design. Your datastore can be whatever you want it to be. The llm performs a tool use YOU define. RAG isn’t one thing. You can build a rag system out of flat files or a huge vector datastore. You determine how much data is returned to the context window. Python and chromadb easily scales to gigabytes, on consumer hardware, completely suitable for local rag.
I explained what I did, and how it worked.
generally, this: https://www.youtube.com/watch?v=qV1Ab0qWyT8
the numbers came from my experience, ymmv.
I think there are graph/tree based solutions for rag, which ideally should have the books at leaf nodes, and their overlapping summaries as parent nodes.
Kinda like BST for books to rag from.
Not quite what you’re asking for, but you can self-host ollama. And based on some recent lawsuits against meta, I’m pretty sure all companies are using as many books as they can get their hands on to train their models. And so their training set contains the books you have in Calibre and more.
Try asking llama3.3 or whichever model you choose your questions.
Do you know how any of this works?
If copyright were magically not an issue, why does this need to be local/self hosted?
Like sure, some people will still self host and we need some people to keep information independent of corporations. But for people who just want a summary of a car maintenance task, why would they go to a local repository instead of the largest one they can find?
Because those who dedicate themselves to finding the best information would have the best AI specific to that information. I would want my question routed to Joe’s shop and not jiffy lubes shop.
Maybe your question is about photography or maybe specific to photography chemistry or maybe to the mechanisms of a shutter or maybe you want to known how to setup a second curtain flash or maybe you want to known what specific wavelengths your first doublet filters for or what wavefront shape the light beam reflecting off the camera’s sensor will have or what sort of material can absorb it best so it doesn’t reflect back as a haze. Who knows! Well the very best people who do that sort of thing know, and they got all the books about it! So why not share that information if the books are just sitting there every day just collecting dust.
Sure something like this hasn’t been done yet. But its not because it can’t be done. Its because it’s difficult to do. But all the pieces are already there. We just need a few good puzzle masters to put it all together.
Ah, you want specialized instead of general.
Well, then is a couple of books enough to train your LLM? How many books are there on wavelengths your first doublet filters for?
Seems like you might want a forum full of topic specific comments too to feed into the model. A photography textbook with a section on lenses is good, real questions and answers from actual photographers with real scenarios would be better for most people.
Not just that but. My own notes from say Joplin. Can it consume those and expand my knowledge based on things I’ve already done and results I’ve already tried. And that’s still just an example. I could be a race car driver learning a particular race way or a set of other drivers etc.
So, that kinda already exists. For example, if I ask DeepSeek R1 “How do I change the oil on a 2013 Ford Focus SE?” it will output the steps which goes so far as to list the part number for the oil filter. If you are just looking for terms or phrases from a book collection, you wouldn’t really need to use AI for that. You’d probably need to convert the ebook to a parseable format like TXT and then just use regular expressions to pull out the matches and provide what book and where in the book it found the match.
I only used that as an example that is relatable. At work we are always making new data. Gigabytes hand over fist. But to decipher that data it is a gargantuan task for little or nonprofit to the company. That means Nobody gets to learn from the data. But if I could write my observations along with the data into a format that AI could learn from, then it could help me dig deeper and deeper into answers. But you know, its all complicated stuff. Back at home sure Dali llama can tell me the square root of the speed of light in lithium crystals. That’s everyday information. But I want to known how to use that information to build bigger and better things.
LOL:
In summary, the speed of light in lithium crystals such as lithium niobate is about 1.36×1081.36×108 m/s, and in lithium fluoride, it’s about 2.16×1082.16×108 m/s, both significantly slower than in a vacuum due to their respective refractive indices. Just take the square root of that LOL.