Submitted by iacoposk8 t3_zvkgar in deeplearning
knight1511 t1_j1u3uv3 wrote
Reply to comment by iacoposk8 in Search engine within a text document by iacoposk8
Also if you are adept at coding in Java, you can look at the Apache Lucene or Solr library. This is what most search engines use behind the scenes. It is low level but it allows you to configure your program to do exactly what you want it to do.
By the way, do you want to search the filenames as well or just the content inside the files?
iacoposk8 OP t1_j1udt2j wrote
I would like to search only in the content of a text, but in an intelligent way.
So if in the text file it says: "Hardware is the physical part of the computer and software is the logical part"
and to find it I wrote: "Hardware is the part of the computer that we can touch while software is the programs"
It should be able to find it for me anyway, right?
knight1511 t1_j1w32f7 wrote
What you are querying borders semantic search. That is you are not looking for the exact phrase but for something that semantically means the same thing. Unfortunately this is not something elasticsearch can do for you out of the box.
Traditional search engines work by matching the exact words of your query and looking for its occurrences in the document. They do this by creating an inverted index, which is nothing but a lookup table of all the words/tokens present in a document. They do this for all documents you want to index. Then when a query comes in they use some similarity algorithm to evaluate the contents of all the indexed documents against the words/tokens present in the query. They then return the documents in a ranked order from most similar to least based on the score. The semantics or the “meaning” of the text is not considered.
If your query has some overlapping words with the document you are looking for, then sure you will get some relevant documents back. But if there is NO words that are same then this will not work. For example you cant expect it to return a document with the phrase “the computer is not working “for the query “the pc is broken”
What you are looking for is semantic search. There are some pre-trained language models on HuggingFace whose embeddings can be used as search index. And there is an open-source FAISS library by Facebook that allows you to search it. But the specifics of this is highly dependant on your use case. Also the implementation is a bit more complex and will requires some coding expertise and understanding of ML. You will need professional help unless you already know the stuff or are willing to learn
I have experience doing this before. But to be honest this is a time consuming task and not something to be done for free ;)
iacoposk8 OP t1_j1uj6mu wrote
I installed it. There is no gui right?
Could you give me an example of command to index and search inside the file? Thank you so much
Viewing a single comment thread. View all comments