Partial Indexing of an XML file (Bleve) - go

I am evaluating a couple different libraries to see which one will best fit what I need.
Right now I am looking at Bleve, but I am happy to use any library.
I am looking to index full files except specific ones which are in XML format. For those I only want Bleve to index specific tags as most of the tags are worthless to search. I am trying to evaluate if this is possible but, being new to Bleve, I am not sure what part I need to customize.
The documentation is very good, but I can't seem to find this answer. All I need is an explanation with keywords and steps, no code is required, I just need a push as I have spent hours spinning my wheels with google searches and I am getting no where.

There are probably many ways to approach this. Here's one.
Bleve indexes documents which are collections of key/value metadata pairs.
In your case, a document could be represented by 2 key/value pairs: name of .xml file (to uniquely identify the document) and content of the file.
type Doc struct {
Name string
Body string
}
The issue is that body is XML and Bleve doesn't support XML out-of-the-box.
A way to address it would be to pre-process XML file by stripping unwanted tags and content. You can do it using encoding/xml standard library.
For an example of a similar task you can see the code of https://github.com/blevesearch/fosdem-search/
In there they index file in custom format (https://github.com/blevesearch/fosdem-search/blob/master/fosdem.ical) by parsing it into a format they can submit to Bleve for indexing (https://github.com/blevesearch/fosdem-search/blob/master/ical.go).

Related

How does file convertors work in general like word to pdf, XML to json, word to txt etc

I've used many types of file convertor like word to pdf, XML to json, word to txt etc.
How do they work in backend? Is there some specific guidelines each of them follow? Are there some similarity in the way they are implemented.
I tried searching it but most of the articles take me to the web app that can convert the doc, but none of them gives clarity on how it's done.
All of them work by parsing the first document into a data structure. Then generate a document in the other format from that data structure using recursion.
Parsing itself is a giant topic that people take courses on in computer science. But long story short, it proceeds by breaking the document into tokens, and then fitting the tokens into a parse tree using one of a standard set of methods. They have all sorts of fancy names like Recursive Descent and LALR(1). That's where most of the theory you'd want to learn is.
For example if you're writing a JSON to XML converter, you'd first need to parse that JSON. A JSON Parser shows how you could write that, from scratch, using recursive descent. Once written you just need to write a recursive function that takes each data type and does something appropriate with it to generate text in the format that you want.
Incidentally you can also write a "document converter" that converts from a document format to the same document format. Why would someone want to do that? The two most common use cases are to prettify or minify code. Despite the fact that only one format is being dealt with, the principles of how you do it are exactly the same.

How to index files while uploading on google drive and then search using indexing?

Is it possible to index file content during upload to drive and later search against it using Google API?
In this way it should be faster. Also, I am thinking to extract contend (using vision) and use it to search later via search APIs.
The search function for file.list response is very limited. You cant search on all the fields so unless you want to add some special tag to the name of the file i dont think there is anyway your going to be able to index them for faster searching.
Yes you can.
You need to duplicate the file content and set it as contentHints.indexableText within your POST body.
From https://developers.google.com/drive/api/v3/reference/files/create
contentHints.indexableText string Text to be indexed for the file to improve fullText queries. This is limited to 128KB in length and may contain HTML elements.

Tiny Elasticsearch like search library in ruby?

Is there an alternative to Elasticsearch? I am looking for something lightweight and tiny. Some sort of Ruby library would be ideal. For example, if I don't need to run an application to do the job but instead using that library I can search YAML or JSON files for matching text etc. And based on match score returns the sorted list.
Example here would make it clear.
Let's say I have a single YAML file with array of strings E.g
values:
- “some nice text here”
- “this is another sentence”
- “some nice”
- “nice text here”
So now if I search for “some nice text here” the first line should come on top but also there is a slight similarity with the last two lines, so the search results should return as follows:
some nice text here
nice text here
some nice
Thanks
Sphinx, Solr, come to mind but they're not small. Hell even using PostgreSQL until you really need to scale is fine (and you'll know when you need to scale) and it's probably already baked into your stack.
If your needs are to search YAML files, you could easily write your own naive approac
There's also https://github.com/mezis/fuzzily but it might require some updating.

exist-db: XQuery and documents with XInclude

I'm embarking on a new project with eXist. We'll be storing a few hundred TEI XML documents that represent manuscripts. A number of things we want to capture are repetitve, mainly people and places. My colleague has asked the TEI community about strategies for representing what we want to capture and using XInclude had been suggested as a way of reducing duplication.
I've had a quick play with adding an XInclude into a document and the serialized XML does render the include XML file. However, the included text was missing from an XQuery. I notice in the eXist docs (http://exist-db.org/exist/apps/doc/xinclude.xml) that:
eXist-db expands XIncludes at serialization time, which means that the
query engine will see the XInclude tags before they are expanded. You
therefore cannot query across XIncludes - unless you create your own
code (e.g. an XQuery function) for it. We would certainly like to
support queries over xincluded content in the future though.
What is the best practice for querying files that use XInclude?
I'm wondering whether I should have a 'job' that serializes the source TEI XML files to expand the XIncludes and store these files in a separate collection? In that case, would file:serialize be the correct function for this task?
We are at the start of the project, so any advice appreciated.
Can you describe what kind of query you tried that was missing the text?
Generally, since the files referenced via XInclude are well-formed xml documents, you can use collections (folders) to organise your queries in exist-db. So instead of for $search in doc("mydoc.xml") you could for $search in collection('/app/mydata')/*
more elaborate answers would follow the attribute of the unexpanded xinclude statement in source document and find the matching element in the target, but its difficult to abstract that without a concrete MWE.
have you tried to create a temporary and expanded fragment in a let clause, and query that instead of the stored xml?
Beware of namespaces !
Hope this helps, and greetings to Sebastiaan.

Classify documents with tags

I have a huge amount of documents (mainly pdfs and doc's) I want to classify, so I can search over them according to certain tags. These tags could either be of my own (I put the tags to the document) or extracted from the text.
I've just seen a post related to this (Classify data using Apache Mahout), but perhaps there is something even more simple.
Mahout might be overkill for your problem - but you can get a fairly quick, easy solution by using OpenNLP.
http://opennlp.sourceforge.net/api/index.html
Specifically, look at the opennlp.tools.doccat package. Essentially, you have to go through and manually tag a small(ish) set of the items for each category you desire. If they are really distinct, you can get away with a small sample size.
You can use the DocumentCategorizerME.train() static function to train a collection of documents, where each requires a category tag and the text block to train on. Then, you can initialize the DocumentCategorizerME with the trained model and begin classifying all the rest of your documents.
Once you do this, you can (I think) write the model to a file so you don't have to ever do that again.
This post on extracting keywords and classifying webpages is related and may be helpful. In your example it sounds like you can use tags in lieu of the keyword extraction piece (although you may want to use both in combination). Weka is easy to use, I would definitely recommend giving it a look.

Resources