I am looking at creating text search capabilities on the files on multiple github repo. I am looking at options elastic search or logstash. Please suggest and point to any reference related to such examples. Thanks!
Related
I had installed Hue with the Cloudera Manager on AWS. I have uploaded some directories with few files in there. If I am on the /user/hdfs path, there are directories like project1, project2. If I am searching project, I get as result the projects. But if Im searching to files in the project directories like file1, I get no result. I have looked at config files from hdfs, hue, solr, but cannot find this configuration parameter. How can I fix that, so I can search deeper in the hdfs with hue filebrowser?
Hue doesn't use solr for the file browser, it's webhdfs, which is only able to search from the current folder
The search within the File Browser is just a filtering of the files on the current directory, not a full text search across all the directories.
Currently the top search is only about tables and documents: http://gethue.com/realtime-catalog-search-with-hue-and-apache-atlas/
In the future, it will be extended to leverage the File search of Apache Atlas.
In our team we have several Projects, each separately documented with Sphinx.
We want to have a central documentation page, which includes all of our projects.
Of course we could build a HTML page with links to the different documentations.
Is there a way to combine the documentations with Sphinx itself?
Maybe with a separate Documentation project, which somehow includes the documentation from the other projects?
You could try intersphinx to create links between existing documentations. This works like Wikipedia links to other Wikis in the Wikimedia universe.
An alternative is to checkout all documentation directories into a new repository. You might need to create new toctrees because it either doesn't fit into the overall structure or the nesting gets to deep. Therefore you could provide alternative index files by changing the index file name to e.g. master.rst.
An individual documentation is build by using the original index.rst but if it's used in the master documentation, only master.rst files will be considered as toctrees.
I'm trying to index a lot of latex and markdown files which are in different folders using elasticsearch from command line.
So far I haven't been able to find a tutorial which gives me detailed information on how to do it.
Is there anyone with ElasticSearch that could help me out?
Thank you very much.
Collecting files is easy with Logstash.
But what are you trying to achieve? Capturing the full LaTeX file or just the raw text?
If you're only after the raw text, I'd use Detex and you can actually call it from Logstash with the exec plugin. Should be pretty straight forward.
I have on my PC a folder containing many epub and pdf files that i want to be able to do fulltext search.
I know windows has already indexing service. but i would like to perform more logic than simple search for keywords.
So i would like to import those epub and pdf files into elasticsearch. anyone knows a script that can do this?
ElasticSearch has implemented plugin for mapping attachments so hope this would help you:
https://www.elastic.co/guide/en/elasticsearch/plugins/master/mapper-attachments.html
https://github.com/elastic/elasticsearch-mapper-attachments
It works fine for me.
I am trying order the files on a common fileshare of my department, containing thousands of documents of various filetypes. My idea was to sort them by content-related keywords. Only few files contain valid info in the keywords file attribute provided by Windows. My idea was to let some desktop search engine index the files (and their content) and then use the generated keywords from the index.
The problem is that I don't know how to read these generated keywords from the search index.
Neither Microsoft nor Copernic seem to provide any information on how to access their index files.
MSDN only provides info about how to query the Windows Search engine directly from your program, but the results do only contain Windows file attributes and file information, but not those generated keywords used for indexing.
Copernic does not seem to provide any info at all.
I am very grateful for any idea on how to access these generated keywords.
Thank you in advance!
If Google Desktop search is an option, you may use the Google Desktop Search API.
A more programming-intensive option is using Lucene. Somewhere in the middle is nutch.