I started playing with Elasticsearch. I want to create index for a textfile. I mean that I have multiple text files in a folder. I want to create index on these text files so that I can perform text search on these files. Is there a way to do this using command line or . Please guide me with an example.
yes, you can by using the FS river + mapper attachment plugin. Here is a link to the source page.
I ran a few tests with it a little while ago. It works fine. Be aware though, that the file has to be local for this to work (even if you can mount a remote file to a local path).
Hope this helps.
Related
I work with CSV files and upload them to an S3 server.
Sometimes after a small process that I did with the file I get hidden characters to look like this  before the first columns, I want to write a script that "clean" the files before upload but I can see those characters only on specific text editors like nano, the python didn't recognize those characters and I can see them in Amazon Athena after the query was created already and I need to upload it again.
Does anyone know a solution to this problem?
After a small research I learn that the symbol called BOM and they added to the files because I added encoding='utf-8'.
Good morning,
I spent almost an hour, trying to find some lines of codes that I knew I had written somewhere but remember in which file.
I tried many things with the Windows search tool to find that file, such as using wildcards (* for a string, ? for a character) or using the content: filter, but it never managed to find it, whereas all files are indexed and I use the search tool often enough to know that it works in those folders (usually searching directly for a file name, not a content).
I did find the file eventually (just opening and scrolling through each of them...), but it still bothers me that this search below wasn't able to find this file, please see screenshot.
Is there any way it is actually possible to make this search work in W7 search tool? It's a .py file, so the content: filter should be able to search inside text files, should it not?
I had installed Hue with the Cloudera Manager on AWS. I have uploaded some directories with few files in there. If I am on the /user/hdfs path, there are directories like project1, project2. If I am searching project, I get as result the projects. But if Im searching to files in the project directories like file1, I get no result. I have looked at config files from hdfs, hue, solr, but cannot find this configuration parameter. How can I fix that, so I can search deeper in the hdfs with hue filebrowser?
Hue doesn't use solr for the file browser, it's webhdfs, which is only able to search from the current folder
The search within the File Browser is just a filtering of the files on the current directory, not a full text search across all the directories.
Currently the top search is only about tables and documents: http://gethue.com/realtime-catalog-search-with-hue-and-apache-atlas/
In the future, it will be extended to leverage the File search of Apache Atlas.
I'm trying to index a lot of latex and markdown files which are in different folders using elasticsearch from command line.
So far I haven't been able to find a tutorial which gives me detailed information on how to do it.
Is there anyone with ElasticSearch that could help me out?
Thank you very much.
Collecting files is easy with Logstash.
But what are you trying to achieve? Capturing the full LaTeX file or just the raw text?
If you're only after the raw text, I'd use Detex and you can actually call it from Logstash with the exec plugin. Should be pretty straight forward.
Should the file name contain a number for the tetFileStream to pickup? my program is picking up new files only if the file name contains a number. Ignoring all other files even if they are new. Is there any setting I need to change for picking up all the files? Please help
No. it scans the directory for new files which appear within the window. If you are writing to S3, do a direct write with your code, as the file doesn't appear until the final close() —no need to rename. In constrast, if you are working with file streaming sources against normal filesystems, you should create out of the scanned dir and rename in at the end —otherwise work-in-progress files may get read. And once read: never re-read.
After spending hours on analyzing stack trace, I figured out that the problem is S3 address. I was providing "s3://mybucket", which was working for Spark 1.6 and Scala 2.10.5. On Spark 2.0 (and Scala 2.11), it must be provided as "s3://mybucket/". May be some Regex related stuff. Working fine now. Thanks for all the help.