I have two questions.
1.I have indexed some documents to my elasticsearch cluster (running on local machine) and some files like (_0.cfe,_0.cfs,_0.si,segments_e,write.lock) are created in the data directory (E:\elasticsearch-6.1.1\data\nodes\0\indices\atLl0jUNTbuAJKT3OxgpUQ\4\index).
I want to know which among these file contains to exact data that i have indexed.
2.I tired to view these files but they are not human readable.Is there any way to view the actual data
Related
Elastic search is a search engine, according to Wikipedia. This implies it is not a database, and does not store the data it is indexing (but presumably does store its indexes)
There are presumably 2 ways to get data into Es. Log shipping or directly via api.
Let’s say my app wants to write an old fashioned log file entry:
Logger.error(now() + “ something bad happened in module “ + module + “;” + message”
This could either write to a file or put the data directly in es using a rest api.
If it was done via rest api, does es store the entire log message, in which case you dont need to waste disk writing the logs to files for compliance etc. Or does it only index the data, so you need to keep a separate copy? If you delete or move the original log file, how does es know, and is what it Deos store still usefull?
If you write to a log file, then use log stash or similar to “put the log data in es” does es store the entire log file as well as any indexes?
How does es parse or index arbitrary log files? Does it treat a log line as a single string, or does it require logs to have a specific format such as cvs or Jason?
Does anyone know of a resource with this key info?
Elasticsearch does store the data you are indexing.
When you ingest data into elasticsearch, this data is stored in one or more index and then it can be searched. To be able to search something with elasticsearch you need to store the data in elasticsearch, it can not for example search on external files.
In your example, if you have an app sending logs do elasticsearch, it will store the entire message you send and after it is in elasticsearch you don't need the original log anymore.
If you need to parse your documents in different fields you can do it before sending the log to elasticsearch as a json document, use logstash to do this or use an ingest pipeline in elasticsearch.
A good starting point to know more about how it works is the official documentation
While working with self driving car data, I have requirement like with given range of timestamp, I need to search Elastic Search and get the vehicle location image,video,latitudes,longitudes,speeds of vehicle. The input data will be loaded into Hadoop using Kafka+Spark Streaming, this transformed data contains json files, images, videos. The json file has attributes like course,speed,timestamp,latitude,longitude,accuracy. The json file name matches with corresponding image name and video name except extensions(.JPG, .JSON) differs. As the search should fetch results very fast on Petta bytes of data, we are asked to use Elastic search and Kibana here by integrating with Hadoop(Impala or presto may not match performance of kibana).
Problem here is we can store image and video data in Hadoop using sequence file format. But can the same data be indexed in ES with Hadoop integration? or can we fetch image or video data directly from hive to Kibana while searching? Or is there alternate way. Your help is much appreciated here.
I am new to Apache solr
I have a requirement in my project where I have to upload pdf documents from HDFS to Solr and from there I want to get using Solr rest API's.
I have total 40k pdf documents in my local file system, first I will push them to HDFS. But from there to Solr I really don't have any idea
Another thing is while indexing into solr, i want to read some data from pdf document and index that data also into Solr.
Example: I want extraxt candidate name, candidate location from pdf document and push them into solr schema which looks like,
name: "candidate_name"
location: "candidate_location"
document: "pdf_document"
I searched for this over the internet, but couldn't find the right solution
Try using the https://github.com/lucidworks/hadoop-solr
You should try the DirectoryIngestMapper, it has Tika parsing, but you will have to customized it.
I have written a code in JAVA for search through solr it gives particular document. But now I want to show the directory where the data is stored as an output to a user.
If user searches for a keyword I want to show user that in which directory their data is stored.
You can use a (complicated) workaround : retrieve the name of the shard used by adding "fl=*,[shard]" to your request. You can from there know the name of the core with the core.properties file in the shard's directory. The index data will be stored in hdfs under the core's name.
However, this does look like an xy problem.
I'm trying to cluster documents I have collected as part of a research project. I am trying to use Carrot2 workbench and can't find out how to point carrot at the folder containing the documents. How do I do this please? (I have a small number of documents (.txt) to compare and they're on a standalone research machine so I can't connect to the web and process them there).
Any help gratefully received!
(I am trying to identify similarities/themes/groups across the documents; if Carrot2 isn't the right tool then would be grateful for alternative suggestions!)
Many thanks,
John
Currently Carrot2 Workbench does not support clustering files directly from a local folder. There are a few solutions here:
Convert all your text file to Carrot2 XML format and cluster the XML file in Carrot2 Workbench.
Index your files in Apache Solr and query your Solr index from Carrot2 Workbench.
Convert your files to a Lucene index and query the index from Carrot2 Workbench. I wrote a simple utility for that task called folder2index (source code).
Assuming you're on Windows, the indexing process is the following:
Uzip the folder2index tool somewhere, let's assume you unzipped it to c:\carrot2\folder2index-0.0.1.
To index text files from some directory (let's assume c:\txt-input) and create the index in c:\txt-input-index, do this:
a. Open command line console (Start menu -> Run program -> type cmd and press Enter).
b. In the console, type:
cd c:\carrot2\folder2index-0.0.2
java -jar folder2index-0.0.2.jar --index c:\txt-input-index --folders c:\txt-input --use-tika
After a short while you should see something like:
...
Index created: c:\txt-input-index
Once you've indexed the files, you can cluster them in Carrot2 Workbench, using the Lucene document source. Use the content file name to refer to the content of your text file; the name of the file is stored in the fileName field.
A couple of notes:
Currently only PDF, HTML and TXT files are indexed, other files are ignored.
If the index already exists, files are added to the index. This means that if you run the command twice with the same parameters, the index will contain duplicate documents. To re-index a folder to which you've just added some files, it's best to delete the index directory first.
You can use the Query field in Carrot2 Workbench to select specific files from the index, e.g.:
*:* -- retrieves all the content (up to the requested number of results)
mining -- retrieves all the documents that contain the word "mining" in them (again, up to the requested number of results)
"data mining" -- retrieves documents that contain the exact phrase "data mining"
fileName:92* -- retrieves contents of files whose names start with "92"
I recently had built a document clustering software. This software is build in java. This software is absolutely free. Document organizer software can cluster a huge collection of document of following extensions:
txt
pdf
doc
docx
xls
xlsx
ppt
pptx
If this software doesnt fullfill your requirement please let me know.
Here's the link:
http://www.computergodzilla.com
If you want to read more, refer here:
http://computergodzilla.blogspot.com/2013/07/document-organizer-software.html