carrot2 - can I cluster documents from a folder? - carrot2

I'm trying to cluster documents I have collected as part of a research project. I am trying to use Carrot2 workbench and can't find out how to point carrot at the folder containing the documents. How do I do this please? (I have a small number of documents (.txt) to compare and they're on a standalone research machine so I can't connect to the web and process them there).
Any help gratefully received!
(I am trying to identify similarities/themes/groups across the documents; if Carrot2 isn't the right tool then would be grateful for alternative suggestions!)
Many thanks,
John

Currently Carrot2 Workbench does not support clustering files directly from a local folder. There are a few solutions here:
Convert all your text file to Carrot2 XML format and cluster the XML file in Carrot2 Workbench.
Index your files in Apache Solr and query your Solr index from Carrot2 Workbench.
Convert your files to a Lucene index and query the index from Carrot2 Workbench. I wrote a simple utility for that task called folder2index (source code).
Assuming you're on Windows, the indexing process is the following:
Uzip the folder2index tool somewhere, let's assume you unzipped it to c:\carrot2\folder2index-0.0.1.
To index text files from some directory (let's assume c:\txt-input) and create the index in c:\txt-input-index, do this:
a. Open command line console (Start menu -> Run program -> type cmd and press Enter).
b. In the console, type:
cd c:\carrot2\folder2index-0.0.2
java -jar folder2index-0.0.2.jar --index c:\txt-input-index --folders c:\txt-input --use-tika
After a short while you should see something like:
...
Index created: c:\txt-input-index
Once you've indexed the files, you can cluster them in Carrot2 Workbench, using the Lucene document source. Use the content file name to refer to the content of your text file; the name of the file is stored in the fileName field.
A couple of notes:
Currently only PDF, HTML and TXT files are indexed, other files are ignored.
If the index already exists, files are added to the index. This means that if you run the command twice with the same parameters, the index will contain duplicate documents. To re-index a folder to which you've just added some files, it's best to delete the index directory first.
You can use the Query field in Carrot2 Workbench to select specific files from the index, e.g.:
*:* -- retrieves all the content (up to the requested number of results)
mining -- retrieves all the documents that contain the word "mining" in them (again, up to the requested number of results)
"data mining" -- retrieves documents that contain the exact phrase "data mining"
fileName:92* -- retrieves contents of files whose names start with "92"

I recently had built a document clustering software. This software is build in java. This software is absolutely free. Document organizer software can cluster a huge collection of document of following extensions:
txt
pdf
doc
docx
xls
xlsx
ppt
pptx
If this software doesnt fullfill your requirement please let me know.
Here's the link:
http://www.computergodzilla.com
If you want to read more, refer here:
http://computergodzilla.blogspot.com/2013/07/document-organizer-software.html

Related

Indexing .png/JPG/PDF files in elastic search from fileserver

Am having search based requirement. Am able to do indexing oracle database tables into elasticsearch by using logstash. In the same way, i have to index png/JPG/PDF files which are all presented in fileserver now.
Am using elasticsearch version 6.2.3. Can anyone have any idea about indexing files from fileserver to elasticsearch ?
purpose - why am seeing for indexing png/JPG/PDF :
i have to search and display some products with product information, along with that i have to display product picture also which is stored in fileserver.
I have a feature to search for documents (pdf). so,if is search with any keywords, it should also search in the contents of the documents and bring those document as search results. Here documents filepath is available in DB only files are available in fileserver.
For these two purpose, am looking for indexing png/JPG/PDF files.
You just have to get bytes from your image(you can do it in any program language) and then save them in field with binary type. But it is not a good idea, try to save the link to the image.

On which file does elasticsearch store its data

I have two questions.
1.I have indexed some documents to my elasticsearch cluster (running on local machine) and some files like (_0.cfe,_0.cfs,_0.si,segments_e,write.lock) are created in the data directory (E:\elasticsearch-6.1.1\data\nodes\0\indices\atLl0jUNTbuAJKT3OxgpUQ\4\index).
I want to know which among these file contains to exact data that i have indexed.
2.I tired to view these files but they are not human readable.Is there any way to view the actual data

How to use Elasticsearch to make files in a directory searchable?

I am very new to search engines and Elasticsearch, so please bear with me and apologies if this question sounds vague. I have a large directory with lots of .csv and .hdr files, and I want to be able to search text within these files. I've done the tutorials and read some of the documentation but I'm still struggling to understand the concept of indexing. It seems like all the tutorials show you how to index one document at a time, but this will take a long time as I have lots of files. Is there an easier way to make elasticsearch index all the documents in this directory and be able to search for what I want?
Elasticsearch can only search on documents it has indexed. Indexed means Elasticsearch has consumed a document one by one and stored it internally.
Normaly internal structure matters and you shold understand what you're doing to get best performance.
So you need a way to get your files into elastic search, I'm affraid there is no "one click way" to achieve this...
You need
Running cluster
Designed index on for the documents
Get document from filesystem to Elasticsearch
Your question is focused on 3).
For this, search for script examples or tools that can crawl your directory and provide Elasticsearch with documents.
5 seconds of using Google brought me to
https://github.com/dadoonet/fscrawler
https://gist.github.com/stevehanson/7462063
Theoretically it could be done with Logstash (https://www.elastic.co/guide/en/logstash/current/plugins-inputs-file.html), but I would give fscrawler a try.

How to index pdf files from HDFS to Solr

I am new to Apache solr
I have a requirement in my project where I have to upload pdf documents from HDFS to Solr and from there I want to get using Solr rest API's.
I have total 40k pdf documents in my local file system, first I will push them to HDFS. But from there to Solr I really don't have any idea
Another thing is while indexing into solr, i want to read some data from pdf document and index that data also into Solr.
Example: I want extraxt candidate name, candidate location from pdf document and push them into solr schema which looks like,
name: "candidate_name"
location: "candidate_location"
document: "pdf_document"
I searched for this over the internet, but couldn't find the right solution
Try using the https://github.com/lucidworks/hadoop-solr
You should try the DirectoryIngestMapper, it has Tika parsing, but you will have to customized it.

Export/Import Kibana 4 saved Searches, Visualization & Dashboards

I'm looking for a list of commands required to export and then import all Kibana 4 saved Searches, Visualizations and Dashboards.
I'd also like to have the default Kibana 4 index pattern created automatically for logstash.
I've tried using elasticdump as outlined here http://air.ghost.io/kibana-4-export-and-import-visualizations-and-dashboards/ but the default Kibana index pattern isn't created and the saved searches don't seem to get exported.
You can export saved visualizations, dashboards and searches from Settings >> Objects as shown in image below
you also have to export associated visualizations and searches with the dashboard. clicking on dashboard export will not include dependent objects.
All information pertaining to saved objects like saved searches, index patterns, dashboards and visualizations is saved in the .kibana index in Elasticsearch.
The GitHub project elastic/beats-dashboards contains a Python script for dumping Kibana definitions (to JSON, one file per definition), and a shell script for loading those exported definitions into an Elasticsearch instance.
The Python script dumps all Kibana definitions, which, in my case, is more than I want.
I want to distribute only some definitions: specifically, the definitions for a few dashboards (and their visualizations and searches), rather than all of the dashboards on my Elasticsearch instance.
I considered various options, including writing scripts to get a specific dashboard definition, and then parse that definition, and get the cited visualization and search definitions, but for now, I've gone with the following solution (inelegant but pragmatic).
In Kibana, I edited each definition, and inserted a string into the Description field that identifies the definition as being one that I want to export. For example, "#exportme".
In the Python script (from beats-dashboards) that dumps the definitions, I introduced a query parameter into the search function call, limiting it to definitions with that identifying string. For example:
res = es.search(
index='.kibana',
doc_type=doc_type,
size=1000,
q='description:"#exportme"')
(In practice, rather than hardcoding the "hashtag", it's better practice to specify it via a command-line argument.)
One aspect of the dump'n'load scripts provided with elastic/beats-dashboards that I particularly like is their granularity: one JSON file per definition. I find this useful for version control.
You can get searches using elasticdump like so:
elasticdump --input=http://localhost:9200/.kibana --output=$ --type=data --searchBody='{"filter": {"type": {"value": "search"}} }'

Resources