How to index pdf files from HDFS to Solr

How to index pdf files from HDFS to Solr - hadoop

I am new to Apache solr
I have a requirement in my project where I have to upload pdf documents from HDFS to Solr and from there I want to get using Solr rest API's.
I have total 40k pdf documents in my local file system, first I will push them to HDFS. But from there to Solr I really don't have any idea
Another thing is while indexing into solr, i want to read some data from pdf document and index that data also into Solr.
Example: I want extraxt candidate name, candidate location from pdf document and push them into solr schema which looks like,
name: "candidate_name"
location: "candidate_location"
document: "pdf_document"
I searched for this over the internet, but couldn't find the right solution

Try using the https://github.com/lucidworks/hadoop-solr
You should try the DirectoryIngestMapper, it has Tika parsing, but you will have to customized it.

Related

In Kibana can I visualize or extract images and video data from Hadoop (Hive) using Elastic search and Hadoop connector

While working with self driving car data, I have requirement like with given range of timestamp, I need to search Elastic Search and get the vehicle location image,video,latitudes,longitudes,speeds of vehicle. The input data will be loaded into Hadoop using Kafka+Spark Streaming, this transformed data contains json files, images, videos. The json file has attributes like course,speed,timestamp,latitude,longitude,accuracy. The json file name matches with corresponding image name and video name except extensions(.JPG, .JSON) differs. As the search should fetch results very fast on Petta bytes of data, we are asked to use Elastic search and Kibana here by integrating with Hadoop(Impala or presto may not match performance of kibana).
Problem here is we can store image and video data in Hadoop using sequence file format. But can the same data be indexed in ES with Hadoop integration? or can we fetch image or video data directly from hive to Kibana while searching? Or is there alternate way. Your help is much appreciated here.

Indexing .png/JPG/PDF files in elastic search from fileserver

Am having search based requirement. Am able to do indexing oracle database tables into elasticsearch by using logstash. In the same way, i have to index png/JPG/PDF files which are all presented in fileserver now.
Am using elasticsearch version 6.2.3. Can anyone have any idea about indexing files from fileserver to elasticsearch ?
purpose - why am seeing for indexing png/JPG/PDF :
i have to search and display some products with product information, along with that i have to display product picture also which is stored in fileserver.
I have a feature to search for documents (pdf). so,if is search with any keywords, it should also search in the contents of the documents and bring those document as search results. Here documents filepath is available in DB only files are available in fileserver.
For these two purpose, am looking for indexing png/JPG/PDF files.

You just have to get bytes from your image(you can do it in any program language) and then save them in field with binary type. But it is not a good idea, try to save the link to the image.

Where can we find data through web crawling by nutch after the crawl completes?

I have crawled website through apache nutch. I have done this process by order inject, segmentation, fetch, parse, updatedb. In which directory extracted data is present? As I was searched in all nutch directories like crawldb, segments, it was showing in unreadable format. After searching I have given dump command so that i am getting in html format. Is that right way of extracting data?
Thank you.

You can use Solr to index those data. By that way you can filter the data by givin queries
http://lucene.apache.org/solr/

Elasticsearch index from csv file

I have to setup php project maintaining the elasticsearch server. The client provides me a large txt/csv file with all the data which I want to import(update) in index in elasticsearch. With a bulk operations I have to specify a valid json structure, but I have text file instead.
Is there a way of doing that using the elastic api. Or is there at all such possibility without need of converting the csv file to json
I am totally new for the elastisearch and have difficulties of finding the solution.

How can I copy hadoop data to SOLR

I've a SOLR search which uses lucene index as a backend.
I also have some data in Hadoop I would like to use.
How do I copy this data into SOLR ??
Upon googling the only likns I can find tell me how to use use an HDFS index instead of a local index, in SOLR.
I don't want to read the index directly from hadoop, I want to copy them to SOLR and read it from there.
How do I copy? And it would be great if there is some incremental copy mechanism.

If you have a standalone Solr instance, then you could face some scaling issues, depending on the volume of data.
I am assuming high volume given you are using Hadoop/HDFS. In which case, you might need to look at SolrCloud.
As for reading from hdfs, here is a tutorial from LucidImagination, that addresses this issue, and recommends the use of Behemoth
You might also want to look at Katta project, that claims to integrate with hadoop and provide near real-time read access of large datasets . The architecture is illustrated here
EDIT 1
Solr has an open ticket for this. Support for HDFS is scheduled for Solr 4.9. You can apply the patch if you feel like it.

You cannot just copy custom data to Solr, you need to index* it. You data may have any type and format (free text, XML, JSON or even binary data). To use it with Solr, you need to create documents (flat maps with key/value pairs as fields) and add them to Solr. Take a look at this simple curl-based example.
Note, that reading data from HDFS is a different question. For Solr, it doesn't matter where you are reading data from as long as you provide it with documents.
Storing index on local disk or in HDFS is also a different question. If you expect your index to be really large, you can configure Solr to use HDFS. Otherwise you can use default properties and use local disk.
* - "Indexing" is a common term for adding documents to Solr, but in fact adding documents to Solr internal storage and indexing (making fields searchable) are 2 distinct things and can be configured separately.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to index pdf files from HDFS to Solr - hadoop

Try using the https://github.com/lucidworks/hadoop-solr You should try the DirectoryIngestMapper, it has Tika parsing, but you will have to customized it.

Related

In Kibana can I visualize or extract images and video data from Hadoop (Hive) using Elastic search and Hadoop connector

Indexing .png/JPG/PDF files in elastic search from fileserver

Where can we find data through web crawling by nutch after the crawl completes?

Elasticsearch index from csv file

How can I copy hadoop data to SOLR

Categories

Resources