how to load data from hadoop to solr using sqoop? - hadoop

I want to copy indexes created via MR jobs which are now residing in HDFS into solr. Is it possible using sqoop?
If yes, what is the jdbc connector or driver to use? If not sqoop, is there any other way to do this?

You may want to consider using flume. https://flume.apache.org/FlumeUserGuide.html#flume-1-5-2-user-guide
MorphlineSolrSink: This sink is well suited for use cases that stream raw data into HDFS (via the HdfsSink) and simultaneously extract, transform and load the same data into Solr (via MorphlineSolrSink).
For more: https://flume.apache.org/FlumeUserGuide.html#morphlinesolrsink

Related

I want to ingest data using NIFI to two directions one in HDFS and one in Oracle Database. Is it Possible?

We are using Nifi to ingesting data in HDFS. Can at same time same data be ingested in Oracle or any other database using NIFI?
I need to publish same data two places (HDFS and Oracle Database) and do not want to write two subscribe program.
NiFi has processors to get data from an RDBMS (Oracle, e.g.) such as QueryDatabaseTable and ExecuteSQL, and also from HDFS (ListHDFS, FetchHDFS, etc.). It also has processors to put data into an RDBMS (PutDatabaseRecord, PutSQL, etc.) or HDFS (PutHDFS, e.g.). So you can get your data from multiple sources and send it to multiple targets with NiFi.

How to filter multiple source data using Apache Flume?

I am using flume to handle multiple sources data and stored in HDFS but I could not understand how to filter data before storing in HDFS.
You have two options:
Use Flume interceptor, check answer here.
Use streaming based solution(Apache spark, Apache Heron/Storm) to filter records then store it in HDFS,
2nd Option give you more flexibility to write different types of streaming patterns. Add comment if you have more queries.

Hdfs and Hbase: how it works?

Hi everybody
I'm quite new with bigdata, I have installed a HDFS + Hbase test database and I use Talend Big Data (an ETL) to make my test.
I would like to know : if I put a file directly in the HDFS, without going via hbase, I could never request these data ? I mean, I have to read the entire file if I want to filter data I want to chose, is that right ?
Thanks a lot for any help !
HDFS is just a distributed file system, you cannot query your files without passing by an intermidiate component.
Hbase is a nosql database that persist your data on the HDFS, use it when you need a random access to your data.
If you want to store your files on the HDFS as they are and query them, you can create an external table upon them using Hive.
The best option is to use hive on the top of the files which are on the HDFS. You can use bucketing and partitioning in the hive for performance improvement.

Clarification of Sqoop and Flume

I am very new to big data and i have little confusion regarding Sqoop and Flume
So i get that difference between the Sqoop and Flume
Sqoop is for transferring bulk data from RDBMS
Flume is for streaming of data such as log files
My confusion is because big data architecture i am looking at (which i have no virtual copy of) grouped structured data and its transferred by Sqoop and Unstructured streamed by Flume.
My question regard that is does that mean Flume is only for streaming?
What about high frequency data? and does Flume support transfer of unstructured data that are non-log files (i.e. audio, video) or would Sqoop be able to handle that?
Final question is can Sqoop work with federated data sources? if yes with both real and virtual?
Thanks,
Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.
The use of Apache Flume is not only restricted to log data aggregation. Since data sources are customizable, Flume can be used to transport massive quantities of event data including but not limited to network traffic data, social-media-generated data, email messages and pretty much any data source possible.
Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases(it imports data, transform the data in Hadoop MapReduce, and then export the data).
Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance.
Source: sqoop-vs-flume-battle-of-the-hadoop
Reference: INGESTION AND STREAMING
Flume is efficient with streams and if you want to just dump data from RDBMS why not use sqoop?
By high frequency data if you mean social media yes flume can handle it. Unstructured data yes, flume may handle that too.
sqoop is essentially a tool to ingest data in HDFS from RDBMS. Under the hood, it generates simple Java code which submit a query to a RDBMS and writes the result to HDFS. This means that you can import with sqoop everything which can be accessed via JDBC connection and which has a Java driver available. For this reason, you can't use it for files (like logs) or things like that.
Then sqoop can't handle video or audio files.
Flume, instead, is used to monitor and ingesting in real time informations. You can ingest everything for which there is a Flume source available (https://flume.apache.org/FlumeUserGuide.html#flume-sources).

Processing HDFS files

Let me begin by saying I am a complete newbie to Hadoop. My requirement is to analyse server log files using Hadoop infrastructure. The first step I took in this direction was to stream the log files and dump them raw into my single node Hadoop cluster using Flume HDFS sink. Now I have a bunch of files with records which look something like this:
timestamp req-id level module-name message
My next step is to parse the files (separate out the fields) and store them back so that they are ready for searching.
What approach should I use for this? Can I do this using Hive? (sorry if the question is naive). The information available on the internet is overwhelming.
You can use HCatalog or Impala for faster querying.
From your explanation you have time series data.Hadoop with HDFS itself is not meant for random access or querying. You can use HBase a database for hadoop as HDFS a backend filesystem. It is good for random access.
Also for your need parsing and rearranging data, you can make use of Hadoop's MapReduce.HBase has built in support for this. HBase can be used for input/output of MapReduce Job.
Basic information you can get from here. For better understanding try Definitive Guide for HBase / HBase in Action books.

Resources