Store data in linux local directory using flume - hadoop

is it possible to get the data from oracle using flume and store it in Linux local folders, not in hdfs?

Using "File Roll Sink" you can store streaming data to local system. But Flume can't use to ingest data from any RDMS tool.

Not sure about Oracle, but writing to local filesystem is implemented by File Roll Sink.

Related

Is it possible to configure clickhouse data storage to be hdfs

Currently, clickhouse stores data on
/var/lib/clickhouse
path and I've read It doesn't have support for deep storage.
By the way, does it have any configs for hdfs setup in config.xml file?
store clickhouse datadir into HDFS it's a really BAD idea ;)
cause HDFS not posix compatible file system, clickhouse will be extremly slow on this deployment variant
you can use https://github.com/jaykelin/clickhouse-hdfs-loader to load data from HDFS into clickhouse, and in near future https://clickhouse.yandex/docs/en/roadmap/ clickhouse may will be support PARQUET format for loading data
clickhouse have own solution for High Availability and Clusterization
please read
https://clickhouse.yandex/docs/en/operations/table_engines/replication/ and https://clickhouse.yandex/docs/en/operations/table_engines/distributed/
#MajidHajibaba
clickhouse designed initially for data locality, it means you have local disk and data will read from local disk as fast as possible
3 years later, S3 and HDFS as remote data storage with local caching is good implemented approach
look https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree/#table_engine-mergetree-s3 fo details
look to cache_enabled and cache_path options
and https://clickhouse.com/docs/en/operations/storing-data/#configuring-hdfs
HDFS engine provides integration with Apache Hadoop ecosystem by allowing to manage data on HDFSvia ClickHouse. This engine is similar to the File and URL engines, but provides Hadoop-specific features.
https://clickhouse.yandex/docs/ru/operations/table_engines/hdfs/

Loading Batch Offline Data to DWH environment with Kafka as the "Entering door"

Some context to my question.
As you can see here:
https://medium.com/airbnb-engineering/data-infrastructure-at-airbnb-8adfb34f169c
There are 2 "doors" to load data into HDFS
Sqoop
Kafka
Using this topology as an example, what will be the best practice to load batch offline data which is hosted on an FTP server info HDFS?
Let's also assume that no changes are needed to perform on the file, we need to store it in HDFS in the same structure it is stored in the FTP server.
Thoughts?
Kafka isn't exactly configured to transfer "file sized" data by default. At least, not entire files in one message. Maybe break the lines apart, but then you need to reorder them and put them back together in HDFS.
In my experience, I've seen a few options from an FTP server.
Vanilla Hadoop, no extra software required
Use an NFS Gateway, WebHDFS or HttpFS to copy files directly to HDFS as if it were another filesystem
Additional Software required
Your own code with an FTP and HDFS client connection
Spark Streaming w/ an FTP Connector and HDFS write output
Kafka & Kafka Connect with an FTP Connector source and HDFS Sink
A Flume agent running on the FTP Server with an HDFS sink
Apache NiFi with a GetFTP and PutHDFS processor
Streamsets Data Collector doing something similar to NiFi (don't know the terms for this one)
we need to store it in HDFS in the same structure it is stored in the FTP server.
If these are small files, you're better off at least compressing the files into a Hadoop supported archive format before uploading to HDFS

Flume and sqoop limitation

I have a terabyte of data files on different machines i want to collect it on centralized machine for some processing is it advisable to use flume ?
Same amount of data is there in RDBMS which i would like to put in hdfs is it advisable to use sqoop to trasffer terabyte of data? if not what will be alternative
Using Sqoop to transfer few terabytes from RDBMS to HDFS is a great idea, highly recommended. This is Sqoop's intended use case and it does do reliably.
Flume is mostly intended for streaming data, so if the files all have events, and you get new files frequently, then Flume with Spooling Directory source can work.
Otherwise, "HDFS -put" is a good way to copy files to HDFS.

How to convert cassandra to HDFS file system for shark/hive query

Is there any way to expose cassandra data as HDFS and then perfom shark/Hive query on HDFS ??
If yes, kindly provide some links to transform cassandra db into HDFS.
You can write identity MapReduce Code which take input from CFS (cassandra filesystem) and dump data to HDFS.
Once you have data in HDFS , you can map a hive table and run queries.
The typical way to access Cassandra data in Hive is to use the CqlStorageHandler.
Details see Hive Support for Cassandra CQL3.
But if you have some reasons to access the data directly, take a look at Cassowary. It is a "Hive storage handler for Cassandra and Shark that reads the SSTables directly. This allows total control over the resources used to run ad-hoc queries so that the impact on real-time Cassandra performance is controlled."
I think you are trying to run Hive/Shark against data already in Cassandra. If that is the case then you don't need to access it as HDFS but you need a hive-handler for using it against Cassandra.
For this you can use Tuplejump's project, CASH The Readme provides the instruction on how to build and use it. If you want to put your "big files" in Cassandra and query on them, like you do from HDFS, you will need a FileSystem that runs on Cassandra like DataStax's CFS present in DSE, or Tuplejump's SnackFS (present in the Calliope Project Early Access Repo)
Disclaimer: I work for Tuplejump, Inc.
You can use Tuplejump Calliope Project.
https://github.com/tuplejump/calliope
Configure external Cassandra Table in Shark(like Hive) using Storage Handler provided in TumpleJump code.
All the best!
Three cassandra hive storage
https://github.com/2013Commons/hive-cassandra for 2.0 and hadoop 2
https://github.com/dvasilen/Hive-Cassandra/tree/HIVE-0.11.0-HADOOP-2.0.0-CASSANDRA-1.2.9
https://github.com/richardalow/cassowary directly from sstable

Hadoop - streaming data from HTTP upload (PUT) into HDFS directly

I have the following application deployment
web front-end taking data from client through HTTP/FTP
hadoop cluster
I need to store client's data on HDFS. What is the best way of doing that? Is it possible to stream data to HDFS directly, without consuming all data from the client on local drive, and then put it into the HDFS?
The feasible options which I can think of right now are :
HttpFS
WebHDFS
FTP client over HDFS
HDFS over WebDAV
Choosing the "best" one is totally upto you, based on your convenience and ease.
Personally, if you want low latency access to HDFS, your best bet is HBase. You can put and get values very easily since it is just a key value store. We are using the same thing in our application(s) and it works fabulously.

Resources