Files transfer to HDFS [closed] - hadoop

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I need to bring the files (zip, csv, xml etc) from windows share location to HDFS. Which is the best approach ? I have kafka - flume - hdfs in mind. Please suggest the efficient way.
I tried getting the files to Kafka consumer.
producer.send(
new ProducerRecord(topicName,key,value),
Expect an efficient approach

Kafka is not designed to send files, only individual messages of up to 1MB, by default.
You can install NFS Gateway in Hadoop, then you should be able to copy directly from the windows share to HDFS without any streaming technology, only a scheduled script on the windows machine, or externally ran
Or you can mount the windows share on some Hadoop node, and schedule a Cron job if you need continuous file delivery - https://superuser.com/a/1439984/475508
Other solutions I've seen use tools like Nifi / Streamsets which can be used to read/move files
https://community.hortonworks.com/articles/26089/windows-share-nifi-hdfs-a-practical-guide.html

Related

Insist on using Kafka brokers on Windows [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I know Java services love Linux/Unix hosts much more.
However, there are indeed some scenarios where it's not always feasible to let customer install a Linux cluster in their environment just to deploy Kafka, i.e. Windows 10 / Windows Server may be their only acceptable choice.
To describle our application briefly: not a service running constantly, we just want to introduce Kafka as a reliable communication broker to exchange data among quite a few different distributed processes (on different machines in the network, probably including some machines on the cloud) when a certain operation starts and runs for a variable duration, say, from 1 hour up to 48 hours. Each run will create many temporary topics.
In such cases, is Kafka on windows a production option?
BTW, I encountered quite a few known issues for Kafka on windows, e.g. this one. For this specific issue, We simply assume there will someone in the customer company, or some scheduled script will be available and respsonbile for cleaning up the out-dated topics from the logs, say, topics from one month ago.
Is there any other unsolvable road blockers to use Kafka on Windows?
Any thoughts or comments are appreciated.
Is it an option? Yes. Is it a sensible option? … perhaps not.
As you've identified, there are several known issues with running Kafka on Windows. There are workarounds etc etc, but do you really want to be dealing with those in Production? It's one thing to run a hack to get your sandbox to work, but if you've got production workloads, quite another.
Here is one option if you really want to run Kafka on Windows - do so using WSL2.

How to use processor 'GetFile' in Apache NIFI to get the file of network-attached disks [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I just start using Apache NIFI,what i want to do is using the processor 'GetFile' to get some file from a remoting network-attached machine to my local disks,but i don't know how to configure it in the settings of the processor,and i can not find any documents about this question, Any help is appreciated, thanks.
There is extensive documentation on the Apache NiFi website and within your running instance of NiFi, you can right-click on any processor and select "Usage" to see this documentation inline.
To configure any processor, right-click and select "Configuration", then switch to the "Properties" tab. In GetFile, you need to provide the path to the directory you want to monitor as the Input Directory property, and the file name or pattern you want to retrieve as the File Filter. If this is a specific file known a priori, you can provide a literal name. If it is a pattern (i.e. all CSV files), you can use a pattern like [^\.].*\.csv. You should use the same input path as you would use to browse to the files on the host operating system.

What does a spark cluster means? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have used spark on my local machine using python for analytical puproses.
Recently I've heard the words "spark cluster" and I was wondering what it is exactly?
Is it just Spark running on some cluster of machines ?
And how can it be used on cluster without Hadoop system? Is it possible? Can you please describe?
Apache spark is a distributed computing system. While it can run on a single machine, it is meant to run on a cluster and to take advantage of parallelism possible utilizing the cluster. Spark utilizes much of the Hadoop stack, such as the HDFS file system. However, Spark overlaps considerably with Hadoop distributed computing chain. Hadoop centers around the map reduce programming pattern, while Spark is more general with regard to program design. Also, Spark has features to help increase performance.
For more information, see https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/

Connection refused to quickstart.cloudera:8020 [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I'm using Cloudera-quickstart 5.5.0 virtualbox
Trying to run this on terminal. As you can below, there is an exception. I've searched for solution to solve this on internet and found something.
1-) configuring core-site.xml file. https://datashine.wordpress.com/2014/09/06/java-net-connectexception-connection-refused-for-more-details-see-httpwiki-apache-orghadoopconnectionrefused/
But I can only open this file readable and haven't been able to change it. It seems I need to be root or hdfs user (su hdfs -) but it asks me for a password which I don't know.
A network configuration is not your problem. You don't need to touch any configurations in the VM, you need to start the services. In this image, for example. The HDFS service on the left is disabled, and I get the same error on that last command.
You have to start Cloudera Manager and start ZooKeeper, YARN, and HDFS (in that order).
To open Cloudera Manager, go to http://quickstart.cloudera:7180 in Firefox on the VM.
Then start the mentioned services.
After you start the services, you can use HDFS commands.

How can you change the file being redirected to while script is still running? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Assume I have this script running continually:
node myprogram.js > logfile.log
If I want to make the output dump to a new log file every day without stopping or restarting "node myprogram.js", what should I do?
For example, every day, I want to see the logs saved as 2015-12-01.log, 2015-12-02.log, 2015-12-03.log, etc, and not have the logs be dumped into a single file.
I would use logrotate its the pre-installed utility most linux OS's use for what you are talking about plus more, typical default settings would involve automatically compressing log files of a certain age and then eventually deleting the oldest log files.
The utility runs automatically once a day and performs log rotations as per a configuration you define.
I would prefer this question in server fault sister site. Nonetheless, there are many tools to use. Check out logrotate / rotatelogs.

Resources