So I am following this book, machine learning hands on for developers written by Jason Bell. I got very far in this book until I had to connect my spring-xd streams to hadoop. I am running spring-xd 1.2.1, and I am running hadoop (1.2.1, and 2.6.0, I have tried both) which is on port 9000. In this tutorial we are supposed to take a twitter stream and pipe it to a file in hadoop, but when I created and deployed this stream the file it created was not getting populated with tweets. So now, to make things simplier I am now just trying to get the stream connected to hdfs by creating this stream,
stream create --name ticktock --definition "time | hdfs" --deploy
which should be piping the date to a file in /xd/ticktock/ticktock-0.txt.tmp, however, when I try to use the command
hadoop fs cat /xd/ticktock/ticktock-0.txt.tmp
it produces nothing leaving me to assume that there is no data reaching it. I did place a tap on this stream, and ran it to a local file. In that file it was recording the times correctly, so I know that my stream is doing the correct function and producing an output, it's just not reaching hadoop for some reason.
It will create the file in hadoop, so it's not like hadoop is completely ignoring the stream, theres just nothing inside the file that it creates for it.
I did find someone who was having the same problem as me and they their vm networking to NET or something, but I am not using a vmbox.
I have tried chmoding the folder xd to 777,
I have made sure that I can ssh to my local machine without a password,
I have made sure that there is a data node running in my hadoop cluster,
and I have made sure that the function cat works by placing a file that I created into my hdfs then running the cat command on it from both within spring-xd shell and from a regular terminal.
I unfortunately am at a loss, could someone help me out in this scenario?
If you need any information about my hadoop cluster or spring-xd setup let me know what, I am still a newby with these technologies.
You can see the files in hdfs sink once you destroy the stream.
2.Also,
Rollover: Even when the stream is alive, once the stored data size exceeds
1G(default value), Spring XD will rollover the 1G content to an HDFS file and create a new tmp file and store the current timetock values in it.
Thanks
S.Satish
Okay I fixed it, for some reason I re-read that error message and saw that there was no datanodes running again. I restarted haoop but this time in 2.6.0 then ran that test stream for a couple of seconds and then destroyed it. Sure enough that did the tick. Thanks Satish Srinivasan, I had no idea the stream had to be deleted before read.
Related
I have an issue with small files and HDFS.
Scenario: I am using NiFi to read messages from the Kafka topic, these are all really small.
Requirement: to store these raw messages of data in HDFS(for replay capability)...before doing further processing on them.
I was thinking using Hadoop Archive (HAR) on them periodically. Is that something i can do through NiFi? the har command seems like a command line thing rather than something that i could execute through Nifi? Would love to know a solution that can achieve my requirement, without bringing down HDFS due to the small files.
Ginil
You can execute command line inside Nifi with ExecuteProcess processor :
http://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.6.0/org.apache.nifi.processors.standard.ExecuteProcess/
You can also take a look at Kafka-connect HDFS for putting kafka records into HDFS.
I need to build a server that reads large csv data files (100GBs) in a directory, transforms some fields and streams them to a Hadoop cluster.
These files are copied over from other servers at random time (100s times/day). It takes a long time to finish copying a file.
I need to:
Regularly check for new files to process (i.e., encrypt and stream)
Check if a csv is completely copied over to kick off encryption
Process Stream multiple files in parallel, but prevent two processes
to stream the same file
Mark files being streamed successfully
Mark
files being streamed unsuccessfully and restart the streaming
process.
My question is: is there an open source ETL tool that provide all of the 5, and works well with Hadoop/Spark Stream? I assume this process is fairly standard, but I couldn't find any yet.
Thank you.
Flume or Kafka will serve your purpose. Both are well integrated with Spark and Hadoop.
Try taking a look at the great library https://github.com/twitter/scalding. Maybe it can point you in the right direction :)
I have created a real time application in which I am writing data streams to hdfs from weblogs using flume, and then processing that data using spark stream. But while flume is writing and creating new files in hdfs spark stream is unable to process those files. If I am putting the files to hdfs directory using put command spark stream is able to read and process the files. Any help regarding the same will be great.
You have detected the problem yourself: while the stream of data continues, the HDFS file is "locked" and can not be read by any other process. On the contrary, as you have experienced, if you put a batch of data (that's yur file, a batch, not a stream), once it is uploaded it is ready for being read.
Anyway, and not being an expert on Spark streaming, it seems from the Spark Streaming Programming Guide, Overview section, that you are not performing the right deployment. I mean, from the picture shown there, it seems the streaming (in this case generated by Flume) must be directly sent to Spark Streaming engine; then the results will be put in HDFS.
Nevertheless, if you want to maintain your deployment, i.e. Flume -> HDFS -> Spark, then my suggestion is to create mini-batches of data in temporal HDFS folders, and once the mini-batches are ready, store new data in a second minibatch, passing the first batch to Spark for analysis.
HTH
In addition to frb's answer: which is correct - SparkStreaming with Flume acts as an Avro RPC Server - you'll need to configure an AvroSink which points to your SparkStreaming instance.
with spark2, now you can connect directly your spark streaming to flume, see official docs, and then write once on HDFS at the end of the process.
import org.apache.spark.streaming.flume._
val flumeStream = FlumeUtils.createStream(streamingContext, [chosen machine's hostname], [chosen port])
I want to run some executables outside of hadoop (but on the same cluster) using input files that are stored inside HDFS.
Do these files need to be copied locally to the node? or is there a way to access HDFS outside of hadoop?
Any other suggestions on how to do this are fine. Unfortunately my executables can not be run within hadoop though.
Thanks!
There are a couple typical ways:
You can access HDFS files through the HDFS Java API if you are writing your program in Java. You are probably looking for open. This will give you a stream that acts like a generic open file.
You can stream your data with hadoop cat if your program takes input through stdin: hadoop fs -cat /path/to/file/part-r-* | myprogram.pl. You could hypothetically create a bridge with this command line command with something like popen.
Also check WebHDFS which made into the 1.0.0 release and will be in the 23.1 release also. Since it's based on rest API, any language can access it and also Hadoop need not be installed on the node on which the HDFS files are required. Also. it's equally fast as the other options mentioned by orangeoctopus.
The best way is install "hadoop-0.20-native" package on the box where you are running your code.
hadoop-0.20-native package can access hdfs filesystem. It can act as a hdfs proxy.
I had similar issue and asked appropriate question. I needed to access HDFS / MapReduce services outside of cluster. After I found solution I posted answer here for HDFS. Most painfull issue there happened to be user authentication which in my case was solved in most simple case (complete code is in my question).
If you need to minimize dependencies and don't want to install hadoop on clients here is nice Cloudera article how to configure Maven to build JAR for this. 100% success for my case.
Main difference in Remote MapReduce job posting comparing to HDFS access is only one configuration setting (check for mapred.job.tracker variable).
I need a system to analyze large log files. A friend directed me to hadoop the other day and it seems perfect for my needs. My question revolves around getting data into hadoop-
Is it possible to have the nodes on my cluster stream data as they get it into HDFS? Or would each node need to write to a local temp file and submit the temp file after it reaches a certain size? and is it possible to append to a file in HDFS while also running queries/jobs on that same file at the same time?
Fluentd log collector just released its WebHDFS plugin, which allows the users to instantly stream data into HDFS. It's really easy to install with ease of management.
Fluentd + Hadoop: Instant Big Data Collection
Of course you can import data directly from your applications. Here's a Java example to post logs against Fluentd.
Fluentd: Data Import from Java Applications
A hadoop job can run over multiple input files, so there's really no need to keep all your data as one file. You won't be able to process a file until its file handle is properly closed, however.
HDFS does not support appends (yet?)
What I do is run the map-reduce job periodically and output results to an 'processed_logs_#{timestamp}" folder.
Another job can later take these processed logs and push them to a database etc. so it can be queried on-line
I'd recommend using Flume to collect the log files from your servers into HDFS.