I have a spark code that runs on a yarn cluster and converts csv to parquet using databricks library.
It works fine when the csv source is hdfs. But when the csv source is non-hdfs, which is usually the case, I come across this exception.
It should not happen as the same code works for hdfs csv source.
Complete link to the issue :
https://issues.apache.org/jira/browse/SPARK-19344
As discussed in the comments.
When the files are on the driver node, but not access-able by the nodes, the read will fail.
When using reading input file (e.g. spark.read in spark 2.0), the files should be be access by all executors nodes (e.g. when the files are on HDFS, cassandra, etc)
Related
I have an issue with small files and HDFS.
Scenario: I am using NiFi to read messages from the Kafka topic, these are all really small.
Requirement: to store these raw messages of data in HDFS(for replay capability)...before doing further processing on them.
I was thinking using Hadoop Archive (HAR) on them periodically. Is that something i can do through NiFi? the har command seems like a command line thing rather than something that i could execute through Nifi? Would love to know a solution that can achieve my requirement, without bringing down HDFS due to the small files.
Ginil
You can execute command line inside Nifi with ExecuteProcess processor :
http://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.6.0/org.apache.nifi.processors.standard.ExecuteProcess/
You can also take a look at Kafka-connect HDFS for putting kafka records into HDFS.
We are generating Parquet files , using apache Nifi in a non hadoop environment. We need to run analytics on Parquet files.
Apart from using apache frameworks like Hive , Spark etc. Do we have any open source BI or a reporting tool which can read Parquet files , or is there any other work around for this . In our environment we have Jasper Reporting tool.
Any suggestion is appreciated. Thanks.
You can easily process Parquet files in Python:
To read/write Parquet files, you can use pyarrow or fastparquet.
To analyze the data, you can use Pandas (which can even read/write Parquet itself using one of the implemention mentioned in the previous item behind the scenes).
To get a nice interactive data exploration environment, you can use Jupyter Notebook.
All of these work in a non-Hadoop environment.
I have created a real time application in which I am writing data streams to hdfs from weblogs using flume, and then processing that data using spark stream. But while flume is writing and creating new files in hdfs spark stream is unable to process those files. If I am putting the files to hdfs directory using put command spark stream is able to read and process the files. Any help regarding the same will be great.
You have detected the problem yourself: while the stream of data continues, the HDFS file is "locked" and can not be read by any other process. On the contrary, as you have experienced, if you put a batch of data (that's yur file, a batch, not a stream), once it is uploaded it is ready for being read.
Anyway, and not being an expert on Spark streaming, it seems from the Spark Streaming Programming Guide, Overview section, that you are not performing the right deployment. I mean, from the picture shown there, it seems the streaming (in this case generated by Flume) must be directly sent to Spark Streaming engine; then the results will be put in HDFS.
Nevertheless, if you want to maintain your deployment, i.e. Flume -> HDFS -> Spark, then my suggestion is to create mini-batches of data in temporal HDFS folders, and once the mini-batches are ready, store new data in a second minibatch, passing the first batch to Spark for analysis.
HTH
In addition to frb's answer: which is correct - SparkStreaming with Flume acts as an Avro RPC Server - you'll need to configure an AvroSink which points to your SparkStreaming instance.
with spark2, now you can connect directly your spark streaming to flume, see official docs, and then write once on HDFS at the end of the process.
import org.apache.spark.streaming.flume._
val flumeStream = FlumeUtils.createStream(streamingContext, [chosen machine's hostname], [chosen port])
Using pig or hadoop streaming, has anyone loaded and uncompressed a zipped file? The original csv file was compressed using pkzip.
Not sure if this helps because its mainly focused on using MapReduce in Java, but there is a ZipFileInputFormat available in hadoop. Its use via the Java API is described here:
http://cotdp.com/2012/07/hadoop-processing-zip-files-in-mapreduce/
The main part of this is the ZipFileRecordReader which uses Javas ZipInputStream to process each ZipEntry. The Hadoop reader is probably not going to work for you out of the box because it passes the file path of each ZipEntry as the key and the ZipEntry contents as the value.
I need a system to analyze large log files. A friend directed me to hadoop the other day and it seems perfect for my needs. My question revolves around getting data into hadoop-
Is it possible to have the nodes on my cluster stream data as they get it into HDFS? Or would each node need to write to a local temp file and submit the temp file after it reaches a certain size? and is it possible to append to a file in HDFS while also running queries/jobs on that same file at the same time?
Fluentd log collector just released its WebHDFS plugin, which allows the users to instantly stream data into HDFS. It's really easy to install with ease of management.
Fluentd + Hadoop: Instant Big Data Collection
Of course you can import data directly from your applications. Here's a Java example to post logs against Fluentd.
Fluentd: Data Import from Java Applications
A hadoop job can run over multiple input files, so there's really no need to keep all your data as one file. You won't be able to process a file until its file handle is properly closed, however.
HDFS does not support appends (yet?)
What I do is run the map-reduce job periodically and output results to an 'processed_logs_#{timestamp}" folder.
Another job can later take these processed logs and push them to a database etc. so it can be queried on-line
I'd recommend using Flume to collect the log files from your servers into HDFS.