I would like to convert Avro files to Parquet in NiFi. I know it's possible to convert to ORC via the ConvertAvroToORC processor but I didn't found a solution to convert to Parquet.
I'm converting a JSON to Avro via a ConvertRecord (JsonTreeReader and AvroRecordSetWriter) processor. After that I would like to convert the Avro payload to Parquet before I will put it in a S3 bucket. I don't want to store it in HDFS, therefore the PutParquet processor seems not to be applicable.
I would need a processor such as: ConvertAvroToParquet
#Martin, you can use a very handy processor ConvertAvroToParquet which I recently contributed in Nifi. It should be available in latest version.
The purpose of this processor is exactly similar to what you are looking for. For more details on this processor & why it was created : Nifi-5706
Code Link.
Actually it is possible to use the PutParquet processor.
Following description is from a working flow in nifi-1.8.
Place the following libs into a folder e.g. home/nifi/s3libs/:
aws-java-sdk-1.11.455.jar (+ Third-party libs)
hadoop-aws-3.0.0.jar
Create a xml file e.g. /home/nifi/s3conf/core-site.xml. Might need some additional tweaking, use the right endpoint for your zone.
<configuration>
<property>
<name>fs.defaultFS</name>
<value>s3a://BUCKET_NAME</value>
</property>
<property>
<name>fs.s3a.access.key</name>
<value>ACCESS-KEY</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>SECRET-KEY</value>
</property>
<property>
<name>fs.AbstractFileSystem.s3a.imp</name>
<value>org.apache.hadoop.fs.s3a.S3A</value>
</property>
<property>
<name>fs.s3a.multipart.size</name>
<value>104857600</value>
<description>Parser could not handle 100M. replacing with bytes. Maybe not needed after testing</description>
</property>
<property>
<name>fs.s3a.endpoint</name>
<value>s3.eu-central-1.amazonaws.com</value>
<description>Frankfurt</description>
</property>
<property>
<name>fs.s3a.fast.upload.active.blocks</name>
<value>4</value>
<description>
Maximum Number of blocks a single output stream can have
active (uploading, or queued to the central FileSystem
instance's pool of queued operations.
This stops a single stream overloading the shared thread pool.
</description>
</property>
<property>
<name>fs.s3a.threads.max</name>
<value>10</value>
<description>The total number of threads available in the filesystem for data
uploads *or any other queued filesystem operation*.</description>
</property>
<property>
<name>fs.s3a.max.total.tasks</name>
<value>5</value>
<description>The number of operations which can be queued for execution</description>
</property>
<property>
<name>fs.s3a.threads.keepalivetime</name>
<value>60</value>
<description>Number of seconds a thread can be idle before being terminated.</description>
</property>
<property>
<name>fs.s3a.connection.maximum</name>
<value>15</value>
</property>
</configuration>
Usage
Create a PutParquet processor. Under Properties set
Hadoop Configuration Resources: /home/nifi/s3conf/core-site.xml,
Additional Classpath Reources: /home/nifi/s3libs,
Directory: s3a://BUCKET_NAME/folder/ (EL available)
Compression Type: tested with NONE, SNAPPY
Remove CRC: true
The flow-file must contain a filename attribute - No fancy chars or slashes.
Related
I just have started using streamsets, and i'm trying to load a text file from local to HDFS.
Please note: I'm using Cloudera Manager, here is a view of "core-site.xml":
<property>
<name>hadoop.ssl.server.conf</name>
<value>ssl-server.xml</value>
<final>true</final>
</property>
<property>
<name>hadoop.ssl.client.conf</name>
<value>ssl-client.xml</value>
<final>true</final>
</property>
<property>
<name>hadoop.proxyuser.sdc.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.sdc.groups</name>
<value>*</value>
</property>
</configuration>
The local file is a text file stored in "/home/cloudera/Desktop".
Here is a view of the source (Local) configuration in Streamsets:
Here is a view of Hadoop fs configuration in Streamsets:
It was validated successfully!
After I played the pipline, I'm supposed to find the file in HDFS directory that I specified, especially at "/user/cloudera".
But when I run it the file hasn't been loaded.
I'm sure I missed something, and I couldn't find answer for this.
Could you please help!
Thanks,
You need to play the pipeline, not only validate it.
I am trying to integrate my es 2.2.0 version with hadoop HDFS.In my envoirnment,I have 1 master node and 1 data node. On my master node my Es is installed.
But while integrating it with HDFS my resource manager applications jobs get stuck in Accepted state.
Somehow i found link to change my yarn-site.xml settings:
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>2200</value>
<description>Amount of physical memory, in MB, that can be allocated for containers.</description>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>500</value>
</property>
I have done this also but it is not giving me expected output.
Configuration:-
my core-site.xml
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.
</description> </property>
<property> <name>fs.default.name</name>
<value>
hdfs://localhost:54310
</value>
<description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.
</description>
</property>
my mapred-site.xml,
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description>
</property>
my hdfs-site.xml,
<property>
<name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description>
</property>
Please help me how can i change my RM job to running state.So that i can use my elasticsearch data on HDFS.
If the screenshot is correct - you have 0 nodemanager - thus the application can’t start running - you need to start at least 1 nodemanager, so that application master and later tasks can be started.
Normally I'll do the following to use LZO:
Use lzop command to compress the data file on local disk.
Put it into HDFS.
Use distributed lzo indexer the generate the .index files.
I'm wondering is there a way to compress and index the raw files on HDFS in-place at the same time?
Yep you can:
In your core-site.xml on the client and server append com.hadoop.compression.lzo.LzopCodec to the comma-separated list of codecs:
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.DeflateCodec,
org.apache.hadoop.io.compress.SnappyCodec,com.hadoop.compression.lzo.LzopCodec</value>
</property>
Edit mapred-site.xml file on the JobTracker host machine:
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
<property>
<name>mapred.map.output.compression.codec</name>
<value>com.hadoop.compression.lzo.LzopCodec</value>
</property>
<property>
<name>mapred.output.compression.type</name>
<value>BLOCK</value>
</property>
Using hadoop multinode setup (1 mater , 1 salve)
After starting up start-mapred.sh on master , i found below error in TT logs (Slave an)
org.apache.hadoop.mapred.TaskTracker: Failed to get system directory
can some one help me to know what can be done to avoid this error
I am using
Hadoop 1.2.0
jetty-6.1.26
java version "1.6.0_23"
mapred-site.xml file
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>master:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
<property>
<name>mapred.map.tasks</name>
<value>1</value>
<description>
define mapred.map tasks to be number of slave hosts
</description>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>1</value>
<description>
define mapred.reduce tasks to be number of slave hosts
</description>
</property>
</configuration>
core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hduser/workspace</value>
</property>
</configuration>
It seems that you just added hadoop.tmp.dir and started the job. You need to restart the Hadoop daemons after adding any property to the configuration files. You have specified in your comment that you added this property at a later stage. This means that all the data and metadata along with other temporary files is still in the /tmp directory. Copy all those things from there into your /home/hduser/workspace directory, restart Hadoop and re run the job.
Do let me know the result. Thank you.
If, it is your windows PC and you are using cygwin to run Hadoop. Then task tracker will not work.
We have installed hadoop cluster. We want to use HBase over it. My hbase-site.xml is below
<property>
<name>hbase.rootdir</name>
<value>hdfs://ali:54310/hbase</value>
<description>The directory shared by RegionServers.
</description>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
<description>
</description>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>ali,reg_server1</value>
<description>The directory shared by region servers.
</description>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>
</description>
</property>
And I have 2 region servers ali and reg_server1. When I open page at http://ali:60010 I see that server reg_server1 has 0 regions. but server ali has n > 0 regions. I put some data to Hbase but, server reg_server1 still has 0 regions. Does it means that this node is not participiating in cluster? How can I resolve it?
Thanks
No, you are ok as long as you see both regionservers in the master's web UI. When you write to an HBase table, it will write to one region (one region is always on one regionserver, in your case, ali). Once you write enough data to make the region exceed the configured max file size, the region will be split and distributed across the two regionservers.