When I run my jobs I see:
parquet.hadoop.ParquetFileReader: Initiating action with parallelism: 5
It is by default set to 5 but what is it? and how can I used it to get better performance?
Yes, it defaults to 5.
The configuration parameter's name is parquet.metadata.read.parallelism. It affects only in how many threads metainformation about Parquet files is read.
I believe it does not affect performance much as it's only related to reading of metadata, not the data itself.
Related
I am using hadoop 1.0.3. Can the input split/block be changed (increase/decrease) during run time based on some constraints. Is there a class to override to accomplish this mechanism like FileSplit/InputTextFormat? Can we have variant size blocks in HDFS depending on logical constraint in one job?
You're not limited to TextInputFormat... Thats entirely configurable based on the data source you are reading. Most examples are line delimited plaintext, but that obviously doesn't work for XML, for example.
No, block boundaries can't change during runtime as your data should already be on disk, and ready to read.
But the InputSplit is dependent upon the InputFormat for the given job, which should remain consistent throughout a particular job, but the Configuration object in the code is basically a Hashmap which can be changed while running, sure
If you want to change block size only for a particular run or application you can do by overriding "-D dfs.block.size=134217728" .It helps you to change block size for your application instead of changing overall block size in hdfs-site.xml.
-D dfs.block.size=134217728
Right now we are using a mapreduce job to convert data and store the result in the Parquet format.
There is a summary file (_metadata) generated as well. But the problem is that it is too big (over 5G). Is there any way to reduce the size?
Credits to Alex Levenson and Ryan Blue:
Alex Levenson:
You can push the reading of the summary file to the mappers instead of reading it on the submitter node:
ParquetInputFormat.setTaskSideMetaData(conf, true);
(Ryan Blue: This is the default from 1.6.0 forward)
or setting "parquet.task.side.metadata" to true in your configuration. We
had a similar issue, by default the client reads the summary file on the
submitter node which takes a lot of time and memory. This flag fixes the
issue for us by instead reading each individual file's metadata from the
file footer in the mappers (each mapper reads only the metadata it needs).
Another option, which is something we've been talking about in the past, is
to disable creating this metadata file at all, as we've seen creating it
can be expensive too, and if you use the task side metadata approach, it's
never used.
(Ryan Blue: There's an option to suppress the files, which I recommend. Now that file metadata is handled on the tasks, there's not much need for the summary files.)
I have gone through lot of blogs on stackoverflow and also apache wiki for getting to know the way the mappers are set in Hadoop. I also went through [hadoop - how total mappers are determined [this] post.
Some say its based on InputFormat and some posts say its based on the number of blocks the input file id split into.
Some how I am confused by the default setting.
When I run a wordcount example I see the mappers are low as 2. What is really happening in the setting ? Also this post [http://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-examples/src/main/java/org/apache/hadoop/examples/QuasiMonteCarlo.java] [example program]. Here they set the mappers based on user input. How can one manually do this setting ?
I would really appreciate for some help and understanding of how mappers work.
Thanks in advance
Use the java system properties mapred.min.split.size and mapred.max.split.size to guide Hadoop to use the split size you want. This won't always work - particularly when your data is in a compression format that is not splittable (e.g. gz, but bzip2 is splittable).
So if you want more mappers, use a smaller split size. Simple!
(Updated as requested) Now this won't work for many small files, in particular you'll end up with more mappers than you want. For this situation use CombineFileInputFormat ... in Scalding this SO explains: Create Scalding Source like TextLine that combines multiple files into single mappers
I'm parsing data from one table and writing it back to another one. Input are characteristics, written as text. Output is a boolean field that needs to be updated. For example a characteristic would be "has 4 wheel drive" and I want to set a boolean has_4weeldrive to true.
I'm going through all the characteristics that belong to a car and set it to true if found, else to null. The filter after the tmap_1 filters the rows for which the attribute is true, and then updates that in a table. I want to do that for all different characteristics (around 10).
If I do it for one characteristic the job runs fine, as soon as I have more than 1 it only loads 1 record and waits indefinitely. I can of course make 10 jobs and it will run, but I need to touch all the characteristics 10 times, that doesn't feel right. Is this a locking issue? Is there a better way to do this? Target and source db is Postgresql if that makes a difference.
Shared connections could cause problems like this.
Also make sure you're committing after each update. Talend use 1 thread for execution (except the enterprise version) so multiple shared outputs could cause problems.
Setting the commit to 1 should eliminate the problem.
Sorry guys,just a simple question but I cannot find exact question on google.
The question about what's dfs.replication mean? If I made one file named filmdata.txt in hdfs, if I set dfs.replication=1,so is it totally one file(one filmdata.txt)?or besides the main file(filmdata.txt) hadoop will create another replication file.
shortly say:if set dfs.replication=1,there are totally one filmdata.txt,or two filmdata.txt?
Thanks in Advance
The total number of files in the file system will be what's specified in the dfs.replication factor. So, if you set dfs.replication=1, then there will be only one copy of the file in the file system.
Check the Apache Documentation for the other configuration parameters.
To ensure high availability of data, Hadoop replicates the data.
When we are storing the files into HDFS, hadoop framework splits the file into set of blocks( 64 MB or 128 MB) and then these blocks will be replicated across the cluster nodes.The configuration dfs.replication is to specify how many replications are required.
The default value for dfs.replication is 3, But this is configurable depends on your cluster setup.
Hope this helps.
The link provided by Praveen is now broken.
Here is the updated link describing the parameter dfs.replication.
Refer Hadoop Cluster Setup. for more information on configuration parameters.
You may want to note that files can span multiple blocks and each block will be replicated number of times specified in dfs.replication (default value is 3). The size of such blocks is specified in the parameter dfs.block.size.
In HDFS framework, we use commodity machines to store the data, these commodity machines are not high end machines like servers with high RAM, there will be a chance of loosing the data-nodes(d1, d2, d3) or a block(b1,b2,b3), as a result HDFS framework splits the each block of data(64MB, 128MB) into three replications(as a default) and each block will be stored in a separate data-nodes(d1, d2, d3). Now consider block(b1) gets corrupted in data-node(d1) the copy of block(b1) is available in data-node(d2) and data-node(d3) as well so that client can request data-node(d2) to process the block(b1) data and provide the result and same as if data-node(d2) fails client can request data-node(d3) to process block(b1) data . This is called-dfs.replication mean.
Hope you got some clarity.