I am trying to export some data to mysql using SQOOP. Though I specify this parameter --num-mappers 12, it allocates only one mapper to work on this job. And, it is extremely slow. How do I make sure Sqoop job gets more mappers and not 1.
The number of mappers is determined based upon these criteria
Size of the file in HDFS
Format of the file and whether it supports splittable
Splittable and non-splittable if the file is compressed
Run this command hadoop fs -du -s -h <path_of_the_file> to get the size of the file. Also check parameter values min split size and max split size.
Related
I tried the process(word labeling of sentence) of large data(about 150GB) using tez , but the problem is that it took so much time(1week or more),then
I tried to specify number of mapper.
Though I set mapred.map.tasks =2000,
but I can't stop mapper being set to about 150,
so I can't do what I want to do.
I specify the map value in oozie workflow file and use the tez.
How can I specify the number of mapper?
Finally I want to speed up the process, it is ok not to use tez.
In addition, I would like to count labeled sentence by reducer, it takes so much time,too.
And , I also want to know how I adjust memory size to use each mapper and reducer process.
In order to manually set the number of mappers in a Hive query when TEZ is the execution engine the configuration tez.grouping.split-count can be used...
... set tez.grouping.split-count=4 will create 4 mappers
https://community.pivotal.io/s/article/How-to-manually-set-the-number-of-mappers-in-a-TEZ-Hive-job
However, overall, you should optimize the storage format and the Hive partitions before you even begin tuning the Tez settings. Do not try and process data STORED AS TEXT in Hive. Convert it to ORC or Parquet first.
If Tez isn't working out for you, you can always try Spark. Plus labelling sentences is probably a Spark MLlib worlflow you can find somewhere
I am looking for a default compression in HDFS. I saw this but I don' t want my files to have gzip like extensions(in fact, they should be accesible as if they didn' t compressed) Actually, what I am looking for is exactly like the option "Compress contents to save disk space" on Windows. This option compresses the files internally, but they can be accessed just like usual files. Any ideas will be helpful.
Thanks
This doesn't exist in standard HDFS implementations and you have to manage it yourself. You have to manage your own compression. However, a proprietary implementation of Hadoop, MapR, does this, if solving this problem is important enough for you.
After using hadoop for a little while this doesn't really bother me anymore. Pig and MapReduce and such handle the compression automatically enough for me. I know that's not a real answer, but I couldn't tell in your question if you are simply annoyed or you have a real problem this is causing. Getting use to adding | gunzip to everything didn't take long. I For example:
hadoop fs -cat /my/file.gz | gunzip
cat file.txt | gzip | hadoop fs -put - /my/file.txt.gz
When you're using compressed files you need to think about having them splittable - i.e. can Hadoop split this file when running a map reduce (if the file is not splittable it will only be read by a single map)
The usual way around this is to use a container format e.g. sequence file, orc file etc. where you can enable compression. If you are using simple text files (csv etc) - there's an lzo project by twitter but I didn't use it personally
The standard way to store files with compression in HDFS is through default compression argument while writing any file into HDFS. This is available in mapper libraries, sqoop, flume , hive , hbase catalog and so on. I am quoting some examples here from Hadoop. Here you dont need to worry about compressing the file locally for efficiency in hadoop. Its best to default hdfs file format option to perform this work. This type of compression will smoothly integrate with the hadoop mapper processing.
Job written through Mapper Library
While creating the writer in your mapper program. Here is the definition. You will write your own mapper and reducer to write the file into HDFS with your codec defined as a argument to the Writer method.
createWriter(Configuration conf, FSDataOutputStream out, Class keyClass, Class valClass, org.apache.hadoop.io.SequenceFile.CompressionType **compressionType**, CompressionCodec codec)
Sqoop Import
Below option send default compression argument for the file import into HDFS
sqoop import --connect jdbc://mysql://yourconnection/rawdata --table loglines --target-dir /tmp/data/logs/ --compress
with sqoop you can also specify specific codec as well with option
sqoop --connect jdbc://mysql://yourconnection/rawdata --table loglines --target-dir /tmp/data/logs compression-codec org.apache.hadoop.io.compress.SnappyCodec
Hive import
Below example, you can use your desired option to read the file into hive. This again is property you can set while reading from your local file.
SET hive.exec.compress.output=true;
SET parquet.compression=**SNAPPY**; --this is the default actually
CREATE TABLE raw (line STRING) STORED AS PARQUET ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log' INTO TABLE raw;
I have not mentioned all the example of methods of data compression while you import into HDFS.
HDFS CLI doesn't (for e.g hdfs dfs -copyFromLocal) provide any direct way to compress. This is my understanding of working with hadoop CLI.
Iam using Hadoop to parse ample(about 1 million) text files and each has lot of data into it.
Firstly I uploaded all my text files into hdfs using Eclipse. But when uploading the files, my map-reduce operation resulted in huge amount of files in following directory C:\tmp\hadoop-admin\dfs\data.
So , is there any mechanism, using which I can shrink the size of my HDFS (basically above mentioned drive).
to shrink your HDFS size you can set a greater value (in bytes) to following hdfs-site.xml property
dfs.datanode.du.reserved=0
You can also lower the amount of data generated by map outputs by enabling map output compression.
map.output.compress=true
hope that helps.
I am new to PIG and HDFS. Here is what I am trying to do.
I have a lot of flat text LZO compressed ill formatted server logs files - about 2 GB each getting generated from around 400 servers daily.
I am trying to take advantage of map reduce to format and clean up the data in HDFS using my java formatter and then load the output in Hive.
My problem is that my PIG scripts spawns only one mapper which takes around 15 mins. to read the file sequentially. This is not practical for amount of data I have to load daily in hive.
Here is my pig script.
SET default_parallel 100;
SET output.compression.enabled true;
SET output.compression.codec com.hadoop.compression.lzo.LzopCodec
SET mapred.min.split.size 256000;
SET mapred.max.split.size 256000;
SET pig.noSplitCombination true;
SET mapred.max.jobs.per.node 1;
register file:/apps/pig/pacudf.jar
raw1 = LOAD '/data/serverx/20120710/serverx_20120710.lzo' USING PigStorage() as (field1);
pac = foreach raw1 generate pacudf.filegenerator(field1);
store pac into '/data/bazooka/';
Looks like mapred.min.split.size setting isn't working. I can see only 1 mapper being initiated which works on the whole 2 GB file on a single server of the cluster. As we have a 100 node cluster I was wondering if I can make use for more servers in parallel if I can spawn more mappers.
Thanks in advance
Compression support in PigStorage does not provide splitting ability. For splittable lzo compression support with pig, you would need the elephant-bird library from twitter. Also to get splitting work (properly ?) with existing regular-lzo files, you would need to index them prior to loading in your pig script.
Now i am trying to export data from a db table, and write it into hdfs.
And the problem is: will the name node become bottleneck? and how is the machanism, will name node cache a slice(64MB) and then give it to data node?
And is there any better way rather than write the hdfs? because i think it dosen't take the advantage of parellism.
Thanks:)
Have you considered using Sqoop. Sqoop can be used to extract data from any DB with supports JDBC and put it in HDFS.
http://www.cloudera.com/blog/2009/06/introducing-sqoop/
Sqoop import command takes the number of map jobs to be run (it defaults to 1). Also, while parallelizing the work (map tasks > 1) the splitting column can be specified or Sqoop will make a guess based on the sequence key for the table. Each map file will create a separate file for the results in a directory. The NN will not be a bottleneck unless a huge number of files created is huge (the NN keeps the meta data about the files in the memory).
Sqoop can also interpret the source DB (Oracle, MySQL or others) and use the DB specific tools like mysqldump and import instead of the JDBC channel for better performance.