HDFS file compression internally - hadoop

I am looking for a default compression in HDFS. I saw this but I don' t want my files to have gzip like extensions(in fact, they should be accesible as if they didn' t compressed) Actually, what I am looking for is exactly like the option "Compress contents to save disk space" on Windows. This option compresses the files internally, but they can be accessed just like usual files. Any ideas will be helpful.
Thanks

This doesn't exist in standard HDFS implementations and you have to manage it yourself. You have to manage your own compression. However, a proprietary implementation of Hadoop, MapR, does this, if solving this problem is important enough for you.
After using hadoop for a little while this doesn't really bother me anymore. Pig and MapReduce and such handle the compression automatically enough for me. I know that's not a real answer, but I couldn't tell in your question if you are simply annoyed or you have a real problem this is causing. Getting use to adding | gunzip to everything didn't take long. I For example:
hadoop fs -cat /my/file.gz | gunzip
cat file.txt | gzip | hadoop fs -put - /my/file.txt.gz

When you're using compressed files you need to think about having them splittable - i.e. can Hadoop split this file when running a map reduce (if the file is not splittable it will only be read by a single map)
The usual way around this is to use a container format e.g. sequence file, orc file etc. where you can enable compression. If you are using simple text files (csv etc) - there's an lzo project by twitter but I didn't use it personally

The standard way to store files with compression in HDFS is through default compression argument while writing any file into HDFS. This is available in mapper libraries, sqoop, flume , hive , hbase catalog and so on. I am quoting some examples here from Hadoop. Here you dont need to worry about compressing the file locally for efficiency in hadoop. Its best to default hdfs file format option to perform this work. This type of compression will smoothly integrate with the hadoop mapper processing.
Job written through Mapper Library
While creating the writer in your mapper program. Here is the definition. You will write your own mapper and reducer to write the file into HDFS with your codec defined as a argument to the Writer method.
createWriter(Configuration conf, FSDataOutputStream out, Class keyClass, Class valClass, org.apache.hadoop.io.SequenceFile.CompressionType **compressionType**, CompressionCodec codec)
Sqoop Import
Below option send default compression argument for the file import into HDFS
sqoop import --connect jdbc://mysql://yourconnection/rawdata --table loglines --target-dir /tmp/data/logs/ --compress
with sqoop you can also specify specific codec as well with option
sqoop --connect jdbc://mysql://yourconnection/rawdata --table loglines --target-dir /tmp/data/logs compression-codec org.apache.hadoop.io.compress.SnappyCodec
Hive import
Below example, you can use your desired option to read the file into hive. This again is property you can set while reading from your local file.
SET hive.exec.compress.output=true;
SET parquet.compression=**SNAPPY**; --this is the default actually
CREATE TABLE raw (line STRING) STORED AS PARQUET ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log' INTO TABLE raw;
I have not mentioned all the example of methods of data compression while you import into HDFS.
HDFS CLI doesn't (for e.g hdfs dfs -copyFromLocal) provide any direct way to compress. This is my understanding of working with hadoop CLI.

Related

data cleaning in hdfs without using hive

Is there an option where i can do hadoop fs -sed , essentially I am trying to replace "\" into "something" in my data directly in hdfs without having to bring data into local and load.
currently I am using getmerge to bring the data into local , clean it and load it with copyFromlocal to hdfs back. it takes a lot of time this way . so is there more easier solution or faster way of doing the replacement of character data.
Not clear why you'd use Hive for this anyway.
Pig or Spark are far better options that don't require an explicit schema for the data.
See Pig REPLACE function
In any case, Hadoop CLI has no sed option
Another option would be NiFi, but that requires more setup, and is overkill for this task.

Can we use Sqoop to move any structured data file apart from moving data from RDBMS?

This question was asked to me in a recent interview.
As per my knowledge we can use Sqoop to transfer data between RDBMS and hadoop ecosystems(hdfs, hive,pig,hbase).
Can someone please help me in finding answer?
As per my understanding, Sqoop can't move any structured data file (like CSV) to HDFS or other Hadoop ecosystem component like Hive, HBase, etc.
Why would you use Sqoop for this?
You can simply put any data file directly into HDFS using it's REST, Web or Java API.
Sqoop is not meant for this type of use case.
Main purpose of sqoop import is to fetch data from RDBMS in parallel.
Apart from that, Sqoop has Sqoop Import Mainframe.
The import-mainframe tool imports all sequential datasets in a partitioned dataset(PDS) on a mainframe to HDFS. A PDS is akin to a directory on the open systems. The records in a dataset can contain only character data. Records will be stored with the entire record as a single text field.

How do I output data in a MapReduce job for Sqoop to export?

I've read a lot about importing from SQL using Sqoop, but there are only tidbits on exporting, and the examples always assume that you're exporting imported/pre-formatted data for some reason, or are using Hive.
How, from a MapReduce job, do I write data to HDFS that Sqoop can read and export?
This Sqoop documentation shows me the file formats supported. I guess I can use text/CSV, but how do I get there in MapReduce?
I've found this answer, which says to just modify the options for TextOutputFormat, but just writes key/values. My "values" are multiple fields/columns!
Try using other storages like avro or parquet (more buggy), so you have a schema. Then you can "query" those files and export their data into a RDBMS.
However, it looks like that support was a bit buggy/broken, and only worked properly if you created the files with Kite or sqoop (which internally uses kite).
http://grokbase.com/t/sqoop/user/1532zggqb7/how-does-sqoop-export-detect-avro-schema
I used the codegen tool to generate classes that could write to SequenceFiles:
sqoop/bin/sqoop-codegen --connect jdbc://sqlserver://... --table MyTable --class-name my.package.name.ClassForMyTable --outdir ./out/
And then I was able to read those in using Sqoop, exporting with the bulk setting. But the performance was abysmal. In the end, I instead just wrote simple CSV-like text files importable with the BCP tool, and what took hours with Sqoop completed in minutes.

Sqoop export to mysql very slow

I am trying to export some data to mysql using SQOOP. Though I specify this parameter --num-mappers 12, it allocates only one mapper to work on this job. And, it is extremely slow. How do I make sure Sqoop job gets more mappers and not 1.
The number of mappers is determined based upon these criteria
Size of the file in HDFS
Format of the file and whether it supports splittable
Splittable and non-splittable if the file is compressed
Run this command hadoop fs -du -s -h <path_of_the_file> to get the size of the file. Also check parameter values min split size and max split size.

How to use DistCp to directly convert data into tables in Hive?

I am using DistCp to copy the data from cluster 1 to cluster 2. I was successfully able to copy the table data from cluster 1 into cluster 2. However, using the hdfs, the data has been sent to file browser.
Is there any direct way to convert this hdfs data into a Hive table (including data type, delimeters ...etc) by use of DistCp command(s)? I can certainly query it to gather the data from hdfs, however I'll have to convert them one-by-one. Trying to look for efficient way to this. Thanks!
Example:
hadoop distcp hdfs://nn1:8020/source/a hdfs://nn1:8020/source/b hdfs://nn2:8020/destination
Haven't found a documentation where you can directly use DistCp to copy tables. However, if any one is looking for similar situation, they can use. Worked for me.
--hive
export table <<<table_name>>> to '<<<hdfs path>>>';
#bash/shell
hadoop distcp source desitination
--hive
import table <<<table_name>> from '<<<hdfs>>>';

Resources