filename of the sqoop import - sqoop

When we import from RDBMS to HDFS using sqoop we will give target directory to store data, once the job completed we can see the filename as part-m-0000 as mapper output. Is there any way we can pass the filename in which the data will stored? Is sqoop have any option like that?

According to this answer, you can specify arguments passed to mapreduce with -D option, which can accept file name options:
-Dmapreduce.output.basename=myoutputprefix
Although this will change the basename of your file, it will not change the part numbers.
Same answers on other sites:
cloudera
hadoopinrealworld

No you can't rename it.
You can specify --target-dir <dir> to tell the location of directory where all the data is imported,
In this directory, you see many part files (e.g. part-m-00000). These part files are created by various mappers (remember -m <number> in your sqoop import command)
Since data is imported in multiple files, how would you name each part file?
I did not see any additional benefit for this renaming.

Related

How to specify number of partitions when writing a Parquet file?

parquet_writer.write_table(table)
This line writes a single file.
The documentation says:
This creates a single Parquet file. In practice, a Parquet dataset may consist of many files in many directories. We can read a single file back with read_table:
Is there a way for PyArrow to create a parquet file in the form of a directory with multiple part files in it such as :
ls -lrt permit-inspections-recent.parquet
... 14:53 part-00001-bd5d902d-fac9-4e03-b63e-6a8dfc4060b6.snappy.parquet
... 14:53 part-00000-bd5d902d-fac9-4e03-b63e-6a8dfc4060b6.snappy.parquet
Regards,
Yash
You need to tell Arrow how to partition the data. This done with partition_cols argument. See here: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_to_dataset.html

Working with zips in pyspark

I have n zips in a directory and I want to extract each one of those and then extract some data out of a file or two lying inside the zips and add it to a graph DB. I have made a sequential python script for this whole thing, but I am stuck at converting it for spark. All of my zips are in a HDFS directory. And, he graph DB is Neo4j. I am yet to learn about connecting spark with neo4j but I am stuck at a more initial step.
I am thinking my code should be along these lines.
# Names of all my zips
zip_names = ["a.zip", "b.zip", "c.zip"]
# function extract_&_populate_graphDB() returns 1 after doing all the work.
# This was done so that a closure can be applied to start the spark job.
sc.parallelize(zip_names).map(extract_&_populate_grapDB).reduce(lambda a, b: a+b)
What I cant do to test this is how to extract the zips and read the files within. I was able to read the zip by sc.textFile but it on running take(1) on it, it returned hex data.
So, is it possible to read in a zip and extract the data? Or, should I extract the data before putting it into the HDFS? Or maybe there's some other approach to deal with this?
Updating Answer*
If you'd like to use Gzip compressed files, there are parameters you can set when you configure your Spark shell or Spark job that allow you to read and write compressed data.
--conf spark.hadoop.mapred.output.compress=True \
--conf spark.hadoop.mapred.output.compression.codec=True \
--conf spark.hadoop.mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
--conf spark.hadoop.mapred.output.compression.type: BLOCK
Add those to the bash script you are currently using to create a shell (e.g. pyspark) and you can read and write compressed data.
Unfortunately, there is no innate support of Zip files, so you'll need to do a bit more legwork to get there.

Multiple source files for s3distcp

Is there a way to copy a list of files from S3 to hdfs instead of complete folder using s3distcp? this is when srcPattern can not work.
I have multiple files on a s3 folder all having different names. I want to copy only specific files to a hdfs directory. I did not find any way to specify multiple source files path to s3distcp.
Workaround that I am currently using is to tell all the file names in srcPattern
hadoop jar s3distcp.jar
--src s3n://bucket/src_folder/
--dest hdfs:///test/output/
--srcPattern '.*somefile.*|.*anotherone.*'
Can this thing work when the number of files is too many? like around 10 000?
hadoop distcp should solve your problem.
we can use distcp to copy data from s3 to hdfs.
And it also supports wildcards and we can provide multiple source paths in the command.
http://hadoop.apache.org/docs/r1.2.1/distcp.html
Go through the usage section in this particular url
Example:
consider you have the following files in s3 bucket(test-bucket) inside test1 folder.
abc.txt
abd.txt
defg.txt
And inside test2 folder you have
hijk.txt
hjikl.txt
xyz.txt
And your hdfs path is hdfs://localhost.localdomain:9000/user/test/
Then distcp command is as follows for a particular pattern.
hadoop distcp s3n://test-bucket/test1/ab*.txt \ s3n://test-bucket/test2/hi*.txt hdfs://localhost.localdomain:9000/user/test/
Yes you can. create a manifest file with all the files you need and use --copyFromManifest option as mentioned here

Hadoop: How to generate custom reduce output file name?

Now, I use MultipuleOuputs.
I would like to remove the suffix string "-00001" from reducer's output filename such as "xxxx-[r/m]-00001".
Is there any idea?
Thanks.
From Hadoop javadoc to the write() method of MultipleOutputs:
Output path is a unique file generated for the namedOutput. For example, {namedOutput}-(m|r)-{part-number}
So you need to rename or merge these files on the HDFS.
I think you can do it on job driver. When your job completes, change the file names. Also you could do it via terminal commands.

Mahout - Naive Bayes

I tried deploying 20- news group example with mahout, it seems working fine. Out of curiosity I would like to dig deep into the model statistics,
for example: bayes-model directory contains the following sub directories,
trainer-tfIdf trainer-thetaNormalizer trainer-weights
which contains part-0000 files. I would like to read the contents of the file for better understanding, cat command doesnt seems to work, it prints some garbage.
Any help is appreciated.
Thanks
The 'part-00000' files are created by Hadoop, and are in Hadoop's SequenceFile format, containing values specific to Mahout. You can't open them as text files, no. You can find the utility class SequenceFileDumper in Mahout that will try to output the content as text to stdout.
As to what those values are to begin with, they're intermediate results of the multi-stage Hadoop-based computation performed by Mahout. You can read the code to get a better sense of what these are. The "tfidf" directory for example contains intermediate calculations related to term frequency.
You can read part-0000 files using hadoop's filesystem -text option. Just get into the hadoop directory and type the following
`bin/hadoop dfs -text /Path-to-part-file/part-m-00000`
part-m-00000 will be printed to STDOUT.
If it gives you an error, you might need to add the HADOOP_CLASSPATH variable to your path. For example, if after running it gives you
text: java.io.IOException: WritableName can't load class: org.apache.mahout.math.VectorWritable
then add the corresponding class to the HADOOP_CLASSPATH variable
export HADOOP_CLASSPATH=/src/mahout/trunk/math/target/mahout-math-0.6-SNAPSHOT.jar
That worked for me ;)
In order to read part-00000 (sequence files) you need to use the "seqdumper" utility. Here's an example I used for my experiments:
MAHOUT_HOME$: bin/mahout seqdumper -s
~/clustering/experiments-v1/t14/tfidf-vectors/part-r-00000
-o ~/vectors-v2-1010
-s is the sequence file you want to convert to plain text
-o is the output file

Resources