Working with zips in pyspark - hadoop

I have n zips in a directory and I want to extract each one of those and then extract some data out of a file or two lying inside the zips and add it to a graph DB. I have made a sequential python script for this whole thing, but I am stuck at converting it for spark. All of my zips are in a HDFS directory. And, he graph DB is Neo4j. I am yet to learn about connecting spark with neo4j but I am stuck at a more initial step.
I am thinking my code should be along these lines.
# Names of all my zips
zip_names = ["a.zip", "b.zip", "c.zip"]
# function extract_&_populate_graphDB() returns 1 after doing all the work.
# This was done so that a closure can be applied to start the spark job.
sc.parallelize(zip_names).map(extract_&_populate_grapDB).reduce(lambda a, b: a+b)
What I cant do to test this is how to extract the zips and read the files within. I was able to read the zip by sc.textFile but it on running take(1) on it, it returned hex data.
So, is it possible to read in a zip and extract the data? Or, should I extract the data before putting it into the HDFS? Or maybe there's some other approach to deal with this?

Updating Answer*
If you'd like to use Gzip compressed files, there are parameters you can set when you configure your Spark shell or Spark job that allow you to read and write compressed data.
--conf spark.hadoop.mapred.output.compress=True \
--conf spark.hadoop.mapred.output.compression.codec=True \
--conf spark.hadoop.mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
--conf spark.hadoop.mapred.output.compression.type: BLOCK
Add those to the bash script you are currently using to create a shell (e.g. pyspark) and you can read and write compressed data.
Unfortunately, there is no innate support of Zip files, so you'll need to do a bit more legwork to get there.

Related

How does a shell script read the data in a batch test folder

I recently replicated a SEGAN experiment based on TensorFlow0.12.1.The author provides a shell script for testing (clean_wav.sh), as shown in the figure below:
This is the original version provided by the author. According to the path of my test data, the modified version is as follows:
Noisy_testset_wav_16k is my test data folder, but running the script system will report an error:
This folder is a directory, but when I change the path to:
NOISY_WAVNAME='/home/zyf/SEGAN/ SEGAN/segan-master1/noisy_testset_wav_16k/p232_023.wav'
the script runs normally and the program function can also be achieved.
However, only one audio file can be processed at a time and cannot be processed in batches.Hope everybody knows reason or have opinion, can give me give advice or two, thank very much.
The code is written in the way it only processes file, you can add a loop in shell script to process all files in the folder:
for f in $NOISY_WAVDIR/*.wav; do
python main.py --init_noise_std 0. --save_path segan_v1.1 \
--batch_size 100 --g_nl prelu --weights SEGAN-41700 \
--preemph 0.95 --bias_deconv True \
--bias_downconv True --bias_D_conv True \
--test_wav $f --save_clean_path $SAVE_PATH
done
but that would not be optimal use of the GPU since you do not process audio in batch. Ideally you'd want to modify python code to process audio in batches but that would not be a trivial task.

Having multiple reduce tasks assemble a single HDFS file as output

Is there any low level API in Hadoop allowing multiple reduce tasks running on different machines to assemble a single HDFS as output of their computation?
Something like, a stub HDFS file is created at the beginning of the job then each reducer creates, as output, a variable number of data blocks and assigns them to this file according to a certain order
The answer is no, that would be an unnecessary complication for a rare use case.
What you should do
option 1 - add some code at the end of your hadoop command
int result = job.waitForCompletion(true) ? 0 : 1;
if (result == 0) { // status code OK
// ls job output directory, collect part-r-XXXXX file names
// create HDFS readers for files
// merge them in a single file in whatever way you want
}
All of the required methods are present in hadoop FileSystem api.
option 2 - add job to merge files
You can create a generic hadoop job that would accept directory name as input and pass everything as-is to the single reducer, that would merge results into one output file. Call this job in a pipeline with your main job.
This would work faster for big inputs.
If you want merged output file on local, you can use hadoop command getmerge to combine multiple reduce task files into one single local output file, below is command for same.
hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt

How to merge CSV files in Hadoop?

I am new to the Hadoop framework and I would like to merge 4 CSV files into a single file.
All the 4 CSV files have same headers and order is also the same.
I don't think Pig STORE offers such a feature.
You could use Spark's coalesce(1) function, however, there is little reason to do this as almost all Hadoop processing tools prefer to read directories, not files.
You should ideally not be storing raw CSV in Hadoop for very long, anyway, and rather you convert it to ORC or Parquet as columnar data. Especially if you are reading CSV to begin with already -- do not output CSV again.
If the idea is to produce one CSV to later download, then I would suggest using Hive + Beeline to do that
This will store the result into a file in the local file system.
beeline -u 'jdbc:hive2://[databaseaddress]' --outputformat=csv2 -f yourSQlFile.sql > theFileWhereToStoreTheData.csv
try using getmerge utility to merge the csv files
for example you have a couple of EMP_FILE1.csv EMP_FILE2.csv EMP_FILE3.csv are placed at some location on hdfs. you can merge all these files and can placed merge file at some new location.
hadoop fs -getmerge /hdfsfilelocation/EMP_FILE* /newhdfsfilelocation/MERGED_EMP_FILE.csv

Generating TDB Dataset from archive containing N-TRIPLES files

Apologies, in advance, for a possible duplicate.
I have an archive containing 117,426 files (each in the N-TRIPLES format) that I wish to load into the default graph of a TDB dataset. Due to the large number of files, I need to be able to perform this import without manually selecting individual files for upload.
I am in Bash, with Jena and Fuseki distributions at my disposal.
If possible, I want to avoid the worst-case scenario of just writing a java application to do this. If I have to write a java application for this, what hooks exist in RIOT/TDB to perform programmatic bulk-loading?
As a genenral comment, one way is to concatenate the N-Triples files to generate one single file.
You can load many files at once with either tdbloader or tdbloader2.
tdbloader --loc DB ... your files ...
The 117,426 may strain you OS for a single command line invocation. You can pipe the files into tdbloader (it's just like concatenating the files first)
... | tdbloader --loc DB -- -
where ... is some way to get bash to cat the files (possible from a subshell).
e.g. (you'll need to adjust to file all 117,426 files):
( for x in data*.nt
do
cat $x
done
) | tdbloader --loc DB -- -

Mahout - Naive Bayes

I tried deploying 20- news group example with mahout, it seems working fine. Out of curiosity I would like to dig deep into the model statistics,
for example: bayes-model directory contains the following sub directories,
trainer-tfIdf trainer-thetaNormalizer trainer-weights
which contains part-0000 files. I would like to read the contents of the file for better understanding, cat command doesnt seems to work, it prints some garbage.
Any help is appreciated.
Thanks
The 'part-00000' files are created by Hadoop, and are in Hadoop's SequenceFile format, containing values specific to Mahout. You can't open them as text files, no. You can find the utility class SequenceFileDumper in Mahout that will try to output the content as text to stdout.
As to what those values are to begin with, they're intermediate results of the multi-stage Hadoop-based computation performed by Mahout. You can read the code to get a better sense of what these are. The "tfidf" directory for example contains intermediate calculations related to term frequency.
You can read part-0000 files using hadoop's filesystem -text option. Just get into the hadoop directory and type the following
`bin/hadoop dfs -text /Path-to-part-file/part-m-00000`
part-m-00000 will be printed to STDOUT.
If it gives you an error, you might need to add the HADOOP_CLASSPATH variable to your path. For example, if after running it gives you
text: java.io.IOException: WritableName can't load class: org.apache.mahout.math.VectorWritable
then add the corresponding class to the HADOOP_CLASSPATH variable
export HADOOP_CLASSPATH=/src/mahout/trunk/math/target/mahout-math-0.6-SNAPSHOT.jar
That worked for me ;)
In order to read part-00000 (sequence files) you need to use the "seqdumper" utility. Here's an example I used for my experiments:
MAHOUT_HOME$: bin/mahout seqdumper -s
~/clustering/experiments-v1/t14/tfidf-vectors/part-r-00000
-o ~/vectors-v2-1010
-s is the sequence file you want to convert to plain text
-o is the output file

Resources