Move files to HDFS using Spring XD - spring-xd

How to move the files from local disk to HDFS using Spring XD.
I do not want contents , but to move whole file for archival which saves the file with original name and content.
Here is what i have tried
stream create --name fileapple --definition "file --mode=ref --dir=/Users/dev/code/open/learnspringxd/input --pattern=apple*.txt | WHATTODOHERE"
I can see now with reference the file names with full path are made available , how to move that to HDFS.

You might want to check this which imports data from files to HDFS as a batch job and check if that fits your requirement. You can also check file | hdfs as a stream if that works for you.

example like below will load the file from data folder to HDFS and save the file by date folders(if there are multi records with different date) which by the record column named LastModified, the data file is a json file separate by lines.
file --mode=ref --dir=/Users/dev/code/open/learnspringxd/input --pattern=apple*.txt | hdfs --directory=/user/file_folder --partitionPath=path(dateFormat('yyyy-MM-dd',#jsonPath(payload,'$.LastModified'),'yyyy-MM-dd')) --fileName=output_file_name_prefix --fsUri=hdfs://HDFShostname.company.com:8020 --idleTimeout=30000

Related

Writing Spark dataframe as parquet to S3 without creating a _temporary folder

Using pyspark I'm reading a dataframe from parquet files on Amazon S3 like
dataS3 = sql.read.parquet("s3a://" + s3_bucket_in)
This works without problems. But then I try to write the data
dataS3.write.parquet("s3a://" + s3_bucket_out)
I do get the following exception
py4j.protocol.Py4JJavaError: An error occurred while calling o39.parquet.
: java.lang.IllegalArgumentException: java.net.URISyntaxException:
Relative path in absolute URI: s3a://<s3_bucket_out>_temporary
It seems to me that Spark is trying to create a _temporary folder first, before it is writing to write into the given bucket. Can this be prevent somehow, so that spark is writing directly to the given output bucket?
You can't eliminate the _temporary file as that's used to keep the intermediate
work of a query hidden until it's complete
But that's OK, as this isn't the problem. The problem is that the output committer gets a bit confused trying to write to the root directory (can't delete it, see)
You need to write to a subdirectory under a bucket, with a full prefix. e.g.
s3a://mybucket/work/out .
I should add that trying to commit data to S3A is not reliable, precisely because of the way it mimics rename() by what is something like ls -rlf src | xargs -p8 -I% "cp % dst/% && rm %". Because ls has delayed consistency on S3, it can miss newly created files, so not copy them.
See: Improving Apache Spark for the details.
Right now, you can only reliably commit to s3a by writing to HDFS and then copying. EMR s3 works around this by using DynamoDB to offer a consistent listing
I had the same issue when writing the root of S3 bucket:
df.save("s3://bucketname")
I resolved it by adding a / after the bucket name:
df.save("s3://bucketname/")

Hadoop : Using Pig to add text at the end of every line of a hdfs file

We have files in HDFS with raw logs, each individual log is a line as these logs are line separated.
Our requirement is that to add a text (' 12345' for e.g. ) by the end of every log in these files ... using pig / hadoop command / or any other map reduce based tool.
Please advice
Thanks
AJ
Load the files where each log entry is loaded into one field i.e. line:chararray and use CONCAT to add the text to each line.Store it into new log file.If you want the individual files then you will have to parameterize the script to load each file and store into a new file instead of wildcard load.
Log = LOAD '/path/wildcard/*.log' USING TextLoader(line:chararray);
Log_Text = FOREACH Log GENERATE CONCAT(line,'Your Text') as newline;
STORE Log_Text INTO /path/NewLog.log';
If your files aren't extremely large, you can do that with a single shell command.
hdfs dfs -cat /user/hdfs/logfile.log | sed 's/$/12345/g' |\
hdfs dfs -put - /user/hdfs/newlogfile.txt

Concatenating multiple text files into one very large file in HDFS

I have the multiple text files.
The total size of them exceeds the largest disk size available to me (~1.5TB)
A spark program reads a single input text file from HDFS. So I need to combine those files into one. (I cannot re-write the program code. I am given only the *.jar file for execution)
Does HDFS have such a capability? How can I achieve this?
What I understood from your question is you want to Concatenate multiple files into one. Here is a solution which might not be the most efficient way of doing it but it works. suppose you have two files: file1 and file2 and you want to get a combined file as ConcatenatedFile
.Here is the script for that.
hadoop fs -cat /hadoop/path/to/file/file1.txt /hadoop/path/to/file/file2.txt | hadoop fs -put - /hadoop/path/to/file/Concatenate_file_Folder/ConcatenateFile.txt
Hope this helps.
HDFS by itself does not provide such capabilities. All out-of-the-box features (like hdfs dfs -text * with pipes or FileUtil's copy methods) use your client server to transfer all data.
In my experience we always used our own written MapReduce jobs to merge many small files in HDFS in distributed way.
So you have two solutions:
Write your own simple MapReduce/Spark job to combine text files with
your format.
Find already implemented solution for such kind of
purposes.
About solution #2: there is the simple project FileCrush for combining text or sequence files in HDFS. It might be suitable for you, check it.
Example of usage:
hadoop jar filecrush-2.0-SNAPSHOT.jar crush.Crush -Ddfs.block.size=134217728 \
--input-format=text \
--output-format=text \
--compress=none \
/input/dir /output/dir 20161228161647
I had a problem to run it without these options (especially -Ddfs.block.size and output file date prefix 20161228161647) so make sure you run it properly.
You can do a pig job:
A = LOAD '/path/to/inputFiles' as (SCHEMA);
STORE A into '/path/to/outputFile';
Doing a hdfs cat and then putting it back to hdfs means, all this data is processed in the client node and will degradate your network

How to get the hive table output or text file in hdfs on which hive table created to .CSV format.

So there is one condition with the cluster i'm working on. Nothing can be taken out of cluster to linux box.
Files on which hive table are built are in sequence file format or text format.
I need to change those files to CSV format with out outputting them to linux box and also i can create table from existing table which can be STORED AS CSVfile if possible. (i'm not sure if i can do that).
I have tried lot things..but couldn't do it unless i output it to linux box. Any help is appreciated.
You can create another hive table like this:
CREATE TABLE hivetable_csv ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n' as
select * from hivetable;
Then copy the table contents to a new directory
hadoop fs -cat /user/hive/warehouse/csv_dump/* | hadoop fs -put - /user/username/hivetable.csv
Alternatively, you can also try
hadoop fs -cp

how to prevent hadoop corrupted .gz file

I'm using following simple code to upload files to hdfs.
FileSystem hdfs = FileSystem.get(config);
hdfs.copyFromLocalFile(src, dst);
The files are generated by webserver java component and rotated and closed by logback in .gz format. I've noticed that sometimes the .gz file is corrupted.
> gunzip logfile.log_2013_02_20_07.close.gz
gzip: logfile.log_2013_02_20_07.close.gz: unexpected end of file
But the following command does show me the content of the file
> hadoop fs -text /input/2013/02/20/logfile.log_2013_02_20_07.close.gz
The impact of having such files is quite disaster - since the aggregation for the whole day fails, and also several slave nodes is marked as blacklisted in such case.
What can I do in such case?
Can hadoop copyFromLocalFile() utility corrupt the file?
Does anyone met similar problem ?
It shouldn't do - this error is normally associated with GZip files which haven't been closed out when originally written to local disk, or are being copied to HDFS before they have finished being written to.
You should be able to check by running an md5sum on the original file and that in HDFS - if they match then the original file is corrupt:
hadoop fs -cat /input/2013/02/20/logfile.log_2013_02_20_07.close.gz | md5sum
md5sum /path/to/local/logfile.log_2013_02_20_07.close.gz
If they don't match they check the timestamps on the two files - the one in HDFS should be modified after the local file system one.

Resources