I'm trying to compress a list of files generated by previous processor. The names are random with start & end as repetitive.
Ex:
part-00000-1dfde626-2a4f-4bc2-aa43-eaf3c940b2a8-c000.csv
part-00000-547c93da-088e-46c4-a478-a41aabfef9ea-c000.csv
I'm trying to zip all the files in one single file using ExecuteStreamCommand processor. Following are my command & its arguments: It doesn't work.
command: /bin/zip
Argument: finalCompressedFile.zip;part.*csv
The regex part.*csv does match with all the file patterns generated. But the * is (what I suspect) is getting passed to bash shell as literal. If I give a single full file name, it does the job but then I won't be compressing all the files.
Any idea on this?
Related
I want to split a file into two parts
I want to split a single file into two different files using shell script.
You can use linux split command, either by lines split -l<num_of_line> <file_name> or by size split -b<size><K|M|G> <file_name>.
For example: split -l100 a.txt will split each 100 lines into separate files.
Here is a link you can see more examples and all details.
We have 100 hundreds of file in trx_date.tar.bz2 compressed file which has request and response . below is file structure of trx_date.tar.bz2 : trx_date.tar: trx_date contains : log1 ,log2,log3 files which has xml request having some sensitive info and i would like to mask it to some default value. Request Request is having tag 1234567 and i want to mask it to i.e update it to log file to 3333333
I am able to grep it using the the :
Number1=bzcat $LOGDIR/$LOG_FORMAT | grep "<number>[0-2,4-9][0-2,4-9][0-2,4-9][0-2,4-9][0-2,4-9][0-2,4-9][0-2,4-9]"
how we can override the those value in the log files using shell script ?
Log file contains request and response.. Where we have tag like 123456 and also other tags as well . I want to read all the line of log file and replace that specific tag like below 333333 and save the info into same file. We have info tag with 333333 as well but I don't want to consider that.
In principle, you cannot do directly what you want (without extracting the file from your .tar.bz2 compressed archive), since a .tar.bz2 file is a bzip2-ed compression of a tar archive. So the only good solution would be to extract files from the archive, do the modification on the extracted files (e.g. with sed(1) or awk), and recreate an archive from it. Using sed on one particular textual file to replace a pattern like <number>[0-9]*</number> by <number>0000000</number> is easy. Writing a bash for loop to iterate that on several files is easy. So combine both approaches, or write a tiny shell or Python script doing that (on the extracted files).
In practice (but that is risky and I don't recommend that) you could hope that <number> digits </number> happens only in the files part of the tar archive you want to modify in place, and then you could perhaps replace (directly in the uncompressed tar archive), using e.g. sed(1), such sequences with other sequences of the same byte length (read more about the tar format: meta data such as file sizes appear in textual form, NUL bytes completed).
You might also consider using tardy, a tar post-processor (that you need to install).
I strongly recommend extracting the tar archive, operate on the extracted files, then recreate that archive again. Of course, you need enough disk space, and you have to estimate it. But tell your manager that disk space is cheap, generally cheaper than your labor costs.
PS. The command given in your question is really wrong and does not do what you dream of. Read more about redirection, pipelines, globbing, unix shells. Read carefully the documentation of Bash (notably basic shell features, shell expansion, command substitution). Read also the documentation of each command that you want to use, e.g. tar(1), grep(1), sed(1), etc....). Read the relevant man-pages(7) perhaps with the man(1) command.
I want to create a script to split large files into multiple files with respect to line numbers. Mainly if a file is getting split there should be a complete line at the end / beginning.
No partial line should present in any of the split files.
split is what you might be looking for.
split --lines <linenumber> <file>
and you will get a bunch of splitted files named like: PREFIXaa, PREFIXab...
For further info see man split.
Specifying multiple inputs for command line tool?
I am new to bash and am wanting to loop a command line program over a folder containing numerous files.
The script takes two input files (in my case, these differ in one field of the file name ("...R1" vs "...R2"). Running a single instance of the tool looks like this:
tool_name infile1 infile2 -o outfile_suffix
Actual example:
casper sample_name_R1_001.out.fastq sample_name_R2_001.out.fastq -o sample_name_merged
File name format:
DCP-137-5102-T1A3_S33_L001_R1_001.fastq
DCP-137-5102-T1A3_S33_L001_R2_001.fastq
The field in bold will vary between different pairs (e.g., 2000, 2110, 5100 etc...) with each pair distinguished by either R1 or R2.
I would like to know how to loop the script over a folder containing numerous pairs of matched files, and also ensure that the output (-o) gets the 'sample_name' suffix.
I am familiar with the basic for file in ./*.*; do ... $file...; done but that obviously won't work for this example. Any suggestions would be appreciated!
You want to loop over the R1's and derive the R2 and merged-file names from that, something like:
for file1 in ./*R1*; do
file2=${file1/R1/R2}
merge=${file1#*R1}_merged
casper ${file1} ${file2} -o ${merge}
done
Note: Markdown is showing the #*R1}_merged as a comment -- it's not
As I've noted previously, Pig doesn't cope well with empty (0-byte) files. Unfortunately, there are lots of ways that these files can be created (even within Hadoop utilitities).
I thought that I could work around this problem by explicitly loading only files that match a given naming convention in the LOAD statement using Hadoop's glob syntax. Unfortunately, this doesn't seem to work, as even when I use a glob to filter down to known-good input files, I still run into the 0-byte failure mentioned earlier.
Here's an example: Assume I have the following files in S3:
mybucket/a/b/ (0 bytes)
mybucket/a/b/myfile.log (>0 bytes)
mybucket/a/b/yourfile.log (>0 bytes)
If I use a LOAD statement like this in my pig script:
myData = load 's3://mybucket/a/b/*.log as ( ... )
I would expect that Pig would not choke on the 0-byte file, but it still does. Is there a trick to getting Pig to actually only look at files that match the expected glob pattern?
This is a fairly ugly solution, but globs that don't rely on the * wildcard syntax appear to work. So, in our workflow (before calling our pig script), we list all of the files below the prefix we're interested, and then create a specific glob that consists of only the paths we're interested in.
For example, in the example above, we list "mybucket/a":
hadoop fs -lsr s3://mybucket/a
Which returns a list of files, plus other metadata. We can then create the glob from that data:
myData = load 's3://mybucket/a/b{/myfile.log,/yourfile.log}' as ( ... )
This requires a bit more front-end work, but allows us to specifically target files we're interested and avoid 0-byte files.
Update: Unfortunately, I've found that this solution fails when the glob pattern gets long; Pig ends up throwing an exception "Unable to create input slice".