Trouble with the process substitution in the file command - bash

Easier to show as trying to describe with words.
find . -name jo\* -print > list
cat list
#./jo1
#./jo2
#./jo3
# the "file" by reading the list of files from the file "list"
file -f list
#./jo1: ASCII text
#./jo2: ASCII text
#./jo3: ASCII text
#now with process substitution
file -f <(find . -name jo\* -print)
outputs nothing.. ;(
#repeat with -x
set -x
file -f <(find . -name jo\* -print)
set +x
#shows
+ file -f /dev/fd/63
++ find . -name 'jo*' -print
+ set +x
so, it should work. But doesnt. why?
EDIT
Please note - the process substitution should work everywhere you should enter filename, let say:
diff <(some command) <(another command)
the bash the above uses as
diff /dev/fd/... /dev/fd/...
also for example in grep - you can use:
grep -f <(command_for_produce_the_patterns) files..
again, the bash internally uses this as
grep -f /dev/fd/63 files....
So, the same should work in the file
file -f <(command)

You're doing things right. It's a bug in your implementation of file, which I can reproduce on mine (file 5.22 on Debian jessie). It expects the argument to -f to be a seekable file, and doesn't detect the error when the file isn't seekable. That's why it works with a regular file, but not with a pipe (which is what process substitution uses to pass the data between the two processes).
You can observe what's going on with strace:
$ strace file -f <(echo foo)
…
open("/proc/self/fd/13", O_RDONLY) = 3
fstat(3, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
…
read(3, "foo\n", 4096) = 5
…
read(3, "", 4096) = 0
lseek(3, 0, SEEK_SET) = -1 ESPIPE (Illegal seek)
read(3, "", 4096) = 0
close(3) = 0
The file program opens the list of file names on file descriptor 3 and reads it. The it tries to seek back to the beginning of the file. This fails, but the program reads from the file again, which yields no data since the file position is already at the end. Thus file ends up with an empty list of file names.
In the source code, the -f option triggers the unwrap function:
private int
unwrap(struct magic_set *ms, const char *fn)
{
// …
if (strcmp("-", fn) == 0) {
f = stdin;
wid = 1;
} else {
if ((f = fopen(fn, "r")) == NULL) {
(void)fprintf(stderr, "%s: Cannot open `%s' (%s).\n",
progname, fn, strerror(errno));
return 1;
}
while ((len = getline(&line, &llen, f)) > 0) {
// … code to determine column widths
}
rewind(f);
}
// Code to read the file names from f follows
}
If the file name isn't - (instructing to read from standard input) then the code reads the file twice, once to determine the maximum width of the file names and once to process the files. The call to rewind is missing error handling. With - as a file name, the code doesn't try to align the columns.

Instead of using the <(cmd) syntax, use $(cmd). This should overcome the bug

Related

rsync rename duplicated files in dest directory

I have implemented a rsync based system to move files from different environments to others.
The problem I'm facing now is that sometimes, there are files with the same name, but different path and content.
I want to make rsync (if possible) rename duplicated files because I need and use --no-relative option.
Duplicated files can occur in two ways:
There was a file with same name in dest directory already.
In the same rsync execution, we are transferring file with same name in a different location. Ex: dir1/file.txt and dir2/file.txt
Adding -b --suffix options, allows me to have at least 1 repetition for the first duplicated file's type mentioned.
A minimum example (for Linux based systems):
mkdir sourceDir1 sourceDir2 sourceDir3 destDir;
echo "1" >> sourceDir1/file.txt;
echo "2" >> sourceDir2/file.txt;
echo "3" >> sourceDir3/file.txt;
rsync --no-relative sourceDir1/file.txt destDir
rsync --no-relative -b --suffix="_old" sourceDir2/file.txt sourceDir3/file.txt destDir
Is there any way to achieve my requirements?
I don't think that you can do it directly with rsync.
Here's a work-around in bash that does some preparation work with find and GNU awk and then calls rsync afterwards.
The idea is to categorize the input files by "copy number" (for example sourceDir1/file.txt would be the copy #1 of file.txt, sourceDir2/file.txt the copy #2 and sourceDir3/file.txt the copy #3) and generate a file per "copy number" containing the list of all the files in that category.
Then, you just have to launch an rsync with --from-file and a customized --suffix per category.
Pros
fast: incomparable to firing one rsync per file.
safe: it won't ever overwrite a file (see the step #3 below).
robust: handles any filename, even with newlines in it.
Cons
the destination directory have to be empty (or else it might overwrite a few files).
the code is a little long (and I made it longer by using a few process substitutions and by splitting the awk call into two).
Here are the steps:
0)   Use a correct shebang for bash in your system.
#!/usr/bin/env bash
1)   Create a directory for storing the temporary files.
tmpdir=$( mktemp -d ) || exit 1
2)   Categorize the input files by "duplicate number", generate the files for rsync --from-file (one per dup category), and get the total number of categories.
read filesCount < <(
find sourceDir* -type f -print0 |
LANG=C gawk -F '/' '
BEGIN {
RS = ORS = "\0"
tmpdir = ARGV[2]
delete ARGV[2]
}
{
id = ++seen[$NF]
if ( ! (id in outFiles) ) {
outFilesCount++
outFiles[id] = tmpdir "/" id
}
print $0 > outFiles[id]
}
END {
printf "%d\n", outFilesCount
}
' - "$tmpdir"
)
3)   Find a unique suffix — generated using a given set of chars — for rsync --suffix => the string shall be appended to it.
note: You can skip this step if you know for sure that there's no existing filename that ends with _old+number.
(( filesCount > 0 )) && IFS='' read -r -d '' suffix < <(
LANG=C gawk -F '/' '
BEGIN {
RS = ORS = "\0"
charsCount = split( ARGV[2], chars)
delete ARGV[2]
for ( i = 1; i <= 255; i++ )
ord[ sprintf( "%c", i ) ] = i
}
{
l0 = length($NF)
l1 = length(suffix)
if ( substr( $NF, l0 - l1, l1) == suffix ) {
n = ord[ substr( $NF, l0 - l1 - 1, 1 ) ]
suffix = chars[ (n + 1) % charsCount ] suffix
}
}
END {
print suffix
}
' "$tmpdir/1" '0/1/2/3/4/5/6/7/8/9/a/b/c/d/e/f'
)
4)   Run the rsync(s).
for (( i = filesCount; i > 0; i-- ))
do
fromFile=$tmpdir/$i
rsync --no-R -b --suffix="_old${i}_$suffix" -0 --files-from="$fromFile" ./ destDir/
done
5)   Clean-up the temporary directory.
rm -rf "$tmpdir"
Guess it's not possible with only rsync. You have to make a list of files first and analyze it to work around dupes. Take a look at this command:
$ rsync --no-implied-dirs --relative --dry-run --verbose sourceDir*/* dst/
sourceDir1/file.txt
sourceDir2/file.txt
sourceDir3/file.txt
sent 167 bytes received 21 bytes 376.00 bytes/sec
total size is 6 speedup is 0.03 (DRY RUN)
Lets use it to create list of source files:
mapfile -t list < <(rsync --no-implied-dirs --relative --dry-run --verbose sourceDir*/* dst/)
Now we can loop through this list with something like this:
declare -A count
for item in "${list[#]}"; {
[[ $item =~ ^sent.*bytes/sec$ ]] && break
[[ $item ]] || break
fname=$(basename $item)
echo "$item dst/$fname${count[$fname]}"
((count[$fname]++))
}
sourceDir1/file.txt dst/file.txt
sourceDir2/file.txt dst/file.txt1
sourceDir3/file.txt dst/file.txt2
Change echo to rsync and that is it.

Paste hundreds of file with specific pattern name in bash/awk/c

I have 500 files and I want to merge them by adding columns.
My first file
3
4
1
5
My second file
7
1
4
2
Output should look like
3 7
4 1
1 4
5 2
But I have 500 files (sum_1.txt, sum_501.txt until sum_249501.txt), so I must have 500 column, so It will be very frustrating to write 500 file names.
Is it possible to do this easier? I try this, but it not makes 500 columns, but instead it makes a lot of rows
#!/bin/bash
file_name="sum"
tmp=$(mktemp) || exit 1
touch ${file_name}_calosc.txt
for first in {1..249501..500}
do
paste -d ${file_name}_calosc.txt ${file_name}_$first.txt >> ${file_name}_calosc.txt
done
Something like this (untested) should work regardless of how many files you have:
awk '
BEGIN {
for (i=1; i<=249501; i+=500) {
ARGV[ARGC++] = "sum_" i
}
}
{ vals[FNR] = (NR==FNR ? "" : vals[FNR] OFS) $0 }
END {
for (i=1; i<=FNR; i++) {
print vals[i]
}
}
'
It'd only fail if the total content of all the files was too big to fit in memory.
Your command says to paste two files together; to paste more files, give more files as arguments to paste.
You can paste a number of files together like
paste sum_{1..249501..500}_calosc.txt > sum_calosc.txt
but if the number of files is too large for paste, or the resulting command line is too long, you may still have to resort to temporary files.
Here's an attempt to paste 25 files at a time, then combine the resulting 20 files in a final big paste.
#!/bin/bash
d=$(mktemp -d -t pastemanyXXXXXXXXXXX) || exit
# Clean up when done
trap 'rm -rf "$d"; exit' ERR EXIT
for ((i=1; i<= 249501; i+=500*25)); do
printf -v dest "paste%06i.txt" "$i"
for ((j=1, k=i; j<=500; j++, k++)); do
printf "sum_%i.txt\n" "$k"
done |
xargs paste >"$d/$dest"
done
paste "$d"/* >sum_calosc.txt
The function of xargs is to combine its inputs into a single command line (or more than one if it would otherwise be too long; but we are specifically trying to avoid that here, because we want to control exactly how many files we pass to paste).

Split large csv file into multiple files and keep header in each part

How to split a large csv file (1GB) into multiple files (say one part with 1000 rows, 2nd part 10000 rows, 3rd part 100000, etc) and preserve the header in each part ?
How can I achieve this
h1 h2
a aa
b bb
c cc
.
.
12483720 rows
into
h1 h2
a aa
b bb
.
.
.
1000 rows
And
h1 h2
x xx
y yy
.
.
.
10000 rows
Another awk. First some test records:
$ seq 1 1234567 > file
Then the awk:
$ awk 'NR==1{n=1000;h=$0}{print > n}NR==n+c{n*=10;c=NR-1;print h>n}' file
Explained:
$ awk '
NR==1 { # first record:
n=1000 # set first output file size and
h=$0 # store the header
}
{
print > n # output to file
}
NR==n+c { # once target NR has been reached. close(n) goes here if needed
n*=10 # grow target magnitude
c=NR-1 # set the correction factor.
print h > n # first the head
}' file
Count the records:
$ wc -l 1000*
1000 1000
10000 10000
100000 100000
1000000 1000000
123571 10000000
1234571 total
Here is a small adaptation of the solution from: Split CSV files into smaller files but keeping the headers?
awk -v l=1000 '(NR==1){header=$0;next}
(n==l) {
c=sprintf("%0.5d",c+1);
close(file); file=FILENAME; sub(/csv$/,c".csv",file)
print header > file
n=0;l*=10
}
{print $0 > file; n++}' file.csv
This works in the following way:
(NR==1){header=$0;next}: If the record/line is the first line, save that line as the header.
(n==l){...}: Every time we wrote the requested amount of records/lines, we need to start writing to a new file. This happens every time n==l and we perform the following actions:
c=sprintf("%0.5d",c+1): increase the counter with one, and print it as 000xx
close(file): close the file you just wrote too.
file=FILENAME; sub(/csv$/,c".csv",file): define the new filename
print header > file: open the file and write the header to that file.
n=0: reset the current record count
l*=10: increase the maximum record count for the next file
{print $0 > file; n++}: write the entries to the file and increment the record count
Hacky, but utlizes the split utility, which does most of the heavy lifting for splitting the files. Then, with the split files with a well-defined naming convention, I loop over files without the header, and spit out a file with the header concatenated with the file body to tmp.txt, and then move that file back to the original filename.
# Use `split` utility to split the file csv, with 5000 lines per files,
# adding numerical suffixs, and adding additional suffix '.split' to help id
# files.
split -l 5000 -d --additional-suffix=.split repro-driver-table.csv
# This identifies all files that should NOT have headers
# ls -1 *.split | egrep -v -e 'x0+\.split'
# This identifies files that do have headers
# ls -1 *.split | egrep -e 'x0+\.split'
# Walk the files that do not have headers. For each one, cat the header from
# file with header, with rest of body, output to tmp.txt, then mv tmp.txt to
# original filename.
for f in $(ls -1 *.split | egrep -v -e 'x0+\.split'); do
cat <(head -1 $(ls -1 *.split | egrep -e 'x0+\.split')) $f > tmp.txt
mv tmp.txt $f
done
Here's a first approach:
#!/bin/bash
head -1 $1 >header
split $1 y
for f in y*; do
cp header h$f
cat $f >>h$f
done
rm -f header
rm -f y*
The following bash solution should work nicely :
IFS='' read -r header
for ((curr_file_max_rows=1000; 1; curr_file_max_rows*=10)) {
curr_file_name="file_with_${curr_file_max_rows}_rows"
echo "$header" > "$curr_file_name"
for ((curr_file_row_count=0; curr_file_row_count < curr_file_max_rows; curr_file_row_count++)) {
IFS='' read -r row || break 2
echo "$row" >> "$curr_file_name"
}
}
We have a first iteration level which produces the number of rows we're going to write for each successive file. It generates the file names and write the header to them. It is an infinite loop because we don't check how many lines the input has and therefore don't know beforehand how many files we're going to write to, so we'll have to break out of this loop to end it.
Inside this loop we iterate a second time, this time over the number of lines we're going to write to the current file. In this loop we try to read a line from the input file. If it works we write it to the current output file, if it doesn't (we've reached the end of the input) we break out of two levels of loop.
You can try it here.

Why do I get unexpected results when using a pipe and a herestring?

When trying to both pipe the output of a command into another, and also have the second command read from a herestring, I do not get the results I expected. For example:
echo "a" | grep -f - <<<abc
I would expect to produce the output
abc
but I get nothing instead
Both the pipe and the herestring are trying to become stdin for the second command, here grep. Using strace on it shows that only the herestring, in this case, is actually available to grep. It then gets an empty "file" to search, and finds no matches. Here is a portion of strace when the search space is a file instead of a herestring:
read(0, "a\n", 4096) = 2
read(0, "", 4096) = 0
<snip>
read(3, "abc\n", 32768) = 4
read(3, "", 32768) = 0
but with the herestring we instead see:
read(0, "abc\n", 4096) = 4
read(0, "", 4096) = 0
<snip>
read(0, "", 32768) = 0
so we never read the values from the pipe that we expected to have been our pattern space.
Using a process substitution does get around this problem, because then either the pattern space or the search space are not coming from file handle 0:
echo "a" | grep -f - <(echo "abc")
for example produces the exected abc output

Bash 'Process substitution' - what is going on here?

I am using process substitution to create a shorthand inline XSL function that I have written...
function _quickxsl() {
if [[ $1 == "head" ]] ; then
cat <<'HEAD'
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xsl:stylesheet [
<!ENTITY apos "'">
]>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:exsl="http://exslt.org/common"
xmlns:func="http://exslt.org/functions"
xmlns:kcc="http://www.adp.com/kcc"
extension-element-prefixes="func kcc">
HEAD
else
cat <<'FOOT'
</xsl:stylesheet>
FOOT
fi
}
function quickxsl() {
{
_quickxsl head && cat && _quickxsl foot
} | xsltproc - "$#"
}
It seems to work fine if I provide real files as arguments to xsltproc. In the case where I call it with a process substitution on the other hand:
$ quickxsl <(cat xml/kconf.xml) <<QUICKXSL
QUICKXSL
warning: failed to load external entity "/dev/fd/63"
unable to parse /dev/fd/63
Now, I understand that the pipe path is being provided to a sub process connected via another pipe (xsltproc). So I rewrote it slightly:
function quickxsl() {
xsltproc - "$#" < <( _quickxsl head && cat && _quickxsl foot )
}
It seemed to resolve things a little
/dev/fd/63:1: parser error : Document is empty
^
/dev/fd/63:1: parser error : Start tag expected, '<' not found
^
unable to parse /dev/fd/63
Any idea why the pipe cannot be inherited?
Update:
If I simplify the quickxsl function again:
function quickxsl() {
xsltproc <( _quickxsl head && cat && _quickxsl foot ) "$#"
}
I get the same issue, but it's easy to identify which fifo is causing the issue with a bit of xtrace...
$ quickxsl <(cat xml/kconf.xml) <<QUICKXSL
QUICKXSL
+ quickxsl /dev/fd/63
++ cat xml/kconf.xml
+ xsltproc /dev/fd/62 /dev/fd/63
++ _quickxsl head
++ [[ head == \h\e\a\d ]]
++ cat
++ cat -
++ _quickxsl foot
++ [[ foot == \h\e\a\d ]]
++ cat
/dev/fd/62:1: parser error : Document is empty
^
/dev/fd/62:1: parser error : Start tag expected, '<' not found
^
cannot parse /dev/fd/62
The purpose of this exercise is to have the 'process substitution' pipe, connected to a function that returns XML on it's standard output, which it does and works correctly. If I write the contents to a file and pass that to the function, all is well. If I use process substitution, the child process can't read from the pipe and the pipe appears closed or inaccessible. Example:
quickxsl <(my_soap_service "query") <<XSL
<xsl:template match="/">
<xsl:value-of select="/some/path/text()"/>
</xsl:template>
XSL
As you can see, it provides some shortcuts.
Update:
A good point was that pipes can't be continuously opened or closed. Strace output for xsltproc reveals it only opens the file once.
$ grep /dev/fd !$
grep /dev/fd /tmp/xsltproc.strace
execve("/usr/bin/xsltproc", ["xsltproc", "/dev/fd/62"], [/* 31 vars */]) = 0
stat("/dev/fd/62", {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
stat("/dev/fd/62", {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
stat("/dev/fd/62", {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
open("/dev/fd/62", O_RDONLY) = 3
write(2, "/dev/fd/62:1: ", 14) = 14
write(2, "/dev/fd/62:1: ", 14) = 14
write(2, "cannot parse /dev/fd/62\n", 24) = 24
Blimey, I overlooked seeking:
read(3, "<?xml version=\"1.0\" encoding=\"UT"..., 16384) = 390
read(3, "", 12288) = 0
lseek(3, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)
lseek(3, 18446744073709547520, SEEK_SET) = -1 ESPIPE (Illegal seek)
read(3, "", 4096) = 0
Seems that I found a bug in xsltproc. It doesn't recognise FIFO file types and tries to seek on the FIFO file descriptor after reading in the document. I have raised a bug. A work-around is to parse the xsltproc arguments for FIFO file types and convert them into temporary files that can be seeked by xsltproc.
https://bugzilla.gnome.org/show_bug.cgi?id=730545

Resources