I am using process substitution to create a shorthand inline XSL function that I have written...
function _quickxsl() {
if [[ $1 == "head" ]] ; then
cat <<'HEAD'
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xsl:stylesheet [
<!ENTITY apos "'">
]>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:exsl="http://exslt.org/common"
xmlns:func="http://exslt.org/functions"
xmlns:kcc="http://www.adp.com/kcc"
extension-element-prefixes="func kcc">
HEAD
else
cat <<'FOOT'
</xsl:stylesheet>
FOOT
fi
}
function quickxsl() {
{
_quickxsl head && cat && _quickxsl foot
} | xsltproc - "$#"
}
It seems to work fine if I provide real files as arguments to xsltproc. In the case where I call it with a process substitution on the other hand:
$ quickxsl <(cat xml/kconf.xml) <<QUICKXSL
QUICKXSL
warning: failed to load external entity "/dev/fd/63"
unable to parse /dev/fd/63
Now, I understand that the pipe path is being provided to a sub process connected via another pipe (xsltproc). So I rewrote it slightly:
function quickxsl() {
xsltproc - "$#" < <( _quickxsl head && cat && _quickxsl foot )
}
It seemed to resolve things a little
/dev/fd/63:1: parser error : Document is empty
^
/dev/fd/63:1: parser error : Start tag expected, '<' not found
^
unable to parse /dev/fd/63
Any idea why the pipe cannot be inherited?
Update:
If I simplify the quickxsl function again:
function quickxsl() {
xsltproc <( _quickxsl head && cat && _quickxsl foot ) "$#"
}
I get the same issue, but it's easy to identify which fifo is causing the issue with a bit of xtrace...
$ quickxsl <(cat xml/kconf.xml) <<QUICKXSL
QUICKXSL
+ quickxsl /dev/fd/63
++ cat xml/kconf.xml
+ xsltproc /dev/fd/62 /dev/fd/63
++ _quickxsl head
++ [[ head == \h\e\a\d ]]
++ cat
++ cat -
++ _quickxsl foot
++ [[ foot == \h\e\a\d ]]
++ cat
/dev/fd/62:1: parser error : Document is empty
^
/dev/fd/62:1: parser error : Start tag expected, '<' not found
^
cannot parse /dev/fd/62
The purpose of this exercise is to have the 'process substitution' pipe, connected to a function that returns XML on it's standard output, which it does and works correctly. If I write the contents to a file and pass that to the function, all is well. If I use process substitution, the child process can't read from the pipe and the pipe appears closed or inaccessible. Example:
quickxsl <(my_soap_service "query") <<XSL
<xsl:template match="/">
<xsl:value-of select="/some/path/text()"/>
</xsl:template>
XSL
As you can see, it provides some shortcuts.
Update:
A good point was that pipes can't be continuously opened or closed. Strace output for xsltproc reveals it only opens the file once.
$ grep /dev/fd !$
grep /dev/fd /tmp/xsltproc.strace
execve("/usr/bin/xsltproc", ["xsltproc", "/dev/fd/62"], [/* 31 vars */]) = 0
stat("/dev/fd/62", {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
stat("/dev/fd/62", {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
stat("/dev/fd/62", {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
open("/dev/fd/62", O_RDONLY) = 3
write(2, "/dev/fd/62:1: ", 14) = 14
write(2, "/dev/fd/62:1: ", 14) = 14
write(2, "cannot parse /dev/fd/62\n", 24) = 24
Blimey, I overlooked seeking:
read(3, "<?xml version=\"1.0\" encoding=\"UT"..., 16384) = 390
read(3, "", 12288) = 0
lseek(3, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)
lseek(3, 18446744073709547520, SEEK_SET) = -1 ESPIPE (Illegal seek)
read(3, "", 4096) = 0
Seems that I found a bug in xsltproc. It doesn't recognise FIFO file types and tries to seek on the FIFO file descriptor after reading in the document. I have raised a bug. A work-around is to parse the xsltproc arguments for FIFO file types and convert them into temporary files that can be seeked by xsltproc.
https://bugzilla.gnome.org/show_bug.cgi?id=730545
Related
I am trying to write a util function in a bash script that can take a multi-line string and append it to the supplied file if it does not already exist.
This works fine using grep if the pattern does not contain \n.
if grep -qF "$1" $2
then
return 1
else
echo "$1" >> $2
fi
Example usage
append 'sometext\nthat spans\n\tmutliple lines' ~/textfile.txt
I am on MacOS btw which has presented some problems with some of the solutions I've seen posted elsewhere being very linux specific. I'd also like to avoid installing any other tools to achieve this if possible.
Many thanks
If the files are small enough to slurp into a Bash variable (you should be OK up to a megabyte or so on a modern system), and don't contain NUL (ASCII 0) characters, then this should work:
IFS= read -r -d '' contents <"$2"
if [[ "$contents" == *"$1"* ]]; then
return 1
else
printf '%s\n' "$1" >>"$2"
fi
In practice, the speed of Bash's built-in pattern matching might be more of a limitation than ability to slurp the file contents.
See the accepted, and excellent, answer to Why is printf better than echo? for an explanation of why I replaced echo with printf.
Using awk:
awk '
BEGIN {
n = 0 # length of pattern in lines
m = 0 # number of matching lines
}
NR == FNR {
pat[n++] = $0
next
}
{
if ($0 == pat[m])
m++
else if (m > 0 && $0 == pat[0])
m = 1
else
m = 0
}
m == n {
exit
}
END {
if (m < n) {
for (i = 0; i < n; i++)
print pat[i] >>FILENAME
}
}
' - "$2" <<EOF
$1
EOF
if necessary, one would need to properly escape any metacharacters inside FS | OFS :
jot 7 9 |
{m,g,n}awk 'BEGIN { FS = OFS = "11\n12\n13\n"
_^= RS = (ORS = "") "^$" } _<NF || ++NF'
9
10
11
12
13
14
15
jot 7 -2 | (... awk stuff ...)
-2
-1
0
1
2
3
4
11
12
13
When trying to both pipe the output of a command into another, and also have the second command read from a herestring, I do not get the results I expected. For example:
echo "a" | grep -f - <<<abc
I would expect to produce the output
abc
but I get nothing instead
Both the pipe and the herestring are trying to become stdin for the second command, here grep. Using strace on it shows that only the herestring, in this case, is actually available to grep. It then gets an empty "file" to search, and finds no matches. Here is a portion of strace when the search space is a file instead of a herestring:
read(0, "a\n", 4096) = 2
read(0, "", 4096) = 0
<snip>
read(3, "abc\n", 32768) = 4
read(3, "", 32768) = 0
but with the herestring we instead see:
read(0, "abc\n", 4096) = 4
read(0, "", 4096) = 0
<snip>
read(0, "", 32768) = 0
so we never read the values from the pipe that we expected to have been our pattern space.
Using a process substitution does get around this problem, because then either the pattern space or the search space are not coming from file handle 0:
echo "a" | grep -f - <(echo "abc")
for example produces the exected abc output
Easier to show as trying to describe with words.
find . -name jo\* -print > list
cat list
#./jo1
#./jo2
#./jo3
# the "file" by reading the list of files from the file "list"
file -f list
#./jo1: ASCII text
#./jo2: ASCII text
#./jo3: ASCII text
#now with process substitution
file -f <(find . -name jo\* -print)
outputs nothing.. ;(
#repeat with -x
set -x
file -f <(find . -name jo\* -print)
set +x
#shows
+ file -f /dev/fd/63
++ find . -name 'jo*' -print
+ set +x
so, it should work. But doesnt. why?
EDIT
Please note - the process substitution should work everywhere you should enter filename, let say:
diff <(some command) <(another command)
the bash the above uses as
diff /dev/fd/... /dev/fd/...
also for example in grep - you can use:
grep -f <(command_for_produce_the_patterns) files..
again, the bash internally uses this as
grep -f /dev/fd/63 files....
So, the same should work in the file
file -f <(command)
You're doing things right. It's a bug in your implementation of file, which I can reproduce on mine (file 5.22 on Debian jessie). It expects the argument to -f to be a seekable file, and doesn't detect the error when the file isn't seekable. That's why it works with a regular file, but not with a pipe (which is what process substitution uses to pass the data between the two processes).
You can observe what's going on with strace:
$ strace file -f <(echo foo)
…
open("/proc/self/fd/13", O_RDONLY) = 3
fstat(3, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
…
read(3, "foo\n", 4096) = 5
…
read(3, "", 4096) = 0
lseek(3, 0, SEEK_SET) = -1 ESPIPE (Illegal seek)
read(3, "", 4096) = 0
close(3) = 0
The file program opens the list of file names on file descriptor 3 and reads it. The it tries to seek back to the beginning of the file. This fails, but the program reads from the file again, which yields no data since the file position is already at the end. Thus file ends up with an empty list of file names.
In the source code, the -f option triggers the unwrap function:
private int
unwrap(struct magic_set *ms, const char *fn)
{
// …
if (strcmp("-", fn) == 0) {
f = stdin;
wid = 1;
} else {
if ((f = fopen(fn, "r")) == NULL) {
(void)fprintf(stderr, "%s: Cannot open `%s' (%s).\n",
progname, fn, strerror(errno));
return 1;
}
while ((len = getline(&line, &llen, f)) > 0) {
// … code to determine column widths
}
rewind(f);
}
// Code to read the file names from f follows
}
If the file name isn't - (instructing to read from standard input) then the code reads the file twice, once to determine the maximum width of the file names and once to process the files. The call to rewind is missing error handling. With - as a file name, the code doesn't try to align the columns.
Instead of using the <(cmd) syntax, use $(cmd). This should overcome the bug
How to split file by percentage of no. of lines?
Let's say I want to split my file into 3 portions (60%/20%/20% parts), I could do this manually, -_- :
$ wc -l brown.txt
57339 brown.txt
$ bc <<< "57339 / 10 * 6"
34398
$ bc <<< "57339 / 10 * 2"
11466
$ bc <<< "34398 + 11466"
45864
bc <<< "34398 + 11466 + 11475"
57339
$ head -n 34398 brown.txt > part1.txt
$ sed -n 34399,45864p brown.txt > part2.txt
$ sed -n 45865,57339p brown.txt > part3.txt
$ wc -l part*.txt
34398 part1.txt
11466 part2.txt
11475 part3.txt
57339 total
But I'm sure there's a better way!
There is a utility that takes as arguments the line numbers that should become the first of each respective new file: csplit. This is a wrapper around its POSIX version:
#!/bin/bash
usage () {
printf '%s\n' "${0##*/} [-ks] [-f prefix] [-n number] file arg1..." >&2
}
# Collect csplit options
while getopts "ksf:n:" opt; do
case "$opt" in
k|s) args+=(-"$opt") ;; # k: no remove on error, s: silent
f|n) args+=(-"$opt" "$OPTARG") ;; # f: filename prefix, n: digits in number
*) usage; exit 1 ;;
esac
done
shift $(( OPTIND - 1 ))
fname=$1
shift
ratios=("$#")
len=$(wc -l < "$fname")
# Sum of ratios and array of cumulative ratios
for ratio in "${ratios[#]}"; do
(( total += ratio ))
cumsums+=("$total")
done
# Don't need the last element
unset cumsums[-1]
# Array of numbers of first line in each split file
for sum in "${cumsums[#]}"; do
linenums+=( $(( sum * len / total + 1 )) )
done
csplit "${args[#]}" "$fname" "${linenums[#]}"
After the name of the file to split up, it takes the ratios for the sizes of the split files relative to their sum, i.e.,
percsplit brown.txt 60 20 20
percsplit brown.txt 6 2 2
percsplit brown.txt 3 1 1
are all equivalent.
Usage similar to the case in the question is as follows:
$ percsplit -s -f part -n 1 brown.txt 60 20 20
$ wc -l part*
34403 part0
11468 part1
11468 part2
57339 total
Numbering starts with zero, though, and there is no txt extension. The GNU version supports a --suffix-format option that would allow for .txt extension and which could be added to the accepted arguments, but that would require something more elaborate than getopts to parse them.
This solution plays nice with very short files (split file of two lines into two) and the heavy lifting is done by csplit itself.
$ cat file
a
b
c
d
e
$ cat tst.awk
BEGIN {
split(pcts,p)
nrs[1]
for (i=1; i in p; i++) {
pct += p[i]
nrs[int(size * pct / 100) + 1]
}
}
NR in nrs{ close(out); out = "part" ++fileNr ".txt" }
{ print $0 " > " out }
$ awk -v size=$(wc -l < file) -v pcts="60 20 20" -f tst.awk file
a > part1.txt
b > part1.txt
c > part1.txt
d > part2.txt
e > part3.txt
Change the " > " to just > to actually write to the output files.
Usage
The following bash script allows you to specify the percentage like
./split.sh brown.txt 60 20 20
you also can use the placeholder . which fills the percentage up to 100%.
./split.sh brown.txt 60 20 .
the splitted file is written to
part1-brown.txt
part2-brown.txt
part3-brown.txt
The script always generates as much part files as numbers are specified.
If the percentages sum up to 100, cat part* will always generate the original file (no duplicated or missing lines).
Bash Script: split.sh
#! /bin/bash
file="$1"
fileLength=$(wc -l < "$file")
shift
part=1
percentSum=0
currentLine=1
for percent in "$#"; do
[ "$percent" == "." ] && ((percent = 100 - percentSum))
((percentSum += percent))
if ((percent < 0 || percentSum > 100)); then
echo "invalid percentage" 1>&2
exit 1
fi
((nextLine = fileLength * percentSum / 100))
if ((nextLine < currentLine)); then
printf "" # create empty file
else
sed -n "$currentLine,$nextLine"p "$file"
fi > "part$part-$file"
((currentLine = nextLine + 1))
((part++))
done
BEGIN {
split(w, weight)
total = 0
for (i in weight) {
weight[i] += total
total = weight[i]
}
}
FNR == 1 {
if (NR!=1) {
write_partitioned_files(weight,a)
split("",a,":") #empty a portably
}
name=FILENAME
}
{a[FNR]=$0}
END {
write_partitioned_files(weight,a)
}
function write_partitioned_files(weight, a) {
split("",threshold,":")
size = length(a)
for (i in weight){
threshold[length(threshold)] = int((size * weight[i] / total)+0.5)+1
}
l=1
part=0
for (i in threshold) {
close(out)
out = name ".part" ++part
for (;l<threshold[i];l++) {
print a[l] " > " out
}
}
}
Invoke as:
awk -v w="60 20 20" -f above_script.awk file_to_split1 file_to_split2 ...
Replace " > " with > in script to actually write partitioned files.
The variable w expects space separated numbers. Files are partitioned in that proportion. For example "2 1 1 3" will partition files into four with number of lines in proportion of 2:1:1:3. Any sequence of numbers adding up to 100 can be used as percentages.
For large files the array a may consume too much memory. If that is an issue, here is an alternative awk script:
BEGIN {
split(w, weight)
for (i in weight) {
total += weight[i]; weight[i] = total #cumulative sum
}
}
FNR == 1 {
#get number of lines. take care of single quotes in filename.
name = gensub("'", "'\"'\"'", "g", FILENAME)
"wc -l '" name "'" | getline size
split("", threshold, ":")
for (i in weight){
threshold[length(threshold)+1] = int((size * weight[i] / total)+0.5)+1
}
part=1; close(out); out = FILENAME ".part" part
}
{
if(FNR>=threshold[part]) {
close(out); out = FILENAME ".part" ++part
}
print $0 " > " out
}
This passes through each file twice. Once for counting lines (via wc -l) and the other time while writing partitioned files. Invocation and effect is similar to the first method.
i like Benjamin W.'s csplit solution, but it's so long...
#!/bin/bash
# usage ./splitpercs.sh file 60 20 20
n=`wc -l <"$1"` || exit 1
echo $* | tr ' ' '\n' | tail -n+2 | head -n`expr $# - 1` |
awk -v n=$n 'BEGIN{r=1} {r+=n*$0/100; if(r > 1 && r < n){printf "%d\n",r}}' |
uniq | xargs csplit -sfpart "$1"
(the if(r > 1 && r < n) and uniq bits are to prevent creating empty files or strange behavior for small percentages, files with small numbers of lines, or percentages that add to over 100.)
I just followed your lead and made what you do manually into a script. It may not be the fastest or "best", but if you understand what you are doing now and can just "scriptify" it, you may be better off should you need to maintain it.
#!/bin/bash
# thisScript.sh yourfile.txt 20 50 10 20
YOURFILE=$1
shift
# changed to cat | wc so I dont have to remove the filename which comes from
# wc -l
LINES=$(cat $YOURFILE | wc -l )
startpct=0;
PART=1;
for pct in $#
do
# I am assuming that each parameter is on top of the last
# so 10 30 10 would become 10, 10+30 = 40, 10+30+10 = 50, ...
endpct=$( echo "$startpct + $pct" | bc)
# your math but changed parts of 100 instead of parts of 10.
# change bc <<< to echo "..." | bc
# so that one can capture the output into a bash variable.
FIRSTLINE=$( echo "$LINES * $startpct / 100 + 1" | bc )
LASTLINE=$( echo "$LINES * $endpct / 100" | bc )
# use sed every time because the special case for head
# doesn't really help performance.
sed -n $FIRSTLINE,${LASTLINE}p $YOURFILE > part${PART}.txt
$((PART++))
startpct=$endpct
done
# get the rest if the % dont add to 100%
if [[ $( "lastpct < 100" | bc ) -gt 0 ]] ; then
sed -n $FIRSTLINE,${LASTLINE}p $YOURFILE > part${PART}.txt
fi
wc -l part*.txt
I'm currently working on a maths project and just run into a bit of a brick wall with programming in bash.
Currently I have a directory containing 800 texts files, and what I want to do is run a loop to cat the first 80 files (_01 through to _80) into a new file and save elsewhere, then the next 80 (_81 to _160) files etc.
all the files in the directory are listed like so: ath_01, ath_02, ath_03 etc.
Can anyone help?
So far I have:
#!/bin/bash
for file in /dir/*
do
echo ${file}
done
Which just simple lists my file. I know I need to use cat file1 file2 > newfile.txt somehow but it's confusing my with the numerated extension of _01, _02 etc.
Would it help if I changed the name of the file to use something other than an underscore? like ath.01 etc?
Cheers,
Since you know ahead of time how many files you have and how they are numbered, it may be easier to "unroll the loop", so to speak, and use copy-and-paste and a little hand-tweaking to write a script that uses brace expansion.
#!/bin/bash
cat ath_{001..080} > file1.txt
cat ath_{081..160} > file2.txt
cat ath_{161..240} > file3.txt
cat ath_{241..320} > file4.txt
cat ath_{321..400} > file5.txt
cat ath_{401..480} > file6.txt
cat ath_{481..560} > file7.txt
cat ath_{561..640} > file8.txt
cat ath_{641..720} > file9.txt
cat ath_{721..800} > file10.txt
Or else, use nested for-loops and the seq command
N=800
B=80
for n in $( seq 1 $B $N ); do
for i in $( seq $n $((n+B - 1)) ); do
cat ath_$i
done > file$((n/B + 1)).txt
done
The outer loop will iterate n through 1, 81, 161, etc. The inner loop will iterate i over 1 through 80, then 81 through 160, etc. The body of the inner loops just dumps the contents if the ith file to standard output, but the aggregated output of the loop is stored in file 1, then 2, etc.
You could try something like this:
cat "$file" >> "concat_$(( ${file#/dir/ath_} / 80 ))"
with ${file#/dir/ath_} you remove the prefix /dir/ath_ from the filename
$(( / 80 )) you get the suffix divided by 80 (integer division)
Also change the loop to
for file in /dir/ath_*
So you only get the files you need
If you want groups of 80 files, you'd do best to ensure that the names are sortable; that's why leading zeroes were often used. Assuming that you only have one underscore in the file names, and no newlines in the names, then:
SOURCE="/path/to/dir"
TARGET="/path/to/other/directory"
(
cd $SOURCE || exit 1
ls |
sort -t _ -k2,2n |
awk -v target="$TARGET" \
'{ file[n++] = $1
if (n >= 80)
{
printf "cat"
for (i = 0; i < 80; i++)
printf(" %s", file[i]
printf(" >%s/%s.%.2d\n", target, "newfile", ++number)
n = 0
}
END {
if (n > 0)
{
printf "cat"
for (i = 0; i < n; i++)
printf(" %s", file[i]
printf(" >%s/%s.%.2d\n", target, "newfile", ++number)
}
}' |
sh -x
)
The two directories are specified (where the files are and where the summaries should go); the command changes directory to the source directory (where the 800 files are). It lists the names (you could specify a glob pattern if you needed to) and sorts them numerically. The output is fed into awk which generates a shell script on the fly. It collects 80 names at a time, then generates a cat command that will copy those files to a single destination file such as "newfile.01"; tweak the printf() command to suit your own naming/numbering conventions. The shell commands are then passed to a shell for execution.
While testing, replace the sh -x with nothing, or sh -vn or something similar. Only add an active shell when you're sure it will do what you want. Remember, the shell script is in the source directory as it is running.
Superficially, the xargs command would be nice to use; the difficulty is coordinating the output file number. There might be a way to do that with the -n 80 option to group 80 files at a time and some fancy way to generate the invocation number, but I'm not aware of it.
Another option is to use xargs -n to execute a shell script that can deduce the correct output file number by listing what's already in the target directory. This would be cleaner in many ways:
SOURCE="/path/to/dir"
TARGET="/path/to/other/directory"
(
cd $SOURCE || exit 1
ls |
sort -t _ -k2,2n |
xargs -n 80 cpfiles "$TARGET"
)
Where cpfiles looks like:
TARGET="$1"
shift
if [ $# -gt 0 ]
then
old=$(ls -r newfile.?? | sed -n -e 's/newfile\.//p; 1q')
new=$(printf "%.2d" $((old + 1)))
cat "$#" > "$TARGET/newfile. $new
fi
The test for zero arguments avoids trouble with xargs executing the command once with zero arguments. On the whole, I prefer this solution to the one using awk.
Here's a macro for #chepner's first solution, using GNU Make as the templating language:
SHELL := /bin/bash
N = 800
B = 80
fileNums = $(shell seq 1 $$((${N}/${B})) )
files = ${fileNums:%=file%.txt}
all: ${files}
file%.txt : start = $(shell echo $$(( ($*-1)*${B}+1 )) )
file%.txt : end = $(shell echo $$(( $* * ${B} )) )
file%.txt:
cat ath_{${start}..${end}} > $#
To use:
$ make -n all
cat ath_{1..80} > file1.txt
cat ath_{81..160} > file2.txt
cat ath_{161..240} > file3.txt
cat ath_{241..320} > file4.txt
cat ath_{321..400} > file5.txt
cat ath_{401..480} > file6.txt
cat ath_{481..560} > file7.txt
cat ath_{561..640} > file8.txt
cat ath_{641..720} > file9.txt
cat ath_{721..800} > file10.txt