How can I diff two (or more) files, displaying the file name at the beginning of each line?
That is, instead of this:
--- file1.c
+++ file2.c
## -1 +1 ##
-int main() {
+int main(void) {
I would prefer something like this:
file1.c:- int main() {
file2.c:+ int main(void) {
This is not so useful when there are only two files, but extremely handy when using --from-file/--to-file.
I could not find a more concise solution, so I wrote my own script to do it, using several calls to diff to add a different prefix each time.
#!/bin/bash
# the first argument is the original file that others are compared with
orig=$1
len1=${#1}
shift
# we compute the length of the filenames to ensure they are aligned
for arg in "$#"
do
len2=${#arg}
maxlen=$((len1 > len2 ? len1 : len2))
prefix1=$(printf "%-${maxlen}s" "$orig")
prefix2=$(printf "%-${maxlen}s" "$arg")
diff --old-line-format="$prefix1:-%L" \
--new-line-format="$prefix2:+%L" \
--unchanged-line-format="" $orig $arg
echo "---" # not necessary, but helps visual separation
done
Improved above script by adding more flavors like group-formats and also printed file names as top header, which is the basic need, especially when we run diff on several sub-directories recursively. Also added diff -rwbit to ignore white spaces etc. Please remove this option -wbit if you don't need. Keep -r which is harmless.
I use this quite a lot with git as follows: git difftool -v -y -x mydiff
+ cat mydiff
#!/usr/bin/bash
# the first argument is the original file that others are compared with
orig=$1
len1=${#1}
shift
# we compute the length of the filenames to ensure they are aligned
for arg in "$#"
do
len2=${#arg}
maxlen=$((len1 > len2 ? len1 : len2))
prefix1=$(printf "%-${maxlen}s" "$orig")
prefix2=$(printf "%-${maxlen}s" "$arg")
echo -e "\nmydiff $orig $arg =========================\n"
diff -rwbit \
--old-line-format="$prefix1:-%L" \
--new-line-format="$prefix2:+%L" \
--old-group-format='%df%(f=l?:,%dl)d%dE
%<' \
--new-group-format='%dea%dF%(F=L?:,%dL)
%>' \
--changed-group-format='%df%(f=l?:,%dl)c%dF%(F=L?:,%dL)
%<---
%>' \
--unchanged-line-format="" $orig $arg
echo "---" # not necessary, but helps visual separation
done
+ cat -n test1
1 1st line in test1
2 2nd line in test1
3 3rd line in test1
4 4th line in test1
5 6th line in test1
+ cat -n test2
1 1st line in test1
2 2nd line in test2 changed
3 3rd line added in test2
4 4th line in test1
+ mydiff test1 test2
mydiff test1 test2 =========================
2,3c2,3
test1:-2nd line in test1
test1:-3rd line in test1
---
test2:+2nd line in test2 changed
test2:+3rd line added in test2
5d4
test1:-6th line in test1
---
Related
I am currently working on a language that aims to compile to POSIX shell languages and I want to introduce a pop feature. Just like how you can use "shift" to remove the first argument passed to a function:
f() {
shift
printf '%s' "$*"
}
f 1 2 3 #=> 2 3
I want some code that when introduced below can remove the last argument.
g() {
# pop
printf '%s' "$*"
}
g 1 2 3 #=> 1 2
I am aware of the array method as detailed in (Remove last argument from argument list of shell script (bash)), but I want something portable that will work in at least the following shells: ash, dash, ksh (Unix), bash, and zsh. I also want something reasonably speedy; something that opens external processes/subshells would be too heavy for small argument counts, thought if you have a creative solution I wouldn't mind seeing it regardless (and they can still be used as a fallback for large argument counts). Something as fast as those array methods would be ideal.
This is my current answer:
pop() {
local n=$(($1 - ${2:-1}))
if [ -n "$ZSH_VERSION" -o -n "$BASH_VERSION" ]; then
POP_EXPR='set -- "${#:1:'$n'}"'
elif [ $n -ge 500 ]; then
POP_EXPR="set -- $(seq -s " " 1 $n | sed 's/[0-9]\+/"${\0}"/g')"
else
local index=0
local arguments=""
while [ $index -lt $n ]; do
index=$((index+1))
arguments="$arguments \"\${$index}\""
done
POP_EXPR="set -- $arguments"
fi
}
Note that local is not POSIX, but since all major sh shells support it (and specifically the ones I asked for in my question) and not having it can cause serious bugs, I decided to include it in this leading function. But here's a fully compliant POSIX version with obfuscated arguments to reduce the chance of bugs:
pop() {
__pop_n=$(($1 - ${2:-1}))
if [ -n "$ZSH_VERSION" -o -n "$BASH_VERSION" ]; then
POP_EXPR='set -- "${#:1:'$__pop_n'}"'
elif [ $__pop_n -ge 500 ]; then
POP_EXPR="set -- $(seq -s " " 1 $__pop_n | sed 's/[0-9]\+/"${\0}"/g')"
else
__pop_index=0
__pop_arguments=""
while [ $__pop_index -lt $__pop_n ]; do
__pop_index=$((__pop_index+1))
__pop_arguments="$__pop_arguments \"\${$__pop_index}\""
done
POP_EXPR="set -- $__pop_arguments"
fi
}
Usage
pop1() {
pop $#
eval "$POP_EXPR"
echo "$#"
}
pop2() {
pop $# 2
eval "$POP_EXPR"
echo "$#"
}
pop1 a b c #=> a b
pop1 $(seq 1 1000) #=> 1 .. 999
pop2 $(seq 1 1000) #=> 1 .. 998
pop_next
Once you've created the POP_EXPR variable with pop, you can use the following
function to change it to omit further arguments:
pop_next() {
if [ -n "$BASH_VERSION" -o -n "$ZSH_VERSION" ]; then
local np="${POP_EXPR##*:}"
np="${np%\}*}"
POP_EXPR="${POP_EXPR%:*}:$((np == 0 ? 0 : np - 1))}\""
return
fi
POP_EXPR="${POP_EXPR% \"*}"
}
pop_next is a much simpler operation than pop in posix shells (though it's
slightly more complex than pop on zsh and bash)
It's used like this:
main() {
pop $#
pop_next
eval "$POP_EXPR"
}
main 1 2 3 #=> 1
POP_EXPR and variable scope
Note that if you're not going to be using eval "$POP_EXPR" immediately after
pop and pop_next, if you're not careful with scoping some function call
inbetween the operations could change the POP_EXPR variable and mess things
up. To avoid this, simply put local POP_EXPR at the start of every function
that uses pop, if it's available.
f() {
local POP_EXPR
pop $#
g 1 2
eval "$POP_EXPR"
printf '%s' "f=$*"
}
g() {
local POP_EXPR
pop $#
eval "$POP_EXPR"
printf '%s, ' "g=$*"
}
f a b c #=> g=1, f=a b
popgen.sh
This particular function is good enough for my purposes, but I did create a
script to generate further optimized functions.
https://gist.github.com/fcard/e26c5a1f7c8b0674c17c7554fb0cd35c#file-popgen-sh
One of the ways to improve performance without using external tools here is
to realize that having several small string concatenations is slow, so doing
them in batches makes the function considerably faster. calling the script
popgen.sh -gN1,N2,N3 creates a pop function that handles the operations
in batches of N1, N2, or N3 depending on the argument count. The script also
contains other tricks, exemplified and explained below:
$ sh popgen \
> -g 10,100 \ # concatenate strings in batches\
> -w \ # overwrite current file\
> -x9 \ # hardcode the result of the first 9 argument counts\
> -t1000 \ # starting at argument count 1000, use external tools\
> -p posix \ # prefix to add to the function name (with a underscore)\
> -s '' \ # suffix to add to the function name (with a underscore)\
> -c \ # use the command popsh instead of seq/sed as the external tool\
> -# \ # on zsh and bash, use the subarray method (checks on runtime)\
> -+ \ # use bash/zsh extensions (removes runtime check from -#)\
> -nl \ # don't use 'local'\
> -f \ # use 'function' syntax\
> -o pop.sh # output file
An equivalent to the above function can be generated with popgen.sh -t500 -g1 -#.
In the gist containing popgen.sh you will find a popsh.c file that can be
compiled and used as a specialized, faster alternative to the default shell
external tools, it will be used by any function generated with
popgen.sh -c ... if it's accessible as popsh by the shell.
Alternatively, you can create any function or tool named popsh and use
it in its place.
Benchmark
Benchmark functions:
The script I used for benchmarking can be found on this gist:
https://gist.github.com/fcard/f4aec7e567da2a8e97962d5d3f025ad4#file-popbench-sh
The benchmark functions are found in these lines:
https://gist.github.com/fcard/f4aec7e567da2a8e97962d5d3f025ad4#file-popbench-sh-L233-L301
The script can be used as such:
$ sh popbench.sh \
> -s dash \ # shell used by the benchmark, can be dash/bash/ash/zsh/ksh.\
> -f posix \ # function to be tested\
> -i 10000 \ # number of times that the function will be called per test\
> -a '\0' \ # replacement pattern to model arguments by index (uses sed)\
> -o /dev/stdout \ # where to print the results to (concatenates, defaults to stdout)\
> -n 5,10,1000 # argument sizes to test
It will output a time -p style sheet with a real, user and sys time values,
as well as an int value, for internal, that is calculated inside the benchmark
process using date.
Times
The following are the int results of calls to
$ sh popbench.sh -s $shell -f $function -i 10000 -n 1,5,10,100,1000,10000
posix refers to the second and third clauses, subarray refers to the first,
while final refers to the whole.
value count 1 5 10 100 1000 10000
---------------------------------------------------------------------------------------
dash/final 0m0.109s 0m0.183s 0m0.275s 0m2.270s 0m16.122s 1m10.239s
ash/final 0m0.104s 0m0.175s 0m0.273s 0m2.337s 0m15.428s 1m11.673s
ksh/final 0m0.409s 0m0.557s 0m0.737s 0m3.558s 0m19.200s 1m40.264s
bash/final 0m0.343s 0m0.414s 0m0.470s 0m1.719s 0m17.508s 3m12.496s
---------------------------------------------------------------------------------------
bash/subarray 0m0.135s 0m0.179s 0m0.224s 0m1.357s 0m18.911s 3m18.007s
dash/posix 0m0.171s 0m0.290s 0m0.447s 0m3.610s 0m17.376s 1m8.852s
ash/posix 0m0.109s 0m0.192s 0m0.285s 0m2.457s 0m14.942s 1m10.062s
ksh/posix 0m0.416s 0m0.581s 0m0.768s 0m4.677s 0m18.790s 1m40.407s
bash/posix 0m0.409s 0m0.739s 0m1.145s 0m10.048s 0m58.449s 40m33.024s
On zsh
For large argument counts setting set -- ... with eval is very slow on zsh no
matter no matter the method, save for eval 'set -- "${#:1:$# - 1}"'. Even as
simple a modification as changing it to eval "set -- ${#:1:$# - 1}"
(ignoring that it doesn't work for arguments with spaces) makes it two orders
of magnitude slower.
value count 1 5 10 100 1000 10000
---------------------------------------------------------------------------------------
zsh/subarray 0m0.203s 0m0.227s 0m0.233s 0m0.461s 0m3.643s 0m38.396s
zsh/final 0m0.399s 0m0.416s 0m0.441s 0m0.722s 0m4.205s 0m37.217s
zsh/posix 0m0.718s 0m0.913s 0m1.182s 0m6.200s 0m46.516s 42m27.224s
zsh/eval-zsh 0m0.419s 0m0.353s 0m0.375s 0m0.853s 0m5.771s 32m59.576s
More benchmarks
For more benchmarks, including only using external tools, the c popsh tool or the naive algorithm, see this file:
https://gist.github.com/fcard/f4aec7e567da2a8e97962d5d3f025ad4#file-benchmarks-md
It's generated like this:
$ git clone https://gist.github.com/f4aec7e567da2a8e97962d5d3f025ad4.git popbench
$ cd popbench
$ sh popgen_run.sh
$ sh popbench_run.sh --fast # or without --fast if you have a day to spare
$ sh poptable.sh -g >benchmarks.md
Conclusion
This has been the result of a week-long research on the subject, and I thought
I'd share it. Hopefully it's not too long, I tried to trim it to the main
information with links to the gist. This was initially made as an answer to
(Remove last argument from argument list of shell script (bash)) but I felt the focus on POSIX
made it off topic.
All the code in the gists linked here is licensed under the MIT license.
alias pop='set -- $(eval printf '\''%s\\n'\'' $(seq $(expr $# - 1) | sed '\''s/^/\$/;H;$!d;x;s/\n/ /g'\'') )'
EDIT:
this is a POSIX shell solution that use aliases instead of functions; if called in a function, this gives the desired effect (it resets the function arguments by using the same number of arguments minus the last; being an alias, and with eval, it can change the values of the enclosing function):
func () {
pop
pop
echo "$#"
}
func a b c d e # prints a b c
pop () {
i=0
while [ $((i+=1)) -lt $# ]; do
set -- "$#" "$1"
shift
done # 1 2 3 -> 3 1 2
printf '%s' "$1" # last argument
shift # $# is now without last argument
}
I am stuck on that. So I have this while-read loop within my code that is taking so long and I would like to run it in many processors. But, I'd like to split the input file and run 14 loops (because I have 14 threads), one for each splited file, in parallel. Thing is that I don't know how to tell the while loop which file to get and work with.
For example, in a regular while-read loop I would code:
while read line
do
<some code>
done < input file or variable...
But in this case I would like to split the above input file in 14 files and run 14 while loops in parallel, one for each splited file.
I tried :
split -n 14 input_file
find . -name "xa*" | \
parallel -j 14 | \
while read line
do
<lot of stuff>
done
also tried
split -n 14 input_file
function loop {
while read line
do
<lot of stuff>
done
}
export -f loop
parallel -j 14 ::: loop
But neither I was able to tell which file would be the input to the loop so parallel would understand "take each of those xa* files and place into individual loops in parallel"
An example of the input file (a list of strings)
AEYS01000010.10484.12283
CVJT01000011.50.2173
KF625180.1.1799
KT949922.1.1791
LOBZ01000025.54942.57580
EDIT
This is the code.
The output is a table (741100 lines) with some statistics regarding DNA sequences alignments already made.
The loop takes an input_file (no broken lines, varies from 500 to ~45000 lines, 800Kb) with DNA sequence acessions, reads it line-by-line and look for each correspondent full taxonomy for those acessions in a databank (~45000 lines). Then, it does a few sums/divisions. Output is a .tsv and looks like this (an example for sequence "KF625180.1.1799"):
Rate of taxonomies for this sequence in %: KF625180.1.1799 D_6__Bacillus_atrophaeus
Taxonomy %aligned number_ocurrences_in_the_alignment num_ocurrences_in_databank %alingment/databank
D_6__Bacillus_atrophaeus 50% 1 20 5%
D_6__Bacillus_amyloliquefaciens 50% 1 154 0.649351%
$ head input file
AEYS01000010.10484.12283
CVJT01000011.50.217
KF625180.1.1799
KT949922.1.1791
LOBZ01000025.54942.57580
Two additional files are also used inside the loop. They are not the loop input.
1) a file called alnout_file that only serves for finding how many hits (or alignments) a given sequence had against the databank. It was also previously made outside this loop. It can vary in the number of lines from hundreads to thousands. Only columns 1 and 2 matters here. Column1 is the name of the sequence and col2 is the name of all sequences it matched in the databnk. It looks like that:
$ head alnout_file
KF625180.1.1799 KF625180.1.1799 100.0 431 0 0 1 431 1 431 -1 0
KF625180.1.1799 KP143082.1.1457 99.3 431 1 2 1 431 1 429 -1 0
KP143082.1.1457 KF625180.1.1799 99.3 431 1 2 1 429 1 431 -1 0
2) a databank .tsv file containing ~45000 taxonomies correspondent to the DNA sequences. Each taxonomy is in one line:
$ head taxonomy.file.tsv
KP143082.1.1457 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_amyloliquefaciens
KF625180.1.1799 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_atrophaeus
So, given sequence KF625180.1.1799. I previously aligned it against a databank containing ~45000 other DNA sequences and got an output whis has all the accessions to sequences that it matched. What the loop does is that it finds the taxonomies for all those sequences and calculates the "statistics" I mentionded previously. Code does it for all the DNA-sequences-accesions I have.
TAXONOMY=path/taxonomy.file.tsv
while read line
do
#find hits
hits=$(grep $line alnout_file | cut -f 2)
completename=$(grep $line $TAXONOMY | sed 's/D_0.*D_4/D_4/g')
printf "\nRate of taxonomies for this sequence in %%:\t$completename\n"
printf "Taxonomy\t%aligned\tnumber_ocurrences_in_the_alignment\tnum_ocurrences_in_databank\t%alingment/databank\n"
#find hits and calculate the frequence (%) of the taxonomy in the alignment output
# ex.: Bacillus_subtilis 33
freqHits=$(grep "${hits[#]}" $TAXONOMY | \
cut -f 2 | \
awk '{a[$0]++} END {for (i in a) {print i, "\t", a[i]/NR*100, "\t", a[i]}}' | \
sed -e 's/D_0.*D_5/D_5/g' -e 's#\s\t\s#\t#g' | \
sort -k2 -hr)
# print frequence of each taxonomy in the databank
freqBank=$(while read line; do grep -c "$line" $TAXONOMY; done < <(echo "$freqHits" | cut -f 1))
#print cols with taxonomy and calculations
paste <(printf %s "$freqHits") <(printf %s "$freqBank") | awk '{print $1,"\t",$2"%","\t",$3,"\t",$4,"\t",$3/$4*100"%"}'
done < input_file
It is a lot of greps and parsing so it takes about ~12h running in one processor for doing it to all the 45000 DNA sequence accessions. The, I would like to split input_file and do it in all the processors I have (14) because it would the time spend in that.
Thank you all for being so patient with me =)
You are looking for --pipe. In this case you can even use the optimized --pipepart (version >20160621):
export TAXONOMY=path/taxonomy.file.tsv
doit() {
while read line
do
#find hits
hits=$(grep $line alnout_file | cut -f 2)
completename=$(grep $line $TAXONOMY | sed 's/D_0.*D_4/D_4/g')
printf "\nRate of taxonomies for this sequence in %%:\t$completename\n"
printf "Taxonomy\t%aligned\tnumber_ocurrences_in_the_alignment\tnum_ocurrences_in_databank\t%alingment/databank\n"
#find hits and calculate the frequence (%) of the taxonomy in the alignment output
# ex.: Bacillus_subtilis 33
freqHits=$(grep "${hits[#]}" $TAXONOMY | \
cut -f 2 | \
awk '{a[$0]++} END {for (i in a) {print i, "\t", a[i]/NR*100, "\t", a[i]}}' | \
sed -e 's/D_0.*D_5/D_5/g' -e 's#\s\t\s#\t#g' | \
sort -k2 -hr)
# print frequence of each taxonomy in the databank
freqBank=$(while read line; do grep -c "$line" $TAXONOMY; done < <(echo "$freqHits" | cut -f 1))
#print cols with taxonomy and calculations
paste <(printf %s "$freqHits") <(printf %s "$freqBank") | awk '{print $1,"\t",$2"%","\t",$3,"\t",$4,"\t",$3/$4*100"%"}'
done
}
export -f doit
parallel -a input_file --pipepart doit
This will chop input_file into 10*ncpu blocks (where ncpu is the number of CPU threads), pass each block to doit, run ncpu jobs in parallel.
That said I think your real problem is spawning too many programs: If you rewrite doit in Perl or Python I will expect you will see a major speedup.
As an alternative I threw together a quick test.
#! /bin/env bash
mkfifo PIPELINE # create a single queue
cat "$1" > PIPELINE & # supply it with records
{ declare -i cnt=0 max=14
while (( ++cnt <= max )) # spawn loop creates worker jobs
do printf -v fn "%02d" $cnt
while read -r line # each work loop reads common stdin...
do echo "$fn:[$line]"
sleep 1
done >$fn.log 2>&1 & # these run in background in parallel
done # this one exits
} < PIPELINE # *all* read from the same queue
wait
cat [0-9][0-9].log
Doesn't need split, but does need a mkfifo.
Obviously, change the code inside the internal loop.
This answers what you asked, namely how to process in parallel the 14 files you get from running split. However, I don't think it is the best way of doing whatever it is that you are trying to do - but we would need some answers from you for that.
So, let's make a million line file and split it into 14 parts:
seq 1000000 > 1M
split -n 14 1M part-
That gives me 14 files called part-aa through part-an. Now your question is how to process those 14 parts in parallel - (read the last line first):
#!/bin/bash
# This function will be called for each of the 14 files
DoOne(){
# Pick up parameters
job=$1
file=$2
# Count lines in specified file
lines=$(wc -l < "$file")
echo "Job No: $job, file: $file, lines: $lines"
}
# Make the function above known to processes spawned by GNU Parallel
export -f DoOne
# Run 14 parallel instances of "DoOne" passing job number and filename to each
parallel -k -j 14 DoOne {#} {} ::: part-??
Sample Output
Job No: 1, file: part-aa, lines: 83861
Job No: 2, file: part-ab, lines: 72600
Job No: 3, file: part-ac, lines: 70295
Job No: 4, file: part-ad, lines: 70295
Job No: 5, file: part-ae, lines: 70294
Job No: 6, file: part-af, lines: 70295
Job No: 7, file: part-ag, lines: 70295
Job No: 8, file: part-ah, lines: 70294
Job No: 9, file: part-ai, lines: 70295
Job No: 10, file: part-aj, lines: 70295
Job No: 11, file: part-ak, lines: 70295
Job No: 12, file: part-al, lines: 70294
Job No: 13, file: part-am, lines: 70295
Job No: 14, file: part-an, lines: 70297
You would omit the -k argument to GNU Parallel normally - I only added it so the output comes in order.
I think that using a bunch of grep and awk commands is the wrong approach here - you would be miles better off using Perl, or awk. As you have not provided any sample files I generated some using this code:
#!/bin/bash
for a in {A..Z} {0..9} ; do
for b in {A..Z} {0..9} ; do
for c in {A..Z} {0..9} ; do
echo "${a}${b}${c}"
done
done
done > a
# Now make file "b" which has the same stuff but shuffled into a different order
gshuf < a > b
Note that there are 26 letters in the alphabet, so if I add the digits 0..9 to the letters of the alphabet, I get 36 alphanumeric digits and if I nest 3 loops of that I get 36^3 or 46,656 lines which matches your file sizes roughly. File a now looks like this:
AAA
AAB
AAC
AAD
AAE
AAF
File b looks like this:
UKM
L50
AOC
79U
K6S
6PO
12I
XEV
WJN
Now I want to loop through a finding the corresponding line in b. First, I use your approach:
time while read thing ; do grep $thing b > /dev/null ; done < a
That takes 9 mins 35 seconds.
If I now exit grep on the first match, on average I will find it in the middle, which means the time will be halved since I won't continue to needlessly read b after I find what I want.
time while read thing ; do grep -m1 $thing b > /dev/null ; done < a
That improves the time down to 4 mins 30 seconds.
If I now use awk to read the contents of b into an associative array (a.k.a. hash) and then read the elements of a and find them in b like this:
time awk 'FNR==NR{a[$1]=$1; next} {print a[$1]}' b a > /dev/null
That now runs in 0.07 seconds. Hopefully you get the idea of what I am driving at. I expect Perl would do this in the same time and also provide more expressive facilities for the maths in the middle of your loop too.
I hope this small script helps you out:
function process {
while read line; do
echo "$line"
done < $1
}
function loop {
file=$1
chunks=$2
dir=`mktemp -d`
cd $dir
split -n l/$chunks $file
for i in *; do
process "$i" &
done
rm -rf $dir
}
loop /tmp/foo 14
It runs the process loop on the specified file with the specified number of chunks (without splitting lines) in parallel (using & to put each invocation in the background). I hope it gets you started.
This can do the job for You, I am not familiar with parallel instead using native bash spawning processes &:
function loop () {
while IFS= read -r -d $'\n'
do
# YOUR BIG STUFF
done < "${1}"
}
arr_files=(./xa*)
for i in "${arr_files[#]}"
do loop "${i}" &
done
wait
I have a large directory of data files which I am in the process of manipulating to get them in a desired format. They each begin and end 15 lines too soon, meaning I need to strip the first 15 lines off one file and paste them to the end of the previous file in the sequence.
To begin, I have written the following code to separate the relevant data into easy chunks:
#!/bin/bash
destination='media/user/directory/'
for file1 in `ls $destination*.ascii`
do
echo $file1
file2="${file1}.end"
file3="${file1}.snip"
sed -e '16,$d' $file1 > $file2
sed -e '1,15d' $file1 > $file3
done
This worked perfectly, so the next step is the worlds simplest cat command:
cat $file3 $file2 > outfile
However, what I need to do is to stitch file2 to the previous file3. Look at this screenshot of the directory for better understanding.
See how these files are all sequential over time:
*_20090412T235945_20090413T235944_* ### April 13
*_20090413T235945_20090414T235944_* ### April 14
So I need to take the 15 lines snipped off the April 14 example above and paste it to the end of the April 13 example.
This doesn't have to be part of the original code, in fact it would be probably best if it weren't. I was just hoping someone would be able to help me get this going.
Thanks in advance! If there is anything I have been unclear about and needs further explanation please let me know.
"I need to strip the first 15 lines off one file and paste them to the end of the previous file in the sequence."
If I understand what you want correctly, it can be done with one line of code:
awk 'NR==1 || FNR==16{close(f); f=FILENAME ".new"} {print>f}' file1 file2 file3
When this has run, the files file1.new, file2.new, and file3.new will be in the new form with the lines transferred. Of course, you are not limited to three files: you may specify as many as you like on the command line.
Example
To keep our example short, let's just strip the first 2 lines instead of 15. Consider these test files:
$ cat file1
1
2
3
$ cat file2
4
5
6
7
8
$ cat file3
9
10
11
12
13
14
15
Here is the result of running our command:
$ awk 'NR==1 || FNR==3{close(f); f=FILENAME ".new"} {print>f}' file1 file2 file3
$ cat file1.new
1
2
3
4
5
$ cat file2.new
6
7
8
9
10
$ cat file3.new
11
12
13
14
15
As you can see, the first two lines of each file have been transferred to the preceding file.
How it works
awk implicitly reads each file line-by-line. The job of our code is to choose which new file a line should be written to based on its line number. The variable f will contain the name of the file that we are writing to.
NR==1 || FNR==16{f=FILENAME ".new"}
When we are reading the first line of the first file, NR==1, or when we are reading the 16th line of whatever file we are on, FNR==16, we update f to be the name of the current file with .new added to the end.
For the short example, which transferred 2 lines instead of 15, we used the same code but with FNR==16 replaced with FNR==3.
print>f
This prints the current line to file f.
(If this was a shell script, we would use >>. This is not a shell script. This is awk.)
Using a glob to specify the file names
destination='media/user/directory/'
awk 'NR==1 || FNR==16{close(f); f=FILENAME ".new"} {print>f}' "$destination"*.ascii
Your task is not that difficult at all. You want to gather a list of all _end files in the directory (using a for loop and globbing, NOT looping on the results of ls). Once you have all the end files, you simply parse the dates using parameter expansion w/substing removal say into d1 and d2 for date1 and date2 in:
stuff_20090413T235945_20090414T235944_end
| d1 | | d2 |
then you simply subtract 1 from d1 into say date0 or d0 and then construct a previous filename out of d0 and d1 using _snip instead of _end. Then just test for the existence of the previous _snip filename, and if it exists, paste your info from the current _end file to the previous _snip file. e.g.
#!/bin/bash
for i in *end; do ## find all _end files
d1="${i#*stuff_}" ## isolate first date in filename
d1="${d1%%T*}"
d2="${i%T*}" ## isolate second date
d2="${d2##*_}"
d0=$((d1 - 1)) ## subtract 1 from first, get snip d1
prev="${i/$d1/$d0}" ## create previous 'snip' filename
prev="${prev/$d2/$d1}"
prev="${prev%end}snip"
if [ -f "$prev" ] ## test that prev snip file exists
then
printf "paste to : %s\n" "$prev"
printf " from : %s\n\n" "$i"
fi
done
Test Input Files
$ ls -1
stuff_20090413T235945_20090414T235944_end
stuff_20090413T235945_20090414T235944_snip
stuff_20090414T235945_20090415T235944_end
stuff_20090414T235945_20090415T235944_snip
stuff_20090415T235945_20090416T235944_end
stuff_20090415T235945_20090416T235944_snip
stuff_20090416T235945_20090417T235944_end
stuff_20090416T235945_20090417T235944_snip
stuff_20090417T235945_20090418T235944_end
stuff_20090417T235945_20090418T235944_snip
stuff_20090418T235945_20090419T235944_end
stuff_20090418T235945_20090419T235944_snip
Example Use/Output
$ bash endsnip.sh
paste to : stuff_20090413T235945_20090414T235944_snip
from : stuff_20090414T235945_20090415T235944_end
paste to : stuff_20090414T235945_20090415T235944_snip
from : stuff_20090415T235945_20090416T235944_end
paste to : stuff_20090415T235945_20090416T235944_snip
from : stuff_20090416T235945_20090417T235944_end
paste to : stuff_20090416T235945_20090417T235944_snip
from : stuff_20090417T235945_20090418T235944_end
paste to : stuff_20090417T235945_20090418T235944_snip
from : stuff_20090418T235945_20090419T235944_end
(of course replace stuff_ with your actual prefix)
Let me know if you have questions.
You could store the previous $file3 value in a variable (and do a check if it is not the first run with -z check):
#!/bin/bash
destination='media/user/directory/'
prev=""
for file1 in $destination*.ascii
do
echo $file1
file2="${file1}.end"
file3="${file1}.snip"
sed -e '16,$d' $file1 > $file2
sed -e '1,15d' $file1 > $file3
if [ -z "$prev" ]; then
cat $prev $file2 > outfile
fi
prev=$file3
done
I have a file and contents are like :
|T1234
010000000000
02123456878
05122345600000000000000
07445678920000000000000
09000000000123000000000
10000000000000000000000
.T1234
|T798
013457829
0298365799
05600002222222222222222
09348977722220000000000
10000057000004578933333
.T798
Here one complete batch means it will start from |T and end with .T.
In the file i have 2 batches.
I want to edit this file to delete a batch for record 10(position1-2),if from position 3 till position 20 is 0 then delete the batch.
Please let me know how i can achieve this by writing a shell script or syncsort or sed or awk .
I am still a little unclear about exactly what you want, but I think I have it enough to give you an outline on a bash solution. The part I was unclear on is exactly which line contained the first two characters of 10 and remaining 0's, but it looks like that is the last line in each batch. Not knowing exactly how you wanted the batch (with the matching 10) handled, I have simply written the remaining wanted batch(es) out to a file called newbatch.txt in the current working directory.
The basic outline of the script is to read each batch into a temporary array. If during the read, the 10 and 0's match is found, it sets a flag to delete the batch. After the last line is read, it checks the flag, if set simply outputs the batch number to delete. If the flag is not set, then it writes the batch to ./newbatch.txt.
Let me know if your requirements are different, but this should be fairly close to a solution. The code is fairly well commented. If you have questions, just drop a comment.
#!/bin/bash
ifn=${1:-dat/batch.txt} # input filename
ofn=./newbatch.txt # output filename
:>"$ofn" # truncate output filename
declare -i bln=0 # batch line number
declare -i delb=0 # delete batch flag
declare -a ba # temporary batch array
[ -r "$ifn" ] || { # test input file readable
printf "error: file not readable. usage: %s filename\n" "${0//*\//}"
exit 1
}
## read each line in input file
while read -r line || test -n "$line"; do
printf " %d %s\n" $bln "$line"
ba+=( "$line" ) # add line to array
## if chars 1-2 == 10 and chars 3 on == 00...
if [ ${line:0:2} == 10 -a ${line:3} == 00000000000000000000 ]; then
delb=1 # set delete flag
fi
((bln++)) # increment line number
## if the line starts with '.'
if [ ${line:0:1} == '.' ]; then
## if the delete batch flag is set
if [ $delb -eq 1 ]; then
## do nothing (but show batch no. to delete)
printf " => deleting batch : %s\n" "${ba[0]}"
## if delb not set, then write the batch to output file
else
printf "%s\n" ${ba[#]} >> "$ofn"
fi
## reset line no., flags, and uset array.
bln=0
delb=0
unset ba
fi
done <"$ifn"
exit 0
Output (to stdout)
$ bash batchdel.sh
0 |T1234
1 010000000000
2 02123456878
3 05122345600000000000000
4 07445678920000000000000
5 09000000000123000000000
6 10000000000000000000000
7 .T1234
=> deleting batch : |T1234
0 |T798
1 013457829
2 0298365799
3 05600002222222222222222
4 09348977722220000000000
5 10000057000004578933333
6 .T798
Output (to newbatch.txt)
$ cat newbatch.txt
|T798
013457829
0298365799
05600002222222222222222
09348977722220000000000
10000057000004578933333
.T798
I'm looking for a way to merge 4 lines of dna probing results into one line.
The problem here is:
I don't want to append the lines. But associating them
The 4 lines of dna probing:
A----A----------A----A-A--AAAA-
-CC----CCCC-C-----CCC-C-------C
------G----G--G--G------G------
---TT--------T-T---------T-----
I need these to be 1 line, not just appended but intermixed without the dashes.
First characters of the result:
ACCTTAGCCCCGC...
This seem to be a kind of general problem, so the language choosed to solve this don't matter.
For fun: one bash way:
lines=(
A----A----------A----A-A--AAAA-
-CC----CCCC-C-----CCC-C-------C
------G----G--G--G------G------
---TT--------T-T---------T-----
)
result=""
for ((i=0;i<${#lines};i++)) ;do
chr=- c=()
for ((l=0;l<${#lines[#]};l++)) ;do
[ "${lines[l]:i:1}" != "-" ] &&
chr="${lines[l]:i:1}" &&
c+=($l)
done
[ ${#c[#]} -eq 0 ] && printf 'Char #%d not replaced.\n' $i
[ ${#c[#]} -gt 1 ] && c="${c[*]}" && chr="*" &&
printf "Conflict at char #%d (lines: %s).\n" $i "${c// /, }"
result+=$chr
done
echo $result
With provided input, there is no conflict and all characters is replaced. So the output is:
ACCTTAGCCCCGCTGTAGCCCACAGTAAAAC
Note: Question stand for 4 different files, so lines= syntax could be:
lines=($(cat file1 file2 file3 file4))
But with a wrong input:
lines=(
A----A---A-----A-----A-A--AAAA-
-CC----CCCC-C-----CCC-C-------C
------G----G---G-G------G------
---TT--------T-T---------T-----
)
output could be:
Conflict at char #9 (lines: 0, 1).
Char #14 not replaced.
Conflict at char #15 (lines: 0, 2, 3).
Char #16 not replaced.
and
echo $result
ACCTTAGCC*CGCT-*-GCCCACAGTAAAAC
Small perl filter
But if input are not to be verified, this little perl filter could do the job:
(Thanks #jm666 for }{ syntax)
perl -nlE 'y+-+\0+;$,|=$_}{say$,' <(cat file1 file2 file3 file4)
where
-n process all lines without output
-l whipe leading cariage return at end of lines
y+lhs+rhs+ replace (translate) chars from 'lhs' to 'rhs'
\0 is the *null* character, binary 0.
$, is a variable
|= binary or, between himself and current line ($_)
}{ at END, once all lines processed
Alternative way - not very efficient - but short:
file="./gene"
line1=$(head -1 "$file")
seq ${#line1} | xargs -n1 -I% cut -c% "$file" | paste -s - | tr -cd '[A-Z\n]'
prints:
ACCTTAGCCCCGCTGTAGCCCACAGTAAAAC
Assumption: each line has the same length.
Decomposition:
the line1=$(head -1 "$file") read the 1st line into the variable line1
the seq ${#line1} generate a sequence of numbers 1..char_count_in_the_line1, like
1
2
..
31
the xargs -n1 -I% cut -c% "$file" will run for each above number the command cut like cut -c22 filename - what extract the given column from the file, e.g. you will get output like:
A
-
-
-
-
C
-
-
# and so on
the paste -s - will join the above lines into one long line with the \t (tab) separator, like:
A - - - - C - - - C - - - - - T ... etc...
finally the tr -cd '[A-Z\n]' remove everything what isn't uppercase character or newline, so will get the final
ACCTTAGCCCCGCTGTAGCCCACAGTAAAAC