I'm looking for a way to merge 4 lines of dna probing results into one line.
The problem here is:
I don't want to append the lines. But associating them
The 4 lines of dna probing:
A----A----------A----A-A--AAAA-
-CC----CCCC-C-----CCC-C-------C
------G----G--G--G------G------
---TT--------T-T---------T-----
I need these to be 1 line, not just appended but intermixed without the dashes.
First characters of the result:
ACCTTAGCCCCGC...
This seem to be a kind of general problem, so the language choosed to solve this don't matter.
For fun: one bash way:
lines=(
A----A----------A----A-A--AAAA-
-CC----CCCC-C-----CCC-C-------C
------G----G--G--G------G------
---TT--------T-T---------T-----
)
result=""
for ((i=0;i<${#lines};i++)) ;do
chr=- c=()
for ((l=0;l<${#lines[#]};l++)) ;do
[ "${lines[l]:i:1}" != "-" ] &&
chr="${lines[l]:i:1}" &&
c+=($l)
done
[ ${#c[#]} -eq 0 ] && printf 'Char #%d not replaced.\n' $i
[ ${#c[#]} -gt 1 ] && c="${c[*]}" && chr="*" &&
printf "Conflict at char #%d (lines: %s).\n" $i "${c// /, }"
result+=$chr
done
echo $result
With provided input, there is no conflict and all characters is replaced. So the output is:
ACCTTAGCCCCGCTGTAGCCCACAGTAAAAC
Note: Question stand for 4 different files, so lines= syntax could be:
lines=($(cat file1 file2 file3 file4))
But with a wrong input:
lines=(
A----A---A-----A-----A-A--AAAA-
-CC----CCCC-C-----CCC-C-------C
------G----G---G-G------G------
---TT--------T-T---------T-----
)
output could be:
Conflict at char #9 (lines: 0, 1).
Char #14 not replaced.
Conflict at char #15 (lines: 0, 2, 3).
Char #16 not replaced.
and
echo $result
ACCTTAGCC*CGCT-*-GCCCACAGTAAAAC
Small perl filter
But if input are not to be verified, this little perl filter could do the job:
(Thanks #jm666 for }{ syntax)
perl -nlE 'y+-+\0+;$,|=$_}{say$,' <(cat file1 file2 file3 file4)
where
-n process all lines without output
-l whipe leading cariage return at end of lines
y+lhs+rhs+ replace (translate) chars from 'lhs' to 'rhs'
\0 is the *null* character, binary 0.
$, is a variable
|= binary or, between himself and current line ($_)
}{ at END, once all lines processed
Alternative way - not very efficient - but short:
file="./gene"
line1=$(head -1 "$file")
seq ${#line1} | xargs -n1 -I% cut -c% "$file" | paste -s - | tr -cd '[A-Z\n]'
prints:
ACCTTAGCCCCGCTGTAGCCCACAGTAAAAC
Assumption: each line has the same length.
Decomposition:
the line1=$(head -1 "$file") read the 1st line into the variable line1
the seq ${#line1} generate a sequence of numbers 1..char_count_in_the_line1, like
1
2
..
31
the xargs -n1 -I% cut -c% "$file" will run for each above number the command cut like cut -c22 filename - what extract the given column from the file, e.g. you will get output like:
A
-
-
-
-
C
-
-
# and so on
the paste -s - will join the above lines into one long line with the \t (tab) separator, like:
A - - - - C - - - C - - - - - T ... etc...
finally the tr -cd '[A-Z\n]' remove everything what isn't uppercase character or newline, so will get the final
ACCTTAGCCCCGCTGTAGCCCACAGTAAAAC
Related
Let's suppose that you need to generate a NUL-delimited stream of timestamped filenames.
On Linux & Solaris I can do it with:
stat --printf '%.9Y %n\0' -- *
On BSD, I can get the same info, but delimited by newlines, with:
stat -f '%.9Fm %N' -- *
The man talks about a few escape sequences but the NUL byte doesn't seem supported:
If the % is immediately followed by one of n, t, %, or #, then a newline character, a tab character, a percent character, or the current file number is printed.
Is there a way to work around that? edit: (accurately and efficiently?)
Update:
Sorry, the glob * is misleading. The arguments can contain any path.
I have a working solution that forks a stat call for each path. I want to improve it because of the massive number of files to process.
You may try this work-around solution if running stat command for files:
stat -nf "%.9Fm %N/" * | tr / '\0'
Here:
-n: To suppress newlines in stat output
Added / as terminator for each entry from stat output
tr / '\0': To convert / into NUL byte
Another work-around is to use a control character in stat and use tr to replace it with \0 like this:
stat -nf "%.9Fm %N"$'\1' * | tr '\1' '\0'
This will work with directories also.
Unfortunately, stat out of the box does not offer this option, and so what you ask is not directly achievable.
However, you can easily implement the required functionality in a scripting language like Perl or Python.
#!/usr/bin/env python3
from pathlib import Path
from sys import argv
for arg in argv[1:]:
print(
Path(arg).stat().st_mtime,
arg, end="\0")
Demo: https://ideone.com/vXiSPY
The demo exhibits a small discrepancy in the mtime which does not seem to be a rounding error, but the result could be different on MacOS (the demo platform is Debian Linux, apparently). If you want to force the result to a particular number of decimal places, Python has formatting facilities similar to those of stat and printf.
With any command that can't produce NUL-terminated (or any other character/string terminated) output, you can just wrap it in a function to call the command and then printf it's output with a terminating NUL instead of newline, for example:
nulstat() {
local fmt=$1 file
shift
for file in "$#"; do
printf '%s\0' "$(stat -f "$fmt" "$file")"
done
}
nulstat '%.9Fm %N' *
For example:
$ > foo
$ > $'foo\nbar'
$ nulstat '%.9Fm %N' foo* | od -c
0000000 1 6 6 3 1 6 2 5 3 6 . 4 7 7 9 8
0000020 0 1 4 0 f o o \0 1 6 6 3 1 6 2
0000040 5 3 9 . 3 8 8 0 6 9 9 3 0 f o
0000060 o \n b a r \0
0000066
1. What you can do (accurate but slow):
Fork a stat command for each input path:
for p in "$#"
do
stat -nf '%.9Fm' -- "$p" &&
printf '\t%s\0' "$p"
done
2. What you can do (accurate but twisted):
In the input paths, replace each occurrence of (possibly overlapping) /././ with a single /./, make stat output /././\n at the end of each record, and use awk to substitute each /././\n by a NUL byte:
#!/bin/bash
shopt -s extglob
stat -nf '%.9Fm%t%N/././%n' -- "${#//\/.\/+(.\/)//./}" |
awk -F '/\\./\\./' '{
if ( NF == 2 ) {
printf "%s%c", record $1, 0
record = ""
} else
record = record $1 "\n"
}'
N.B. If you wonder why I chose /././\n as record separator then take a look at Is it "safe" to replace each occurrence of (possibly overlapped) /./ with / in a path?
3. What you should do (accurate & fast):
You can use the following perl one‑liner on almost every UNIX/Linux:
LANG=C perl -MTime::HiRes=stat -e '
foreach (#ARGV) {
my #st = stat($_);
if ( #st > 0 ) {
printf "%.9f\t%s\0", $st[9], $_;
} else {
printf STDERR "stat: %s: %s\n", $_, $!;
}
}
' -- "$#"
note: for perl < 5.8.9, remove the -MTime::HiRes=stat from the command line.
ASIDE: There's a bug in BSD's stat:
When %N is at the end of the format string and the filename ends with a newline character, then its trailing newline might get stripped:
For example:
stat -f '%N' -- $'file1\n' file2
file1
file2
For getting the output that one would expect from stat -f '%N' you can use the -n switch and add an explicit %n at the end of the format string:
stat -nf '%N%n' -- $'file1\n' file2
file1
file2
Is there a way to work around that?
If all you need is to just replace all newlines with NULLs, then following tr should suffice
stat -f '%.9Fm %N' * | tr '\n' '\000'
Explanation: 000 is NULL expressed as octal value.
I am stuck on that. So I have this while-read loop within my code that is taking so long and I would like to run it in many processors. But, I'd like to split the input file and run 14 loops (because I have 14 threads), one for each splited file, in parallel. Thing is that I don't know how to tell the while loop which file to get and work with.
For example, in a regular while-read loop I would code:
while read line
do
<some code>
done < input file or variable...
But in this case I would like to split the above input file in 14 files and run 14 while loops in parallel, one for each splited file.
I tried :
split -n 14 input_file
find . -name "xa*" | \
parallel -j 14 | \
while read line
do
<lot of stuff>
done
also tried
split -n 14 input_file
function loop {
while read line
do
<lot of stuff>
done
}
export -f loop
parallel -j 14 ::: loop
But neither I was able to tell which file would be the input to the loop so parallel would understand "take each of those xa* files and place into individual loops in parallel"
An example of the input file (a list of strings)
AEYS01000010.10484.12283
CVJT01000011.50.2173
KF625180.1.1799
KT949922.1.1791
LOBZ01000025.54942.57580
EDIT
This is the code.
The output is a table (741100 lines) with some statistics regarding DNA sequences alignments already made.
The loop takes an input_file (no broken lines, varies from 500 to ~45000 lines, 800Kb) with DNA sequence acessions, reads it line-by-line and look for each correspondent full taxonomy for those acessions in a databank (~45000 lines). Then, it does a few sums/divisions. Output is a .tsv and looks like this (an example for sequence "KF625180.1.1799"):
Rate of taxonomies for this sequence in %: KF625180.1.1799 D_6__Bacillus_atrophaeus
Taxonomy %aligned number_ocurrences_in_the_alignment num_ocurrences_in_databank %alingment/databank
D_6__Bacillus_atrophaeus 50% 1 20 5%
D_6__Bacillus_amyloliquefaciens 50% 1 154 0.649351%
$ head input file
AEYS01000010.10484.12283
CVJT01000011.50.217
KF625180.1.1799
KT949922.1.1791
LOBZ01000025.54942.57580
Two additional files are also used inside the loop. They are not the loop input.
1) a file called alnout_file that only serves for finding how many hits (or alignments) a given sequence had against the databank. It was also previously made outside this loop. It can vary in the number of lines from hundreads to thousands. Only columns 1 and 2 matters here. Column1 is the name of the sequence and col2 is the name of all sequences it matched in the databnk. It looks like that:
$ head alnout_file
KF625180.1.1799 KF625180.1.1799 100.0 431 0 0 1 431 1 431 -1 0
KF625180.1.1799 KP143082.1.1457 99.3 431 1 2 1 431 1 429 -1 0
KP143082.1.1457 KF625180.1.1799 99.3 431 1 2 1 429 1 431 -1 0
2) a databank .tsv file containing ~45000 taxonomies correspondent to the DNA sequences. Each taxonomy is in one line:
$ head taxonomy.file.tsv
KP143082.1.1457 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_amyloliquefaciens
KF625180.1.1799 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_atrophaeus
So, given sequence KF625180.1.1799. I previously aligned it against a databank containing ~45000 other DNA sequences and got an output whis has all the accessions to sequences that it matched. What the loop does is that it finds the taxonomies for all those sequences and calculates the "statistics" I mentionded previously. Code does it for all the DNA-sequences-accesions I have.
TAXONOMY=path/taxonomy.file.tsv
while read line
do
#find hits
hits=$(grep $line alnout_file | cut -f 2)
completename=$(grep $line $TAXONOMY | sed 's/D_0.*D_4/D_4/g')
printf "\nRate of taxonomies for this sequence in %%:\t$completename\n"
printf "Taxonomy\t%aligned\tnumber_ocurrences_in_the_alignment\tnum_ocurrences_in_databank\t%alingment/databank\n"
#find hits and calculate the frequence (%) of the taxonomy in the alignment output
# ex.: Bacillus_subtilis 33
freqHits=$(grep "${hits[#]}" $TAXONOMY | \
cut -f 2 | \
awk '{a[$0]++} END {for (i in a) {print i, "\t", a[i]/NR*100, "\t", a[i]}}' | \
sed -e 's/D_0.*D_5/D_5/g' -e 's#\s\t\s#\t#g' | \
sort -k2 -hr)
# print frequence of each taxonomy in the databank
freqBank=$(while read line; do grep -c "$line" $TAXONOMY; done < <(echo "$freqHits" | cut -f 1))
#print cols with taxonomy and calculations
paste <(printf %s "$freqHits") <(printf %s "$freqBank") | awk '{print $1,"\t",$2"%","\t",$3,"\t",$4,"\t",$3/$4*100"%"}'
done < input_file
It is a lot of greps and parsing so it takes about ~12h running in one processor for doing it to all the 45000 DNA sequence accessions. The, I would like to split input_file and do it in all the processors I have (14) because it would the time spend in that.
Thank you all for being so patient with me =)
You are looking for --pipe. In this case you can even use the optimized --pipepart (version >20160621):
export TAXONOMY=path/taxonomy.file.tsv
doit() {
while read line
do
#find hits
hits=$(grep $line alnout_file | cut -f 2)
completename=$(grep $line $TAXONOMY | sed 's/D_0.*D_4/D_4/g')
printf "\nRate of taxonomies for this sequence in %%:\t$completename\n"
printf "Taxonomy\t%aligned\tnumber_ocurrences_in_the_alignment\tnum_ocurrences_in_databank\t%alingment/databank\n"
#find hits and calculate the frequence (%) of the taxonomy in the alignment output
# ex.: Bacillus_subtilis 33
freqHits=$(grep "${hits[#]}" $TAXONOMY | \
cut -f 2 | \
awk '{a[$0]++} END {for (i in a) {print i, "\t", a[i]/NR*100, "\t", a[i]}}' | \
sed -e 's/D_0.*D_5/D_5/g' -e 's#\s\t\s#\t#g' | \
sort -k2 -hr)
# print frequence of each taxonomy in the databank
freqBank=$(while read line; do grep -c "$line" $TAXONOMY; done < <(echo "$freqHits" | cut -f 1))
#print cols with taxonomy and calculations
paste <(printf %s "$freqHits") <(printf %s "$freqBank") | awk '{print $1,"\t",$2"%","\t",$3,"\t",$4,"\t",$3/$4*100"%"}'
done
}
export -f doit
parallel -a input_file --pipepart doit
This will chop input_file into 10*ncpu blocks (where ncpu is the number of CPU threads), pass each block to doit, run ncpu jobs in parallel.
That said I think your real problem is spawning too many programs: If you rewrite doit in Perl or Python I will expect you will see a major speedup.
As an alternative I threw together a quick test.
#! /bin/env bash
mkfifo PIPELINE # create a single queue
cat "$1" > PIPELINE & # supply it with records
{ declare -i cnt=0 max=14
while (( ++cnt <= max )) # spawn loop creates worker jobs
do printf -v fn "%02d" $cnt
while read -r line # each work loop reads common stdin...
do echo "$fn:[$line]"
sleep 1
done >$fn.log 2>&1 & # these run in background in parallel
done # this one exits
} < PIPELINE # *all* read from the same queue
wait
cat [0-9][0-9].log
Doesn't need split, but does need a mkfifo.
Obviously, change the code inside the internal loop.
This answers what you asked, namely how to process in parallel the 14 files you get from running split. However, I don't think it is the best way of doing whatever it is that you are trying to do - but we would need some answers from you for that.
So, let's make a million line file and split it into 14 parts:
seq 1000000 > 1M
split -n 14 1M part-
That gives me 14 files called part-aa through part-an. Now your question is how to process those 14 parts in parallel - (read the last line first):
#!/bin/bash
# This function will be called for each of the 14 files
DoOne(){
# Pick up parameters
job=$1
file=$2
# Count lines in specified file
lines=$(wc -l < "$file")
echo "Job No: $job, file: $file, lines: $lines"
}
# Make the function above known to processes spawned by GNU Parallel
export -f DoOne
# Run 14 parallel instances of "DoOne" passing job number and filename to each
parallel -k -j 14 DoOne {#} {} ::: part-??
Sample Output
Job No: 1, file: part-aa, lines: 83861
Job No: 2, file: part-ab, lines: 72600
Job No: 3, file: part-ac, lines: 70295
Job No: 4, file: part-ad, lines: 70295
Job No: 5, file: part-ae, lines: 70294
Job No: 6, file: part-af, lines: 70295
Job No: 7, file: part-ag, lines: 70295
Job No: 8, file: part-ah, lines: 70294
Job No: 9, file: part-ai, lines: 70295
Job No: 10, file: part-aj, lines: 70295
Job No: 11, file: part-ak, lines: 70295
Job No: 12, file: part-al, lines: 70294
Job No: 13, file: part-am, lines: 70295
Job No: 14, file: part-an, lines: 70297
You would omit the -k argument to GNU Parallel normally - I only added it so the output comes in order.
I think that using a bunch of grep and awk commands is the wrong approach here - you would be miles better off using Perl, or awk. As you have not provided any sample files I generated some using this code:
#!/bin/bash
for a in {A..Z} {0..9} ; do
for b in {A..Z} {0..9} ; do
for c in {A..Z} {0..9} ; do
echo "${a}${b}${c}"
done
done
done > a
# Now make file "b" which has the same stuff but shuffled into a different order
gshuf < a > b
Note that there are 26 letters in the alphabet, so if I add the digits 0..9 to the letters of the alphabet, I get 36 alphanumeric digits and if I nest 3 loops of that I get 36^3 or 46,656 lines which matches your file sizes roughly. File a now looks like this:
AAA
AAB
AAC
AAD
AAE
AAF
File b looks like this:
UKM
L50
AOC
79U
K6S
6PO
12I
XEV
WJN
Now I want to loop through a finding the corresponding line in b. First, I use your approach:
time while read thing ; do grep $thing b > /dev/null ; done < a
That takes 9 mins 35 seconds.
If I now exit grep on the first match, on average I will find it in the middle, which means the time will be halved since I won't continue to needlessly read b after I find what I want.
time while read thing ; do grep -m1 $thing b > /dev/null ; done < a
That improves the time down to 4 mins 30 seconds.
If I now use awk to read the contents of b into an associative array (a.k.a. hash) and then read the elements of a and find them in b like this:
time awk 'FNR==NR{a[$1]=$1; next} {print a[$1]}' b a > /dev/null
That now runs in 0.07 seconds. Hopefully you get the idea of what I am driving at. I expect Perl would do this in the same time and also provide more expressive facilities for the maths in the middle of your loop too.
I hope this small script helps you out:
function process {
while read line; do
echo "$line"
done < $1
}
function loop {
file=$1
chunks=$2
dir=`mktemp -d`
cd $dir
split -n l/$chunks $file
for i in *; do
process "$i" &
done
rm -rf $dir
}
loop /tmp/foo 14
It runs the process loop on the specified file with the specified number of chunks (without splitting lines) in parallel (using & to put each invocation in the background). I hope it gets you started.
This can do the job for You, I am not familiar with parallel instead using native bash spawning processes &:
function loop () {
while IFS= read -r -d $'\n'
do
# YOUR BIG STUFF
done < "${1}"
}
arr_files=(./xa*)
for i in "${arr_files[#]}"
do loop "${i}" &
done
wait
Consider a plain text file containing page-breaking ASCII control character "Form Feed" ($'\f'):
alpha\n
beta\n
gamma\n\f
one\n
two\n
three\n
four\n
five\n\f
earth\n
wind\n
fire\n
water\n\f
Note that each page has a random number of lines.
Need a bash routine that return the page number of a given line number from a text file containing page-breaking ASCII control character.
After a long time researching the solution I finally came across this piece of code:
function get_page_from_line
{
local nline="$1"
local input_file="$2"
local npag=0
local ln=0
local total=0
while IFS= read -d $'\f' -r page; do
npag=$(( ++npag ))
ln=$(echo -n "$page" | wc -l)
total=$(( total + ln ))
if [ $total -ge $nline ]; then
echo "${npag}"
return
fi
done < "$input_file"
echo "0"
return
}
But, unfortunately, this solution proved to be very slow in some cases.
Any better solution ?
Thanks!
The idea to use read -d $'\f' and then to count the lines is good.
This version migth appear not ellegant: if nline is greater than or equal to the number of lines in the file, then the file is read twice.
Give it a try, because it is super fast:
function get_page_from_line ()
{
local nline="${1}"
local input_file="${2}"
if [[ $(wc -l "${input_file}" | awk '{print $1}') -lt nline ]] ; then
printf "0\n"
else
printf "%d\n" $(( $(head -n ${nline} "${input_file}" | grep -c "^"$'\f') + 1 ))
fi
}
Performance of awk is better than the above bash version. awk was created for such text processing.
Give this tested version a try:
function get_page_from_line ()
{
awk -v nline="${1}" '
BEGIN {
npag=1;
}
{
if (index($0,"\f")>0) {
npag++;
}
if (NR==nline) {
print npag;
linefound=1;
exit;
}
}
END {
if (!linefound) {
print 0;
}
}' "${2}"
}
When \f is encountered, the page number is increased.
NR is the current line number.
----
For history, there is another bash version.
This version is using only built-it commands to count the lines in current page.
The speedtest.sh that you had provided in the comments showed it is a little bit ahead (20 sec approx.) which makes it equivalent to your version:
function get_page_from_line ()
{
local nline="$1"
local input_file="$2"
local npag=0
local total=0
while IFS= read -d $'\f' -r page; do
npag=$(( npag + 1 ))
IFS=$'\n'
for line in ${page}
do
total=$(( total + 1 ))
if [[ total -eq nline ]] ; then
printf "%d\n" ${npag}
unset IFS
return
fi
done
unset IFS
done < "$input_file"
printf "0\n"
return
}
awk to the rescue!
awk -v RS='\f' -v n=09 '$0~"^"n"." || $0~"\n"n"." {print NR}' file
3
updated anchoring as commented below.
$ for i in $(seq -w 12); do awk -v RS='\f' -v n="$i"
'$0~"^"n"." || $0~"\n"n"." {print n,"->",NR}' file; done
01 -> 1
02 -> 1
03 -> 1
04 -> 2
05 -> 2
06 -> 2
07 -> 2
08 -> 2
09 -> 3
10 -> 3
11 -> 3
12 -> 3
A script of similar length can be written in bash itself to locate and respond to the embedded <form-feed>'s contained in a file. (it will work for POSIX shell as well, with substitute for string index and expr for math) For example,
#!/bin/bash
declare -i ln=1 ## line count
declare -i pg=1 ## page count
fname="${1:-/dev/stdin}" ## read from file or stdin
printf "\nln:pg text\n" ## print header
while read -r l; do ## read each line
if [ ${l:0:1} = $'\f' ]; then ## if form-feed found
((pg++))
printf "<ff>\n%2s:%2s '%s'\n" "$ln" "$pg" "${l:1}"
else
printf "%2s:%2s '%s'\n" "$ln" "$pg" "$l"
fi
((ln++))
done < "$fname"
Example Input File
The simple input file with embedded <form-feed>'s was create with:
$ echo -e "a\nb\nc\n\fd\ne\nf\ng\nh\n\fi\nj\nk\nl" > dat/affex.txt
Which when output gives:
$ cat dat/affex.txt
a
b
c
d
e
f
g
h
i
j
k
l
Example Use/Output
$ bash affex.sh <dat/affex.txt
ln:pg text
1: 1 'a'
2: 1 'b'
3: 1 'c'
<ff>
4: 2 'd'
5: 2 'e'
6: 2 'f'
7: 2 'g'
8: 2 'h'
<ff>
9: 3 'i'
10: 3 'j'
11: 3 'k'
12: 3 'l'
With Awk, you can define RS (the record separator, default newline) to form feed (\f) and IFS (the input field separator, default any sequence of horizontal whitespace) to newline (\n) and obtain the number of lines as the number of "fields" in a "record" which is a "page".
The placement of form feeds in your data will produce some empty lines within a page so the counts are off where that happens.
awk -F '\n' -v RS='\f' '{ print NF }' file
You could reduce the number by one if $NF == "", and perhaps pass in the number of the desired page as a variable:
awk -F '\n' -v RS='\f' -v p="2" 'NR==p { print NF - ($NF == "") }' file
To obtain the page number for a particular line, just feed head -n number to the script, or loop over the numbers until you have accrued the sum of lines.
line=1
page=1
for count in $(awk -F '\n' -v RS='\f' '{ print NF - ($NF == "") }' file); do
old=$line
((line += count))
echo "Lines $old through line are on page $page"
((page++)
done
This gnu awk script prints the "page" for the linenumber given as command line argument:
BEGIN { ffcount=1;
search = ARGV[2]
delete ARGV[2]
if (!search ) {
print "Please provide linenumber as argument"
exit(1);
}
}
$1 ~ search { printf( "line %s is on page %d\n", search, ffcount) }
/[\f]/ { ffcount++ }
Use it like awk -f formfeeds.awk formfeeds.txt 05 where formfeeds.awk is the script, formfeeds.txt is the file and '05' is a linenumber.
The BEGIN rule deals mostly with the command line argument. The other rules are simple rules:
$1 ~ search applies when the first field matches the commandline argument stored in search
/[\f]/ applies when there is a formfeed
I have a large .xml file like that:
c1="a1" c2="b1" c3="cccc1"
c1="aa2" c2="bbbb2" c3="cc2"
c1="aaaaaa3" c2="bb3" c3="cc3"
...
I need the result like the following:
a1 b1 cccc1
aa2 bbbb2 cc2
aaaaaa3 bb3 cc3
...
How can I get the column in BASH?
I have the following method in PL/SQL,but it's very inconvenient:
SELECT C1,
TRIM(BOTH '"' FROM REGEXP_SUBSTR(C1, '"[^"]+"', 1, 1)) c1,
TRIM(BOTH '"' FROM REGEXP_SUBSTR(C1, '"[^"]+"', 1, 2)) c2,
TRIM(BOTH '"' FROM REGEXP_SUBSTR(C1, '"[^"]+"', 1, 3)) c3
FROM TEST;
Use cut:
cut -d'"' -f2,4,6 --output-delimiter=" " test.txt
Or you can use sed if the number of columns is not known:
sed 's/[a-z][a-z0-9]\+="\([^"]\+\)"/\1/g' < test.txt
Explanation:
[a-z][a-z0-9]\+ - matches a string starting with a alpha char followed by any number of alphanumeric chars
"\([^"]\+\)" - captures any string inside the quotes
\1 - represents the captured string that in this case is used to replace the entire match
A perl approach (based on the awk answer by #A-Ray)
perl -F'"' -ane 'print join(" ",#F[ map { 2 * $_ + 1} (0 .. $#F) ]),"\n";' < test.txt
Explanation:
-F'"' set input separator to "
-a turn autosplit on - this results in #F being filed with content of fields in the input
-n iterate through all lines but don't print them by default
-e execute code following
map { 2 * $_ + 1} (0 .. $#F) generates a list of indexes (1,3,5 ...)
#F[map { 2 * $_ + 1} (0 .. $#F)] takes a slice from the array, selecting only odd fields
join - joins the slice with spaces
NOTE: I would not use this approach without a good reason, the first two are easier.
Some benchmarking (on a Raspberry Pi, with a 60000 lines input file and output thrown away to /dev/null)
cut - 0m0.135s no surprise there
sed - 0m5.864s
perl - 0m8.218s - I guess regenerating the index list every line isn't that fast (with a hard coded slice list it goes to half, but that would defeat the purpose)
the read based solution - 0m52.027s
You can also look at the built-in substring replacement/removal bash offers. Either in a short script or One-Liner:
#!/bin/bash
while read -r line; do
new=${line//c[0-9]=/} ## remove 'cX=', where X is '0-9'
new=${new//\"/} ## remove all '"' (double-quotes)
echo "$new"
done <"$1"
exit 0
Input
$ cat dat/stuff.xml
c1="a1" c2="b1" c3="cccc1"
c1="aa2" c2="bbbb2" c3="cc2"
c1="aaaaaa3" c2="bb3" c3="cc3"
Output
$ bash parsexmlcx.sh dat/stuff.xml
a1 b1 cccc1
aa2 bbbb2 cc2
aaaaaa3 bb3 cc3
As a One-Liner
while read -r line; do new=${line//c[0-9]=/}; new=${new//\"/}; echo "$new"; done <dat/stuff.xml
awk -F '"' '{ for(i=2; i<=NF; i+=2) { printf $i" " } print "" }'
Explanation
-F '"' makes Awk treat quotation marks (") as field delimiters. For example, Awk will split the line...
c1="a1" c2="b1" c3="cccc1"
...into fields numbered as...
1: 'c1='
2: 'a1'
3: ' c2='
4: 'b1'
5: ' c3='
6: 'cccc1'
7: ''
for(i=2; i<=NF; i+=2) { printf $i" " } starts at field 2, prints the value of the field, skips a field, and continues. In this case, fields 2, 4, and 6 will be printed.
print outputs a string following by a newline. printf also outputs a string, but doesn't append a newline. Therefore...
printf $i" "
...outputs the value of field $i followed by a space.
print ""
...simply outputs a newline.
In UNIX environment, I have a file.txt that contains following details:
Data recording started:
0001100 Matched at 412090
0001101 Mismatched at 414798
0001102 Matched at 420007
0001103 Mismatched at 420015
Job completed
How do I can get the first Matched value by searching "Matched" (line 2) word and also for the first "Mismatched" (line 3)
Find the difference between them and store as a variable, "dif"
The result is Matched minus Mismatched, so it cannot find the data by specify line number, i.e. find line 3 last integers minus line 2 last integers, because the mismatched may come at first like following:
Data recording started:
0001100 Mismatched at 412090
0001101 Matched at 414798
0001102 Mismatched at 420007
0001103 Matched at 420015
Job completed
One way:
echo $((
$(grep Matched input | head -1 | sed 's/.*at //')
- $(grep Mismatched input | head -1 | sed 's/.*at //')
))
or using only sed:
echo $((
$(sed -n 's/.*Matched.*at //p' input | head -1)
- $(sed -n 's/.*Mismatched.*at //p' input | head -1)
))
Output
-2708
We can use grep -m 1 to kick away head.
dif=$((
$(grep -m 1 'Matched' a.txt | sed 's/.*at \([0-9]*\).*/\1/')
- $(grep -m 1 'Mismatched' a.txt | sed 's/.*at \([0-9]*\).*/\1/')
))
echo $dif