How to pad out values line by line while mainting overall record length in a Unix Shell script ksh - shell

IFS=$'\n'
while read -r line
do
--header/trailer record
if echo ${line} | grep -e '000000000000000' -e '999999999999999' >/dev/null 2>&1
then
echo ${line} >> outfile.01.DAT.sampleNEW
elif echo ${line} | grep '+0' >/dev/null 2>&1
then
echo ${line} | sed -e 's/+/+00000000/; s/ X/X/' >> outfile.01.DAT.sampleNEW
else
echo ${line} | sed -e 's/-/-00000000/; s/ X/X/' >> outfile.01.DAT.sampleNEW
fi
done < Inputfile.01.DAT
I have a large file that I need to pad out the amount fields (signed) but retain the overall record length so have to remove some filler spaces at the end (each line ends with X). The file has a header/trailer that does not need to change. I have come up with a way but it is very slow when using a large input file. I am sure the use of grep here is not good.
Sample records. end with X - Overall length 107 bytes
000000000000000PPPPPPPPP Information INV TRANSACTION 0120160505201605052154HI203.SEQ 01 X
000000000000001PPPPP14PA 000YYYYYY488 -0001235.2520150319 X
000000000000002PPPMS PA 000RRRRR4539 +0008285.0020160301 X
000000000000003PPPP506 000TTTTTT605 -0000225.0020150608 X
9999999999999990000000000000439.940000000079802782.180000005 X

I suspect you want something like this, but it is very hard to tell given the way you have presented your question:
awk '
/000000000000000/ || /999999999999999/ {print;next}
/\+0/ {sub(/\+0/,"+00000000"); sub(/ X/,'X'); print; next}
/\-0/ {sub(/\-0/,"-00000000"); sub(/ X/,'X'); print; next}
' Inputfile.01.DAT
That says... "if the line contains a string of 15 zeroes or 15 nines, print it and move to the next line. If the line contains +0, replace it with +00000000 and remove 8 spaces before the final X, then print. Likewise for -0."
You could also maybe use Perl, and do something like this:
perl -nle '/0{15}|9{15}/ && print; s/([+-])0/$1\0000000000/ && s/ X/X/ && print' Inputfile.01.DAT

Related

How to filter text data in bash more efficiently

I have data file which I need to filter with bash script, see data example:
name=pencils
name=apples
value=10
name=rocks
value=3
name=tables
value=6
name=beds
name=cups
value=89
I need to group name value pairs like so apples=10, if current line starts with name and next line starts with name, first line should be omitted entirely. So result file should look like this:
apples=10
rocks=3
tables=6
cups=89
I came with this simple solution which works but is very slow, it takes 5 min to complete for file with 2000 lines.
VALUES=$(cat input.txt)
for x in $VALUES; do
if [[ -n $(echo $x | grep 'name=') ]]; then
name=$(echo $x | sed "s/name=//")
elif [[ -n $(echo $x | grep 'value=') ]]; then
value=$(echo $x | sed "s/value=//")
echo "${name}=${value}" >> output.txt
fi
done
I'm aware that this kind of task is not very suitable for bash, but script is already written and this is just small part of it.
How can I optimize this task in bash?
Do not run any commands in subshells, it slows your script a lot. You can do everything in the current shell.
#! /bin/bash
while IFS== read k v ; do
if [[ $k == name ]] ; then
name=$v
elif [[ $k == value ]] ; then
printf '%s=%s\n' "$name" "$v"
fi
done
There are three easy optimizations you can make that will greatly speed up the script without requiring a major rethink.
1. Replace for with while read
Loading input.txt into a string, and then looping over that string with for x in $VALUES is slow. It requires the whole file to be read into memory even though this task could be done in a streaming fashion, reading a line at a time.
A common replacement for for line in $(cat file) is while read line; do ... done < file. It turns out that loops are compound commands, and like the normal one-line commands we're used to, compound commands can have < and > redirections. Redirecting a file into a loop means that for the duration of the loop, stdin comes from the file. So if you call read line inside the loop then it will read one line each iteration.
while IFS= read -r x; do
if [[ -n $(echo $x | grep 'name=') ]]; then
name=$(echo $x | sed "s/name=//")
elif [[ -n $(echo $x | grep 'value=') ]]; then
value=$(echo $x | sed "s/value=//")
echo "${name}=${value}" >> output.txt
fi
done < input.txt
2. Redirect output outside loop
It's not just input that can be redirected. We can do the same thing for the >> output.txt redirection. Here's where you'll see the biggest speedup. When >> output.txt is inside the loop output.txt must be opened and closed every iteration, which is crazy slow. Moving it to the outside means it only needs to be opened once. Much, much faster.
while IFS= read -r x; do
if [[ -n $(echo $x | grep 'name=') ]]; then
name=$(echo $x | sed "s/name=//")
elif [[ -n $(echo $x | grep 'value=') ]]; then
value=$(echo $x | sed "s/value=//")
echo "${name}=${value}"
fi
done < input.txt > output.txt
3. Shell string processing
One final improvement is to use faster string processing. Calling grep requires forking a subprocess every time just to do a simple string split. It'd be a lot faster if we could do the string splitting using just shell constructs. Well, as it happens that's easy now that we've switched to read. read can do more than read whole lines; it can also split on a delimiter from the variable $IFS (inter-field separator).
while IFS='=' read -r key value; do
case "$key" in
name) name="$value";;
value) echo "$name=$value";;
fi
done < input.txt > output.txt
Further reading
BashFAQ/001 - How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?
This explains why I have IFS= read -r in the first two iterations.
BashFAQ/024 - I set variables in a loop that's in a pipeline. Why do they disappear after the loop terminates? Or, why can't I pipe data to read?
cmd | while read; do ... done is another popular use of while read, but it has unique pitfalls.
BashFAQ/100 - How do I do string manipulations in bash?
More in-shell string processing options.
If you have performance issues do not use bash at all. Use a text processing tool like, for instance, awk:
$ awk -F= '{name = $2} $1 == "value" {print name "=" $2}' data.txt
apples=10
rocks=3
tables=6
cups=89
Explanation: -F= defines the field separator as character =. The first block is executed only if the first field of a line ($1) is equal to string value. It prints variable name followed by character = and the second field ($2). The second block is executed on each line and it stores the second field ($2) in variable name.
Normally, if your input resembles what you show, this should automatically skip the first line. Else, we can exclude it explicitly using a test on the NR variable which value is the line number, starting at 1:
awk -F= 'NR != 1 && $1 == "value" {print name "=" $2}
NR != 1 {name = $2}' data.txt
All this works on inputs like the one you show but not on inputs where you would have other types of lines or several value=... consecutive lines. If you really want to test that the name/value pair is on two consecutive lines we need something more. For instance, test if the first field is name and use another variable n to store the line number of the last encountered name=... line. With all these tests we can now put the 2 blocks in a slightly more intuitive order (but the opposite would work the same):
awk -F= 'NR != 1 && $1 == "name" {name = $2; n = NR}
NR != 1 && NR == n+1 && $1 == "value" {print name "=" $2}' data.txt
With awk there might be a more elegant solution but you can have:
awk 'BEGIN{RS="\n?name=";FS="\nvalue="} {if($2) printf "%s=%s\n",$1,$2}' inputs.txt
RS="\n?name=" says that the record separator is name=
FS="\nvalue=" says that the field separator for each record is value=
if($2) says to only proceed the printf is the second field exists

How to get values from one file that fall in a list of ranges from another file

I have bunch of files with sorted numerical values, in example:
cat tag_1_file.val
234
551
626
cat tag_2_file.val
12
1023
1099
etc.
And one file with tags and value ranges that fit my needs. Values are sorted first by tag, then by 2nd column, then by 3rd. Ranges may overlap.
cat ranges.val
tag_1 200 300
tag_1 600 635
tag_2 421 443
and so on.
So I try to loop through file with ranges and then look for all values that fall in range (in every line) in file with appropriate tag:
cat ~/blahblah/ranges.val | while read -a line;
#read line as array
do
cat ~/blahblah/${line[0]}_file.val | while read number;
#get tag name and cat the appropriate file
do
if [[ "$number" -ge "${line[1]}" ]] && [[ "$number" -le "${line[2]}" ]]
#check if current value fall into range
then
echo $number >> ${line[0]}.output
#toss the value that fall into interval to another file
elif [[ "$number" -gt "${line[2]}" ]]
then break
fi
done
done
But these two nested while loops are deadly slow with huge files containing 100M+ lines.
I think, there must be more efficient way of doing such things and I'd be grateful for any hint.
UPD: The expected output based on this example is:
cat file tag_1.output
234
626
Have you tried recoding the inner loop in something more efficient than Bash? Perl would probably be good enough:
while read tag low hi; do
perl -nle "print if \$_ >= ${low} && \$_ <= ${hi}" \
<${tag}_file.val >>${tag}.output
done <ranges.val
The behaviour if this version is slightly different in two ways - the loop doesn't bail out once the high point is reached, and the output file is created even if it is empty. Over to you if that isn't what you want!
another not so efficient implementation with awk
$ awk 'NR==FNR {t[NR]=$1; s[NR]=$2; e[NR]=$3; next}
{for(k in t)
if(t[k]==FILENAME) {
inout = t[k] "." ((s[k]<=$1 && $1<=e[k])?"in":"out");
print > inout;
next}}' ranges tag_1 tag_2
$ head tag_?.*
==> tag_1.in <==
234
==> tag_1.out <==
551
626
==> tag_2.out <==
12
1023
1099
note that I renamed files to match the tag names, otherwise you have to add tag extraction from filenames. Suffix ".in" for in ranges and ".out" for not. Depends on the sorted order of the files. If you have thousands of tag files adding a another layer to filter out the ranges per tag will speed it up. Now it iterates over ranges.
I'd write
while read -u3 -r tag start end; do
f="${tag}_file.val"
if [[ -r $f ]]; then
while read -u4 -r num; do
(( start <= num && num <= end )) && echo "$num"
done 4< "$f"
fi
done 3< ranges.val
I'm deliberately reading the files on separate file descriptors, otherwise the inner while-read loop will also slurp up the rest of "ranges.val".
bash while-read loops are very slow. I'll be back if a few minutes with an alternate solution
here's a GNU awk answer (requires, I believe, a fairly recent version)
gawk '
#load "filefuncs"
function read_file(tag, start, end, file, number, statdata) {
file = tag "_file.val"
if (stat(file, statdata) != -1) {
while (getline number < file) {
if (start <= number && number <= end) print number
}
}
}
{read_file($1, $2, $3)}
' ranges.val
perl
perl -Mautodie -ane '
$file = $F[0] . "_file.val";
next unless -r $file;
open $fh, "<", $file;
while ($num = <$fh>) {
print $num if $F[1] <= $num and $num <= $F[2]
}
close $fh;
' ranges.val
I have a solution for you from bioinformatics:
We have a format and a tool for this kind of task.
The format called .bed is used for description of ranges on chromosomes, but should work with your tags too.
The best toolset for this format is bedtools, which is lightning fast.
The specific tool, which might help you is intersect.
With this installed it becomes a task of formating the data for the tool:
#!/bin/bash
#reformating your positions to .bed format;
#1 adding the tag to each line
#2 repeating the position to make it a range
#3 converting to tab-separation
awk -F $'\t' 'BEGIN {OFS = FS} {print FILENAME, $0, $0}' *_file.val | sed 's/_file.val//g' >all_positions_in_one_range_file.bed
#making your range-file tab-separated
sed 's/ /\t/g' ranges.val >ranges_with_tab.bed
#doing the real comparision of the ranges with bedtools
bedtools intersect -a all_positions_in_one-range_file.bed -b ranges_with_tab.bed >all_positions_intersected.bed
#spliting the one result file back into files named by your tag
awk -F $'\t' '{print $2 >$1".out"}' all_positions_intersected.bed
Or if you prefer oneliners:
bedtools intersect -a <(awk -F $'\t' 'BEGIN {OFS = FS} {print FILENAME, $0, $0}' *_file.val | sed 's/_file.val//g') -b <(sed 's/ /\t/g' ranges.val) | awk -F $'\t' '{print $2 >$1".out"}'

remove duplicate lines with similar prefix

I need to remove similar lines in a file which has duplicate prefix and keep the unique ones.
From this,
abc/def/ghi/
abc/def/ghi/jkl/one/
abc/def/ghi/jkl/two/
123/456/
123/456/789/
xyz/
to this
abc/def/ghi/jkl/one/
abc/def/ghi/jkl/two/
123/456/789/
xyz/
Appreciate any suggestions,
Answer in case reordering the output is allowed.
sort -r file | awk 'a!~"^"$0{a=$0;print}'
sort -r file : sort lines in revers this way longer lines with the same pattern will be placed before shorter line of the same pattern
awk 'a!~"^"$0{a=$0;print}' : parse sorted output where a holds the previous line and $0 holds the current line
a!~"^"$0 checks for each line if current line is not a substring at the beginning of the previous line.
if $0 is not a substring (ie. not similar prefix), we print it and save new string in a (to be compared with next line)
The first line $0 is not in a because no value was assigned to a (first line is always printed)
A quick and dirty way of doing it is the following:
$ while read elem; do echo -n "$elem " ; grep $elem file| wc -l; done <file | awk '$2==1{print $1}'
abc/def/ghi/jkl/one/
abc/def/ghi/jkl/two/
123/456/789/
xyz/
where you read the input file and print each elements and the number of time it appears in the file, then with awk you print only the lines where it appears only 1 time.
Step 1: This solution is based on assumption that reordering the output is allowed. If so, then it should be faster to reverse sort the input file before processing. By reverse sorting, we only need to compare 2 consecutive lines in each loop, no need to search all the file or all the "known prefixes". I understand that a line is defined as a prefix and should be removed if it is a prefix of any another line. Here is an example of remove prefixes in a file, reordering is allowed:
#!/bin/bash
f=sample.txt # sample data
p='' # previous line = empty
sort -r "$f" | \
while IFS= read -r s || [[ -n "$s" ]]; do # reverse sort, then read string (line)
[[ "$s" = "${p:0:${#s}}" ]] || \
printf "%s\n" "$s" # if s is not prefix of p, then print it
p="$s"
done
Explainations: ${p:0:${#s}} take the first ${#s} (len of s) characters in string p.
Test:
$ cat sample.txt
abc/def/ghi/
abc/def/ghi/jkl/one/
abc/def/ghi/jkl/two/
abc/def/ghi/jkl/one/one
abc/def/ghi/jkl/two/two
123/456/
123/456/789/
xyz/
$ ./remove-prefix.sh
xyz/
abc/def/ghi/jkl/two/two
abc/def/ghi/jkl/one/one
123/456/789/
Step 2: If you really need to keep the order, then this script is an example of removing all prefixes, reordering is not allowed:
#!/bin/bash
f=sample.txt
p=''
cat -n "$f" | \
sed 's:\t:|:' | \
sort -r -t'|' -k2 | \
while IFS='|' read -r i s || [[ -n "$s" ]]; do
[[ "$s" = "${p:0:${#s}}" ]] || printf "%s|%s\n" "$i" "$s"
p="$s"
done | \
sort -n -t'|' -k1 | \
sed 's:^.*|::'
Explanations:
cat -n: numbering all lines
sed 's:\t:|:': use '|' as the delimiter -- you need to change it to another one if needed
sort -r -t'|' -k2: reverse sort with delimiter='|' and use the key 2
while ... done: similar to solution of step 1
sort -n -t'|' -k1: sort back to original order (numbering sort)
sed 's:^.*|::': remove the numbering
Test:
$ ./remove-prefix.sh
abc/def/ghi/jkl/one/one
abc/def/ghi/jkl/two/two
123/456/789/
xyz/
Notes: In both solutions, the most costed operations are calls to sort. Solution in step 1 calls sort once, and solution in the step 2 calls sort twice. All other operations (cat, sed, while, string compare,...) are not at the same level of cost.
In solution of step 2, cat + sed + while + sed is "equivalent" to scan that file 4 times (which theorically can be executed in parallel because of pipe).
The following awk does what is requested, it reads the file twice.
In the first pass it builds up all possible prefixes per line
The second pass, it checks if the line is a possible prefix, if not print.
The code is:
awk -F'/' '(NR==FNR){s="";for(i=1;i<=NF-2;i++){s=s$i"/";a[s]};next}
{if (! ($0 in a) ) {print $0}}' <file> <file>
You can also do it with reading the file a single time, but then you store it into memory :
awk -F'/' '{s="";for(i=1;i<=NF-2;i++){s=s$i"/";a[s]}; b[NR]=$0; next}
END {for(i=1;i<=NR;i++){if (! (b[i] in a) ) {print $0}}}' <file>
Similar to the solution of Allan, but using grep -c :
while read line; do (( $(grep -c $line <file>) == 1 )) && echo $line; done < <file>
Take into account that this construct reads the file (N+1) times where N is the amount of lines.

Using cut on stdout with tabs

I have a file which contains one line of text with tabs
echo -e "foo\tbar\tfoo2\nx\ty\tz" > file.txt
I'd like to get the first column with cut. It works if I do
$ cut -f 1 file.txt
foo
x
But if I read it in a bash script
while read line
do
new_name=`echo -e $line | cut -f 1`
echo -e "$new_name"
done < file.txt
Then I get instead
foo bar foo2
x y z
What am I doing wrong?
/edit: My script looks like that right now
while IFS=$'\t' read word definition
do
clean_word=`echo -e $word | external-command'`
echo -e "$clean_word\t<b>$word</b><br>$definition" >> $2
done < $1
External command removes diacritics from a Greek word. Can the script be optimized any further without changing external-command?
What is happening is that you did not quote $line when reading the file. Then, the original tab-delimited format was lost and instead of tabs, spaces show in between words. And since cut's default delimiter is a TAB, it does not find any and it prints the whole line.
So quoting works:
while read line
do
new_name=`echo -e "$line" | cut -f 1`
#----------------^^^^^^^
echo -e "$new_name"
done < file.txt
Note, however, that you could have used IFS to set the tab as field separator and read more than one parameter at a time:
while IFS=$'\t' read name rest;
do
echo "$name"
done < file.txt
returning:
foo
x
And, again, note that awk is even faster for this purpose:
$ awk -F"\t" '{print $1}' file.txt
foo
x
So, unless you want to call some external command while looping the file, awk (or sed) is better.

adding numbers without grep -c option

I have a txt file like
Peugeot:406:1999:Silver:1
Ford:Fiesta:1995:Red:2
Peugeot:206:2000:Black:1
Ford:Fiesta:1995:Red:2
I am looking for a command That counts the number of red Ford Fiesta cars.
The last number in each line is the amount of that particular car.
The command I am looking for CANNOT use the -c option of grep.
so this command should just output the number 4.
Any help would be welcome, thank you.
A simple bit of awk would do the trick:
awk -F: '$1=="Ford" && $4=="Red" { c+=$5 } END { print c }' file
Output:
4
Explanation:
The -F: switch means that the input field separator is a colon, so the car manufacturer is $1 (the 1st field), the model is $2, etc.
If the 1st field is "Ford" and the 4th field is "Red", then add the value of the 5th (last) field to the variable c. Once the whole file has been processed, print out the value of c.
For a native bash solution:
c=0
while IFS=":" read -ra col; do
[[ ${col[0]} == Ford ]] && [[ ${col[3]} == Red ]] && (( c += col[4] ))
done < file && echo $c
Effectively applies the same logic as the awk one above, without any additional dependencies.
Methods:
1.) use some scripting language for counting, like awk or perl and such. Awk solution already posted, here is an perl solution.
perl -F: -lane '$s+=$F[4] if m/Ford:.*:Red/}{print $s' < carfile
#or
perl -F: -lane '$s+=$F[4] if ($F[0]=~m/Ford/ && $F[3]=~/Red/)}{print $s' < carfile
both examples prints
4
2.) The second method is based on shell-pipelining
filter out the right rows
extract the column with the count
sum the numbers
e.g some examples:
grep 'Ford:.*:Red:' carfile | cut -d: -f5 | paste -sd+ | bc
the grep filter out the right rows
the cut get the last column
the paste creates an line like 2+2 what can be counted by
the bc for counting
Another example:
sed -n 's/\(Ford:.*:Red\):\(.*\)/\2/p' carfile | paste -sd+ | bc
the sed filter and extract
another example - different way of counting
(echo 0 ; sed -n 's/\(Ford:.*:Red\):\(.*\)/\2+/p' carfile ;echo p )| dc
numbers are counted by RPN calculator called dc, e.g. it works like 0 2 + - first comes the values and as the last the operation.
the first echo puts into the stack 0
the sed creates a stream of numbers like 2+ 2+
the last echo p prints the stack
exists many other possibilies how count a strem of numbers.
e.g counting by bash
while read -r num
do
sum=$(( $sum + $num ))
done < <(sed -n 's/\(Ford:.*:Red\):\(.*\)/\2/p' carfile)
and pure bash:
while IFS=: read -r maker model year color count
do
if [[ "$maker" == "Ford" && "$color" == "Red" ]]
then
(( sum += $count ))
fi
done < carfile
echo $sum

Resources