shell - grep - how to get only lines that have certain amount char - shell

good morning.
I have the following lines :
1 | blah | 2 | 1993 | 86 | 0 | NA | 123 | 123
1 | blah | TheBeatles | 0 | 3058 | NA | NA | 11
And I wanna get only the lines with 7 "|" and the same first field.
So the output for these two lines will be nothing, but for these two lines :
1 | blah | 2 | 1993 | 86 | 0 | NA | 123
1 | blah | TheBeatles | 0 | 3058 | NA | NA | 11
The output will be "error".
I'm getting the inputs from a file using the following command :
grep '.*|.*|.*|.*|.*|.*|.*|.*' < $1 | sort -nbsk1 | cut -d "|" -f1 | uniq -d |
while read line2; do
echo error
done
But this implementation would still print error even if I have more then 7 "|".
Any suggestions ?
P.S - I can assume that there is a \n in the end of each line.

For printing lines containing only 7 |, try:
awk -F'|' 'NF == 8' filename
If you want to use bash to count the number of | in a given line, try:
line="1 | blah | 2 | 1993 | 86 | 0 | NA | 123 | 123";
count=${line//[^|]/};
echo ${#count};

With grep
grep '^\([^|]*|[^|]*\)\{7\}$'

Assuming zz.txt is:
$ cat zz.txt
1 | blah | 2 | 1993 | 86 | 0 | NA | 123 | 123
1 | blah | TheBeatles | 0 | 3058 | NA | NA | 11
$ cut -d\| -f1-8 zz.txt
above cut will give you the output you need.

I would suggest that you use awk for this job.
BEGIN { FS = "|" }
NF == 8 && $1 == '1' { print $0}
would do the job (although the == and && could be = and & ; my awk is a bit rusty)

Related

Inconsistency in output field separator

We have to find the difference(d) Between last 2 nos and display rows with the highest value of d in ascending order
INPUT
1 | Latha | Third | Vikas | 90 | 91
2 | Neethu | Second | Meridian | 92 | 94
3 | Sethu | First | DAV | 86 | 98
4 | Theekshana | Second | DAV | 97 | 100
5 | Teju | First | Sangamithra | 89 | 100
6 | Theekshitha | Second | Sangamithra | 99 |100
Required OUTPUT
4$Theekshana$Second$DAV$97$100$3
5$Teju$First$Sangamithra$89$100$11
3$Sethu$First$DAV$86$98$12
awk 'BEGIN{FS="|";OFS="$";}{
avg=sqrt(($5-$6)^2)
print $1,$2,$3,$4,$5,$6,avg
}'|sort -nk7 -t "$"| tail -3
Output:
4 $ Theekshana $ Second $ DAV $ 97 $ 100$3
5 $ Teju $ First $ Sangamithra $ 89 $ 100$11
3 $ Sethu $ First $ DAV $ 86 $ 98$12
As you can see there is space before and after $ sign but for the last column (avg) there is no space, please explain why its happening
2)
awk 'BEGIN{FS=" | ";OFS="$";}{
avg=sqrt(($5-$6)^2)
print $1,$2,$3,$4,$5,$6,avg
}'|sort -nk7 -t "$"| tail -3
OUTPUT
4$|$Theekshana$|$Second$|$0
5$|$Teju$|$First$|$0
6$|$Theekshitha$|$Second$|$0
I have not mentiond | as the output field separator but still it appears, why is this happening and the difference is zero too
I am just 6 days old in unix,please answer even if its easy
your field separator is only the pipe symbol, so surrounding whitespace is part of the field definitions and that's what you see in the output. In combined uses pipe has the regex special meaning and need to be escaped. In your second case it means space or space is the field separator.
$ awk 'BEGIN {FS=" *\\| *"; OFS="$"}
{d=sqrt(($NF-$(NF-1))^2); $1=$1;
print d "\t" $0,d}' file | sort -n | tail -3 | cut -f2-
4$Theekshana$Second$DAV$97$100$3
5$Teju$First$Sangamithra$89$100$11
3$Sethu$First$DAV$86$98$12
a slight rewrite will eliminate the number of fields dependency and fixes the format.

Match string in file1 with string in file2

my data examples are
1.txt
MTQZ3CODT0SQKGE3QE6B | j t | j | t | 22312 | stimpy | EST | 8 | 20 | text | list | 0 | | 2002-08-22 13:07:05
2.txt
MTQZ3CODT0SQKGE3QE6B | joe#example.com
desired output
joe#example.com | j t | j | t | 22312 | stimpy | EST | 8 | 20 | text | list | 0 | | 2002-08-22 13:07:05
I suppose to match & replace 1st column from 1.txt
with 2nd column in 2.txt
so far i did try :
awk 'BEGIN { while((getline < "file2.txt") > 0) a[$1]=$3 } { $1 = a[$1] } 1' file1.txt
Its work well but after 12hours of running i just finalise only 1GB looks very slow
INFO: file1.txt=7GB file2.txt=4GB my memory 16GB
I'm not sure what cause the slowly thing but i hope if there's another fast way then i'm using of awk
will be helpfull.
Thanks!!
Note: I'm running out of memory is there another way to do it
and that's to not have an array at all?
Also in my case lines are randomly and not in the same lines!
$ join <(sort 2.txt) <(sort 1.txt) | cut -d' ' -f3-
joe#example.com | j t | j | t | 22312 | stimpy | EST | 8 | 20 | text | list | 0 | | 2002-08-22 13:07:05
If that's not all you need then edit your question to provide more truly representative sample input/output including cases that this doesn't work for.
You may use this awk:
awk -F ' *\\| *' -v OFS=' | ' '
FNR == NR {
map[$1]=$2
next
}
$1 in map {
$1 = map[$1]
} 1' 2.txt 1.txt
joe#example.com | j t | j | t | 22312 | stimpy | EST | 8 | 20 | text | list | 0 | | 2002-08-22 13:07:05

bash looping and extracting of the fragment of txt file

I am dealing with the analysis of big number of dlg text files located within the workdir. Each file has a table (usually located in different positions of the log) in the following format:
File 1:
CLUSTERING HISTOGRAM
____________________
________________________________________________________________________________
| | | | |
Clus | Lowest | Run | Mean | Num | Histogram
-ter | Binding | | Binding | in |
Rank | Energy | | Energy | Clus| 5 10 15 20 25 30 35
_____|___________|_____|___________|_____|____:____|____:____|____:____|____:___
1 | -5.78 | 11 | -5.78 | 1 |#
2 | -5.53 | 13 | -5.53 | 1 |#
3 | -5.47 | 17 | -5.44 | 2 |##
4 | -5.43 | 20 | -5.43 | 1 |#
5 | -5.26 | 19 | -5.26 | 1 |#
6 | -5.24 | 3 | -5.24 | 1 |#
7 | -5.19 | 4 | -5.19 | 1 |#
8 | -5.14 | 16 | -5.14 | 1 |#
9 | -5.11 | 9 | -5.11 | 1 |#
10 | -5.07 | 1 | -5.07 | 1 |#
11 | -5.05 | 14 | -5.05 | 1 |#
12 | -4.99 | 12 | -4.99 | 1 |#
13 | -4.95 | 8 | -4.95 | 1 |#
14 | -4.93 | 2 | -4.93 | 1 |#
15 | -4.90 | 10 | -4.90 | 1 |#
16 | -4.83 | 15 | -4.83 | 1 |#
17 | -4.82 | 6 | -4.82 | 1 |#
18 | -4.43 | 5 | -4.43 | 1 |#
19 | -4.26 | 7 | -4.26 | 1 |#
_____|___________|_____|___________|_____|______________________________________
The aim is to loop over all the dlg files and take the single line from the table corresponding to wider cluster (with bigger number of slashes in Histogram column). In the above example from the table this is the third line.
3 | -5.47 | 17 | -5.44 | 2 |##
Then I need to add this line to the final_log.txt together with the name of the log file (that should be specified before the line). So in the end I should have something in following format (for 3 different log files):
"Name of the file 1": 3 | -5.47 | 17 | -5.44 | 2 |##
"Name_of_the_file_2": 1 | -5.99 | 13 | -5.98 | 16 |################
"Name_of_the_file_3": 2 | -4.78 | 19 | -4.44 | 3 |###
A possible model of my BASH workflow would be:
#!/bin/bash
do
file_name2=$(basename "$f")
file_name="${file_name2/.dlg}"
echo "Processing of $f..."
# take a name of the file and save it in the log
echo "$file_name" >> $PWD/final_results.log
# search of the beginning of the table inside of each file and save it after its name
cat $f |grep 'CLUSTERING HISTOGRAM' >> $PWD/final_results.log
# check whether it works
gedit $PWD/final_results.log
done
Here I need to substitute combination of echo and grep in order to take selected parts of the table.
You can use this one, expected to be fast enough. Extra lines in your files, besides the tables, are not expected to be a problem.
grep "#$" *.dlg | sort -rk11 | awk '!seen[$1]++'
grep fetches all the histogram lines which are then sorted in reverse order by last field, that means lines with most # on the top, and finally awk removes the duplicates. Note that when grep is parsing more than one file, it has -H by default to print the filenames at the beginning of the line, so if you test it for one file, use grep -H.
Result should be like this:
file1.dlg: 3 | -5.47 | 17 | -5.44 | 2 |##########
file2.dlg: 3 | -5.47 | 17 | -5.44 | 2 |####
file3.dlg: 3 | -5.47 | 17 | -5.44 | 2 |#######
Here is a modification to get the first appearence in case of many equal max lines in a file:
grep "#$" *.dlg | sort -k11 | tac | awk '!seen[$1]++'
We replaced the reversed parameter in sort, with the 'tac' command which is reversing the file stream, so now for any equal lines, initial order is preserved.
Second solution
Here using only awk:
awk -F"|" '/#$/ && $NF > max[FILENAME] {max[FILENAME]=$NF; row[FILENAME]=$0}
END {for (i in row) print i ":" row[i]}' *.dlg
Update: if you execute it from different directory and want to keep only the basename of every file, to remove the path prefix:
awk -F"|" '/#$/ && $NF > max[FILENAME] {max[FILENAME]=$NF; row[FILENAME]=$0}
END {for (i in row) {sub(".*/","",i); print i ":" row[i]}}'
Probably makes more sense as an Awk script.
This picks the first line with the widest histogram in the case of a tie within an input file.
#!/bin/bash
awk 'FNR == 1 { if(sel) print sel; sel = ""; max = 0 }
FNR < 9 { next }
length($10) > max { max = length($10); sel = FILENAME ":" $0 }
END { if (sel) print sel }' ./"$prot"/*.dlg
This assumes the histograms are always the tenth field; if your input format is even messier than the lump you show, maybe adapt to taste.
In some more detail, the first line triggers on the first line of each input file. If we have collected a previous line (meaning this is not the first input file), print that, and start over. Otherwise, initialize for the first input file. Set sel to nothing and max to zero.
The second line skips lines 1-8 which contain the header.
The third line checks if the current line's histogram is longer than max. If it is, update max to this histogram's length, and remember the current line in sel.
The last line is spillover for when we have processed all files. We never printed the sel from the last file, so print that too, if it's set.
If you mean to say we should find the lines between CLUSTERING HISTOGRAM and the end of the table, we should probably have more information about what the surrounding lines look like. Maybe something like this, though;
awk '/CLUSTERING HISTOGRAM/ { if (sel) print sel; looking = 1; sel = ""; max = 0 }
!looking { next }
looking > 1 && $1 != looking { looking = 0; nextfile }
$1 == looking && length($10) > max { max = length($10); sel = FILENAME ":" $0 }
END { if (sel) print sel }' ./"$prot"/*.dlg
This sets looking to 1 when we see CLUSTERING HISTOGRAM, then counts up to the first line where looking is no longer increasing.
I would suggest processing using awk:
for i in $FILES
do
echo -n \""$i\": "
awk 'BEGIN {
output="";
outputlength=0
}
/(^ *[0-9]+)/ { # process only lines that start with a number
if (length(substr($10, 2)) > outputlength) { # if line has more hashes, store it
output=$0;
outputlength=length(substr($10, 2))
}
}
END {
print output # output the resulting line
}' "$i"
done

convert a text file to csv using shell script

I am new to shell scripting, can anyone give me shell script for the condition below
My Input:
id | name | values
----+------+--------
1 | abc | 2
1 | abc | 3
1 | abc | 4
1 | abc | 5
1 | abc | 6
1 | abc | 7
Expected Output:
1,abc,2
"
"
1 million records
You can use awk for this:
awk -F '[[:blank:]]*\\|[[:blank:]]*' -v OFS=, 'NF==3 && NR>1{sub(/^[[:blank:]]*/, "", $1); print}' file
1,abc,2
1,abc,3
1,abc,4
1,abc,5
1,abc,6
1,abc,7

Can't iterate over array in Bash

I need to add a new column with a (ordinal) number after the last column in my table.
Both input and output files are .CSV tables.
Incoming table has more then 500 000 lines (rows) of data and 7 columns, e.g. https://www.dropbox.com/s/g2u68fxrkttv4gq/incoming_data.csv?dl=0
Incoming CSV table (this is just an example, so "|" and "-" are here for the sake of clarity):
| id | Name |
-----------------
| 1 | Foo |
| 1 | Foo |
| 1 | Foo |
| 4242 | Baz |
| 4242 | Baz |
| 4242 | Baz |
| 4242 | Baz |
| 702131 | Xyz |
| 702131 | Xyz |
| 702131 | Xyz |
| 702131 | Xyz |
Result CSV (this is just an example, so "|" and "-" are here for the sake of clarity):
| id | Name | |
--------------------------
| 1 | Foo | 1 |
| 1 | Foo | 2 |
| 1 | Foo | 3 |
| 4242 | Baz | 1 |
| 4242 | Baz | 2 |
| 4242 | Baz | 3 |
| 4242 | Baz | 4 |
| 702131 | Xyz | 1 |
| 702131 | Xyz | 2 |
| 702131 | Xyz | 3 |
| 702131 | Xyz | 4 |
First column is ID, so I've tried to group all lines with the same ID and iterate over them. Script (I don't know bash scripting, to be honest):
FILE=$PWD/$1
# Delete header and extract IDs and delete non-unique values. Also change \n to ♥, because awk doesn't properly work with it.
IDS_ARRAY=$(awk -v FS="|" '{for (i=1;i<=NF;i++) if ($i=="\"") inQ=!inQ; ORS=(inQ?"♥":"\n") }1' $FILE | awk -F'|' '{if (NR!=1) {print $1}}' | awk '!seen[$0]++')
for id in $IDS_ARRAY; do
# Group $FILE by $id from $IDS_ARRAY.
cat $FILE | grep $id >> temp_mail_group.csv
ROW_GROUP=$PWD/temp_mail_group.csv
# Add a number after each row.
# NF+1 — add a column after last existing.
awk -F'|' '{$(NF+1)=++i;}1' OFS="|", $ROW_GROUP >> "numbered_mails_$(date +%Y-%m-%d).csv"
rm -f $PWD/temp_mail_group.csv
done
Right now this script works almost like I want to, except that it thinks that (for example) ID 2834 and 772834 are the same.
UPD: Although I marked one answer as approved it does not assign correct values to some groups of records with the same ID (right now I don't see a pattern).
You can do everything in a single script:
gawk 'BEGIN { FS="|"; OFS="|";}
/^-/ {print; next;}
$2 ~ /\s*id\s*/ {print $0,""; next;}
{print "", $2, $3, ++a[$2];}
'
$1 is the empty field before the first | in the input. I use an empty output column "" to get the leading |.
The trick is ++a[$2] which takes the second field in each row (= the ID column) and looks for it in the associative array a. If there is no entry, the result is 0. By pre-incrementing, we start with 1 and add 1 every time the ID reappears.
Every time you write a loop in shell just to manipulate text you have the wrong approach. The guys who invented shell also invented awk for shell to call to manipulate text - don't disappoint them :-).
$ awk '
BEGIN{ w = 8 }
{
if (NR==1) {
val = sprintf("%*s|",w,"")
}
else if (NR==2) {
val = sprintf("%*s",w+1,"")
gsub(/ /,"-",val)
}
else {
val = sprintf(" %-*s|",w-1,++cnt[$2])
}
print $0 val
}
' file
| id | Name | |
----------------------
| 1 | Foo | 1 |
| 1 | Foo | 2 |
| 1 | Foo | 3 |
| 42 | Baz | 1 |
| 42 | Baz | 2 |
| 42 | Baz | 3 |
| 42 | Baz | 4 |
| 70 | Xyz | 1 |
| 70 | Xyz | 2 |
| 70 | Xyz | 3 |
| 70 | Xyz | 4 |
An awk way
Without considering the dotted line being extended.
awk 'NR>2{$0=$0 (++a[$2])"|"}1' file
output
| id | Name |
-------------
| 1 | Foo |1|
| 1 | Foo |2|
| 1 | Foo |3|
| 42 | Baz |1|
| 42 | Baz |2|
| 42 | Baz |3|
| 42 | Baz |4|
| 70 | Xyz |1|
| 70 | Xyz |2|
| 70 | Xyz |3|
| 70 | Xyz |4|
Here's a way to do it with pure Bash:
inputfile=$1
prev_id=
while IFS= read -r line ; do
printf '%s' "$line"
IFS=$'| \t\n' read t1 id name t2 <<<"$line"
if [[ $line == -* ]] ; then
printf '%s\n' '---------'
elif [[ $id == 'id' ]] ; then
printf ' Number |\n'
else
if [[ $id != "$prev_id" ]] ; then
id_count=0
prev_id=$id
fi
printf '%2d |\n' "$(( ++id_count ))"
fi
done <"$inputfile"

Resources