How to delete lines with same middle block? - shell

This is probably an easy question for command-line ninjas, but I can't figure it out for the life of me. As of now, I'm using a PHP script to accomplish this, but I need to do it using awk/sed/cut or similar.
I have got a log file like this:
123 | foo | 12.13
756 | bar | 14.25
236 | baz | 11.23
536 | foo | 10.13
947 | bar | 34.25
134 | baz | 11.26
I need to delete all lines that has the middle element same. If there is a duplicate, the newer version needs to be kept. The output of the file after the removal should look like:
536 | foo | 10.13
947 | bar | 34.25
134 | baz | 11.26
I'm new to this and have no idea how to do this, so a little nudge in the right direction would be of great help.

$ tac file | awk -F' +[|] +' '!seen[$2]++' | tac
536 | foo | 10.13
947 | bar | 34.25
134 | baz | 11.26
or if you prefer an awk-only solution:
$ awk -F' +[|] +' 'NR==FNR{fnr[$2]=FNR; next} FNR==fnr[$2]' file file
536 | foo | 10.13
947 | bar | 34.25
134 | baz | 11.26

You can use this awk command using a custom field separator:
awk -F' *\\| *' '!data[$2]{a[++k]=$2} {data[$2]=$0}
END{for (i=1; i<=k; i++) print data[a[i]]}' file
536 | foo | 10.13
947 | bar | 34.25
134 | baz | 11.26

If you don't care about the output order
perl -F'\s*\|\s*' -lanE '$s{$F[1]}=$_}{say $s{$_} for keys %s' <ca.txt
prints
134 | baz | 11.26
947 | bar | 34.25
536 | foo | 10.13

sed -e ":a
$ !{N;ba
}
:b
s/[0-9]* | \([^ ]*\) | [0-9.]*\n\(.*\)\1/\2\1/g
t b" YourFile
sed posix version (so --posix for GNU sed especially due to use of | inside s///)

Related

How to fix newline character in csv exported in shell script?

I want to fix this below issue in csv file using unix. I don't have access to source so i have to fix with this csv file alone. I need to desired output. is it achievable. Please help.
I have tried this below code but it doesn't work.
perl -p00e 's/\n|/|/g' test.csv
Issue:
DATECODE|SUBCLASSCODE|SUBCLASS_NAME|CLASS
2021-05-25|2202|Bras|1310
2021-05-25|1119|No Longer in Use - Depleted by 2019 Reclass|0805
2021-05-25|0949|No Longer in Use - Depleted by 2021 Reclass|0231
2021-05-25|1928|Fishing Gloves|1155
2021-05-25|1604|Training FW|1080
2021-05-25|0894|Hunting Waders|0894
2021-05-25|1873|Small Game|0326
2021-05-25|9950|EVENT
REGISTRATION FEE|9950
2021-05-25|0476|Regular Golf Gloves|0476
2021-05-25|1366|
Shorts|0988
2021-05-25|1914|Wade Shoes|0894
2021-05-25|0537|No Longer in Use - Depleted by 2019 Reclass|0537
2021-05-25|1635|Pickleball FW|
0021
2021-05-25|0679|Case Sunglasses|0679
2021-05-25|1544|Sandals|0001
2021-05-25|
1527|Golf/Tennis Accessories|1059
2021-05-25|1582|Lifestyle FW|0502
Desired Result:
DATECODE|SUBCLASSCODE|SUBCLASS_NAME|CLASS
2021-05-25|2202|Bras|1310
2021-05-25|1119|No Longer in Use - Depleted by 2019 Reclass|0805
2021-05-25|0949|No Longer in Use - Depleted by 2021 Reclass|0231
2021-05-25|1928|Fishing Gloves|1155
2021-05-25|1604|Training FW|1080
2021-05-25|0894|Hunting Waders|0894
2021-05-25|1873|Small Game|0326
2021-05-25|9950|EVENT REGISTRATION FEE|9950
2021-05-25|0476|Regular Golf Gloves|0476
2021-05-25|1366|Shorts|0988
2021-05-25|1914|Wade Shoes|0894
2021-05-25|0537|No Longer in Use - Depleted by 2019 Reclass|0537
2021-05-25|1635|Pickleball FW|0021
2021-05-25|0679|Case Sunglasses|0679
2021-05-25|1544|Sandals|0001
2021-05-25|1527|Golf/Tennis Accessories|1059
2021-05-25|1582|Lifestyle FW|0502
You can fix the output fairly simply with awk using 3-rules. Specifically, you will check that each line begins with a date in your format and ends (e.g. the 4th field $4) with 4-digits. If so, just print the line (rule 1). If not, and the line begins with a date in your format, just output without a '\n' so you can append the next line to it (rule 2). If you have reach a line that satisfies neither rule 1 or rule 2, it is the end of the previous line, just output with a '\n' to complete the previous line (rule 3).
That can be done with:
awk -F'|' '
NF==4 && $4~/^[[:digit:]]{4}$/ { print; next }
$1~/[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}/ {
printf "%s",$0
next
}
{ print }
' f.csv
Example Use/Output
With your input file in f.csv you would obtain:
$ awk -F'|' '
> NF==4 && $4~/^[[:digit:]]{4}$/ { print; next }
> $1~/[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}/ {
> printf "%s",$0
> next
> }
> { print }
> ' f.csv
DATECODE|SUBCLASSCODE|SUBCLASS_NAME|CLASS
2021-05-25|2202|Bras|1310
2021-05-25|1119|No Longer in Use - Depleted by 2019 Reclass|0805
2021-05-25|0949|No Longer in Use - Depleted by 2021 Reclass|0231
2021-05-25|1928|Fishing Gloves|1155
2021-05-25|1604|Training FW|1080
2021-05-25|0894|Hunting Waders|0894
2021-05-25|1873|Small Game|0326
2021-05-25|9950|EVENTREGISTRATION FEE|9950
2021-05-25|0476|Regular Golf Gloves|0476
2021-05-25|1366|Shorts|0988
2021-05-25|1914|Wade Shoes|0894
2021-05-25|0537|No Longer in Use - Depleted by 2019 Reclass|0537
2021-05-25|1635|Pickleball FW|0021
2021-05-25|0679|Case Sunglasses|0679
2021-05-25|1544|Sandals|0001
2021-05-25|1527|Golf/Tennis Accessories|1059
2021-05-25|1582|Lifestyle FW|0502
Which is the output you specified.
You can write it in condensed form with one rule per-line as:
awk -F'|' '
NF==4 && $4~/^[[:digit:]]{4}$/ { print; next }
$1~/[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}/ { printf "%s",$0; next }
{ print }
' f.csv
Look things over and let me know if you have further questions.
You have also very simple solution
perl -pe 's/\n/ /g;s/2021-/\n2021-/g;s/\| */|/g' input.txt
gives you
+------------+--------------+---------------------------------------------+--------+
| DATECODE | SUBCLASSCODE | SUBCLASS_NAME | CLASS |
+------------+--------------+---------------------------------------------+--------+
| 2021-05-25 | 2202 | Bras | 1310 |
| 2021-05-25 | 1119 | No Longer in Use - Depleted by 2019 Reclass | 0805 |
| 2021-05-25 | 0949 | No Longer in Use - Depleted by 2021 Reclass | 0231 |
| 2021-05-25 | 1928 | Fishing Gloves | 1155 |
| 2021-05-25 | 1604 | Training FW | 1080 |
| 2021-05-25 | 0894 | Hunting Waders | 0894 |
| 2021-05-25 | 1873 | Small Game | 0326 |
| 2021-05-25 | 9950 | EVENT REGISTRATION FEE | 9950 |
| 2021-05-25 | 0476 | Regular Golf Gloves | 0476 |
| 2021-05-25 | 1366 | Shorts | 0988 |
| 2021-05-25 | 1914 | Wade Shoes | 0894 |
| 2021-05-25 | 0537 | No Longer in Use - Depleted by 2019 Reclass | 0537 |
| 2021-05-25 | 1635 | Pickleball FW | 0021 |
| 2021-05-25 | 0679 | Case Sunglasses | 0679 |
| 2021-05-25 | 1544 | Sandals | 0001 |
| 2021-05-25 | 1527 | Golf/Tennis Accessories | 1059 |
| 2021-05-25 | 1582 | Lifestyle FW | 0502 |
+------------+--------------+---------------------------------------------+--------+

Inconsistency in output field separator

We have to find the difference(d) Between last 2 nos and display rows with the highest value of d in ascending order
INPUT
1 | Latha | Third | Vikas | 90 | 91
2 | Neethu | Second | Meridian | 92 | 94
3 | Sethu | First | DAV | 86 | 98
4 | Theekshana | Second | DAV | 97 | 100
5 | Teju | First | Sangamithra | 89 | 100
6 | Theekshitha | Second | Sangamithra | 99 |100
Required OUTPUT
4$Theekshana$Second$DAV$97$100$3
5$Teju$First$Sangamithra$89$100$11
3$Sethu$First$DAV$86$98$12
awk 'BEGIN{FS="|";OFS="$";}{
avg=sqrt(($5-$6)^2)
print $1,$2,$3,$4,$5,$6,avg
}'|sort -nk7 -t "$"| tail -3
Output:
4 $ Theekshana $ Second $ DAV $ 97 $ 100$3
5 $ Teju $ First $ Sangamithra $ 89 $ 100$11
3 $ Sethu $ First $ DAV $ 86 $ 98$12
As you can see there is space before and after $ sign but for the last column (avg) there is no space, please explain why its happening
2)
awk 'BEGIN{FS=" | ";OFS="$";}{
avg=sqrt(($5-$6)^2)
print $1,$2,$3,$4,$5,$6,avg
}'|sort -nk7 -t "$"| tail -3
OUTPUT
4$|$Theekshana$|$Second$|$0
5$|$Teju$|$First$|$0
6$|$Theekshitha$|$Second$|$0
I have not mentiond | as the output field separator but still it appears, why is this happening and the difference is zero too
I am just 6 days old in unix,please answer even if its easy
your field separator is only the pipe symbol, so surrounding whitespace is part of the field definitions and that's what you see in the output. In combined uses pipe has the regex special meaning and need to be escaped. In your second case it means space or space is the field separator.
$ awk 'BEGIN {FS=" *\\| *"; OFS="$"}
{d=sqrt(($NF-$(NF-1))^2); $1=$1;
print d "\t" $0,d}' file | sort -n | tail -3 | cut -f2-
4$Theekshana$Second$DAV$97$100$3
5$Teju$First$Sangamithra$89$100$11
3$Sethu$First$DAV$86$98$12
a slight rewrite will eliminate the number of fields dependency and fixes the format.

Extract URLs (multiple lines) from texttable

My source:
+-----------+-------+----------------------+----------------------------------------------------------------------------------+
| positives | total | scan_date | url |
+===========+=======+======================+==================================================================================+
| 4 | 65 | 2015-09-21 23:29:33 | http://thebackpack.fr/wp-content/themes/salient/wpbakery/js_composer/assets/lib/ |
| | | | prettyphoto/images/prettyPhoto/light_rounded/66836487162.txt |
+-----------+-------+----------------------+----------------------------------------------------------------------------------+
| 1 | 64 | 2015-09-17 19:28:50 | http://thebackpack.fr/ |
+-----------+-------+----------------------+----------------------------------------------------------------------------------+
| 1 | 64 | 2015-09-17 08:44:16 | http://thebackpack.fr/wp-content/themes/salient/wpbakery/js_composer/assets/lib/ |
| | | | prettyphoto/images/prettyPhoto/light_rounded/ |
+-----------+-------+----------------------+----------------------------------------------------------------------------------+
I would like to extract the full URLs (Full URL in one line):
hxxp://thebackpack.fr/wp-content/themes/salient/wpbakery/js_composer/assets/lib/prettyphoto/images/prettyPhoto/light_rounded/66836487162.txt
hxxp://thebackpack.fr/
hxxp://thebackpack.fr/wp-content/themes/salient/wpbakery/js_composer/assets/lib/prettyphoto/images/prettyPhoto/light_rounded/
The multiple lines URL is my problem. I tried for example: awk '{print $9}'
Thanks in advance for your help!
You can use this awk command:
awk -F '[[:blank:]]*\\|[[:blank:]]*' 'NR<3 || NF<5{next}
$2{if (url) print url; url=$5; next}
{url=url $5}
END{print url}' file
Output:
http://thebackpack.fr/wp-content/themes/salient/wpbakery/js_composer/assets/lib/prettyphoto/images/prettyPhoto/light_rounded/66836487162.txt
http://thebackpack.fr/
http://thebackpack.fr/wp-content/themes/salient/wpbakery/js_composer/assets/lib/prettyphoto/images/prettyPhoto/light_rounded/

Can't iterate over array in Bash

I need to add a new column with a (ordinal) number after the last column in my table.
Both input and output files are .CSV tables.
Incoming table has more then 500 000 lines (rows) of data and 7 columns, e.g. https://www.dropbox.com/s/g2u68fxrkttv4gq/incoming_data.csv?dl=0
Incoming CSV table (this is just an example, so "|" and "-" are here for the sake of clarity):
| id | Name |
-----------------
| 1 | Foo |
| 1 | Foo |
| 1 | Foo |
| 4242 | Baz |
| 4242 | Baz |
| 4242 | Baz |
| 4242 | Baz |
| 702131 | Xyz |
| 702131 | Xyz |
| 702131 | Xyz |
| 702131 | Xyz |
Result CSV (this is just an example, so "|" and "-" are here for the sake of clarity):
| id | Name | |
--------------------------
| 1 | Foo | 1 |
| 1 | Foo | 2 |
| 1 | Foo | 3 |
| 4242 | Baz | 1 |
| 4242 | Baz | 2 |
| 4242 | Baz | 3 |
| 4242 | Baz | 4 |
| 702131 | Xyz | 1 |
| 702131 | Xyz | 2 |
| 702131 | Xyz | 3 |
| 702131 | Xyz | 4 |
First column is ID, so I've tried to group all lines with the same ID and iterate over them. Script (I don't know bash scripting, to be honest):
FILE=$PWD/$1
# Delete header and extract IDs and delete non-unique values. Also change \n to ♥, because awk doesn't properly work with it.
IDS_ARRAY=$(awk -v FS="|" '{for (i=1;i<=NF;i++) if ($i=="\"") inQ=!inQ; ORS=(inQ?"♥":"\n") }1' $FILE | awk -F'|' '{if (NR!=1) {print $1}}' | awk '!seen[$0]++')
for id in $IDS_ARRAY; do
# Group $FILE by $id from $IDS_ARRAY.
cat $FILE | grep $id >> temp_mail_group.csv
ROW_GROUP=$PWD/temp_mail_group.csv
# Add a number after each row.
# NF+1 — add a column after last existing.
awk -F'|' '{$(NF+1)=++i;}1' OFS="|", $ROW_GROUP >> "numbered_mails_$(date +%Y-%m-%d).csv"
rm -f $PWD/temp_mail_group.csv
done
Right now this script works almost like I want to, except that it thinks that (for example) ID 2834 and 772834 are the same.
UPD: Although I marked one answer as approved it does not assign correct values to some groups of records with the same ID (right now I don't see a pattern).
You can do everything in a single script:
gawk 'BEGIN { FS="|"; OFS="|";}
/^-/ {print; next;}
$2 ~ /\s*id\s*/ {print $0,""; next;}
{print "", $2, $3, ++a[$2];}
'
$1 is the empty field before the first | in the input. I use an empty output column "" to get the leading |.
The trick is ++a[$2] which takes the second field in each row (= the ID column) and looks for it in the associative array a. If there is no entry, the result is 0. By pre-incrementing, we start with 1 and add 1 every time the ID reappears.
Every time you write a loop in shell just to manipulate text you have the wrong approach. The guys who invented shell also invented awk for shell to call to manipulate text - don't disappoint them :-).
$ awk '
BEGIN{ w = 8 }
{
if (NR==1) {
val = sprintf("%*s|",w,"")
}
else if (NR==2) {
val = sprintf("%*s",w+1,"")
gsub(/ /,"-",val)
}
else {
val = sprintf(" %-*s|",w-1,++cnt[$2])
}
print $0 val
}
' file
| id | Name | |
----------------------
| 1 | Foo | 1 |
| 1 | Foo | 2 |
| 1 | Foo | 3 |
| 42 | Baz | 1 |
| 42 | Baz | 2 |
| 42 | Baz | 3 |
| 42 | Baz | 4 |
| 70 | Xyz | 1 |
| 70 | Xyz | 2 |
| 70 | Xyz | 3 |
| 70 | Xyz | 4 |
An awk way
Without considering the dotted line being extended.
awk 'NR>2{$0=$0 (++a[$2])"|"}1' file
output
| id | Name |
-------------
| 1 | Foo |1|
| 1 | Foo |2|
| 1 | Foo |3|
| 42 | Baz |1|
| 42 | Baz |2|
| 42 | Baz |3|
| 42 | Baz |4|
| 70 | Xyz |1|
| 70 | Xyz |2|
| 70 | Xyz |3|
| 70 | Xyz |4|
Here's a way to do it with pure Bash:
inputfile=$1
prev_id=
while IFS= read -r line ; do
printf '%s' "$line"
IFS=$'| \t\n' read t1 id name t2 <<<"$line"
if [[ $line == -* ]] ; then
printf '%s\n' '---------'
elif [[ $id == 'id' ]] ; then
printf ' Number |\n'
else
if [[ $id != "$prev_id" ]] ; then
id_count=0
prev_id=$id
fi
printf '%2d |\n' "$(( ++id_count ))"
fi
done <"$inputfile"

shell - grep - how to get only lines that have certain amount char

good morning.
I have the following lines :
1 | blah | 2 | 1993 | 86 | 0 | NA | 123 | 123
1 | blah | TheBeatles | 0 | 3058 | NA | NA | 11
And I wanna get only the lines with 7 "|" and the same first field.
So the output for these two lines will be nothing, but for these two lines :
1 | blah | 2 | 1993 | 86 | 0 | NA | 123
1 | blah | TheBeatles | 0 | 3058 | NA | NA | 11
The output will be "error".
I'm getting the inputs from a file using the following command :
grep '.*|.*|.*|.*|.*|.*|.*|.*' < $1 | sort -nbsk1 | cut -d "|" -f1 | uniq -d |
while read line2; do
echo error
done
But this implementation would still print error even if I have more then 7 "|".
Any suggestions ?
P.S - I can assume that there is a \n in the end of each line.
For printing lines containing only 7 |, try:
awk -F'|' 'NF == 8' filename
If you want to use bash to count the number of | in a given line, try:
line="1 | blah | 2 | 1993 | 86 | 0 | NA | 123 | 123";
count=${line//[^|]/};
echo ${#count};
With grep
grep '^\([^|]*|[^|]*\)\{7\}$'
Assuming zz.txt is:
$ cat zz.txt
1 | blah | 2 | 1993 | 86 | 0 | NA | 123 | 123
1 | blah | TheBeatles | 0 | 3058 | NA | NA | 11
$ cut -d\| -f1-8 zz.txt
above cut will give you the output you need.
I would suggest that you use awk for this job.
BEGIN { FS = "|" }
NF == 8 && $1 == '1' { print $0}
would do the job (although the == and && could be = and & ; my awk is a bit rusty)

Resources