How to remove partial duplicates from text file? - bash

How can I remove partial duplicates in bash using either awk, grep or sort?
Input:
"3","6"
"3","7"
"4","9"
"5","6"
"26","48"
"543","7"
Expected Output:
"3","6"
"3","7"
"4","9"
"26","48"

Could you please try following and let me know if this helps you.
awk -F'[",]' '!a[$5]++' Input_file
Output will be as follows.
"3","6"
"3","7"
"4","9"
"26","48"
EDIT: Adding explanation too here.
awk -F'[",]' ' ##Setting field separator as " or , for every line of Input_file.
!a[$5]++ ##creating an array named a whose index is $5(fifth field) and checking condition if 5th field is NOT present in array a, so when any 5th field comes in array a then increasing its count so next time it will not take any duplicates in it. Since awk works on condition and then action, since here no action is mentioned so by default print of current line will happen.
' Input_file ##Mentioning the Input_file here too.

Related

Transpose rows to column after nth column in bash

I have a file like below format:
$ cat file_in.csv
1308123;28/01/2019;28/01/2019;22/01/2019
1308456;20/11/2018;27/11/2018;09/11/2018;15/11/2018;10/11/2018;02/12/2018
1308789;06/12/2018;04/12/2018
1308012;unknown
How can i transpose as below, starting from second column:
1308123;28/01/2019
1308123;28/01/2019
1308123;22/01/2019
1308456;20/11/2018
1308456;27/11/2018
1308456;09/11/2018
1308456;15/11/2018
1308456;10/11/2018
1308456;02/12/2018
1308789;06/12/2018
1308789;04/12/2018
1308012;unknown
I'm testing my script, but obtain a wrong result
echo "123;23/05/2018;24/05/2018" | awk -F";" 'NR==3{a=$1";";next}{a=a$1";"}END{print a}'
Thanks in advance
1st Solution: Eaisest solution will be, loop through all fields(off course have set field separator as ;) and then print $1 along with all fields in new line. Also note that loop is running from i=2 to till value of NF leaving first field since we need to print in new line from column 2nd onwards.
awk 'BEGIN{FS=OFS=";"} {for(i=2;i<=NF;i++){print $1,$i}}' Input_file
2nd Solution: Using 1 time substitution(sub) and global substitutions(gsub) functionality of awk. Here I am changing very first occurence of ; with ###(assumed that your Input_file will NOT have this characters together, in case it is there choose any unique character(s) which are NOT in one's Input_file on place of ###), then globally subsituting ;(all occurences) with ORS val(a variable which has value of $1) and ; so make values in new column. Now finally remove ### from first field. Why we have done this approch if we DO NOT substitute very first occurence of ; with any other character then it will place a NEW LINE before substituion which we DO NOT want to have. (Also as per Ed sir's comment this solution was tested in 1 Input_file and may have issues while reading multiple Input_files)
awk 'BEGIN{FS=OFS=";"} {val=$1;sub(";","###");gsub(";",ORS val ";");sub("###",";",$1)} 1' Input_file
Another awk
awk -F";" '{ OFS="\n" $1 ";"; $1=$1;$1=""; printf("%s",$0) } ' file

Filter records based on Text in Unix

I'm trying to extract all the records that matches the text "IN" in the 10th field from this file.
i tried but it's not giving me the accurate results. Any help provided here would be highly appreciated.
awk '$10 == "IN" {print $0}'
input_file: my input file
A1|A2|A3|A4|A5|A6|A7|A8|A9|PK|A11|A13|A14|A15|A16|A17|A18
1|2|3|4|5|6|7|8|9|IN|11|12|13|14|15|16|17|18
AW|BW|CQ|AA|AR|AF|RR|AKL|ASD|US|PP|BN|TY|OL|Q3|M8|I7|V6
AR|BR|CR|A8|AN|AQ|RU|A11|A13|IN|P9P|B0N|T2Y|O4L|Q43|M88|I71|V16
output_file: my output should be
1|2|3|4|5|6|7|8|9|IN|11|12|13|14|15|16|17|18
AR|BR|CR|A8|AN|AQ|RU|A11|A13|IN|P9P|B0N|T2Y|O4L|Q43|M88|I71|V16
all the records that matched "IN" in the 10th field should be filtered.
Since you haven't mentioned the field separator in awk code so by default it makes space as field separator and your Input_file is | pipe delimited so let awk know you should set it up in code.
Could you please try following.
awk -F'|' '$10=="IN"' Input_file
Explanation: Adding explanation for above code too.
awk -F'|' ' ##Setting field separator as |(pipe) for all lines of Input_file.
$10=="IN" ##Checking condition if 10th field is equal to IN here if yes then print the current line.
' Input_file ##Mentioning Input_file name here.

sed squeeze multiple occurrence of word

I have text file with lines like below:
this is the code ;rfc1234;rfc1234
this is the code ;rfc1234;rfc1234;rfc1234;rfc1234
How can I squeeze the the repeating words in file to single word like below:
this is the code ;rfc1234
this is the code ;rfc1234
I tried 'tr' command but it's limited to squeezing characters only
with sed for arbitrary repeated strings prefixed with ;
$ sed -E 's/(;[^;]+)(\1)+/\1/g' file
or, if you want to delete everything after the first token without checking whether they match the preceding one or not
$ sed -E 's/(\S);.*/\1/' file
Explanation
(;[^;]+) is to capture a string starting with semicolon
(\1)+ followed by the same captured string one or more times
/\1/g replace the whole chain with one instance, and repeat
Following awk may help here. It will look for all items in last column of you Input_file and will keep only unique values in it.
awk '{num=split($NF,array,";");for(i=1;i<=num;i++){if(!array1[array[i]]++){val=val?val ";" array[i]:array[i]}};NF--;print $0";"val;val="";delete array;delete array1}' Input_file
Adding a non-one liner form of solution too now.
awk '
{
num=split($NF,array,";");
for(i=1;i<=num;i++){
if(!array1[array[i]]++){
val=val?val ";" array[i]:array[i]}
};
NF--;
print $0";"val;
val="";
delete array;
delete array1
}' Input_file
Explanation:
awk '
{
num=split($NF,array,";"); ##Creating a variable named num whose value is length of array named array, which is created on last field of line with ; as a delimiter.
for(i=1;i<=num;i++){ ##Starting a for loop from i=1 to till value of num each time increment i as 1.
if(!array1[array[i]]++){ ##Chrcking here a condition if array named array1 index is value of array[i] is NOT coming more than 1 value then do following.
val=val?val ";" array[i]:array[i]}##Creating a variable named val here whose value is array[i] value and keep concatenating its own value of it.
};
NF--; ##Reducing the value of NF(number of fields) in current line to remove the last field from it.
print $0";"val; ##Printing the current line(without last field) ; and then value of val here.
val=""; ##Nullifying variable val here.
delete array; ##Deleting array named array here.
delete array1 ##Deleting array named array1 here.
}' Input_file ##Mentioning Input_file name here.
I started playing around with s/(.+)\1/\1/g. It seemed to work with perl (even found the is_is_) but didn't quite take me there:
$ perl -pe 's/(.+)\1+/\1/g' file
this the code ;rfc1234
this the code ;rfc1234;rfc1234
sed 's/\(;[^;]*\).*/\1/' file
You can use the below command to achieve this:-
echo "this is the code ;rfc1234;rfc1234" | sed 's/rfc1234//2g'
echo "this is the code ;rfc1234;rfc1234;rfc1234;rfc1234" | sed 's/rfc1234//2g'
or
sed 's/rfc1234//2g' yourfile.txt
This might work for you (GNU sed):
sed -r ':a;s/(\S+)\1+/\1/g;ta' file
The regex is repeated until only the first pattern remains.

Bash script to echo line only when a parameter exceeds certain value

Have the following lines in logs
queryparam={createdTime=1524456000000,limit=1000, sort=name}
queryparam={createdTime=1524457000000, sort=name,limit=1001}
queryparam={createdTime=1524457000000, sort=name, name=alpha, limit=1001}
queryparam={createdTime=1524458000000, sort=name}
Is there a bash script to fetch only rows which have limit greater than 1000?
P.S. Not sure how to parse the field to get limit value and check whether it is greater than 1000.
EDIT: Since OP mentioned field is NOT always same so adding this solution now.
awk 'match($0,/limit=[0-9]+/){split(substr($0,RSTART,RLENGTH),array,"=");if(array[2]>1000){print}}' Input_file
Following awk may help you here.
awk -F"[=,]" '/limit/ && $5>1000' Input_file
It will only look for those lines which have string limit and their value is greater than 1000 too and print them then.
Explanation:
awk -F"[=,]" ' ##Setting field separator as = and , here for all the lines of Input_file.
/limit/ && $5>1000 ##Checking here if a line contains string line in it and checking if its 5th column is grater than 1000, since awk works on concept of condition then action so here mentioning the condition and NOT mentioning the action. So by default print action of current line will happen.
' Input_file ##Mentioning Input_file name here.

How to Compare two files line by line and output the whole line if different

I have two sorted files in question
1)one is a control file(ctrl.txt) which is external process generated
2)and other is line count file(count.txt) that I generate using `wc -l`
$more ctrl.txt
Thunderbird|1000
Mustang|2000
Hurricane|3000
$more count.txt
Thunder_bird|1000
MUSTANG|2000
Hurricane|3001
I want to compare these two files ignoring wrinkles in column1(filenames) such as "_" (for Thunder_bird) or "upper case" (for MUSTANG) so that my output only shows below file as the only real different file for which counts dont match.
Hurricane|3000
I have this idea to only compare second column from both the files and output whole line if they are different
I have seen other examples in AWK but I could not get anything to work.
Could you please try following awk and let me know if this helps you.
awk -F"|" 'FNR==NR{gsub(/_/,"");a[tolower($1)]=$2;next} {gsub(/_/,"")} ((tolower($1) in a) && $2!=a[tolower($1)])' cntrl.txt count.txt
Adding a non-one liner form of solution too now.
awk -F"|" '
FNR==NR{
gsub(/_/,"");
a[tolower($1)]=$2;
next}
{ gsub(/_/,"") }
((tolower($1) in a) && $2!=a[tolower($1)])
' cntrl.txt count.txt
Explanation: Adding explanation too here for above code.
awk -F"|" ' ##Setting field seprator as |(pipe) here for all lines in Input_file(s).
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first Input_file(cntrl.txt) in this case is being read. Following instructions will be executed once this condition is TRUE.
gsub(/_/,""); ##Using gsub utility of awk to globally subtitute _ with NULL in current line.
a[tolower($1)]=$2; ##Creating an array named a whose index is first field in LOWER CASE to avoid confusions and value is $2 of current line.
next} ##next is awk out of the box keyword which will skip all further instructions now.(to make sure they are read when 2nd Input-file named count.txt is being read).
{ gsub(/_/,"") } ##Statements from here will be executed when 2nd Input_file is being read, using gsub to remove _ all occurrences from line.
((tolower($1) in a) && $2!=a[tolower($1)]) ##Checking condition here if lower form of $1 is present in array a and value of current line $2 is NOT equal to array a value. If this condition is TRUE then print the current line, since I have NOT given any action so by default printing of current line will happen from count.txt file.
' cntrl.txt count.txt ##Mentioning the Input_file names here which we have to pass to awk.

Resources