Checking number prefix - shell

I have some trouble in my script.
I am currently using:
awk '{anum=substr($1,3,22); sub(/^0+/, "", anum); print anum}' file1 | grep -nf file2 | cut -d: -f1 | awk 'FNR==NR{a[$1];next};FNR in a' - file1
file1
5000000000009855892590xxxx xxx
5000000000000068582654xxxx xxx
5000000000009855892580xxxx xxx
5000000000000765432100xxxx xxx
file2
9855892588
985589259
8265
76543210
I am getting the output using the two files below (file1 and file2):
5000000000009855892590xxxx xxx
5000000000000068582654xxxx xxx
5000000000000765432100xxxx xxx
But my expected output is just:
5000000000009855892590xxxx xxx
5000000000000765432100xxxx xxx
My problem is that it captures 8265 in the middle of 5000000000000068582654xxxx which is wrong. What else can I use in replacement of grep -nf to meet my condition? Should the numbers in file2 match the prefix or whole number of 3rd to 22nd digit of file1 (w/o leading zeros).

This will work for your example but as I'm not really sure of exactly how you determine whats valid or not it may not be very robust.
gawk 'NR==FNR{a[$1]=$1;next}{match($0,/0+([1-9][0-9]+)0/,b)}a[b[1]]' file{2,1}
5000000000009855892590xxxx xxx
5000000000000765432100xxxx xxx
It creates an array of all the first fields in the first file(file2), then matches a string that i have guessed is your valid string, in the second file. Next if the string has been saved in the array it prints the line.
Not gawk version
awk 'NR==FNR{a[$1]=$1;next}{n=substr($1,3,22);sub(/^0+/, "", n)
for(i in a)if(n~"^"a[i])print}' test2 test
Same start as the other, then remove the start of the line as OP has done, next for each saved element, check if the newly created line starts with it.

Related

Bash vlookup kind of solution

I have two files,
File 1
2,1,1,1,Test1,1540584051,52
6,5,1,1,Test2,1540579206,54
3,3,0,0,Test3,1540591243,36
File 2
2,1,0,2,Test1,1540584051,52
6,5,0,2,Test2,1540579206,54
i want to look up column 7 value from File 1 to check if it matches with column 7 value from File 2 and when matched, replace the that line in file 2 with corresponding line in file 1
So the output would be
2,1,1,1,Test1,1540584051,52
6,5,1,1,Test2,1540579206,54
Thanks in advance.
You can do that with the following script:
BEGIN { FS="," }
NR==FNR {
lookup[$7] = $0
next
}
{
if (lookup[$7] != "") {
$0 = lookup[$7]
}
print
}
END {
print ""
print "Lookup table used was:"
for (i in lookup) {
print " Key '"i"', Value '"lookup[i]"'"
}
}
The BEGIN section simply sets the field separator to , so individual fields can be easily processed.
The NR and FNR variables are, respectively, the line number of the full input stream (all files) and the line number of the current file in the input stream. When you are processing the first (or only) file, these will be equal, so we use this as a means to simply store the lines from the first file, keyed on field seven.
When NR and FNR are not equal, it's because you've started the second file and this is where we want to replace lines if their key exists in the first file.
This is done by simply checking if a line exists in the lookup table with the desired key and, if it does, replacing the current line the lookup table line. Then we print the (original or replaced) line.
The END section is there just for debugging purposes, it outputs the lookup table that was created and used, and you can remove it once you're satisfied the script works as expected.
You'll see the output in the following transcript, illustrating hopefully that it is working correctly:
pax$ cat file1
2,1,1,1,Test1,1540584051,52
6,5,1,1,Test2,1540579206,54
3,3,0,0,Test3,1540591243,36
pax$ cat file2
2,1,0,2,Test1,1540584051,52
6,5,0,2,Test2,1540579206,54
pax$ awk -f sudarshan.awk file1 file2
2,1,1,1,Test1,1540584051,52
6,5,1,1,Test2,1540579206,54
Lookup table used was:
Key '36', Value '3,3,0,0,Test3,1540591243,36'
Key '52', Value '2,1,1,1,Test1,1540584051,52'
Key '54', Value '6,5,1,1,Test2,1540579206,54'
If you need it as a "short as possible" one-liner to use from your script, just use:
awk -F, 'NR==FNR{x[$7]=$0;next}{if(x[$7]!=""){$0=x[$7]};print}' file1 file2
though I prefer the readable version myself.
This might work for you (GNU sed):
sed -r 's|^([^,]*,){6}([^,]*).*|/^([^,]*,){6}\2/s/.*/&/p|' file1 | sed -rnf - file2
Turn file1 into a sed script and using the 7th field as a key lookup replace any line in file2 that matches.
In your example the 7th field is the last one, so a short version of the above solution is:
sed -r 's|.*,(.*)|/.*,\1/s/.*/&/p|' file1 | sed -nf - file2

match awk column value to a column in another file

I need to know if I can match awk value while I am inside a piped command. Like below:
somebinaryGivingOutputToSTDOUT | grep -A3 "sometext" | grep "somemoretext" | awk -F '[:|]' 'BEGIN{OFS=","; print "Col1,Col2,Col3,Col4"}{print $4,$6,$4*10^10+$6,$8}'
from here I need to check if the computed value $4*10^10+$6 is present (matches to) in any of the column value of another file. If it is present then print, else just move forward.
File where value needs to be matched is as below:
a,b,c,d,e
1,2,30000000000,3,4
I need to match with the 3rd column of the above file.
I would ideally like this to be in the same command, because if this check is not applied, it prints more than 100 million rows (and a large file).
I have already read this question.
Adding more info:
Breaking my command into parts
part1-command:
somebinaryGivingOutputToSTDOUT | grep -A3 "sometext" | grep "Something:"
part1-output(just showing 1 iteration output):
Something:38|Something1:1|Something2:10588429|Something3:1491539456372358463
part2-command Now I use awk
awk -F '[:|]' 'BEGIN{OFS=","; print "Col1,Col2,Col3,Col4"}{print $4,$6,$4*10^10+$6,$8}'
part2-command output: currently below values are printed (see how i multiplied 1*10^10+10588429 and got 10010588429
1,10588429,10010588429,1491539456372358463
3,12394810,30012394810,1491539456372359082
1,10588430,10010588430,1491539456372366413
Now here I need to put a check (within the command [near awk]) to print only if 10010588429 was present in another file (say another_file.csv as below)
another_file.csv
A,B,C,D,E
1,2, 10010588429,4,5
x,y,z,z,k
10,20, 10010588430,40,50
output should only be
1,10588429,10010588429,1491539456372358463
1,10588430,10010588430,1491539456372366413
So for every row of awk we check entry in file2 column C
Using the associative array approach in previous question, include a hyphen in place of the first file to direct AWK to the input stream.
Example:
grep -A3 "sometext" | grep "somemoretext" | awk -F '[:|]'
'BEGIN{OFS=","; print "Col1,Col2,Col3,Col4"}
NR==FNR {
query[$4*10^10+$6]=$4*10^10+$6;
out[$4*10^10+$6]=$4 FS $6 FS $4*10^10+$6 FS $8;
next
}
query[$3]==$3 {
print out[$3]
}' - another_file.csv > output.csv
More info on the merging process in the answer cited in the question:
Using AWK to Process Input from Multiple Files
I'll post a template which you can utilize for your computation
awk 'BEGIN {FS=OFS=","}
NR==FNR {lookup[$3]; next}
/sometext/ {c=4}
c&&c--&&/somemoretext/ {value= # implement your computation here
if(value in lookup)
print "what you want"}' lookup.file FS=':' grep.files...
here awk loads up the values in the third column of the first file (which is comma delimited) into the lookup array (a hashmap in disguise). For the next set of files, sets the delimiter to : and similar to grep -A3 looks within the 3 distance of the first pattern for the second pattern, does the computation and prints what you want.
In awk you can have more control on what column your pattern matches as well, here I replicated grep example.
This is another simplified example to focus on the core of the problem.
awk 'BEGIN{for(i=1;i<=1000;i++) print int(rand()*1000), rand()}' |
awk 'NR==FNR{lookup[$1]; next}
$1 in lookup' perfect.numbers -
first process creates 1000 random records, and second one filters the ones where the first fields is in the look up table.
28 0.736027
496 0.968379
496 0.404218
496 0.151907
28 0.0421234
28 0.731929
for the lookup file
$ head perfect.numbers
6
28
496
8128
the piped data is substituted as the second file at -.
You can pipe your grep or awk output into a while read loop which gives you some degree of freedom. There you could decide on whether to forward a line:
grep -A3 "sometext" | grep "somemoretext" | while read LINE; do
COMPUTED=$(echo $LINE | awk -F '[:|]' 'BEGIN{OFS=","}{print $4,$6,$4*10^10+$6,$8}')
if grep $COMPUTED /the/file/to/search &>/dev/null; then
echo $LINE
fi
done | cat -

sed | awk : Keep end of String until special character is reached

I'm trying to cut a HDD ID's in sed to just contain the serial number of the drive. The ID's looks like:
t10.ATA_____WDC_WD30EFRX2D68EUZN0_________________________WD2DWMC4N2575116
So, I only want to keep the "WD2DWMC4N2575116". Serial numbers are not fixed length so I tried to keep the last character until the first "_" appears. Unfortunately I suck at RegExp :(
To capture all characters after last _, using backreference:
$ sed 's/.*_\(.*\)/\1/' <<< "t10.ATA_____WDC_WD30EFRX2D68EUZN0_________________________WD2DWMC4N2575116"
WD2DWMC4N2575116
Or as pointed out in comment, you can just remove all characters from beginning of the line up to last _:
sed 's/.*_//' file
echo "t10.ATA_____WDC_WD30EFRX2D68EUZN0_________________________WD2DWMC4N2575116" | rev | awk -F '_' '{print $1}' | rev
It works only if the ID is at the end.
Another in awk, this time using sub:
Data:
$ cat file
t10.ATA_____WDC_WD30EFRX2D68EUZN0_________________________WD2DWMC4N2575116
Code + result:
$ awk 'sub(/^.*_/,"")' file
WD2DWMC4N2575116
ie. replace everything from the first character to the last _. As sub returns the number of substitutions made, that value is used to trigger the implicit output. If you have several records to process and not all of them have _s, add ||1 after the sub:
$ cat foo >> file
$ awk 'sub(/^.*_/,"") || 1' file
WD2DWMC4N2575116
foo

Print lines whose 1st and 4th column differ

I have a file with a bunch of lines of this form:
12 AAA 423 12 BBB beta^11 + 3*beta^10
18 AAA 1509 18 BBB -2*beta^17 - beta^16
18 AAA 781 12 BBB beta^16 - 5*beta^15
Now I would like to print only lines where the 1st and the 4th column differ (the columns are space-separated) (the values AAA and BBB are fixed). I know I can do that by getting all possible values in the first column and then use:
for i in $values; do
cat file.txt | grep "^$i" | grep -v " $i BBB"
done
However, this runs through the file as many times as how many different values appear in the first column. Is there a way how to do that simply in one pass only? I think I can do the comparison, my main problem is that I have no idea how to extract the space-separated columns.
This is something quite straight forward for awk:
awk '$1 != $4' file
With awk, you refer to the first field with $1, the second with $2 and so on. This way, you can compare the first and the forth with $1 != $4. If this is true (that is, $1 and $4 differ), awk performs its default action: print the current line.
For your sample input, this works:
$ awk '$1 != $4' file
18 AAA 781 12 BBB beta^16 - 5*beta^15
Note you can define a different field separator with -v FS="...". This way, you can tell awk that your lines contain fields tab / comma / ... separated. All together it would be like this: awk -v FS="\t" '$1 != $4' file.

Print lines where first column matches, second column different

In a text file, how do I print out only the lines where the first column is duplicate but 2nd column is different? I want to reconcile these differences. Possibly using awk/sed/bash?
Input:
Jon AAA
Jon BBB
Ellen CCC
Ellen CCC
Output:
Jon AAA
Jon BBB
Note that the real file is not sorted.
Thanks for any help.
this line should do: (I broke the one-liner into 3 lines for better reading)
awk '!($1 in a) {a[$1]=$2;next}
$1 in a && $2!=a[$1]{p[$1 FS $2];p[$1 FS a[$1]]}
END{for(x in p)print x}' file
the 1st line save $1 $2 into array, if it was checked first time
line2: for existing $1 and different $2, put them (the two lines) into an array p, so that same $1,$2 combination won't be print multiple times.
print the index of array p
sort file | uniq -u
Will only print the unique lines.
This might work for you:
sort file | uniq -u | rev | uniq -Df1 | rev
This sorts the file, removes any duplicate lines, reverses the line, removes and unique lines that don't have the same key (keeps duplicates where the 2nd field is the same) and the reverses the line to its original position.
This will drop duplicate lines and lines with singleton keys.
Just a normal unique sort should work
awk '!a[$0]++' test

Resources