sed | awk : Keep end of String until special character is reached - bash

I'm trying to cut a HDD ID's in sed to just contain the serial number of the drive. The ID's looks like:
t10.ATA_____WDC_WD30EFRX2D68EUZN0_________________________WD2DWMC4N2575116
So, I only want to keep the "WD2DWMC4N2575116". Serial numbers are not fixed length so I tried to keep the last character until the first "_" appears. Unfortunately I suck at RegExp :(

To capture all characters after last _, using backreference:
$ sed 's/.*_\(.*\)/\1/' <<< "t10.ATA_____WDC_WD30EFRX2D68EUZN0_________________________WD2DWMC4N2575116"
WD2DWMC4N2575116
Or as pointed out in comment, you can just remove all characters from beginning of the line up to last _:
sed 's/.*_//' file

echo "t10.ATA_____WDC_WD30EFRX2D68EUZN0_________________________WD2DWMC4N2575116" | rev | awk -F '_' '{print $1}' | rev
It works only if the ID is at the end.

Another in awk, this time using sub:
Data:
$ cat file
t10.ATA_____WDC_WD30EFRX2D68EUZN0_________________________WD2DWMC4N2575116
Code + result:
$ awk 'sub(/^.*_/,"")' file
WD2DWMC4N2575116
ie. replace everything from the first character to the last _. As sub returns the number of substitutions made, that value is used to trigger the implicit output. If you have several records to process and not all of them have _s, add ||1 after the sub:
$ cat foo >> file
$ awk 'sub(/^.*_/,"") || 1' file
WD2DWMC4N2575116
foo

Related

Getting last X fields from a specific line in a CSV file using bash

I'm trying to get as bash variable list of users which are in my csv file. Problem is that number of users is random and can be from 1-5.
Example CSV file:
"record1_data1","record1_data2","record1_data3","user1","user2"
"record2_data1","record2_data2","record2_data3","user1","user2","user3","user4"
"record3_data1","record3_data2","record3_data3","user1"
I would like to get something like
list_of_users="cat file.csv | grep "record2_data2" | <something> "
echo $list_of_users
user1,user2,user3,user4
I'm trying this:
cat file.csv | grep "record2_data2" | awk -F, -v OFS=',' '{print $4,$5,$6,$7,$8 }' | sed 's/"//g'
My result is:
user2,user3,user4,,
Question:
How to remove all "," from the end of my result? Sometimes it is just one but sometimes can be user1,,,,
Can I do it in better way? Users always starts after 3rd column in my file.
This will do what your code seems to be trying to do (print the users for a given string record2_data2 which only exists in the 2nd field):
$ awk -F',' '{gsub(/"/,"")} $2=="record2_data2"{sub(/([^,]*,){3}/,""); print}' file.csv
user1,user2,user3,user4
but I don't see how that's related to your question subject of Getting last X records from CSV file using bash so idk if it's what you really want or not.
Better to use a bash array, and join it into a CSV string when needed:
#!/usr/bin/env bash
readarray -t listofusers < <(cut -d, -f4- file.csv | tr -d '"' | tr ',' $'\n' | sort -u))
IFS=,
printf "%s\n" "${listofusers[*]}"
cut -d, -f4- file.csv | tr -d '"' | tr ',' $'\n' | sort -u is the important bit - it first only prints out the fourth and following fields of the CSV input file, removes quotes, turns commas into newlines, and then sorts the resulting usernames, removing duplicates. That output is then read into an array with the readarray builtin, and you can manipulate it and the individual elements however you need.
GNU sed solution, let file.csv content be
"record1_data1","record1_data2","record1_data3","user1","user2"
"record2_data1","record2_data2","record2_data3","user1","user2","user3","user4"
"record3_data1","record3_data2","record3_data3","user1"
then
sed -n -e 's/"//g' -e '/record2_data/ s/[^,]*,[^,]*,[^,]*,// p' file.csv
gives output
user1,user2,user3,user4
Explanation: -n turns off automatic printing, expressions meaning is as follow: 1st substitute globally " using empty string i.e. delete them, 2nd for line containing record2_data substitute (s) everything up to and including 3rd , with empty string i.e. delete it and print (p) such changed line.
(tested in GNU sed 4.2.2)
awk -F',' '
/record2_data2/{
for(i=4;i<=NF;i++) o=sprintf("%s%s,",o,$i);
gsub(/"|,$/,"",o);
print o
}' file.csv
user1,user2,user3,user4
This might work for you (GNU sed):
sed -E '/record2_data/!d;s/"([^"]*)"(,)?/\1\2/4g;s///g' file
Delete all records except for that containing record2_data.
Remove double quotes from the fourth field onward.
Remove any double quoted fields.

how to use cut command -f flag as reverse

This is a text file called a.txt
ok.google.com
abc.google.com
I want to select every subdomain separately
cat a.txt | cut -d "." -f1 (it select ok From left side)
cat a.txt | cut -d "." -f2 (it select google from left side)
Is there any way, so I can get result from right side
cat a.txt | cut (so it can select com From right side)
There could be few ways to do this, one way which I could think of right now could be using rev + cut + rev solution. Which will reverse the input by rev command and then set field separator as . and print fields as per they are from left to right(but actually they are reversed because of the use of rev), then pass this output to rev again to get it in its actual order.
rev Input_file | cut -d'.' -f 1 | rev
You can use awk to print the last field:
awk -F. '{print $NF}' a.txt
-F. sets the record separator to "."
$NF is the last field
And you can give your file directly as an argument, so you can avoid the famous "Useless use of cat"
For other fields, but counting from the last, you can use expressions as suggested in the comment by #sundeep or described in the users's guide under
4.3 Nonconstant Field Numbers. For example, to get the domain, before the TLD, you can substract 1 from the Number of Fields NF :
awk -F. '{ print $(NF-1) }' a.txt
You might use sed with a quantifier for the grouped value repeated till the end of the string.
( Start group
\.[^[:space:].]+ Match 1 dot and 1+ occurrences of any char except a space or dot
){1} Close the group followed by a quantifier
$ End of string
Example
sed -E 's/(\.[^[:space:].]+){1}$//' file
Output
ok.google
abc.google
If the quantifier is {2} the output will be
ok
abc
Depending on what you want to do after getting the values then you could use bash for splitting your domain into an array of its components:
#!/bin/bash
IFS=. read -ra comps <<< "ok.google.com"
echo "${comps[-2]}"
# or for bash < 4.2
echo "${comps[${#comps[#]}-2]}"
google

Replace every 4th occurence of char "_" with "#" in multiple files

I am trying to replace every 4th occurrence of "_" with "#" in multiple files with bash.
E.g.
foo_foo_foo_foo_foo_foo_foo_foo_foo_foo..
would become
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo...
#perl -pe 's{_}{++$n % 4 ? $& : "#"}ge' *.txt
I have tried perl but the problem is this replaces every 4th _ carrying on from the last file. So for example, some files the first _ is replaced because it is not starting each new file at a count of 0, it carries on from the previous file.
I have tried:
#awk '{for(i=1; i<=NF; i++) if($i=="_") if(++count%4==0) $i="#"}1' *.txt
but this also does not work.
Using sed I cannot find a way to keep replacing every 4th occurrence as there are different numbers of _ in each file. Some files have 20 _, some have 200 _. Therefore, I cant specify a range.
I am really lost what to do, can anybody help?
You just need to reset the counter in the perl one using eof to tell when it's done reading each file:
perl -pe 's{_}{++$n % 4 ? "_" : "#"}ge; $n = 0 if eof' *.txt
This MAY be what you want, using GNU awk for RT:
$ awk -v RS='_' '{ORS=(FNR%4 ? RT : "#")} 1' file
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo..
It only reads each _-separated string into memory 1 at a time so should work no matter how large your input file, assuming there are _s in it.
It assumes you want to replace every 4th _ across the whole file as opposed to within individual lines.
A simple sed would handle this:
s='foo_foo_foo_foo_foo_foo_foo_foo_foo_foo'
sed -E 's/(([^_]+_){3}[^_]+)_/\1#/g' <<< "$s"
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo
Explanation:
(: Start capture group #1
([^_]+_){3}: Match Match 1+ of non-_ characters followed by a _. Repeat this group 3 times to match 3 such words separated by _
[^_]+: Match 1+ of non-_ characters
): End capture group #1
_: Match a _
Replacement is \1# to replace 4th _ with a #
With GNU sed:
sed -nsE ':a;${s/(([^_]*_){3}[^_]*)_/\1#/g;p};N;ba' *.txt
-n suppresses the automatic printing, -s processes each file separately, -E uses extended regular expressions.
The script is a loop between label a (:a) and the branch-to-label-a command (ba). Each iteration appends the next line of input to the pattern space (N). This way, after the last line has been read, the pattern space contains the whole file(*). During the last iteration, when the last line has been read ($), a substitute command (s) replaces every 4th _ in the pattern space by a # (s/(([^_]*_){3}[^_]*)_/\1#/g) and prints (p) the result.
When you will be satisfied with the result you can change the options:
sed -i -nE ':a;${s/(([^_]*_){3}[^_]*)_/\1#/g;p};N;ba' *.txt
to modify the files in-place, or:
sed -i.bkp -nE ':a;${s/(([^_]*_){3}[^_]*)_/\1#/g;p};N;ba' *.txt
to modify the files in-place, but keep a *.txt.bkp backup of each file.
(*) Note that if you have very large files this could cause memory overflows.
With your shown samples, please try following awk program. Have created an awk variable named fieldNum where I have assigned 4 to it, since OP needs to enter # after every 4th _, you can keep it as per your need too.
awk -v fieldNum="4" '
BEGIN{ FS=OFS="_" }
{
val=""
for(i=1;i<=NF;i++){
val=(val?val:"") $i (i%fieldNum==0?"#":(i<NF?OFS:""))
}
print val
}
' Input_file
With GNU awk
$ cat ip.txt
foo_foo_foo_foo_foo_foo_foo_foo_foo_foo
123_45678_90
_
$ awk -v RS='(_[^_]+){3}_' -v ORS= '{sub(/_$/, "#", RT); print $0 RT}' ip.txt
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo
123_45678_90
#
-v RS='(_[^_]+){3}_' set input record separator to cover sequence of four _ (text matched by this separator will be available via RT)
-v ORS= empty output record separator
sub(/_$/, "#", RT) change last _ to #
Use -i inplace for inplace editing.
If the count should reset for each line:
perl -pe's/(?:_[^_]*){3}\K_/\#/g'
$ cat a.txt
foo_foo_foo_foo_foo_foo_foo_foo_foo_foo
foo_foo_foo_foo_foo_foo_foo_foo_foo_foo
$ perl -pe's/(?:_[^_]*){3}\K_/\#/g' a.txt a.txt
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo
If the count shouldn't reset for each line, but should reset for each file:
perl -0777pe's/(?:_[^_]*){3}\K_/\#/g'
The -0777 cause the whole file to be treated as one line. This causes the count to work properly across lines.
But since a new a match is used for each file, the count is reset between files.
$ cat a.txt
foo_foo_foo_foo_foo_foo_foo_foo_foo_foo
foo_foo_foo_foo_foo_foo_foo_foo_foo_foo
$ perl -0777pe's/(?:_[^_]*){3}\K_/\#/g' a.txt a.txt
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo
foo_foo_foo#foo_foo_foo_foo#foo_foo_foo
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo
foo_foo_foo#foo_foo_foo_foo#foo_foo_foo
To avoid that reading the entire file at once, you could continue using the same approach, but with the following added:
$n = 0 if eof;
Note that eof is not the same thing as eof()! See eof.

Get last four characters from a string

I am trying to parse the last 4 characters of Mac serial numbers from terminal. I can grab the serial with this command:
serial=$(ioreg -l |grep "IOPlatformSerialNumber"|cut -d ""="" -f 2|sed -e s/[^[:alnum:]]//g)
but I need to output just the last 4 characters.
Found it in a linux forum echo ${serial:(-4)}
Using a shell parameter expansion to extract the last 4 characters after the fact works, but you could do it all in one step:
ioreg -k IOPlatformSerialNumber | sed -En 's/^.*"IOPlatformSerialNumber".*(.{4})"$/\1/p'
ioreg -k IOPlatformSerialNumber returns much fewer lines than ioreg -l, so it speeds up the operation considerably (about 80% faster on my machine).
The sed command matches the entire line of interest, and replaces it with the last 4 characters before the " that ends the line; i.e., it returns the last 4 chars. of the value.
Note: The ioreg output line of interest looks something like this:
| "IOPlatformSerialNumber" = "A02UV13KDNMJ"
As for your original command: cut -d ""="" is the same as cut -d = - the shell simply removes the empty strings around the = before cut sees the value. Note that cut only accepts a single delimiter char.
You can also do: grep -Eo '.{4}$' <<< "$serial"
I don't know how the output of ioreg -l looks like, but it looks to me that you are using so many pipes to do something that awk alone could handle:
use = as field separator
vvv
awk -F= '/IOPlatformSerialNumber/ { #match lines containing IOPlatform...
gsub(/[^[:alnum:]]/, "", $2) # replace all non alpha chars from 2nd field
print substr($2, length($2)-3, length($2)) # print last 4 characters
}'
Or even sed (a bit ugly one since the repetition of command): catch the first 4 alphanumeric characters occuring after the first =:
sed -rn '/IOPlatformSerialNumber/{
s/^[^=]*=[^a-zA-Z0-9]*([a-zA-Z0-9])[^a-zA-Z0-9]*([a-zA-Z0-9])[^a-zA-Z0-9]*([a-zA-Z0-9])[^a-zA-Z0-9]*([a-zA-Z0-9]).*$/\1\2\3\4/;p
}'
Test
$ cat a
aaa
bbIOPlatformSerialNumber=A_+23B/44C//55=ttt
IOPlatformSerialNumber=A_+23B/44C55=ttt
asdfasd
The last 4 alphanumeric characters between the 1st and 2nd = are 4C55:
$ awk -F= '/IOPlatformSerialNumber/ {gsub(/[^[:alnum:]]/, "", $2); print substr($2, length($2)-3, length($2))}' a
4C55
4C55
Without you posting some sample output of ioreg -l this is untested and a guess but it looks like all you need is something like:
ioreg -l | sed -r -n 's/IOPlatformSerialNumber=[[:alnum:]]+([[:alnum:]]{4})/\1/'

Oneliner to calculate complete size of all messages in maillog

Ok guys I'm really at a dead end here, don't know what else to try...
I am writing a script for some e-mail statistics, one of the things it needs to do is calculate the complete size of all messages in the maillog, this is what I wrote so far:
egrep ' HOSTNAME sendmail\[.*.from=.*., size=' maillog | awk '{print $8}' |
tr "," "+" | tr -cd '[:digit:][=+=]' | sed 's/^/(/;s/+$/)\/1048576/' |
bc -ql | awk -F "." '{print $1}'
And here is a sample line from my maillog:
Nov 15 09:08:48 HOSTNAME sendmail[3226]: oAF88gWb003226:
from=<name.lastname#domain.com>, size=40992, class=0, nrcpts=24,
msgid=<E08A679A54DA4913B25ADC48CC31DD7F#domain.com>, proto=ESMTP,
daemon=MTA1, relay=[1.1.1.1]
So I'll try to explain it step by step:
First I grep through the file to find all the lines containing the actual "size", next i print the 8th field, in this case "size=40992,".
Next I replace all the comma characters with a plus sign.
Then I delete everything except the digits and the plus sign.
Then I replace the beginning of the line with a "(", and I replace the last extra plus sign with a ")" followed by "/1048576". So i get a huge expression looking like this:
"(1+2+3+4+5...+n)/1048576"
Because I want to add up all the individual message sizes and divide it so I get the result in MB.
The last awk command is when I get a decimal number I really don't care for precision so i just print the part before the decimal point.
The problem is, this doesn't work... And I could swear it was working at one point, could it be my expression is too long for bc to handle?
Thanks if you took the time to read through :)
I think a one-line awk script will work too. It matches any line that your egrep pattern matches, then for those lines it splits the eighth record by the = sign and adds the second part (the number) to the SUM variable. When it sees the END of the file it prints out the value of SUM/1048576 (or the byte count in Mibibytes).
awk '/ HOSTNAME sendmail\[.*.from=.*., size=/{ split($8,a,"=") ; SUM += a[2] } END { print SUM/1048576 }' maillog
bc chokes if there is no newline in its input, as happens with your expression. You have to change the sed part to:
sed 's/^/(/;s/+$/)\/1048576\n/'
The final awk will happily eat all your output if the total size is less than 1MB and bc outputs something like .03333334234. If you are not interested in the decimal part remove that last awk command and the -l parameter from bc.
I'd do it with this one-liner:
grep ' HOSTNAME sendmail[[0-9][0-9]*]:..*:.*from=..*, size=' maillog | sed 's|.*, size=\([0-9][0-9]*\), .*|\1+|' | tr -d '\n' | sed 's|^|(|; s|$|0)/1048576\n|' | bc

Resources