Command output with empty values to csv - bash

> lsblk -o NAME,LABEL,FSTYPE,MOUNTPOINT,SIZE,TYPE -x NAME
NAME LABEL FSTYPE MOUNTPOINT SIZE TYPE
nvme0n1 894.3G disk
nvme0n1p1 [SWAP] 4G part
nvme0n1p2 1G part
nvme0n1p3 root /home/cg/root 889.3G part
I need the output of this command in csv format, but all the methods I've tried so far don't handle the empty values correctly, thus generating bad rows like these I got with sed:
> lsblk -o NAME,LABEL,FSTYPE,MOUNTPOINT,SIZE,TYPE -x NAME | sed -E 's/ +/,/g'
NAME,LABEL,FSTYPE,MOUNTPOINT,SIZE,TYPE
nvme0n1,894.3G,disk
nvme0n1p1,[SWAP],4G,part
nvme0n1p2,1G,part
nvme0n1p3,root,/home/cg/root,889.3G,part
Any idea how to add the extra commas for the empty fields?
NAME,LABEL,FSTYPE,MOUNTPOINT,SIZE,TYPE
nvme0n1,,,,894.3G,disk

Make sure that the fields that are possibly empty are at the end of the line. And then re-arrange them in the required sequence.
lsblk -o NAME,SIZE,TYPE,FSTYPE,MOUNTPOINT,LABEL -x NAME | awk '{ print $1,";",$6,";",$4,";",$5,";",$2,";",$3 }'

Just:
lsblk -o NAME,LABEL,FSTYPE,MOUNTPOINT,SIZE,TYPE -x NAME -r | tr ' ' ','

Not really bash, but a quick and dirty Perl would be something like:
my $state=0;
my #input=<>;
my $maxlength=0;
for my $line ( 0 .. $#input){
my $curlength= length($input[$line]);
if ($curlength>$maxlength){$maxlength=$curlength;}
}
my $fill=' ' x $maxlength;
for my $line ( 0 .. $#input){
chomp $input[$line];
$input[$line]="$input[$line] $fill";
}
for (my $pos=0; $pos<$maxlength; $pos++){
my $spacecol=1;
for my $line ( 0 .. $#input){
if (substr($input[$line],$pos,1) ne ' '){
$spacecol=0;
}
}
if ($spacecol==1){
for my $line ( 0 .. $#input){
substr($input[$line],$pos,1)=';';
}
}
}
for my $line ( 0 .. $#input){
print "$input[$line]\n";
}

Assumptions:
output format is fixed-width
header record does not contain any blank fields
no fields contain white space (ie, only white space occurs between fields)
Design overview:
parse header to get initial index for each field; if all columns are left-justified then this would be all we need to do however, with the existence of right-justified columns (eg, SIZE) we need to look for right-justified values that are longer than the associated header field (ie, the value has a lower index than the associated header)
for non-header rows we loop through our set of potential fields, using substr()/match() to find the non-space fields in the line and ...
if said field starts and ends before the next field's index then add the field's value to our output variable but ...
if said field starts before next field's index but ends after next field's index then we're looking at a right-justified value of the next field which happens to have an earlier index than the associated header's index; in this case update the index for the next field and add a blank value (for the current field) to our output variable
if said field starts after the index of the next field then the current field is empty; again, add the empty/blank value to our output variable
once we've completed processing a line of input print the output to stdout
One awk idea:
awk '
BEGIN { OFS="," }
# use header record to determine initial set of indexes
FNR==1 { maxNF=NF
header=$0
out=sep=""
for (i=1;i<=maxNF;i++) {
match(header,/[^[:space:]]+/) # find first non-space string
ndx[i]=ndx[i-1] + prevlen + RSTART - (i==1 ? 0 : 1) # make note of index
out=out sep substr(header,RSTART,RLENGTH) # add value to our output variable
sep=OFS
prevlen=RLENGTH # need for next pass through loop
header=substr(header,RSTART+RLENGTH) # strip off matched string and repeat loop
}
print out # print header to stdout
ndx[1]=1 # in case 1st field is right-justified, override index and set to 1
next
}
# for rest of records need to determine which fields are empty and/or which fields need the associated index updated
{ out=sep=""
for (i=1;i<maxNF;i++) { # loop through all but last field
restofline=substr($0,ndx[i]) # work with current field thru to end of line
if ( match(restofline,/[^[:space:]]+/) ) # if we find a non-space match ...
if ( ndx[i]-1+RSTART < ndx[i+1] ) # if match starts before index of next field and ...
if ( ndx[i]-1+RSTART+RLENGTH < ndx[i+1] ) # ends before index of next field then ...
out=out sep substr(restofline,RSTART,RLENGTH) # append value to our output variable
else { # else if match finished beyond index of next field then ...
out=out sep "" # this field is empty and ...
diff=ndx[i+1]-(ndx[i]+RSTART-1) # figure the difference and ...
ndx[i+1]-=diff # update the index for the next field
}
else # current field is empty
out=out sep ""
sep=OFS
}
field=substr($0,ndx[maxNF]) # process last field
gsub(/[[:space:]]/,"",field) # remove all remaining spaces
print out, field # print new line to stdout
}
' lsblk.out
This generates:
NAME,LABEL,FSTYPE,MOUNTPOINT,SIZE,TYPE
nvme0n1,,,,894.3G,disk
nvme0n1p1,,,[SWAP],4G,part
nvme0n1p2,,,,1G,part
nvme0n1p3,root,,/home/cg/root,889.3G,part

Related

(sed/awk) extract values text file and write to csv (no pattern)

I have (several) large text files from which I want to extract some values to create a csv file with all of these values.
My current solution is to have a few different calls to sed from which I save the values and then have a python script in which I combine the data in different files to a single csv file. However, this is quite slow and I want to speed it up.
The file let's call it my_file_1.txt has a structure that looks something like this
lines I don't need
start value 123
lines I don't need
epoch 1
...
lines I don't need
some epoch 18 words
stop value 234
lines I don't need
words start value 345 more words
lines I don't need
epoch 1
...
lines I don't need
epoch 72
stop value 456
...
and I would like to construct something like
file,start,stop,epoch,run
my_file_1.txt,123,234,18,1
my_file_1.txt,345,456,72,2
...
How can I get the results I want? It doesn't have to be Sed or Awk as long as I don't need to install something new and it is reasonably fast.
I don't really have any experience with awk. With sed my best guess would be
filename=$1
echo 'file,start,stop,epoch,run' > my_data.csv
sed -n '
s/.*start value \([0-9]\+\).*/'"$filename"',\1,/
h
$!N
/.*epoch \([0-9]\+\).*\n.*stop value\([0-9]\+\)/{s/\2,\1/}
D
T
G
P
' $filename | sed -z 's/,\n/,/' >> my_data.csv
and then deal with not getting the run number. Furthermore, this is not quite correct as the N will gobble up some "start value" lines leading to wrong result. It feels like it could be done easier with awk.
It is similar to 8992158 but I can't use that pattern and I know too little awk to rewrite it.
Solution (Edit)
I was not general enough in my description of the problem so I changed it up a bit and fixed some inconsistensies.
Awk (Rusty Lemur's answer)
Here I generalised from knowing that the numbers were at the end of the line to using gensub. For this I should have specified version of awk at is not available in all versions.
BEGIN {
counter = 1
OFS = "," # This is the output field separator used by the print statement
print "file", "start", "stop", "epoch", "run" # Print the header line
}
/start value/ {
startValue = gensub(/.*start value ([0-9]+).*/, "\\1", 1, $0)
}
/epoch/ {
epoch = gensub(/.*epoch ([0-9]+).*/, "\\1", 1, $0)
}
/stop value/ {
stopValue = gensub(/.*stop value ([0-9]+).*/, "\\1", 1, $0)
# we have everything to print our line
print FILENAME, startValue, stopValue, epoch, counter
counter = counter + 1
startValue = "" # clear variables so they aren't maintained through the next iteration
epoch = ""
}
I accepted this answer because it most understandable.
Sed (potong's answer)
sed -nE '1{x;s/^/file,start,stop,epock,run/p;s/.*/0/;x}
/^.*start value/{:a;N;/\n.*stop value/!ba;x
s/.*/expr & + 1/e;x;G;F
s/^.*start value (\S+).*\n.*epoch (\S+)\n.*stop value (\S+).*\n(\S+)/,\1,\3,\2,\4/p}' my_file_1.txt | sed '1!N;s/\n//'
It's not clear how you'd get exactly the output you provided from the input you provided but this may be what you're trying to do (using any awk in any shell on every Unix box):
$ cat tst.awk
BEGIN {
OFS = ","
print "file", "start", "stop", "epoch", "run"
}
{ f[$1] = $NF }
$1 == "stop" {
print FILENAME, f["start"], f["stop"], f["epoch"], ++run
delete f
}
$ awk -f tst.awk my_file_1.txt
file,start,stop,epoch,run
my_file_1.txt,123,234,N,1
my_file_1.txt,345,456,M,2
awk's basic structure is:
read a record from the input (by default a record is a line)
evaluate conditions
apply actions
The record is split into fields (by default based on whitespace as the separator).
The fields are referenced by their position, starting at 1. $1 is the first field, $2 is the second.
The last field is referenced by a variable named NF for "number of fields." $NF is the last field, $(NF-1) is the second-to-last field, etc.
A "BEGIN" section will be executed before any input file is read, and it can be used to initialize variables (which are implicitly initialized to 0).
BEGIN {
counter = 1
OFS = "," # This is the output field separator used by the print statement
print "file", "start", "stop", "epoch", "run" # Print the header line
}
/start value/ {
startValue = $NF # when a line contains "start value" store the last field as startValue
}
/epoch/ {
epoch = $NF
}
/stop value/ {
stopValue = $NF
# we have everything to print our line
print FILENAME, startValue, stopValue, epoch, counter
counter = counter + 1
startValue = "" # clear variables so they aren't maintained through the next iteration
epoch = ""
}
Save that as processor.awk and invoke as:
awk -f processor.awk my_file_1.txt my_file_2.txt my_file_3.txt > output.csv
This might work for you (GNU sed):
sed -nE '1{x;s/^/file,start,stop,epock,run/p;s/.*/0/;x}
/^start value/{:a;N;/\nstop value/!ba;x
s/.*/expr & + 1/e;x;G;F
s/^start value (\S+).*\nepoch (\S+)\nstop value (\S+).*\n(\S+)/,\1,\3,\2,\4/p}' file |
sed '1!N;s/\n//'
The solution contains two invocations of sed, the first to format all but the file name and second to embed the file name into the csv file.
Format the header line on the first line and prime the run number.
Gather up lines between start value and stop value.
Increment the run number, append it to the current line and output the file name. This prints two lines per record, the first is the file name and the second the remainder of the csv file.
In the second sed invocation read two lines at a time (except for the first line) and remove the newline between them, formatting the csv file.

Search field and display next data to it

Is there an easiest way to search the following data with specific field based on the field ##id:?
This is the sample data file called sample
##id: 123 ##name: John Doe ##age: 18 ##Gender: Male
##id: 345 ##name: Sarah Benson ##age: 20 ##Gender: Female
For example, If I want to search an ID of 123 and his gender I would do this:
Basically this is the prototype that I want:
# search.sh
#!/bin/bash
# usage: search.sh <id> <field>
# eg: search 123 age
search="$1"
field="$2"
grep "^##id: ${search}" sample | # FILTER <FIELD>
So when I search an ID 123 like below:
search.sh 123 gender
The output would be
Male
Up until now, based on the code above, I only able to grep one line based on ID, and I'm not sure what is the best method or fastest method with less complicated to get its next value after specifying the field (eg. age)
1st solution: With your shown samples, please try following bash script. This considers that you want to match exact string match.
cat script.bash
#!/bin/bash
search="$1"
field="$2"
awk -v search="$search" -v field="$field" '
match($0,"##id:[[:space:]]*"search){
value=""
match($0,"##"field":[[:space:]]*[^#]+")
value=substr($0,RSTART,RLENGTH)
sub(/.*: +/,"",value)
print value
}
' Input_file
2nd solution: In case you want to search strings(values) irrespective of their cases(lower/upper case) in each line then try following code.
cat script.bash
#!/bin/bash
search="$1"
field="$2"
awk -v search="$search" -v field="$field" '
match(tolower($0),"##id:[[:space:]]*"tolower(search)){
value=""
match(tolower($0),"##"tolower(field)":[[:space:]]*[^#]+")
value=substr($0,RSTART,RLENGTH)
sub(/.*: +/,"",value)
print value
}
' Input_file
Explanation: Simple explanation of code would be, creating BASH script, which is expecting 2 parameters while its being run. Then passing these parameters as values to awk program. Then using match function to match the id in each line and print the value of passed field(eg: name OR Gender etc).
Since you want to extract a part of each line found, different from the part you are matching against, sed or awk would be a better tool than grep. You could pipe the output of grep into one of the others, but that's wasteful because both sed and awk can do the line selection directly. I would do something like this:
#!/bin/bash
search="$1"
field="$2"
sed -n "/^##id: ${search}"'\>/ { s/.*##'"${field}"': *//i; s/ *##.*//; p }' sample
Explanation:
sed is instructed to read file sample, which it will do line by line.
The -n option tells sed to suppress its usual behavior of automatically outputting its pattern space at the end of each cycle, which is an easy way to filter out lines that don't match the search criterion.
The sed expression starts with an address, which in this case is a pattern matching lines by id, according to the script's first argument. It is much like your grep pattern, but I append \>, which matches a word boundary. That way, searches for id 123 will not also match id 1234.
The rest of the sed expression edits out the everything in the line except the value of the requested field, with the field name being matched case-insensitively, and prints the result. The editing is accomplished by the two s/// commands, and the p command is of course for "print". These are all enclosed in curly braces ({}) and separated by semicolons (;) to form a single compound associated with the given address.
Assumptions:
'label' fields have format ##<string>:
need to handle case-insensitive searches
'label' fields could be located anywhere in the line (ie, there is no set ordering of 'label' fields)
the 1st input search parameter is always a value associated with the ##id: label
the 2nd input search parameter is to be matched as a whole word (ie, no partial label matching; nam will not match against ##name:)
if there are multiple 'label' fields that match the 2nd input search parameter we print the value associated with the 1st match found in the line)
One awk idea:
awk -v search="${search}" -v field="${field}" '
BEGIN { field = tolower(field) }
{ n=split($0,arr,"##|:") # split current line on dual delimiters "##" and ":", place fields into array arr[]
found_search = 0
found_field = 0
for (i=2;i<=n;i=i+2) { # loop through list of label fields
label=tolower(arr[i])
value = arr[i+1]
sub(/^[[:space:]]+/,"",value) # strip leading white space
sub(/[[:space:]]+$/,"",value) # strip trailing white space
if ( label == "id" && value == search )
found_search = 1
if ( label == field && ! found_field )
found_field = value
}
if ( found_search && found_field )
print found_field
}
' sample
Sample input:
$ cat sample
##id: 123 ##name: John Doe ##age: 18 ##Gender: Male
##id: 345 ##name: Sarah Benson ##age: 20 ##Gender: Female
##name: Archibald P. Granite, III, Ph.D, M.D. ##age: 20 ##Gender: not specified ##id: 567
Test runs:
search=123 field=gender => Male
search=123 field=ID => 123
search=123 field=Age => 18
search=345 field=name => Sarah Benson
search=567 field=name => Archibald P. Granite, III, Ph.D, M.D.
search=567 field=GENDER => not specified
search=999 field=age => <no output>
For the given data format, you could set the field separator to optional spaces followed by ## to prevent trailing spaces for the printed field.
Then create a key value mapping per row (making the keys and the field to search for lowercase) and search for the key, which will be independent of the order in the string.
If the key is present, then print the value.
#!/bin/bash
search="$1"
field="$2"
awk -v search="${search}" -v field="${field}" '
BEGIN {FS="[[:blank:]]*##"} # Set field separator to optional spaces and ##
{
for (i = 1; i <= NF; i++) { # Loop all the fields
split($i, a, /[[:blank:]]*:[[:blank:]]*/) # Split the field on : with optional surrounded spaces
kv[tolower(a[1])]=a[2] # Create a key value array using the splitted values
}
val = kv[tolower(field)] # Get the value from kv based on the lowercase key
if (kv["id"] == search && val) print val # If there is a matching key and a value, print the value
}' file
And then run
./search.sh 123 gender
Output
Male

Search a CSV file for a value in the first column, if found shift the value of second column one row down

I have CSV files that look like this:
786,1702
787,1722
-,1724
788,1769
789,1766
I would like to have a bash command that searches the first column for the - and if found then shifts the values in the second column down. The - reccurr several times in the first column and would need to start from the top to preserve the order of the second column.
The second column would be blank
Desired output:
786,1702
787,1722
-,
788,1724
789,1769
790,1766
So far I have: awk -F ',' '$1 ~ /^-$/' filename.csv to find the hyphens, but shifting the 2nd column down is tricky...
Assuming that the left column continues with incremental IDs to shift the right column until it is empty.
awk 'BEGIN{start=0;FS=","}$1=="-"{stack[stacklen++]=$2;print $1",";next}stacklen-start{stack[stacklen++]=$2;print $1","stack[start];delete stack[start++];next}1;END{for (i=start;i<stacklen;i++){print $1-start+i+1,stack[i]}}' filename.csv
# or
<filename.csv awk -F, -v start=0 '$1=="-"{stack[stacklen++]=$2;print $1",";next}stacklen-start{stack[stacklen++]=$2;print $1","stack[start];delete stack[start++];next}1;END{for (i=start;i<stacklen;i++){print $1-start+i+1,stack[i]}}'
Or, explained:
I am here using a shifted stack to avoid rewriting indexes. With start as the pointer to the first useful element of the stack, and stacklen as the last element. This avoids the costly operation of shifting all array elements whenever we want to remove the first element.
# chmod +x shift_when_dash
./shift_when_dash filename.csv
with shift_when_dash being an executable file containing:
#!/usr/bin/awk -f
BEGIN { # Everything in this block is executed once before opening the file
start = 0 # Needed because we are using it in a scalar context before initialization
FS = "," # Input field separator is a comma
}
$1 == "-" { # We match the special case where the first column is a simple dash
stack[stacklen++] = $2 # We store the second column on top of our stack
print $1 "," # We print the dash without a second column as asked by OP
next # We stop processing the current record and go on to the record
}
stacklen - start { # In case we still have something in our stack
stack[stacklen++] = $2 # We store the current 2nd column on the stack
print $1 "," stack[start] # We print the current ID with the first stacked element
delete stack[start++] # Free up some memory and increment our pointer
next
}
1 # We print the line as-is, without any modification.
# This applies to lines which were not skipped by the
# 'next' statements above, so in our case all lines before
# the first dash is encountered.
END {
for (i=start;i<stacklen;i++) { # For every element remaining in the stack after the last line
print $1-start+i+1 "," stack[i] # We print a new incremental id with the stack element
}
}
next is an awk statement similar to continue in other languages, with the difference that it skips to the next input line instead of the next loop element. It is useful to emulate a switch-case.

How can I retrieve the matching records from mentioned file format in bash

XYZNA0000778800Z
16123000012300321000000008000000000000000
16124000012300322000000007000000000000000
17234000012300323000000005000000000000000
17345000012300324000000004000000000000000
17456000012300325000000003000000000000000
9
XYZNA0000778900Z
16123000012300321000000008000000000000000
16124000012300322000000007000000000000000
17234000012300323000000005000000000000000
17345000012300324000000004000000000000000
17456000012300325000000003000000000000000
9
I have above file format from which I want to find a matching record. For example, match a number(7789) on line starting with XYZ and once matched look for a matching number (7345) in lines below starting with 1 until it reaches to line starting with 9. retrieve the entire line record. How can I accomplish this using shell script, awk, sed or any combination.
Expected Output:
XYZNA0000778900Z
17345000012300324000000004000000000000000
With sed one can do:
$ sed -n '/^XYZ.*7789/,/^9$/{/^1.*7345/p}' file
17345000012300324000000004000000000000000
Breakdown:
sed -n ' ' # -n disabled automatic printing
/^XYZ.*7789/, # Match line starting with XYZ, and
# containing 7789
/^1.*7345/p # Print line starting with 1 and
# containing 7345, which is coming
# after the previous match
/^9$/ { } # Match line that is 9
range { stuff } will execute stuff when it's inside range, in this case the range is starting at /^XYZ.*7789/ and ending with /^9$/.
.* will match anything but newlines zero or more times.
If you want to print the whole block matching the conditions, one can use:
$ sed -n '/^XYZ.*7789/{:s;N;/\n9$/!bs;/\n1.*7345/p}' file
XYZNA0000778900Z
16123000012300321000000008000000000000000
16124000012300322000000007000000000000000
17234000012300323000000005000000000000000
17345000012300324000000004000000000000000
17456000012300325000000003000000000000000
9
This works by reading lines between ^XYZ.*7779 and ^9$ into the pattern
space. And then printing the whole thing if ^1.*7345 can be matches:
sed -n ' ' # -n disables printing
/^XYZ.*7789/{ } # Match line starting
# with XYZ that also contains 7789
:s; # Define label s
N; # Append next line to pattern space
/\n9$/!bs; # Goto s unless \n9$ matches
/\n1.*7345/p # Print whole pattern space
# if \n1.*7345 matches
I'd use awk:
awk -v rid=7789 -v fid=7345 -v RS='\n9\n' -F '\n' 'index($1, rid) { for(i = 2; i < $NF; ++i) { if(index($i, fid)) { print $i; next } } }' filename
This works as follows:
-v RS='\n9\n' is the meat of the whole thing. Awk separates its input into records (by default lines). This sets the record separator to \n9\n, which means that records are separated by lines with a single 9 on them. These records are further separated into fields, and
-F '\n' tells awk that fields in a record are separated by newlines, so that each line in a record becomes a field.
-v rid=7789 -v fid=7345 sets two awk variables rid and fid (meant by me as record identifier and field identifier, respectively. The names are arbitrary.) to your search strings. You could encode these in the awk script directly, but this way makes it easier and safer to replace the values with those of a shell variables (which I expect you'll want to do).
Then the code:
index($1, rid) { # In records whose first field contains rid
for(i = 2; i < $NF; ++i) { # Walk through the fields from the second
if(index($i, fid)) { # When you find one that contains fid
print $i # Print it,
next # and continue with the next record.
} # Remove the "next" line if you want all matching
} # fields.
}
Note that multi-character record separators are not strictly required by POSIX awk, and I'm not certain if BSD awk accepts it. Both GNU awk and mawk do, though.
EDIT: Misread question the first time around.
an extendable awk script can be
$ awk '/^9$/{s=0} s&&/7345/; /^XYZ/&&/7789/{s=1} ' file
set flag s when line starts with XYZ and contains 7789; reset when line is just 9, and print when flag is set and contains pattern 7345.
This might work for you (GNU sed):
sed -n '/^XYZ/h;//!H;/^9/!b;x;/^XYZ[^\n]*7789/!b;/7345/p' file
Use the option -n for the grep-like nature of sed. Gather up records beginning with XYZ and ending in 9. Reject any records which do not have 7789 in the header. Print any remaining records that contain 7345.
If the 7345 will always follow the header,this could be shortened to:
sed -n '/^XYZ/h;//!H;/^9/!b;x;/^XYZ[^\n]*7789.*7345/p' file
If all records are well-formed (begin XYZ and end in 9) then use:
sed -n '/^XYZ/h;//!H;/^9/!b;x;/^[^\n]*7789.*7345/p' file

Count bytes in a field

I have a file that looks like this:
ASDFGHJ|ASDFEW|ASFEWFEWAFEWASDFWE FEWFDWAEWA FEWDWDFEW|EWFEW|ASKOKJE
IOJIKNH|ASFDFEFW|ASKDFJEO JEWIOFJS IEWOFJEO SJFIEWOF WE|WEFEW|ASFEWAS
I'm having trouble with this file because it's written in Cyrillic and the database complains about number of bytes (vs number of characters). I want to check if, for example, the first field is larger than 10 bytes, the second field is larger than 30 bytes, etc.
I've been trying a lot of different things: awc, wc... I know with wc -c I can count bytes but how can I retrieve only the lines that have a field that is larger than X?
Any idea?
If you are open to using perl then this could help. I have added comments to make it easier for you to follow:
#!/usr/bin/perl
use strict;
use warnings;
use bytes;
## Change the file to path where your file is located
open my $data, '<', 'file';
## Define an array with acceptable sizes for each fields
my #size = qw( 10 30 ... );
LINE: while(<$data>) { ## Read one line at a time
chomp; ## Remove the newline from each line read
## Split the line on | and store each fields in an array
my #fields = split /\|/;
for ( 0 .. $#fields ) { ## Iterate over the array
## If the size is less than desired size move to next line
next LINE unless bytes::length($fields[$_]) > $size[$_];
}
## If all sizes matched print the line
print "$_\n";
}
Here's a Perl one-liner that prints the whole line if the field in bytes is longer than the respective member in an array #m:
perl -F'\|' -Mbytes -lane '#m=(10,10,30,10); print if grep { bytes::length $_ > shift #m } #F' file
As the name suggests, bytes::length ignores the encoding and returns the length of each field in bytes. The -a switch to Perl enables auto-split mode, which creates an array #F containing all the fields. I've used the pipe | as the delimiter (it needs escaping with a backslash). The -l switch removes the newline from the end of the line, ensuring that your final field is the correct length.
The -n switch tells Perl to loop through each line in the file. grep filters the array #F on the condition in the block. I'm using shift to remove and return the first element of #m, so that each field in #F is being compared with the respective element in #m. The filtered list will evaluate to true in this context if it contains any elements (i.e. if any of the fields were longer than their limit).
To obtain the number of bytes in a certain FIELD on a certain LINE you can issue the following awk command:
awk -F'|' -v LINE=1 -v FIELD=3 'NR==LINE{print $FIELD}' input.txt | wc -c
To print the number of bytes for every field you may use a little loop:
awk -F'|' '{for(i=1;i<NF;i++)print $i}' a.txt | \
while read field ; do
nb=$(wc -c <<<"$field")
echo "$field $nb"
# Check if the field is too long
if [ "$nb" -gt 40 ] ; then
echo "field $field is too long"
exit 1
fi
done

Resources