Analyze a control table by Shell Script - bash

A shell script is analysing a control table to get the right parameter for it's processing.
Currently, it is simple - using grep, it points to the correct line, awk {print $n} determines the right columns.
Columns are separated by space only. No special rules, just values separated by space.
All is fine and working, the users like it.
As long as none of the columns is left empty. For last colum, it's ok to leave it empty, but if somebody does not fill in a column in the mid, it confuses the awk {print $n} logic.
Of course, one could as the users to fill in every entry, or one could just define the column delimiter as ";" .
In case something is skipped, one could use " ;; " However, I would prefer not to change table style.
So the question is:
How to effectively analyze a table having blanks in colum values? Table is like this:
ApplikationService ServerName PortNumber ControlValue_1 ControlValue_2
Read chavez.com 3599 john doe
Write 3345 johnny walker
Update curiosity.org jerry
What might be of some help:
If there is a value set in a column, it is (more a less precise) under its column header description.
Cheers,
Tarik

You don't say what your desired output is but this shows you the right approach:
$ cat tst.awk
NR==1 {
print
while ( match($0,/[^[:space:]]+[[:space:]]*/) ) {
width[++i] = RLENGTH
$0 = substr($0,RSTART+RLENGTH)
}
next
}
{
i = 0
while ( (fld = substr($0,1,width[++i])) != "" ) {
gsub(/^ +| +$/,"",fld)
printf "%-*s", width[i], (fld == "" ? "[empty]" : fld)
$0 = substr($0,width[i]+1)
}
print ""
}
$
$ awk -f tst.awk file
ApplikationService ServerName PortNumber ControlValue_1 ControlValue_2
Read chavez.com 3599 john doe
Write [empty] 3345 johnny walker
Update curiosity.org [empty] jerry [empty]
It uses the width of the each field in the title line to determine the width of every field in every line of the file, and then just replaces empty fields with the string "[empty]" and left-aligns every field just to pretty it up a bit.

Related

Search field and display next data to it

Is there an easiest way to search the following data with specific field based on the field ##id:?
This is the sample data file called sample
##id: 123 ##name: John Doe ##age: 18 ##Gender: Male
##id: 345 ##name: Sarah Benson ##age: 20 ##Gender: Female
For example, If I want to search an ID of 123 and his gender I would do this:
Basically this is the prototype that I want:
# search.sh
#!/bin/bash
# usage: search.sh <id> <field>
# eg: search 123 age
search="$1"
field="$2"
grep "^##id: ${search}" sample | # FILTER <FIELD>
So when I search an ID 123 like below:
search.sh 123 gender
The output would be
Male
Up until now, based on the code above, I only able to grep one line based on ID, and I'm not sure what is the best method or fastest method with less complicated to get its next value after specifying the field (eg. age)
1st solution: With your shown samples, please try following bash script. This considers that you want to match exact string match.
cat script.bash
#!/bin/bash
search="$1"
field="$2"
awk -v search="$search" -v field="$field" '
match($0,"##id:[[:space:]]*"search){
value=""
match($0,"##"field":[[:space:]]*[^#]+")
value=substr($0,RSTART,RLENGTH)
sub(/.*: +/,"",value)
print value
}
' Input_file
2nd solution: In case you want to search strings(values) irrespective of their cases(lower/upper case) in each line then try following code.
cat script.bash
#!/bin/bash
search="$1"
field="$2"
awk -v search="$search" -v field="$field" '
match(tolower($0),"##id:[[:space:]]*"tolower(search)){
value=""
match(tolower($0),"##"tolower(field)":[[:space:]]*[^#]+")
value=substr($0,RSTART,RLENGTH)
sub(/.*: +/,"",value)
print value
}
' Input_file
Explanation: Simple explanation of code would be, creating BASH script, which is expecting 2 parameters while its being run. Then passing these parameters as values to awk program. Then using match function to match the id in each line and print the value of passed field(eg: name OR Gender etc).
Since you want to extract a part of each line found, different from the part you are matching against, sed or awk would be a better tool than grep. You could pipe the output of grep into one of the others, but that's wasteful because both sed and awk can do the line selection directly. I would do something like this:
#!/bin/bash
search="$1"
field="$2"
sed -n "/^##id: ${search}"'\>/ { s/.*##'"${field}"': *//i; s/ *##.*//; p }' sample
Explanation:
sed is instructed to read file sample, which it will do line by line.
The -n option tells sed to suppress its usual behavior of automatically outputting its pattern space at the end of each cycle, which is an easy way to filter out lines that don't match the search criterion.
The sed expression starts with an address, which in this case is a pattern matching lines by id, according to the script's first argument. It is much like your grep pattern, but I append \>, which matches a word boundary. That way, searches for id 123 will not also match id 1234.
The rest of the sed expression edits out the everything in the line except the value of the requested field, with the field name being matched case-insensitively, and prints the result. The editing is accomplished by the two s/// commands, and the p command is of course for "print". These are all enclosed in curly braces ({}) and separated by semicolons (;) to form a single compound associated with the given address.
Assumptions:
'label' fields have format ##<string>:
need to handle case-insensitive searches
'label' fields could be located anywhere in the line (ie, there is no set ordering of 'label' fields)
the 1st input search parameter is always a value associated with the ##id: label
the 2nd input search parameter is to be matched as a whole word (ie, no partial label matching; nam will not match against ##name:)
if there are multiple 'label' fields that match the 2nd input search parameter we print the value associated with the 1st match found in the line)
One awk idea:
awk -v search="${search}" -v field="${field}" '
BEGIN { field = tolower(field) }
{ n=split($0,arr,"##|:") # split current line on dual delimiters "##" and ":", place fields into array arr[]
found_search = 0
found_field = 0
for (i=2;i<=n;i=i+2) { # loop through list of label fields
label=tolower(arr[i])
value = arr[i+1]
sub(/^[[:space:]]+/,"",value) # strip leading white space
sub(/[[:space:]]+$/,"",value) # strip trailing white space
if ( label == "id" && value == search )
found_search = 1
if ( label == field && ! found_field )
found_field = value
}
if ( found_search && found_field )
print found_field
}
' sample
Sample input:
$ cat sample
##id: 123 ##name: John Doe ##age: 18 ##Gender: Male
##id: 345 ##name: Sarah Benson ##age: 20 ##Gender: Female
##name: Archibald P. Granite, III, Ph.D, M.D. ##age: 20 ##Gender: not specified ##id: 567
Test runs:
search=123 field=gender => Male
search=123 field=ID => 123
search=123 field=Age => 18
search=345 field=name => Sarah Benson
search=567 field=name => Archibald P. Granite, III, Ph.D, M.D.
search=567 field=GENDER => not specified
search=999 field=age => <no output>
For the given data format, you could set the field separator to optional spaces followed by ## to prevent trailing spaces for the printed field.
Then create a key value mapping per row (making the keys and the field to search for lowercase) and search for the key, which will be independent of the order in the string.
If the key is present, then print the value.
#!/bin/bash
search="$1"
field="$2"
awk -v search="${search}" -v field="${field}" '
BEGIN {FS="[[:blank:]]*##"} # Set field separator to optional spaces and ##
{
for (i = 1; i <= NF; i++) { # Loop all the fields
split($i, a, /[[:blank:]]*:[[:blank:]]*/) # Split the field on : with optional surrounded spaces
kv[tolower(a[1])]=a[2] # Create a key value array using the splitted values
}
val = kv[tolower(field)] # Get the value from kv based on the lowercase key
if (kv["id"] == search && val) print val # If there is a matching key and a value, print the value
}' file
And then run
./search.sh 123 gender
Output
Male

Add column from one file to another based on multiple matches while retaining unmatched

So I am really new to this kind of stuff (seriously, sorry in advance) but I figured I would post this question since it is taking me some time to solve it and I'm sure it's a lot more difficult than I am imagining.
I have the file small.csv:
id,name,x,y,id2
1,john,2,6,13
2,bob,3,4,15
3,jane,5,6,17
4,cindy,1,4,18
and another file big.csv:
id3,id4,name,x,y
100,{},john,2,6
101,{},bob,3,4
102,{},jane,5,6
103,{},cindy,1,4
104,{},alice,7,8
105,{},jane,0,3
106,{},cindy,1,7
The problem with this is I am attempting to put id2 of the small.csv into the id4 column of the big.csv only if the name AND x AND y match. I have tried using different awk and join commands in Git Bash but am coming up short. Again I am sorry for the newbie perspective on all of this but any help would be awesome. Thank you in advance.
EDIT: Sorry, this is what the final desired output should look like:
id3,id4,name,x,y
100,{13},john,2,6
101,{15},bob,3,4
102,{17},jane,5,6
103,{18},cindy,1,4
104,{},alice,7,8
105,{},jane,0,3
106,{},cindy,1,7
And one of the latest trials I did was the following:
$ join -j 1 -o 1.5,2.1,2.2,2.3,2.4,2.5 <(sort -k2 small.csv) <(sort -k2 big.csv)
But I received this error:
join: /dev/fd/63: No such file or directory
Probably not trivial to solve with join but fairly easy with awk:
awk -F, -v OFS=, ' # set input and output field separators to comma
# create lookup table from lines of small.csv
NR==FNR {
# ignore header
# map columns 2/3/4 to column 5
if (NR>1) lut[$2,$3,$4] = $5
next
}
# process lines of big.csv
# if lookup table has mapping for columns 3/4/5, update column 2
v = lut[$3,$4,$5] {
$2 = "{" v "}"
}
# print (possibly-modified) lines of big.csv
1
' small.csv big.csv >bignew.csv
Code assumes small.csv contains only one line for each distinct column 2/3/4.
NR==FNR { ...; next } is a way to process contents of the first file argument. (FNR is less than NR when processing lines from second and subsequent file arguments. next skips execution of the remaining awk commands.)

Using awk to print a new column without apostrophes or spaces

I'm processing a text file and adding a column composed of certain components of other columns. A new requirement to remove spaces and apostrophes was requested and I'm not sure the most efficient way to accomplish this task.
The file's content can be created by the following script:
content=(
john smith thomas blank 123 123456 10
jane smith elizabeth blank 456 456123 12
erin "o'brien" margaret blank 789 789123 9
juan "de la cruz" carlos blank 1011 378943 4
)
# put this into a tab-separated file, with the syntactic (double) quotes above removed
printf '%s\t%s\t%s\t%s\t%s\t%s\t%s\n' "${content[#]}" >infile
This is what I have now, but it fails to remove spaces and apostrophes:
awk -F "\t" '{OFS="\t"; print $1,$2,$3,$5,$6,$7,$6 tolower(substr($2,0,3)); }' infile > outfile
This throws an error "sub third parameter is not a changeable object", which makes sense since I'm trying to process output instead of input, I guess.
awk -F "\t" '{OFS="\t"; print $1,$2,$3,$5,$6,$7,$6 sub("'\''", "",tolower(substr($2,0,3))); }' infile > outfile
Is there a way I can print a combination of column 6 and part of column 2 in lower case, all while removing spaces and apostrophes from the output to the new column? Worst case scenario, I can just create a new file with my first command and process that output with a new awk command, but I'd like to do it in one pass is possible.
The second approach was close, but for order of operations:
awk -F "\t" '
BEGIN { OFS="\t"; }
{
var=$2;
sub("['\''[:space:]]", "", var);
var=substr(var, 0, 3);
print $1,$2,$3,$5,$6,$7,$6 var;
}
'
Assigning the contents you want to modify to a variable lets that variable be modified in-place.
Characters you want to remove should be removed before taking the substring, since otherwise you shorten your 3-character substring.
It's a guess since you didn't provide the expected output but is this what you're trying to do?
$ cat tst.awk
BEGIN { FS=OFS="\t" }
{
abbr = $2
gsub(/[\047[:space:]]/,"",abbr)
abbr = tolower(substr(abbr,1,3))
print $1,$2,$3,$5,$6,$7,$6 abbr
}
$ awk -f tst.awk infile
john smith thomas 123 123456 10 123456smi
jane smith elizabeth 456 456123 12 456123smi
erin o'brien margaret 789 789123 9 789123obr
juan de la cruz carlos 1011 378943 4 378943del
Note that the way to represent a ' in a '-enclosed awk script is with the octal \047 (which will continue to work if/when you move your script to a file, unlike if you relied on "'\''" which only works from the command line), and that strings, arrays, and fields in awk start at 1, not 0, so your substr(..,0,3) is wrong and awk is treating the invalid start position of 0 as if you had used the first valid start position which is 1.
The "sub third parameter is not a changeable object" error you were getting is because sub() modifies the object you call it with as the 3rd argument but you're calling it with a literal string (the output of tolower(substr(...))) and you can't modify a literal string - try sub(/o/,"","foo") and you'll get the same error vs if you used var="foo"; sub(/o/,"",var) which is valid since you can modify the content of variables.

Parsing a CSV file using shell

My shell is a bit rusty so I would greatly appreciate some help in parsing the following data.
Each row in the input file contains data separated by comma.
[name, record_timestamp, action, field_id, field_name, field_value, number_of_fields]
The rows are instructions to create or update information about persons. So for example the first line says that the person John Smith will be created and that the following 6 rows will contain information about him.
The field_id number always represent the same field.
input.csv
John Smith,2017-03-03 11:56:02,create,,,,6
,,,,1,BIRTH_DATE,1985-02-16,,
,,,,2,BIRTH_CITY,Portland,,
,,,,3,SEX,Male,,
,,,,5,CITY,Seattle,,
,,,,7,EMPLOYER,Microsoft,,
,,,,9,MARRIED,Yes,,
Susan Anderson,2017-03-01 12:09:36,create,,,,8
,,,,1,BIRTH_DATE,1981-09-12,,
,,,,2,BIRTH_CITY,San Diego,,
,,,,3,SEX,Female,,
,,,,5,CITY,Palo Alto,,
,,,,7,EMPLOYER,Facebook,,
,,,,8,SALARY,5612,,
,,,,9,MARRIED,No,,
,,,,10,TELEPHONE,5107586290,,
Brad Bradly,2017-02-29 09:15:12,update,,,,3
,,,,3,SEX,Male,,
,,,,7,EMPLOYER,Walmart,,
,,,,9,MARRIED,No,,
Sarah Wilson,2017-02-28 16:21:39,update,,,,5
,,,,2,BIRTH_CITY,Miami,,
,,,,3,SEX,Female,,
,,,,7,EMPLOYER,Disney,,
,,,,8,SALARY,5110,,
,,,,9,MARRIED,Yes,,
I want to parse each of these persons into comma separated strings that looks like this:
name,birth date,birth city,sex,employer,salary,marrage status,record_timestamp
but we should only output such a string if both the fields birth date and birth city or both the fields employer and salary are available for that person. Otherwise just leave it empty (see example below).
Given our input above the output should then be
John Smith,1985-02-16,Portland,Male,,,Yes,2017-03-03 11:56:02
Susan Anderson,1981-09-12,San Diego,Female,Facebook,5612,No,2017-03-01 12:09:36
Sarah Wilson,,,Female,Disney,5110,Yes,2017-02-28 16:21:39
I've figured out that I should probably do something along the following lines. But then I cannot figure out how to implement an inner loop or if there is some other way to proceed.
#!/bin/bash
IFS=','
cat test.txt | while read -a outer
do
echo ${outer[0]}
#...
done
Thanks in advance for any advice!
A UNIX shell is an environment from which to call UNIX tools (and manipulate files and processes) with a language to sequence those calls. It is NOT a tool to manipulate text.
The standard UNIX tool to manipulate text is awk:
$ cat tst.awk
BEGIN {
numFlds=split("name BIRTH_DATE BIRTH_CITY SEX EMPLOYER SALARY MARRIED timestamp",nr2name)
FS=OFS=","
}
$1 != "" {
prtRec()
rec["name"] = $1
rec["timestamp"] = $2
next
}
{ rec[$6] = $7 }
END { prtRec() }
function prtRec( fldNr) {
if ( ((rec["BIRTH_DATE"] != "") && (rec["BIRTH_CITY"] != "")) ||
((rec["EMPLOYER"] != "") && (rec["SALARY"] != "")) ) {
for (fldNr=1; fldNr<=numFlds; fldNr++) {
printf "%s%s", rec[nr2name[fldNr]], (fldNr<numFlds ? OFS : ORS)
}
}
delete rec
}
$ awk -f tst.awk file
John Smith,1985-02-16,Portland,Male,Microsoft,,Yes,2017-03-03 11:56:02
Susan Anderson,1981-09-12,San Diego,Female,Facebook,5612,No,2017-03-01 12:09:36
Sarah Wilson,,Miami,Female,Disney,5110,Yes,2017-02-28 16:21:39
Any time you have records consisting of name+value data like you do, the approach that results in by far the simplest, clearest, most robust, and easiest to enhance/debug code is to first populate an array (rec[] above) containing the values indexed by the names. Once you have that array it's trivial to print and/or manipulate the contents by their names.
Use awk or something like
while IFS=, read -r name timestamp action f_id f_name f_value nr_fields; do
if [ -n "${name}" ]; then
# proces startrecord, store the fields you need for the next line
else
# process next record
fi
done < test.txt
awk to the rescue!
awk -F, 'function pr(a) {if(!(7 in a && 8 in a)) a[7]=a[8]="";
if(!(1 in a && 2 in a)) a[1]=a[2]="";
for(i=0;i<=10;i++) printf "%s,",a[i];
printf "%s\n", a["ts"]}
NR>1 && $1!="" {pr(a); delete a}
$1!="" {a[0]=$1; a["ts"]=$2}
$1=="" {a[$5]=$7}
END {pr(a)}' file
this should cover the general case, and conditioned fields. You may want to filter out the other fields you don't need.
This will print for your input
John Smith,1985-02-16,Portland,Male,,Seattle,,,,Yes,,2017-03-03 11:56:02
Susan Anderson,1981-09-12,San Diego,Female,,Palo Alto,,Facebook,5612,No,5107586290,2017-03-01 12:09:36
Brad Bradly,,,Male,,,,,,No,,2017-02-29 09:15:12
Sarah Wilson,,,Female,,,,Disney,5110,Yes,,2017-02-28 16:21:39
Avoid IFS hacks like the plague. They are ugly stuff.
Play with the -d option to read to specify the comma as delimiter.

How to combine two lines that share the same keyword?

lets say I have a file looking somewhat like this:
X NeedThis1 KEYWORD
.
.
NeedThis2 X KEYWORD
And I need to combine the two lines into one like this:
NeedThis2 NeedThis1 KEYWORD
It needs to be done for every line in that file that contains the same KEYWORD but it can't combine two lines that look like this (two X's at the first|second position)
X NeedThis1 KEYWORD
X NeedThis2 KEYWORD
I am considering myself bash-noob so any advice if it can be done with something like awk or sed would be appreciated.
awk '
{if ($1 == "X") end[$3] = $2; else start[$3] = $1}
END {for (kw in start) if (kw in end) print start[kw], end[kw], kw}
' file
Try this:
awk '
$1=="X" {key = $NF; value = $2; next}
$2=="X" && $NF==key {print value, $1, key}' file
Explanation:
When a line where first field is X, store the last field as key and second field as value.
Look for the next line where second field is X and last field matches the key stored from pervious action.
When found, print the value of last matched line along with first field of the current line and the key.
This will most definitely break if your data does not match the sample you have shown (if it has more spaces or fields in between), so feel free to adjust as per your needs.
I won't give you the full answer, but if you have some way to identify "KEYWORD" (not in your problem statement), then use a BASH associative array:
declare -A keys
while IFS= read -u3 -r line
do
set -- $line
eval keyword=\$$#
keys[$keyword]+=${line%$keyword}
done
you'll certainly have to do some more fiddling, but your problem statement is incomplete and some of the work needs to be an exercise for the reader.

Resources