Is there an easiest way to search the following data with specific field based on the field ##id:?
This is the sample data file called sample
##id: 123 ##name: John Doe ##age: 18 ##Gender: Male
##id: 345 ##name: Sarah Benson ##age: 20 ##Gender: Female
For example, If I want to search an ID of 123 and his gender I would do this:
Basically this is the prototype that I want:
# search.sh
#!/bin/bash
# usage: search.sh <id> <field>
# eg: search 123 age
search="$1"
field="$2"
grep "^##id: ${search}" sample | # FILTER <FIELD>
So when I search an ID 123 like below:
search.sh 123 gender
The output would be
Male
Up until now, based on the code above, I only able to grep one line based on ID, and I'm not sure what is the best method or fastest method with less complicated to get its next value after specifying the field (eg. age)
1st solution: With your shown samples, please try following bash script. This considers that you want to match exact string match.
cat script.bash
#!/bin/bash
search="$1"
field="$2"
awk -v search="$search" -v field="$field" '
match($0,"##id:[[:space:]]*"search){
value=""
match($0,"##"field":[[:space:]]*[^#]+")
value=substr($0,RSTART,RLENGTH)
sub(/.*: +/,"",value)
print value
}
' Input_file
2nd solution: In case you want to search strings(values) irrespective of their cases(lower/upper case) in each line then try following code.
cat script.bash
#!/bin/bash
search="$1"
field="$2"
awk -v search="$search" -v field="$field" '
match(tolower($0),"##id:[[:space:]]*"tolower(search)){
value=""
match(tolower($0),"##"tolower(field)":[[:space:]]*[^#]+")
value=substr($0,RSTART,RLENGTH)
sub(/.*: +/,"",value)
print value
}
' Input_file
Explanation: Simple explanation of code would be, creating BASH script, which is expecting 2 parameters while its being run. Then passing these parameters as values to awk program. Then using match function to match the id in each line and print the value of passed field(eg: name OR Gender etc).
Since you want to extract a part of each line found, different from the part you are matching against, sed or awk would be a better tool than grep. You could pipe the output of grep into one of the others, but that's wasteful because both sed and awk can do the line selection directly. I would do something like this:
#!/bin/bash
search="$1"
field="$2"
sed -n "/^##id: ${search}"'\>/ { s/.*##'"${field}"': *//i; s/ *##.*//; p }' sample
Explanation:
sed is instructed to read file sample, which it will do line by line.
The -n option tells sed to suppress its usual behavior of automatically outputting its pattern space at the end of each cycle, which is an easy way to filter out lines that don't match the search criterion.
The sed expression starts with an address, which in this case is a pattern matching lines by id, according to the script's first argument. It is much like your grep pattern, but I append \>, which matches a word boundary. That way, searches for id 123 will not also match id 1234.
The rest of the sed expression edits out the everything in the line except the value of the requested field, with the field name being matched case-insensitively, and prints the result. The editing is accomplished by the two s/// commands, and the p command is of course for "print". These are all enclosed in curly braces ({}) and separated by semicolons (;) to form a single compound associated with the given address.
Assumptions:
'label' fields have format ##<string>:
need to handle case-insensitive searches
'label' fields could be located anywhere in the line (ie, there is no set ordering of 'label' fields)
the 1st input search parameter is always a value associated with the ##id: label
the 2nd input search parameter is to be matched as a whole word (ie, no partial label matching; nam will not match against ##name:)
if there are multiple 'label' fields that match the 2nd input search parameter we print the value associated with the 1st match found in the line)
One awk idea:
awk -v search="${search}" -v field="${field}" '
BEGIN { field = tolower(field) }
{ n=split($0,arr,"##|:") # split current line on dual delimiters "##" and ":", place fields into array arr[]
found_search = 0
found_field = 0
for (i=2;i<=n;i=i+2) { # loop through list of label fields
label=tolower(arr[i])
value = arr[i+1]
sub(/^[[:space:]]+/,"",value) # strip leading white space
sub(/[[:space:]]+$/,"",value) # strip trailing white space
if ( label == "id" && value == search )
found_search = 1
if ( label == field && ! found_field )
found_field = value
}
if ( found_search && found_field )
print found_field
}
' sample
Sample input:
$ cat sample
##id: 123 ##name: John Doe ##age: 18 ##Gender: Male
##id: 345 ##name: Sarah Benson ##age: 20 ##Gender: Female
##name: Archibald P. Granite, III, Ph.D, M.D. ##age: 20 ##Gender: not specified ##id: 567
Test runs:
search=123 field=gender => Male
search=123 field=ID => 123
search=123 field=Age => 18
search=345 field=name => Sarah Benson
search=567 field=name => Archibald P. Granite, III, Ph.D, M.D.
search=567 field=GENDER => not specified
search=999 field=age => <no output>
For the given data format, you could set the field separator to optional spaces followed by ## to prevent trailing spaces for the printed field.
Then create a key value mapping per row (making the keys and the field to search for lowercase) and search for the key, which will be independent of the order in the string.
If the key is present, then print the value.
#!/bin/bash
search="$1"
field="$2"
awk -v search="${search}" -v field="${field}" '
BEGIN {FS="[[:blank:]]*##"} # Set field separator to optional spaces and ##
{
for (i = 1; i <= NF; i++) { # Loop all the fields
split($i, a, /[[:blank:]]*:[[:blank:]]*/) # Split the field on : with optional surrounded spaces
kv[tolower(a[1])]=a[2] # Create a key value array using the splitted values
}
val = kv[tolower(field)] # Get the value from kv based on the lowercase key
if (kv["id"] == search && val) print val # If there is a matching key and a value, print the value
}' file
And then run
./search.sh 123 gender
Output
Male
So I am really new to this kind of stuff (seriously, sorry in advance) but I figured I would post this question since it is taking me some time to solve it and I'm sure it's a lot more difficult than I am imagining.
I have the file small.csv:
id,name,x,y,id2
1,john,2,6,13
2,bob,3,4,15
3,jane,5,6,17
4,cindy,1,4,18
and another file big.csv:
id3,id4,name,x,y
100,{},john,2,6
101,{},bob,3,4
102,{},jane,5,6
103,{},cindy,1,4
104,{},alice,7,8
105,{},jane,0,3
106,{},cindy,1,7
The problem with this is I am attempting to put id2 of the small.csv into the id4 column of the big.csv only if the name AND x AND y match. I have tried using different awk and join commands in Git Bash but am coming up short. Again I am sorry for the newbie perspective on all of this but any help would be awesome. Thank you in advance.
EDIT: Sorry, this is what the final desired output should look like:
id3,id4,name,x,y
100,{13},john,2,6
101,{15},bob,3,4
102,{17},jane,5,6
103,{18},cindy,1,4
104,{},alice,7,8
105,{},jane,0,3
106,{},cindy,1,7
And one of the latest trials I did was the following:
$ join -j 1 -o 1.5,2.1,2.2,2.3,2.4,2.5 <(sort -k2 small.csv) <(sort -k2 big.csv)
But I received this error:
join: /dev/fd/63: No such file or directory
Probably not trivial to solve with join but fairly easy with awk:
awk -F, -v OFS=, ' # set input and output field separators to comma
# create lookup table from lines of small.csv
NR==FNR {
# ignore header
# map columns 2/3/4 to column 5
if (NR>1) lut[$2,$3,$4] = $5
next
}
# process lines of big.csv
# if lookup table has mapping for columns 3/4/5, update column 2
v = lut[$3,$4,$5] {
$2 = "{" v "}"
}
# print (possibly-modified) lines of big.csv
1
' small.csv big.csv >bignew.csv
Code assumes small.csv contains only one line for each distinct column 2/3/4.
NR==FNR { ...; next } is a way to process contents of the first file argument. (FNR is less than NR when processing lines from second and subsequent file arguments. next skips execution of the remaining awk commands.)
I'm processing a text file and adding a column composed of certain components of other columns. A new requirement to remove spaces and apostrophes was requested and I'm not sure the most efficient way to accomplish this task.
The file's content can be created by the following script:
content=(
john smith thomas blank 123 123456 10
jane smith elizabeth blank 456 456123 12
erin "o'brien" margaret blank 789 789123 9
juan "de la cruz" carlos blank 1011 378943 4
)
# put this into a tab-separated file, with the syntactic (double) quotes above removed
printf '%s\t%s\t%s\t%s\t%s\t%s\t%s\n' "${content[#]}" >infile
This is what I have now, but it fails to remove spaces and apostrophes:
awk -F "\t" '{OFS="\t"; print $1,$2,$3,$5,$6,$7,$6 tolower(substr($2,0,3)); }' infile > outfile
This throws an error "sub third parameter is not a changeable object", which makes sense since I'm trying to process output instead of input, I guess.
awk -F "\t" '{OFS="\t"; print $1,$2,$3,$5,$6,$7,$6 sub("'\''", "",tolower(substr($2,0,3))); }' infile > outfile
Is there a way I can print a combination of column 6 and part of column 2 in lower case, all while removing spaces and apostrophes from the output to the new column? Worst case scenario, I can just create a new file with my first command and process that output with a new awk command, but I'd like to do it in one pass is possible.
The second approach was close, but for order of operations:
awk -F "\t" '
BEGIN { OFS="\t"; }
{
var=$2;
sub("['\''[:space:]]", "", var);
var=substr(var, 0, 3);
print $1,$2,$3,$5,$6,$7,$6 var;
}
'
Assigning the contents you want to modify to a variable lets that variable be modified in-place.
Characters you want to remove should be removed before taking the substring, since otherwise you shorten your 3-character substring.
It's a guess since you didn't provide the expected output but is this what you're trying to do?
$ cat tst.awk
BEGIN { FS=OFS="\t" }
{
abbr = $2
gsub(/[\047[:space:]]/,"",abbr)
abbr = tolower(substr(abbr,1,3))
print $1,$2,$3,$5,$6,$7,$6 abbr
}
$ awk -f tst.awk infile
john smith thomas 123 123456 10 123456smi
jane smith elizabeth 456 456123 12 456123smi
erin o'brien margaret 789 789123 9 789123obr
juan de la cruz carlos 1011 378943 4 378943del
Note that the way to represent a ' in a '-enclosed awk script is with the octal \047 (which will continue to work if/when you move your script to a file, unlike if you relied on "'\''" which only works from the command line), and that strings, arrays, and fields in awk start at 1, not 0, so your substr(..,0,3) is wrong and awk is treating the invalid start position of 0 as if you had used the first valid start position which is 1.
The "sub third parameter is not a changeable object" error you were getting is because sub() modifies the object you call it with as the 3rd argument but you're calling it with a literal string (the output of tolower(substr(...))) and you can't modify a literal string - try sub(/o/,"","foo") and you'll get the same error vs if you used var="foo"; sub(/o/,"",var) which is valid since you can modify the content of variables.
My shell is a bit rusty so I would greatly appreciate some help in parsing the following data.
Each row in the input file contains data separated by comma.
[name, record_timestamp, action, field_id, field_name, field_value, number_of_fields]
The rows are instructions to create or update information about persons. So for example the first line says that the person John Smith will be created and that the following 6 rows will contain information about him.
The field_id number always represent the same field.
input.csv
John Smith,2017-03-03 11:56:02,create,,,,6
,,,,1,BIRTH_DATE,1985-02-16,,
,,,,2,BIRTH_CITY,Portland,,
,,,,3,SEX,Male,,
,,,,5,CITY,Seattle,,
,,,,7,EMPLOYER,Microsoft,,
,,,,9,MARRIED,Yes,,
Susan Anderson,2017-03-01 12:09:36,create,,,,8
,,,,1,BIRTH_DATE,1981-09-12,,
,,,,2,BIRTH_CITY,San Diego,,
,,,,3,SEX,Female,,
,,,,5,CITY,Palo Alto,,
,,,,7,EMPLOYER,Facebook,,
,,,,8,SALARY,5612,,
,,,,9,MARRIED,No,,
,,,,10,TELEPHONE,5107586290,,
Brad Bradly,2017-02-29 09:15:12,update,,,,3
,,,,3,SEX,Male,,
,,,,7,EMPLOYER,Walmart,,
,,,,9,MARRIED,No,,
Sarah Wilson,2017-02-28 16:21:39,update,,,,5
,,,,2,BIRTH_CITY,Miami,,
,,,,3,SEX,Female,,
,,,,7,EMPLOYER,Disney,,
,,,,8,SALARY,5110,,
,,,,9,MARRIED,Yes,,
I want to parse each of these persons into comma separated strings that looks like this:
name,birth date,birth city,sex,employer,salary,marrage status,record_timestamp
but we should only output such a string if both the fields birth date and birth city or both the fields employer and salary are available for that person. Otherwise just leave it empty (see example below).
Given our input above the output should then be
John Smith,1985-02-16,Portland,Male,,,Yes,2017-03-03 11:56:02
Susan Anderson,1981-09-12,San Diego,Female,Facebook,5612,No,2017-03-01 12:09:36
Sarah Wilson,,,Female,Disney,5110,Yes,2017-02-28 16:21:39
I've figured out that I should probably do something along the following lines. But then I cannot figure out how to implement an inner loop or if there is some other way to proceed.
#!/bin/bash
IFS=','
cat test.txt | while read -a outer
do
echo ${outer[0]}
#...
done
Thanks in advance for any advice!
A UNIX shell is an environment from which to call UNIX tools (and manipulate files and processes) with a language to sequence those calls. It is NOT a tool to manipulate text.
The standard UNIX tool to manipulate text is awk:
$ cat tst.awk
BEGIN {
numFlds=split("name BIRTH_DATE BIRTH_CITY SEX EMPLOYER SALARY MARRIED timestamp",nr2name)
FS=OFS=","
}
$1 != "" {
prtRec()
rec["name"] = $1
rec["timestamp"] = $2
next
}
{ rec[$6] = $7 }
END { prtRec() }
function prtRec( fldNr) {
if ( ((rec["BIRTH_DATE"] != "") && (rec["BIRTH_CITY"] != "")) ||
((rec["EMPLOYER"] != "") && (rec["SALARY"] != "")) ) {
for (fldNr=1; fldNr<=numFlds; fldNr++) {
printf "%s%s", rec[nr2name[fldNr]], (fldNr<numFlds ? OFS : ORS)
}
}
delete rec
}
$ awk -f tst.awk file
John Smith,1985-02-16,Portland,Male,Microsoft,,Yes,2017-03-03 11:56:02
Susan Anderson,1981-09-12,San Diego,Female,Facebook,5612,No,2017-03-01 12:09:36
Sarah Wilson,,Miami,Female,Disney,5110,Yes,2017-02-28 16:21:39
Any time you have records consisting of name+value data like you do, the approach that results in by far the simplest, clearest, most robust, and easiest to enhance/debug code is to first populate an array (rec[] above) containing the values indexed by the names. Once you have that array it's trivial to print and/or manipulate the contents by their names.
Use awk or something like
while IFS=, read -r name timestamp action f_id f_name f_value nr_fields; do
if [ -n "${name}" ]; then
# proces startrecord, store the fields you need for the next line
else
# process next record
fi
done < test.txt
awk to the rescue!
awk -F, 'function pr(a) {if(!(7 in a && 8 in a)) a[7]=a[8]="";
if(!(1 in a && 2 in a)) a[1]=a[2]="";
for(i=0;i<=10;i++) printf "%s,",a[i];
printf "%s\n", a["ts"]}
NR>1 && $1!="" {pr(a); delete a}
$1!="" {a[0]=$1; a["ts"]=$2}
$1=="" {a[$5]=$7}
END {pr(a)}' file
this should cover the general case, and conditioned fields. You may want to filter out the other fields you don't need.
This will print for your input
John Smith,1985-02-16,Portland,Male,,Seattle,,,,Yes,,2017-03-03 11:56:02
Susan Anderson,1981-09-12,San Diego,Female,,Palo Alto,,Facebook,5612,No,5107586290,2017-03-01 12:09:36
Brad Bradly,,,Male,,,,,,No,,2017-02-29 09:15:12
Sarah Wilson,,,Female,,,,Disney,5110,Yes,,2017-02-28 16:21:39
Avoid IFS hacks like the plague. They are ugly stuff.
Play with the -d option to read to specify the comma as delimiter.