How to count number of values in a row and store total count to array - shell

I have a scenario where i want to get count of all values row by row and store it to dynamic array
Data in file :
"A","B","C","B"
"P","W","R","S"
"E","U","C","S"
"Y","F","C"
first row as : 4 -> values
second row as : 4 -> values
third row as : 4 -> values
fourth row as : 3 -> values
Expected Output :
store to array : array_list=(4,4,4,3)
written a script but not working
array_list=()
while read -r line
do
var_comma_count=`echo "$line" | tr -cd , | wc -c`
array_list=+($( var_comma_count))
done < demo.txt
when i print array it should give me all values : echo "{array_list[#]}"
Note :
The file might contain empty lines at last which should not be read
when i count file it gave me count : 5 , it should have ignored last line which is empty
where as when i use awk it give me proper count : awk '{print NF}' demo.txt -> 4
I know processing file using while loop is not a best practise , but any better solution will be appreciated

Perhaps this might be easier using awk, set the FS to a comma and check if the number of fields is larger than 0:
#!/bin/bash
array_list=($(awk -v FS=, 'NF>0 {print NF}' demo.txt))
echo "${array_list[#]}"
Output
4 4 4 3
The awk command explained:
awk -v FS=, ' # Start awk, set the Field Separator (FS) to a comma
NF>0 {print NF} # If the Number of Fields (NF) is greater than 0, print the NF
' demo.txt # Close awk and set demo.txt as the input file
Another option could be first matching the format of the whole line. If it matches, there is at least a single occurrence.
Then split the line on a comma.
array_list=($(awk '/^"[A-Z]"(,"[A-Z]")*$/{print(split($0,a,","));}' demo.txt))
echo "${array_list[#]}"
Output
4 4 4 3
The awk command explained:
awk '/^"[A-Z]"(,"[A-Z]")*$/{ # Regex pattern for the whole line, match a single char A-Z between " and optionally repeat preceded by a comma
print(split($0,a,",")); # Split the whole line `$0` on a comma and print the number of parts
}
' demo.txt

Related

Modify values of one column based on values of another column on a line-by-line basis

I'm looking to use bash/awk/sed in order to modify a document.
The document contains multiple columns. Column 5 currently has the value "A" at every row. Column six is composed of increasing numbers. I'm attempting a script that goes through the document line by line, checks the value of Column 6, if the value is greater than a certain integer (specifically 275) the value of Column 5 in that same line is changed to "B".
while IFS="" read -r line ; do
awk 'BEGIN {FS = " "}'
Num=$(awk '{print $6}' original.txt)
if [ $Num > 275 ] ; then
awk '{ gsub("A","B",$5) }'
fi
done < original.txt >> edited.txt
For the above, I've tried setting the residueNum variable both inside and outside of the while loop.
I've also tried using a for loop and cat:
awk 'BEGIN {FS = " "}' original.txt
Num=$(awk '{print $6}' heterodimer_P49913/unrelaxed_model_1.pdb)
integer=275
for data in $Num ; do
if [ $data > $integer ] ; then
##Change value in other column to "B" for all lines containing column 6 values greater than "integer"
fi
done
Thanks in advance.
GNU AWK does not need external while loop (there is implicit loop), if you need further explanation read awk info page. Let file.txt content be
1 2 3 4 A 100
1 2 3 4 A 275
1 2 3 4 A 300
and task to be
checks the value of Column 6, if the value is greater than a certain
integer (specifically 275) the value of Column 5 in that same line is
changed to "B".
then it might be done using GNU AWK following way
awk '$6>275{$5="B"}{print}' file.txt
which gives output
1 2 3 4 A 100
1 2 3 4 A 275
1 2 3 4 B 300
Explanation: action set value of 5th field ($5) to B is applied conditionally to rows where value of 6th field is greater than 275. Action to print is applied unconditionally to all lines. Observe that change if applied is done before printing.
(tested in GNU Awk 5.0.1)

How to loop through a character array to match array items from one column and extract full rows from a tab separated text file

I have a tab separated text file with the first column subject ID (characters) another 23 column (all values except the second column which is also characters).
There are 700 rows (subjects), but I want to extract a subset of 200 rows based on matching with the subject ID column.
I have tried using grep and sed and awk with various combinations but I have not been successful. Some things that have failed include:
for sub in ${subjects[#]}; do
grep $sub | \
sed < baseline_subs.txt > baseline_moresubs.txt;
done
and
awk '{if ($1 == $sub) { print } }' baseline_moresubs.txt
Please help
Suggestion 1 awk script: scanning baseline_moresubs.txt ${#subjects[#]} times
screen same input file many times, each time screening one subject.
for sub in ${subjects[#]}; do
# screen baseline_moresubs.txt for 1st field,
awk "\$1 == \"$sub\" {print}" baseline_moresubs.txt
done
Suggestion 2 awk script: scanning baseline_moresubs.txt once
Map subjects[] array into awk associative array (dictionary) subjDict. Than scan each line for any of the subjects.
awk -v inpArr="${subjects[#]}" '
BEGIN { # pre processing input
split(inpArr,arr); # map inpArr into arr indexed 1,2,3...
for (i in arr) subjDict[arr[i]] = 1; # map arr into dictonary
}
$1 in subjDict{print} # if 1st field in dictionary print line
' baseline_moresubs.txt

Use AWK to get unique count of record file based on certain columns

I have an AWK command to modify to get a unique count of a record file based on primary keys. Inside record file, there 21 elements, column 1 and 18 being the PKs. The record is all on one row, the record seperator is \^ and field seperator is |. This is what I have so far but it still is giving me the total # of records in the file but not unique:
awk 'BEGIN{RS="\\^";FS="\\|";} {a[ $1 $18 ]++;}END{print length(a);}' filename
Sample Data:
1|01212121|0|OUTGOING| | | | | |57 OHARE DR|not available|DALLAS|TX|03560|US|1131142334825|1|Jan 15 2004 11:12:06:576AM|Jan 15 2004 2:54:41:226PM|SYSTEM|\^
There are 2 millions rows of this sort of data and I have 30 duplicates.
Expected output should : 1999970
Use GNU awk for multi-char RS and use SUBSEP between your array index component fields to make the result unique:
awk 'BEGIN{RS="\\^"; FS="|"} NF>1{a[$1,$18]} END{print length(a)}' filename
You need the NF>1 test if your input file/line ends with \^\n instead of just \n. We know it does end with \n because you said if I do a wc -l on the file, it will return 1 and wc -l only counts \ns and your 1 sample input line ends in \^ so that all leads me to believe that your file does end with \^\n and so the test for NF>1 is necessary to avoid including the blank record after the final \^.
At least the record separator RS can only hold a single character. Since everthing is congested in a single line you need to choose the last character of your data rows as RS and discard the last field (consisting of \).
Fix it like this:
awk 'BEGIN{RS="^";FS="|"} {a[$1,$18]++} END{print length(a)}' filename
Note that awk will now split on every ^ it encounters in the input. Should you require to split only on \^ it'd suggest the following:
sed 's/\\^/\n/g' filename |awk 'BEGIN{FS="|"} {a[$1,$18]++} END{print length(a)}'
Edit:
Incorporated remarks from #Ed.

Extract first 5 fields from semicolon-separated file

I have a semicolon-separated file with 10 fields on each line. I need to extract only the first 5 fields.
Input:
A.txt
1;abc ;xyz ;0.0000;3.0; ; ;0.00; ; xyz;
Output file:
B.txt
1;abc ;xyz ;0.0000;3.0;
You can cut from field1-5:
cut -d';' -f1-5 file
If the ending ; is needed, you can append it by other tool or using grep(assume your grep has -P option):
kent$ grep -oP '^(.*?;){5}' file
1;abc ;xyz ;0.0000;3.0;
In sed you can match the pattern string; 5 times:
sed 's/\(\([^;]*;\)\{5\}\).*/\1/' A.txt
or, when your sedsupports -r:
sed -r 's/(([^;]*;){5}).*/\1/' A.txt
cut -f-5 -d";" A.txt > B.txt
Where:
- -f selects the fields (-5 from start to 5)
- -d provides a delimiter, (here the semicolon)
Given that the input is field-based, using awk is another option:
awk 'BEGIN { FS=OFS=";"; ORS=OFS"\n" } { NF=5; print }' A.txt > B.txt
If you're using BSD/macOS, insert $1=$1; after NF=5; to make this work.
FS=OFS=";" sets both the input field separator, FS, and the output field separator, OFS, to a semicolon.
The input field separator is used to break each input record (line) into fields.
The output field separator is used to rebuild the record when individual fields are modified or the number of fields are modified.
ORS=OFS"\n" sets the output record separator to a semicolon followed by a newline, given that a trailing ; should be output.
Simply omit this statement if the trailing ; is undesired.
{ NF=5; print } truncates the input record to 5 fields, by setting NF, the number (count) of fields to 5 and then prints the modified record.
It is at this point that OFS comes into play: the first 5 fields are concatenated to form the output record, using OFS as the separator.
Note: BSD/macOS Awk doesn't modify the record just by setting NF; you must additionally modify a field explicitly for the changed field count to take effect: a dummy operation such as $1=$1 (assigning field 1 to itself) is sufficient.
awk '{print $1,$2,$3}' A.txt >B.txt
1;abc ;xyz ;0.0000;3.0;

Converting file with single field into multiple comma separated fields

I have a .dat file in which there is no delimiter between fields.
Eg: 2014HELLO2500
I have to convert the file into a comma separated file with commas at specific positions
i.e 2014,HELLO,2500
I could convert the file using for loop. But can it be done using a single command.
I tried using --output-delimiter option of cut command. But it does not work.
I am using AIX OS.
Thanks
Assuming your field widths are known, you can use gawk like this:
awk -v FIELDWIDTHS="4 5 4 ..." -v OFS=, '{print $1,$2,$3,$4,$5...}' file
Using awk
Assuming that you know the lengths of the fields, say, for example, 4 characters for the first field and 5 for the second, then try this:
$ awk -v s='4 5' 'BEGIN{n=split(s,a)} {pos=1; for (i=1;i<=n;i++) {printf "%s,",substr($0,pos,a[i]); pos+=a[i]}; print substr($0,pos)}' file
2014,HELLO,2500
As an example of the exact same code but applied with many fields, consider this test file:
$ cat alphabet
ABCDEFGHIJKLMNOPQRSTUVWXYZ
Let's divide it up:
$ awk -v s='1 2 3 2 1 2 3 2 1 2 3 2' 'BEGIN{n=split(s,a)} {pos=1; for (i=1;i<=n;i++) {printf "%s,",substr($0,pos,a[i]); pos+=a[i]}; print substr($0,pos)}' alphabet
A,BC,DEF,GH,I,JK,LMN,OP,Q,RS,TUV,WX,YZ
How it works:
-v s='1 2 3 2 1 2 3 2 1 2 3 2'
This creates a variable s which defines the lengths of all but the last field. (There is no need to specify a length of the last field.)
BEGIN{n=split(s,a)}
This converts the string variable s to an array with each number as an element of the array.
pos=1
At the beginning of each line, we initialize the position variable, pos, to the value 1.
for (i=1;i<=n;i++) {printf "%s,",substr($0,pos,a[i]); pos+=a[i]}
For each element in array a, we print the required number of characters starting at position pos followed by a comma. After each print, we increment position pos so that the next print will start with the next character.
print substr($0,pos)
We print the last field on the line using however many character are left after position pos.
Using sed
Assuming that you know the lengths of the fields, say, for example, 4 characters for the first field and 5 for the second, then try this:
$ sed -E 's/(.{4})(.{5})/\1,\2,/' file
2014,HELLO,2500
This approach can be used for up to nine fields at a time. To get 15 fields, two passes would be needed.
Assuming you want a delimiter always between characters and number then you can use this:
$ sed -r -e 's/([A-Za-z])([0-9])/\1,\2/g' -e 's/([0-9])([A-Za-z])/\1,\2/g' <<< "2014HELLO2500"
2014,HELLO,2500
$
When numbers and strings alternate, you can use
echo "2014HELLO2500other_string121312Other_word10" |
sed 's/\([A-Za-z]\)\([0-9]\)/\1,\2/g; s/\([0-9]\)\([A-Za-z]\)/\1,\2/g'
echo TEP_CHECK.20180627023645.txt | cut -d'.' -f2 | awk 'BEGIN{OFS="_"} {print substr($1,1,4),substr($1,5,2),substr($1,7,2),substr($1,9,2),substr($1,11,2),substr($1,13,2)}'
Output:
2018_06_27_02_36_45

Resources