How to read each cell of a column in csv and take each as input for jq in bash - bash

I am trying to read each cell of CSV and treat it as an input for the JQ command. Below is my code:
line.csv
| Line |
|:---- |
| 11 |
| 22 |
| 33 |
Code to read CSV:
while read line
do
echo "Line is : $line"
done < line.csv
Output:
Line is 11
Line is 22
jq Command
jq 'select(.scan.line == '"$1"') | .scan.line,"|", .scan.service,"|", .scan.comment_1,"|", .scan.comment_2,"|", .scan.comment_3' linescan.json | xargs
I have a linescan.json which have values for line, service, comment_1, comment_2, comment_3
I want to read each value of csv and treat the input in jq query where $1 is mentioned.

Given the input files and desired output:
line.csv
22,Test1
3389,Test2
10,Test3
linescan.json
{
"scan": {
"line": 3389,
"service": "Linetest",
"comment_1": "Line is tested1",
"comment_2": "Line is tested2",
"comment_3": "Line is tested3"
}
}
desired output:
Test2 | 3389 | Linetest | Line is tested1 | Line is tested2 | Line is tested3
Here's a solution with jq:
This is a shot in the dark, as you didn't specify what your output should look like:
jq -sr --rawfile lineArr line.csv '
(
$lineArr | split("\n") | del(.[-1]) | .[] | split(",")
) as [$lineNum,$prefix] |
.[] | select(.scan.line == ($lineNum | tonumber)) |
[
$prefix,
.scan.line,
.scan.service,
.scan.comment_1,
.scan.comment_2,
.scan.comment_3
] |
join(" | ")
' linescan.json
Update: with jq 1.5:
#!/bin/bash
jq -sr --slurpfile lineArr <(jq -R 'split(",")' line.csv) '
($lineArr | .[]) as [$lineNum,$prefix] |
.[] | select(.scan.line == ($lineNum | tonumber)) |
[
$prefix,
(.scan.line | tostring),
.scan.service,
.scan.comment_1,
.scan.comment_2,
.scan.comment_3
] |
join(" | ")
' linescan.json

Related

Redirect the table generated from Beeline to text file without the grid (Shell Script)

I am currently trying to find a way to redirect the standard output from beeline shell to text file without the grid. The biggest problem I am facing right now is that my columns have negative values and when I'm using regex to remove the '-', it is affecting the column values.
+-------------------+
| col |
+-------------------+
| -100 |
| 22 |
| -120 |
| -190 |
| -800 |
+-------------------+
Here's what I'm doing:
beeline -u jdbc:hive2://localhost:10000/default \
-e "SELECT * FROM $db.$tbl;" | sed 's/\+//g' | sed 's/\-//g' | sed 's/\|//g' > table.txt
I am trying to clean this file so I can read all the data into a variable.
Assumming all your data has the same pattern , where no significant '-' are wrapped in '+' :
[root#machine]# cat boo
+-------------------+
| col |
+-------------------+
| -100 |
| 22 |
| -120 |
| -190 |
| -800 |
+-------------------+
[root#machine]# cat boo | sed 's/\+-*+//g' | sed 's/\--//g' | sed 's/|//g'
col
-100
22
-120
-190
-800

Bash extract strings between two characters

I have the output of query result into a bash variable, stored as a single line.
-------------------------------- | NAME | TEST_DATE | ----------------
--------------------- | TESTTT_1 | 2019-01-15 | | TEST_2 | 2018-02-16 | | TEST_NAME_3 | 2020-03-17 | -------------------------------------
I would like to ignore the column names(NAME | TEST_DATE) and store actual values of each name and test_date as a tuple in an array.
So here is the logic I am thinking, I would like to extract third string onwards between two '|' characters. These strings are comma separated and when a space is encountered we start the next tuple in the array.
Expected output:
array=(TESTTT_1,2019-01-15 TEST_2,2018-02-16 TEST_NAME_3,2020-03-17)
Any help is appreciated. Thanks.
let say your
String is stored in variable a (or pipe our query output to below command
echo "$a"
-------------------------------- | NAME | TEST_DATE | ----------------
--------------------- | TESTTT_1 | 2019-01-15 | | TEST_2 | 2018-02-16 | | TEST_NAME_3 | 2020-03-17 | ------------------------------------
Command to obtain desired results is:
array="$(echo "$a" | cut -d '|' -f2,3,5,6,8,9 | tail -n1 | sed 's/ | /,/g')
Above will store ourput in variable named array as you expected
Output of above command is:
echo "$array"
TESTTT_1,2019-01-15,TEST_2,2018-02-16,TEST_NAME_3,2020-03-17
Explanation of command: output of echo $a will be piped into cut and using '|' as delimeter it will cut fields 2,3,5,6,8,9 then the output is piped into tail to remove the undesired NAME and TEST_DATE columns and provide values only and then as per your expected output | will be converted to , using sed.
Here in this string you are having only three dates if you have more then just in cut command add more field numbers and as per format of your string field numbers will be in following style 2,3,5,6,8,9,11,12,14,15 .... and so on.
Hope it solved your problem.
echo "$a" | awk -F "|" '{ for(i=2; i<=NF; i++){ print $i }}' | sed -e '1,3d' -e '$d' | tr ' ' '\n' | sed '/^$/d' | sed 's/^/,/g' | sed -e 'N;s/\n/ /' | sed 's/^.//g' | xargs | sed 's/ ,/, /g'
Above is awk based solution
Output:
TESTTT_1, 2019-01-15 TEST_2, 2018-02-16 TEST_NAME_3, 2020-03-17
Is it ok.

Openstack snapshot create and restore scripting with bash commands

First of all it amazes me there is so little information about openstack and script examples but that is not the question i have
I want to create a snapshot and an simple way to restore the snapshot. Because the way of our hosting provider uses underlying storage i am unable to use the rebuild command so i need to destroy the running vm and recreate it with the snapshot image as a base. The creation of the image only works when all information about the running vm is provided as input parameters and here comes the troubles i have.
the information needed is provided by 3 commands
command1: nova show
Output:
+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Property | Value |
+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| NWFINFRA_1600 network | 10.0.0.39 |
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | gn3a |
| OS-EXT-STS:power_state | 1 |
| OS-EXT-STS:task_state | - |
| OS-EXT-STS:vm_state | active |
| OS-SRV-USG:launched_at | 2019-10-02T14:25:21.000000 |
| OS-SRV-USG:terminated_at | - |
| accessIPv4 | |
| accessIPv6 | |
| config_drive | |
| created | 2019-10-02T14:25:05Z |
| description | - |
| flavor:disk | 0 |
| flavor:ephemeral | 0 |
| flavor:extra_specs | {"ostype": "win", "hw:cpu_cores": "1", "hw:cpu_sockets": "2"} |
| flavor:original_name | win.2large |
| flavor:ram | 8192 |
| flavor:swap | 0 |
| flavor:vcpus | 2 |
| hostId | 18aa94c61106a53b2d9e672e93619a6fce76abb1ee6ba9da471491f9 |
| id | 70941fbf-9143-4f1c-a5e7-979f818ace23 |
| image | IFW039-InstanceSnapshot (8ee1104d-55e4-4c99-93e5-ceb4a53ce13f) |
| key_name | - |
| locked | False |
| metadata | {} |
| name | IWF039 |
| os-extended-volumes:volumes_attached | [{"id": "1134fe12-777b-4c26-ac2b-e6ecb6ad4f70", "delete_on_termination": false}, {"id": "f610a46e-46ad-460f-81b3-e2b34acfbbfc", "delete_on_termination": false}] |
| progress | 0 |
| status | ACTIVE |
| tags | [] |
| tenant_id | 4c15fd467dde4bd6a25427d6bab64a7f |
| trusted_image_certificates | - |
| updated | 2019-10-02T14:25:21Z |
| user_id | ddff2ce854114bef873bac9a1476805e |
+--------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Command 2: ./Scripts/openstack port show NWFINFRA_1600_IWF039
where NWFINFRA_1600_IWF039 is a combination of previous output NWFINFRA_1600 network and the server name IWF039
Output:
+-------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field | Value |
+-------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| admin_state_up | UP |
| allowed_address_pairs | |
| binding_host_id | None |
| binding_profile | None |
| binding_vif_details | None |
| binding_vif_type | None |
| binding_vnic_type | normal |
| created_at | 2019-10-02T06:49:08Z |
| data_plane_status | None |
| description | |
| device_id | 70941fbf-9143-4f1c-a5e7-979f818ace23 |
| device_owner | compute:gn3a |
| dns_assignment | fqdn='iwf039.rijkscloud.local.', hostname='iwf039', ip_address='10.0.0.39' |
| dns_domain | |
| dns_name | iwf039 |
| extra_dhcp_opts | |
| fixed_ips | ip_address='10.0.0.39', subnet_id='3298e8d0-b317-465c-8757-c1a4f2cad298' |
| id | b49a7d3a-bb0d-49cb-a04b-64c5dbf9df20 |
| location | cloud='', project.domain_id='default', project.domain_name=, project.id='4c15fd467dde4bd6a25427d6bab64a7f', project.name='vws-pgb', region_name='Groningen3', zone= |
| mac_address | fa:16:3e:99:cc:3c |
| name | NWFINFRA_1600_IWF039 |
| network_id | 450dcc7a-5e55-4e38-9f4e-de9a9c685502 |
| port_security_enabled | False |
| project_id | 4c15fd467dde4bd6a25427d6bab64a7f |
| propagate_uplink_status | None |
| qos_policy_id | None |
| resource_request | None |
| revision_number | 15 |
| security_group_ids | |
| status | ACTIVE |
| tags | |
| trunk_details | None |
| updated_at | 2019-10-03T11:10:00Z |
+-------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
with these outputs i can create the build command to restore the snapshot:
nova boot --poll --flavor win.2large --image IFW039-InstanceSnapshot --security-groups default --availability-zone gn3a --nic net-id=450dcc7a-5e55-4e38-9f4e-de9a9c685502 IWF039
note: the image is the name of the created snapshot
I try to script this so i can have a simple snapshot create and restore procedure
but i get stuck on the table layout of the output. This shows really nice but i cannot use it in my scripting to redirect the output to input variables.
I tryed using this: { read foo ; read ID Name MAC IP status;} < <(./Scripts/openstack port list --server IWF039 | sed 's/+--------------------------------------+----------------------+-------------------+--------------------------------------------------------------------------+-------- +//' | sed 's/|//' | sed 's/MAC Address/MAC/' | sed 's/Fixed IP Addresses/IP/')
But the variables get contents like '|' char etc.
So echo $Name gives '|' as output.
There must be a simpler way but i am unable to see it.
Please help ...
I managed to get it almost working by using awk instead of grep:
i have now this code:
#/bin/bash # # Query needed variables # echo -e "\nQuery needed information" NETWORK=$(nova show IWF039 | awk '/network/ {print $2}') ZONE=$(nova show IWF039 | awk '/OS-EXT-AZ:availability_zone/ {print $4}') FLAVOR=$(nova show IWF039 | awk '/flavor:original_name/ {print $4}') SERVERID=$(nova show IWF039 | awk -F '|' '/id/ {print $3; exit}') NETWORKPORT=$(nova interface-list IWF039 | awk -F '|' '/ACTIVE/ {print $3}') # Print out variables echo "network: $NETWORK" echo "zone: $ZONE" echo "flavor: $FLAVOR" echo "server_id: $SERVERID" echo "network_port_id: $NETWORKPORT" # Remove current instance echo -e "\nRemove current instance" nova delete $SERVERID # Rebuild instance from snapshot image echo -e "\nRebuild instance from snapshot" nova boot --poll --flavor $FLAVOR --image IFW039-InstanceSnapshot2 --security-groups default --availability-zone $ZONE --nic port-id=$NETWORKPORT IWF039
If i run the script the last item however e.g. IWF039 which is the name of the instance i want to use throws me an error:
error: unrecognized arguments: IWF039
anyone can tell me why ?
If i run the line on the commandline it works, only not from the bash script
#/bin/bash
#
# Script restores snapshot created in OpenStack
# Script created by Lex IT, Alex Flora
# usage: snapshot-restore.sh <snapshot image name> <server name to restore to>
# to use: make sure to install the following python openstack modules:
# pip install python-openstackclient python-keystoneclient python-glanceclient python-novaclient python-neutronclient
#
#
# Query needed variables
#
if [ "$#" -eq "0" ]
then
echo -e "usage: restore_snapshot <name of snapshot> <name of server>"
echo -e "Querying available snapshots, one moment please ..."
glance image-list
echo -e "\n\033[0;33mGive name of snapshot to restore"
echo -e "\033[0m"
read SNAPSHOT
echo -e "\n\033[0;33mGive server name to restore"
echo -e "\033[0m"
read SERVER
else
SNAPSHOT=$1
SERVER=$2
fi
echo -e "\n\033[0mQuery needed server information from server $SERVER, one moment please ..."
NETWORK=$(nova show IWF039 | awk '/network/ {print $2}' | sed -e 's/^[[:space:]]*//')
ZONE=$(nova show IWF039 | awk '/OS-EXT-AZ:availability_zone/ {print $4}' | sed -e 's/^[[:space:]]*//')
FLAVOR=$(nova show IWF039 | awk '/flavor:original_name/ {print $4}' | sed -e 's/^[[:space:]]*//')
SERVERID=$(nova show IWF039 | awk -F '|' '/\<id\>/ {print $3; exit}' | sed -e 's/^[[:space:]]*//')
NETWORKPORT=$(nova interface-list IWF039 | awk -F '|' '/ACTIVE/ {print $3}' | sed -e 's/^[[:space:]]*//')
# Print out variables
echo -e "\033[0mnetwork: \033[0;32m$NETWORK"
echo -e "\033[0mzone: \033[0;32m$ZONE"
echo -e "\033[0mflavor: \033[0;32m$FLAVOR"
echo -e "\033[0mserver_id: \033[0;32m$SERVERID"
echo -e "\033[0mnetwork_port_id: \033[0;32m$NETWORKPORT"
echo -e "\033[0mSnapshot image: \033[0;32m$SNAPSHOT"
echo -e "\033[0mServer naam: \033[0;32m$SERVER"
# Ask confirmation
echo -e "\n\033[0mGoing to restore snapshot image $SNAPSHOT to server $SERVER"
read -p "Is this correct (y/n) ? " -n 1 -r
if [[ $REPLY =~ ^[Yy]$ ]]
then
# Remove current instance
echo -e "\n\033[0mRemove current instance"
nova delete $SERVERID
sleep 3
# Rebuild instance from snapshot image
echo -e "\nRebuild instance from snapshot using command:"
echo -e "\033[0mnova boot --poll --flavor $FLAVOR --image $SNAPSHOT --security-groups default --availability-zone $ZONE --nic port-id=$NETWORKPORT $SERVER"
nova boot --poll --flavor $FLAVOR --image $SNAPSHOT --security-groups default --availability-zone $ZONE --nic port-id=$NETWORKPORT $SERVER
fi
This is my complete script. Problem was that the output contained spaces in front of the string. I resolved this with the following sed command: sed -e 's/^[[:space:]]*//'
Hopefully someone has some use of the script.

Combine multiple grep variables in one column-wise file

I have some grep expressions which count the number of lines matching a string, each one for a group of files with different extension:
Nreads_ini=$(grep -c '^>' $WDIR/*_R1.trim.contigs.fasta)
Nreads_align=$(grep -c '^>' $WDIR/*_R1.trim.contigs.good.unique.align)
Nreads_preclust=$(grep -c '^>' $WDIR/*_R1.trim.contigs.good.unique.filter.unique.precluster.fasta)
Nreads_final=$(grep -c '^>' $WDIR/*_R1.trim.contigs.good.unique.filter.unique.precluster.pick.fasta)
Each of these greps outputs the sample name and the number of occurences, as follows.
The first one:
PATH/V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT_R1.trim.contigs.fasta:13175
PATH/V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT_R1.trim.contigs.fasta:14801
PATH/V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT_R1.trim.contigs.fasta:13475
PATH/V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG_R1.trim.contigs.fasta:13424
PATH/V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA_R1.trim.contigs.fasta:12053
The second one:
PATH/V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT_R1.trim.contigs.good.unique.align:12589
PATH/V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT_R1.trim.contigs.good.unique.align:13934
PATH/V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT_R1.trim.contigs.good.unique.align:12981
PATH/V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG_R1.trim.contigs.good.unique.align:12896
PATH/V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA_R1.trim.contigs.good.unique.align:11617
And so on. I need to create a .txt file with these numerical grep outputs as columns taking the sample name as a key column. The sample name is the part of the file name before "_R1" (V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA, V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG...):
Sample | Nreads_ini | Nreads_align |
-----------------------------------------------------------------------
V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT | 13175 | 12589 |
V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT | 14801 | 13934 |
V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT | 13475 | 12981 |
V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG | 13424 | 12896 |
V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA | 12053 | 11617 |
Any idea? Is there another easier solution for my problem?
Thanks!
In this answers the variable names are shortened to ini and align.
First, we extract the sample name and count from grep's output. Since we have to do this multiple times, we define the function
e() { sed -E 's,^.*/(.*)_R1.*:(.*)$,\1\t\2,'; }
Then we join the extracted data into one file. Lines with the same sample name will be combined.
join -t $'\t' <(e <<< "$ini") <(e <<< "$align")
Now we nearly have the expected output. We only have to add the header and draw lines for the table.
join ... | column -to " | " -N Sample,ini,align
This will print
Sample | ini | align
V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT | 13175 | 12589
V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT | 14801 | 13934
V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT | 13475 | 12981
V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG | 13424 | 12896
V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA | 12053 | 11617
Adding a horizontal line after the header is left as an exercise for the reader :)
This approach also works with more than two number columns. The join and -N parts have to be extended. join can only work with two files, requiring us to use an unwieldy workaround ...
e() { sed -E 's,^.*/(.*)_R1.*:(.*)$,\1\t\2,'; }
join -t $'\t' <(e <<< "$var1") <(e <<< "$var2") |
join -t $'\t' - <(e <<< "$var3") | ... | join -t $'\t' - <(e <<< "$varN") |
column -to " | " -N Sample,Col1,Col2,...,ColN
... so it would be easier to add another helper function
e() { sed -E 's,^.*/(.*)_R1.*:(.*)$,\1\t\2,'; }
j2() { join -t $'\t' <(e <<< "$1") <(e <<< "$2"); }
j() { join -t $'\t' - <(e <<< "$1"); }
j2 "$var1" "$var2" | j "$var3" | ... | j "$varN" |
column -to " | " -N Sample,Col1,Col2,...,ColN
Alternatively, if all inputs contain the same samples in the same order, join can be replaced with one single paste command.
Assuming you have files containing the data you want parse:
$ cat file1
PATH/V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT_R1.trim.contigs.fasta:13175
PATH/V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT_R1.trim.contigs.fasta:14801
PATH/V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT_R1.trim.contigs.fasta:13475
PATH/V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG_R1.trim.contigs.fasta:13424
PATH/V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA_R1.trim.contigs.fasta:12053
$ cat file2
PATH/V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT_R1.trim.contigs.good.unique.align:12589
PATH/V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT_R1.trim.contigs.good.unique.align:13934
PATH/V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT_R1.trim.contigs.good.unique.align:12981
PATH/V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG_R1.trim.contigs.good.unique.align:12896
PATH/V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA_R1.trim.contigs.good.unique.align:11617
$ cat file3 # This is a copy of file2 but could be different
PATH/V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT_R1.trim.contigs.good.unique.align:12589
PATH/V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT_R1.trim.contigs.good.unique.align:13934
PATH/V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT_R1.trim.contigs.good.unique.align:12981
PATH/V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG_R1.trim.contigs.good.unique.align:12896
PATH/V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA_R1.trim.contigs.good.unique.align:11617
If there is a key like V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT, you could use awk:
$ awk -F'[/.:]' '
BEGINFILE{
col[FILENAME]
}
{
row[$2]
a[FILENAME,$2]=$NF
next
}
END{
for(i in row) {
printf "%s ",substr(i,1,length(i)-3)
for(j in col)
printf "%s ",a[j SUBSEP i]; printf "\n"
}
}' file1 file2 file3
V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG 13424 12896 12896
V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT 13175 12589 12589
V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT 13475 12981 12981
V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT 14801 13934 13934
V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA 12053 11617 11617
This awk script fills 3 array col, row and a that respectively stores the column name (filename), the row content and the values for all files.
The END statement prints the content of the array a by looping through all rows and columns.
If you need table decoration, use this:
{ printf "Sample Nreads_ini Nreads_align Nreads_align \n"; awk -F'[/.:]' 'BEGINFILE{col[FILENAME]}{row[$2];a[FILENAME,$2]=$NF;next}END{for(i in row) { printf "%s ",substr(i,1,length(i)-3); for(j in col) printf "%s ",a[j SUBSEP i]; printf "\n" }}' file1 file2 file3; } | column -t -s' ' -o ' | '
Could you please try following and let me know if this helps you.
awk --re-interval -F"[/.:]" '
BEGIN{
print "Sample | Nreads_ini | Nreads_align |"
}
FNR==NR{
match($2,/.*[A-Z]{10}/);
array[substr($2,RSTART,RLENGTH)]=$NF;
next
}
match($2,/.*[A-Z]{10}/) && (substr($2,RSTART,RLENGTH) in array){
print substr($2,RSTART,RLENGTH),array[substr($2,RSTART,RLENGTH)],$NF
}
' OFS=" | " first_one second_one | column -t
Output will be as follows.
Sample | Nreads_ini | Nreads_align |
V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT | 13175 | 12589
V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT | 14801 | 13934
V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT | 13475 | 12981
V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG | 13424 | 12896
V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA | 12053 | 11617

Can't iterate over array in Bash

I need to add a new column with a (ordinal) number after the last column in my table.
Both input and output files are .CSV tables.
Incoming table has more then 500 000 lines (rows) of data and 7 columns, e.g. https://www.dropbox.com/s/g2u68fxrkttv4gq/incoming_data.csv?dl=0
Incoming CSV table (this is just an example, so "|" and "-" are here for the sake of clarity):
| id | Name |
-----------------
| 1 | Foo |
| 1 | Foo |
| 1 | Foo |
| 4242 | Baz |
| 4242 | Baz |
| 4242 | Baz |
| 4242 | Baz |
| 702131 | Xyz |
| 702131 | Xyz |
| 702131 | Xyz |
| 702131 | Xyz |
Result CSV (this is just an example, so "|" and "-" are here for the sake of clarity):
| id | Name | |
--------------------------
| 1 | Foo | 1 |
| 1 | Foo | 2 |
| 1 | Foo | 3 |
| 4242 | Baz | 1 |
| 4242 | Baz | 2 |
| 4242 | Baz | 3 |
| 4242 | Baz | 4 |
| 702131 | Xyz | 1 |
| 702131 | Xyz | 2 |
| 702131 | Xyz | 3 |
| 702131 | Xyz | 4 |
First column is ID, so I've tried to group all lines with the same ID and iterate over them. Script (I don't know bash scripting, to be honest):
FILE=$PWD/$1
# Delete header and extract IDs and delete non-unique values. Also change \n to ♥, because awk doesn't properly work with it.
IDS_ARRAY=$(awk -v FS="|" '{for (i=1;i<=NF;i++) if ($i=="\"") inQ=!inQ; ORS=(inQ?"♥":"\n") }1' $FILE | awk -F'|' '{if (NR!=1) {print $1}}' | awk '!seen[$0]++')
for id in $IDS_ARRAY; do
# Group $FILE by $id from $IDS_ARRAY.
cat $FILE | grep $id >> temp_mail_group.csv
ROW_GROUP=$PWD/temp_mail_group.csv
# Add a number after each row.
# NF+1 — add a column after last existing.
awk -F'|' '{$(NF+1)=++i;}1' OFS="|", $ROW_GROUP >> "numbered_mails_$(date +%Y-%m-%d).csv"
rm -f $PWD/temp_mail_group.csv
done
Right now this script works almost like I want to, except that it thinks that (for example) ID 2834 and 772834 are the same.
UPD: Although I marked one answer as approved it does not assign correct values to some groups of records with the same ID (right now I don't see a pattern).
You can do everything in a single script:
gawk 'BEGIN { FS="|"; OFS="|";}
/^-/ {print; next;}
$2 ~ /\s*id\s*/ {print $0,""; next;}
{print "", $2, $3, ++a[$2];}
'
$1 is the empty field before the first | in the input. I use an empty output column "" to get the leading |.
The trick is ++a[$2] which takes the second field in each row (= the ID column) and looks for it in the associative array a. If there is no entry, the result is 0. By pre-incrementing, we start with 1 and add 1 every time the ID reappears.
Every time you write a loop in shell just to manipulate text you have the wrong approach. The guys who invented shell also invented awk for shell to call to manipulate text - don't disappoint them :-).
$ awk '
BEGIN{ w = 8 }
{
if (NR==1) {
val = sprintf("%*s|",w,"")
}
else if (NR==2) {
val = sprintf("%*s",w+1,"")
gsub(/ /,"-",val)
}
else {
val = sprintf(" %-*s|",w-1,++cnt[$2])
}
print $0 val
}
' file
| id | Name | |
----------------------
| 1 | Foo | 1 |
| 1 | Foo | 2 |
| 1 | Foo | 3 |
| 42 | Baz | 1 |
| 42 | Baz | 2 |
| 42 | Baz | 3 |
| 42 | Baz | 4 |
| 70 | Xyz | 1 |
| 70 | Xyz | 2 |
| 70 | Xyz | 3 |
| 70 | Xyz | 4 |
An awk way
Without considering the dotted line being extended.
awk 'NR>2{$0=$0 (++a[$2])"|"}1' file
output
| id | Name |
-------------
| 1 | Foo |1|
| 1 | Foo |2|
| 1 | Foo |3|
| 42 | Baz |1|
| 42 | Baz |2|
| 42 | Baz |3|
| 42 | Baz |4|
| 70 | Xyz |1|
| 70 | Xyz |2|
| 70 | Xyz |3|
| 70 | Xyz |4|
Here's a way to do it with pure Bash:
inputfile=$1
prev_id=
while IFS= read -r line ; do
printf '%s' "$line"
IFS=$'| \t\n' read t1 id name t2 <<<"$line"
if [[ $line == -* ]] ; then
printf '%s\n' '---------'
elif [[ $id == 'id' ]] ; then
printf ' Number |\n'
else
if [[ $id != "$prev_id" ]] ; then
id_count=0
prev_id=$id
fi
printf '%2d |\n' "$(( ++id_count ))"
fi
done <"$inputfile"

Resources