how to merge two data based on one

how to merge two data based on one - bash

I have two data saved at .txt in a folder
data1 which is called data 1 includes of one column data as follows
from
A0A0A6YXQ7
A0A0A6YXS5
A0A0A6YXW8
A0A0A6YXX6
A0A0A6YXZ1
A0A0A6YY28
A0A0A6YY43
A0A0A6YY47
A0A0A6YY78
A0A0A6YY89
A0A0A6YY91
A0A0A7NQN9
and the second data has two columns fromand to
from to
A0A0A6YXQ7 Myo1f
A0A0A6YXW8 Pak2
A0A0A6YXX6 Arhgap15
A0A0A6YXZ1 Igtp
A0A0A6YY28 pol
A0A0A6YY47 MumuTL
A0A0A6YY78 MumuTL
A0A0A6YY78 MumuTLM
A0A0A6YY91 MumuTL
A0A0A6YY91 MumuTLM
data1 and data2 have a column named from
all strings in data1 should be in data2. if they are not.
I want to load the two data, and if the any string does not exist in the data2, I want to put it there as data1
for example, in data2 the following strings are missing
A0A0A6YXS5 and A0A0A6YY43 and A0A0A6YY89 and A0A0A7NQN9
so the output will look like this
From To
A0A0A6YXQ7 Myo1f
A0A0A6YXS5 -
A0A0A6YXW8 Pak2
A0A0A6YXX6 Arhgap15
A0A0A6YXZ1 Igtp
A0A0A6YY28 pol
A0A0A6YY43 -
A0A0A6YY47 MumuTL
A0A0A6YY78 MumuTL;MumuTLM
A0A0A6YY89 -
A0A0A6YY91 MumuTL;MumuTLM
A0A0A7NQN9 -

How about:
#!/bin/bash
declare -A hash
# scan in file2 and make a key-value(s) table
while read line; do
set -- $line
if [ -z ${hash[$1]} ]; then
hash[$1]=$2
else
hash[$1]="${hash[$1]};$2"
fi
done < data2
# read file1 as keys and print appropriate value(s)
while read line; do
if [ -z ${hash[$line]} ]; then
echo $line "-"
else
echo $line ${hash[$line]}
fi
done < data1
Note that "from" and "to" pair are accidentally properly processed.
Hope this helps.

Related

UNIX: cut inside if

I have a simple search script, where based on user's options it will search in certain column of a file.
The file looks similar to passwd
openvpn:x:990:986:OpenVPN:/etc/openvpn:/sbin/nologin
chrony:x:989:984::/var/lib/chrony:/sbin/nologin
rpcuser:x:29:29:RPC Service User:/var/lib/nfs:/sbin/nologin
nfsnobody:x:65534:65534:Anonymous NFS User:/var/lib/nfs:/sbin/nologin
radvd:x:75:75:radvd user:/:/sbin/nologin
now the function based on user's option will search in different columns of the file. For example
-1 "searchedword" -2 "secondword"
will search in the first column for "searchedword" and in the second column for "secondword"
The function looks like this:
while [ $# -gt 0 ]; do
case "$1" in
-1|--one)
c=1
;;
-2|--two)
c=2
;;
-3|--three)
c=3
;;
...
esac
In the c variable is the number of the column where I want to search.
cat data | if [ "$( cut -f $c -d ':' )" == "$2" ]; then cut -d: -f 1-7 >> result; fi
Now I have something like this, where I try to select the right column and compare it to the second option, which is in this case "searchedword" and then copy the whole column into the result file. But it doesn't work. It doesn't copy anything into the result file.
Does anyone know where is the problem?
Thanks for answers
(At the end of the script I use:
shift
shift
to get the next two options)

I suggest using awk for this task as awk is better tool for processing delimited columns and rows.
Consider this awk command where we pass search column numbers their corresponding search values in 2 different strings cols and vals to awk command:
awk -v cols='1:3' -v vals='rpcuser:29' 'BEGIN {
FS=OFS=":" # set input/output field separator as :
nc = split(cols, c, /:/) # split column # by :
split(vals, v, /:/) # split values by :
}
{
p=1 # initialize p as 1
for(i=1; i<=nc; i++) # iterate the search cols/vals and set p=0
if ($c[i] !~ v[i]) { # if any match fails
p=0
break
} # finally value of p decides if a row is printing or not
} p' file
Output:
rpcuser:x:29:29:RPC Service User:/var/lib/nfs:/sbin/nologin

Storing multiple columns of data from a file in a variable

I'm trying to read from a file the data that it contains and get 2 important pieces of data from the file and use it in a bash script. A string and then a number for example:
Box 12
Toy 85
Dog 13
Bottle 22
I was thinking I could write a while loop to loop through the file and store the data into a variable. However I need two different variables, one for the number and one for the word. How do I get them separated into two variables?

Example code:
#!/bin/bash
declare -a textarr numarr
while read -r text num;do
textarr+=("$text")
numarr+=("$num")
done <file
echo ${textarr[1]} ${numarr[1]} #will print Toy 85
data are stored into two array variables: textarr numarr.
You can access each one of them using index ${textarr[$index]} or all of them at once with ${textarr[#]}

To read all the data into a single associative array (in bash 4.0 or newer):
#!/bin/bash
declare -A data=( )
while read -r key value; do
data[$key]=$value
done <file
With that done, you can retrieve a value by key efficiently:
echo "${data[Box]}"
...or iterate over all keys:
for key in "${!data[#]}"; do
value=${data[$key]}
echo "Key $key has value $value"
done
You'll note that read takes multiple names on its argument list. When given more than one argument, it splits fields by IFS, putting columns into their respective variables (with the entire rest of the line going into the last variable named, if more columns exist than variables are named).

Here I provide my own solution which should be discussed. I am not sure this is a good solution or not. Using while read construct has the drawback of starting a new shell and it will not be able to update a variable outside the loop. Here is an example code which you can modify to suite your own need. If you have more column data to use, then slight adjustment is need.
#!/bin/sh
res=$(awk 'BEGIN{OFS=" "}{print $2, $3 }' mytabularfile.tab)
n=0
for x in $res; do
row=$(expr $n / 2)
col=$(expr $n % 2)
#echo "row: $row column: $col value: $x"
if [ $col -eq 0 ]; then
if [ $n -gt 0 ]; then
echo "row: $row "
echo col1=$col1 col2=$col2
fi
col1=$x
else
col2=$x
fi
n=$(expr $n + 1)
done
row=$(expr $row + 1)
echo "last row: $row col1=$col1 col2=$col2"

Merging rows in .csv in order

After analysis of brain scans I ended up with around 1000 .csv files, one for each scan. I've merged them into one in order (by subject ID and date). My problem is, that some subjects had two or more consecutive scans and some had only one. Database now looks like that:
ID, CC_area, CC_perimeter, CC_circularity
024_S_0985, 407.00, 192.15, 0.138530 //first scan of A
024_S_0985, 437.50, 204.80, 0.131074 //second scan of A
024_S_0985, 400.75, 198.80, 0.127420 //third scan of A
024_S_1063, 544.50, 214.34, 0.148939 //first and only scan of B
024_S_1171, 654.75, 240.33, 0.142453 //first scan of C
024_S_1171, 659.50, 242.21, 0.141269 //second scan of C
...
But I want it to look like that:
ID, CC_area, CC_perimeter, CC_circularity, CC_area2, CC_perimeter2, CC_circularity2, CC_area3, CC_perimeter3, CC_circularity3, ..., CC_circularity6
024_S_0985, 407.00, 192.15, 0.138530, 437.50, 204.80, 0.131074, 400.75, 198.80, 0.127420, ... ,
024_S_1063, 544.50, 214.34, 0.148939,,,,,, ...,
024_S_1171, 654.75, 240.33, 0.142453, 659.50, 242.21, 0.141269,,, ... ,
...
What is important, that order of data must not be changed and number of rows for one ID is not known (it varies from 1 to 6). (So first columns of scan 1, then scan 2 etc.). Could you help me, or provide, with solution for that using bash? I am not experienced in programming and I have lost hope, that I could do it myself.

You can combine the line with the same filename (or initial index) using a normal while read loop and then acting on 3 conditions. (1) whether it is the first line following the header; (2) where the current index is equal to the last; and (3) where the current index differs from the last. There are a number of ways to approach this, but a short bash script could look like the following:
#!/bin/bash
fn="${1:-/dev/stdin}" ## accept filename or stdin
[ -r "$fn" ] || { ## validate file is readable
printf "error: file not found: '%s'\n" "$fn"
exit 1
}
declare -i cnt=0 ## flag for 1st iteration
while read -r line; do ## for each line in file
## read header, print & continue
[ ${line//,*/} = ID ] && printf "%s\n" "$line" && continue
line="${line// */}" ## strip //first scan of A....
idx=${line//,*/} ## parse file index from line
line="${line#*, }" ## strip index
if [ $cnt -eq 0 ]; then ## if first line - print
printf "%s, %s" "$idx" "$line"
((cnt++))
elif [ $idx = $lidx ]; then ## if indexes equal, append
printf ", %s" "$line"
else ## else, newline & print
printf "\n%s, %s" "$idx" "$line"
fi
last="$line" ## save last line
lidx=$idx ## save last index
done <"$fn"
printf "\n"
Input
$ cat dat/cmbcsv.dat
ID, CC_area, CC_perimeter, CC_circularity
024_S_0985, 407.00, 192.15, 0.138530 //first scan of A
024_S_0985, 437.50, 204.80, 0.131074 //second scan of A
024_S_0985, 400.75, 198.80, 0.127420 //third scan of A
024_S_1063, 544.50, 214.34, 0.148939 //first and only scan of B
024_S_1171, 654.75, 240.33, 0.142453 //first scan of C
024_S_1171, 659.50, 242.21, 0.141269 //second scan of C
Output
$ bash cmbcsv.sh dat/cmbcsv.dat
ID, CC_area, CC_perimeter, CC_circularity
024_S_0985, 407.00, 192.15, 0.138530, 437.50, 204.80, 0.131074, 400.75, 198.80, 0.127420
024_S_1063, 544.50, 214.34, 0.148939
024_S_1171, 654.75, 240.33, 0.142453, 659.50, 242.21, 0.141269
Note: I didn't know whether you needed all the additional commas or ellipses or if they were just there to show there could be more of the same index (e.g. ,,...,). You can easily add them if need be.

well if you know which scan belongs to which person you can add an extra column like patient name or id, but I guess that's if you have that original info of how much scans per person

Bash - String verification method

I have a lot of Teradata SQL files (example code of one of this file is below).
create multiset volatile table abc_mountain_peak as(
select
a.kkpp_nip as nip,
from BM_RETABLE_BATOK.EDETON a
) with data on commit preserve rows;
create multiset table qazxsw_asd_1 as (
select
a.address_id,
from DE30T_BIOLOB.HGG994P_ABS_ADDRESS_TRE a,
) with data on commit preserve rows;
create multiset volatile table xyz_sea_depth as(
select
a.trip,
from tele_line_tryt a
) with data on commit preserve rows;
CREATE multiset table wsxzaq_zxc_2 AS (
SELECT
a.bend_data
FROM lokl_station a ,
) WITH data on commit preserve rows;
CREATE multiset table rfvbgt_ttuop_3 AS (
SELECT
a.heret_bini
FROM fvgty_blumion a ,
) WITH data on commit preserve rows;
DROP qazxsw_asd_1;
DROP wsxzaq_zxc_2;
.EXIT
What I need to do is to create a script (bash), which could verify if the multiset tables are dropped.
There are created two kinds of tables:
multiset volatile tables (which shouldn't be dropped), and
multiset tables (which must be dropped)
In my example code, 2 of 3 multiset tables are dropped (which is correct), and one of them is not (which is incorrect).
Do You have any idea how to create script which could verify something like that (give information, that one table, or some tables aren't dropped)? I am really beginner in bash. My idea (could be wrong) is to create array holding a names of the multiset tables (but not a multiset volatile tables), and later create another one table with 'drop' and the names of dropped tables, and finaly check if every table from first array is also in second array.
What do You think? Any help will be gratefully appreciate.

You can do it fairly easily by reading each line in the file, isolate the table names associated with the multiset table commands into one array (dropnames), you then isolate the table names following the DROP statements into another array (droptable). Then it is just a matter of comparing both arrays to find the table in one that is not in the other. A short script like the following will do it for you:
#!/bin/bash
declare -a tmparray ## declare array names
declare -a dropnames
declare -a droptable
volstr="multiset volatile table" ## set query strings
dropstr="multiset table"
## read all lines and collect table names
while read -r line; do
[[ $line =~ $dropstr ]] && { ## collect "multiset table" names
tmparray=( $line )
dropnames+=( ${tmparray[3]} )
}
[[ $line =~ DROP ]] && { ## collect DROP table names
tmp="${line/DROP /}"
droptable+=( ${tmp%;*} )
}
unset array
done
## compare droptable to dropnames, print missing table(s)
if [ ${#dropnames[#]} -gt ${#droptable[#]} ]; then
printf "\n The following tables are missing from DROP tables:\n\n"
for i in "${dropnames[#]}"; do
found=0
for j in "${droptable[#]}"; do
[ $i = $j ] && found=1 && continue
done
[ $found -eq 0 ] && printf " %s\n" "$i"
done
elif [ ${#dropnames[#]} -lt ${#droptable[#]} ]; then
printf "\n The following tables are missing from DROP tables:\n\n"
for i in "${droptable[#]}"; do
found=0
for j in "${dropnames[#]}"; do
[ $i = $j ] && found=1 && continue
done
[ $found -eq 0 ] && printf " %s\n" "$i"
done
fi
printf "\n"
exit 0
Output
$ bash sqlfinddrop.sh <dat/sql.dat
The following tables are missing from DROP tables:
rfvbgt_ttuop_3

I would do it in two parts using sed:
Create list of creates:
sed -ne 's/^.*create multiset \(volatile \)\?table \(\w\+\).*$/\2/Ip' INPUT FILES | sort > creates.txt
Create list of deletes:
sed -ne 's/^.*drop \(\w\+\).*$/\1/Ip' INPUT FILES | sort > drops.txt
Tables which were created and dropped:
join creates.txt drops.txt
Tables created and not dropped:
combine creates.txt not drops.txt

Bash script that analyzes report files

I have the following bash script which I will use to analyze all report files in the current directory:
#!/bin/bash
# methods
analyzeStructuralErrors()
{
# do something with $1
}
# main
reportFiles=`find $PWD -name "*_report*.txt"`;
for f in $reportFiles
do
echo "Processing $f"
analyzeStructuralErrors $f
done
My report files are formatted as such:
Error Code for Issue X - Description Text - Number of errors.
col1_name,col2_name,col3_name,col4_name,col5_name,col6_name
1143-1-1411-247-1-72953-1
1143-2-1411-247-436-72953-1
2211-1-1888-204-442-22222-1
Error Code for Issue Y - Description Text - Number of errors.
col1_name,col2_name,col3_name,col4_name,col5_name,col6_name
Other data
.
.
.
I'm looking for a way to go through each file and aggregate the report data. In the above example, we have two unique issues of type X, which I would like to handle in analyzeStructural. Other types of issues can be ignored in this routine. Can anyone offer advice on how to do this? I want to read each line until I hit the next error basically, and put that data into some kind of data structure.

Below is a working awk implementation that uses it's pseudo multidimensional arrays. I've included sample output to show you how it looks. I took the liberty to add a 'Count' column to denote how many times a certain "Issue" was hit for a given Error Code
#!/bin/bash
awk '
/Error Code for Issue/ {
errCode[currCode=$5]=$5
}
/^ +[0-9-]+$/ {
split($0, tmpArr, "-")
error[errCode[currCode],tmpArr[1]]++
}
END {
for (code in errCode) {
printf("Error Code: %s\n", code)
for (item in error) {
split(item, subscr, SUBSEP)
if (subscr[1] == code) {
printf("\tIssue: %s\tCount: %s\n", subscr[2], error[item])
}
}
}
}
' *_report*.txt
Output
$ ./report.awk
Error Code: B
Issue: 1212 Count: 3
Error Code: X
Issue: 2211 Count: 1
Issue: 1143 Count: 2
Error Code: Y
Issue: 2961 Count: 1
Issue: 6666 Count: 1
Issue: 5555 Count: 2
Issue: 5911 Count: 1
Issue: 4949 Count: 1
Error Code: Z
Issue: 2222 Count: 1
Issue: 1111 Count: 1
Issue: 2323 Count: 2
Issue: 3333 Count: 1
Issue: 1212 Count: 1

As suggested by Dave Jarvis, awk will:
handle this better than bash
is fairly easy to learn
likely available wherever bash is available
I've never had to look farther than The AWK Manual.
It would make things easier if you used a consistent field separator for both the list of column names and the data. Perhaps you could do some pre-processing in a bash script using sed before feeding to awk. Anyway, take a look at multi-dimensional arrays and reading multiple lines in the manual.

Bash has one-dimensional arrays that are indexed by integers. Bash 4 adds associative arrays. That's it for data structures. AWK has one dimensional associative arrays and fakes its way through two dimensional arrays. If you need some kind of data structure more advanced than that, you'll need to use Python, for example, or some other language.
That said, here's a rough outline of how you might parse the data you've shown.
#!/bin/bash
# methods
analyzeStructuralErrors()
{
local f=$1
local Xpat="Error Code for Issue X"
local notXpat="Error Code for Issue [^X]"
while read -r line
do
if [[ $line =~ $Xpat ]]
then
flag=true
elif [[ $line =~ $notXpat ]]
then
flag=false
elif $flag && [[ $line =~ , ]]
then
# columns could be overwritten if there are more than one X section
IFS=, read -ra columns <<< "$line"
elif $flag && [[ $line =~ - ]]
then
issues+=(line)
else
echo "unrecognized data line"
echo "$line"
fi
done
for issue in ${issues[#]}
do
IFS=- read -ra array <<< "$line"
# do something with ${array[0]}, ${array[1]}, etc.
# or iterate
for field in ${array[#]}
do
# do something with $field
done
done
}
# main
find . -name "*_report*.txt" | while read -r f
do
echo "Processing $f"
analyzeStructuralErrors "$f"
done

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

how to merge two data based on one - bash

Related

UNIX: cut inside if

Storing multiple columns of data from a file in a variable

Merging rows in .csv in order

Bash - String verification method

Bash script that analyzes report files

Categories

Resources