compare 2 csv find a match and output it using linux shell

compare 2 csv find a match and output it using linux shell - shell

I am really confused with uniq, sort, awk so ...
got 2 csv
tail 300513-code.csv
11916
11922
11896
11897
128647
1319760
1321176
1017022
1017017
1220901
tail 30-05-4UTF.csv
131318,"...","st365-3",0,5
1220357,"Ящик алюминиевый зимний",,0,1
,"!!Марко Поло",,,
1014492,"Коробка Марко Поло TF1331D 13.8х7.7х3.1см.","1694.13.31"," 16,00",1
1017795,"Ящик Марко Поло FS2000 white-black 2-х полочный 29х16х14см.","1694.20.01"," 122,00",5
10923,"Ящик Марко Поло TR2045 red 2-х секционый большой 51.5х39.5х56.5см.","1694.20.45"," 351,00",4
10925,"Ящик Марко Поло TR2045 yellow 2-х секционый большой 51.5х39.5х56.5см.","1694.20.47"," 351,00",1
12717,"Металоискатель CARRETT",," 4050,00",1
1319913,"Пакет 50 коп.","01.янв",0,269
17596,"Пакет полиэтиленовый 40х50",1," 1,00",4843
So the first one is a code for which i need to find a match and output only the ones that are matching. Example output.csv
12717,"Металоискатель CARRETT",," 4050,00",1
1319913,"Пакет 50 коп.","01.янв",0,269
17596,"Пакет полиэтиленовый 40х50",1," 1,00",4843
suppose this 3 lines had a match

your given input and output don't match. 12717, 1319913, 17596 I cannot find them in your first file. I assume they are just example. And I think the following line is what you are looking for, so try this line:
awk -F, 'NR==FNR{a[$0];next}$1 in a' 300513-code.csv 30-05-4UTF.csv

If you are attempting to link using the first field from each file(bash on linux)
join -1 1 -2 1 -t, <(sort -k1,1 -t, 300513-code.csv)
<(sort -k1,1 -t, 30-05-4UTF.csv)

Related

Multiplication of two variables containing tuples in BASH script

I have two variables containing tuples of same length generated from a PostgreSQL database and several successful follow on calculations, which I would like to multiply to generate a third variable containing the answer tuple. Each tuple contains 100 numeric records. Variable 1 is called rev_p_client_pa and variable 2 is called lawnp_p_client. I tried the following which gives me a third tuple but the answer rows are not calculated correctly:
rev_p_client_pa data is:
0.018183
0.0202814
0.013676
0.0134083
0.0108168
0.014197
0.0202814
lawn_p_client data is:
52.17
45
30.43
50
40
35
50
The command I used in the script:
awk -v var3="$rev_p_client_pa" 'BEGIN{print var3}' | awk -v var4="$lawnp_p_client" -F ',' '{print $(1)*var4}'
The command gives the following output:
0.948607
1.05808
0.713477
0.699511
0.564312
0.740657
1.05808
However when manually calculated in libreoffice calc i get:
0.94860711
0.912663
0.41616068
0.670415
0.432672
0.496895
1.01407
I used this awk structure to multiply a tuple variable with numeric value variable in a previous calculation and it calculated correctly. Does someone know how the correct awk statement should be written or maybe you have some other ideas that might be useful? Thanks for your help.

Use paste to join the two data sets together, forming a list of pairs, each separated by tab.
Then pipe the result to awk to multiply each pair of numbers, resulting in a list of products.
#!/bin/bash
rev_p_client_pa='0.018183
0.0202814
0.013676
0.0134083
0.0108168
0.014197
0.0202814'
lawn_p_client='52.17
45
30.43
50
40
35
50'
paste <(echo "$rev_p_client_pa") <(echo "$lawn_p_client") | awk '{print $1*$2}'
Output:
0.948607
0.912663
0.416161
0.670415
0.432672
0.496895
1.01407

All awk:
$ awk -v rev_p_client_pa="$rev_p_client_pa" \
-v lawn_p_client="$lawn_p_client" ' # "tuples" in as vars
BEGIN {
split(lawn_p_client,l,/\n/) # split the "tuples" by \n
n=split(rev_p_client_pa,r,/\n/) # get count of the other
for(i=1;i<=n;i++) # loop the elements
print r[i]*l[i] # multiply and output
}'
Output:
0.948607
0.912663
0.416161
0.670415
0.432672
0.496895
1.01407

Sorting using -k

I tried this solution to my list and I can't get what I want after sorting.
I got list:
m_2_mdot_3_a_1.dat ro= 303112.12
m_1_mdot_2_a_0.dat ro= 300.10
m_2_mdot_1_a_3.dat ro= 221.33
m_3_mdot_1_a_1.dat ro= 22021.87
I used sort -k 2 -n >name.txt
I would like to get list from the lowest ro to the highest ro. What I did wrong?
I got a sorting but by the names of 1 column or by last value but like: 1000, 100001, 1000.2 ... It sorted like by only 4 meaning numbers or something.

cat test.txt | tr . , | sort -k3 -g | tr , .
The following link gave a good answer Sort scientific and float
In brief,
you need -g option to sort on decimal numbers;
the -k option start
from 1 not 0;
and by default locale, sort use , as seperator
for decimal instead of .
However, be careful if your name.txt contains , characters

Since there's a space or a tab between ro= and the numeric value, you need to sort on the 3rd column instead of the 2nd. So your command will become:
cat input.txt | sort -k 3 -n

Unix sort: inconsistent between 2 files

[1.txt]
Sample10_1.fq.gz
Sample11_1.fq.gz
Sample12_1.fq.gz
Sample1_1.fq.gz
Sample13_1.fq.gz
[2.txt]
Sample10_2.fq.gz
Sample11_2.fq.gz
Sample12_2.fq.gz
Sample1_2.fq.gz
Sample13_2.fq.gz
As you can see, the only difference is the digit after the "_".
Anyway, here are the results of sort:
[sort 1.txt]
Sample10_2.fq.gz
Sample11_2.fq.gz
Sample12_2.fq.gz
Sample1_2.fq.gz
Sample13_2.fq.gz
[sort 2.txt]
Sample10_1.fq.gz
Sample11_1.fq.gz
Sample1_1.fq.gz
Sample12_1.fq.gz
Sample13_1.fq.gz
Discrepancy: "Sample1_" is sorted between "Sample12" and "Sample13" in 1.txt, but it's between "Sample11" and "Sample12" in 2.txt.
Am I doing something wrong to make this inconsistency happen?

Use sort -V
cat 1.txt | sort -V
Sample1_1.fq.gz
Sample10_1.fq.gz
Sample11_1.fq.gz
Sample12_1.fq.gz
Sample13_1.fq.gz

`join` with -e "NA" parameter incorrectly fills "NA" into a non-empty field

I am encountering a weird issue with join in a script I've written.
I have two files, say:
File1.txt (1st field: cluster size; 2nd field: brain coordinates)
54285;-40,-64,-2
5446;-32,6,24
File2.txt (1st field: cluster index; 2nd field: z-value; 3rd field: brain coordinates)
2;7.59;-40,-64,-2
2;7.33;62,-60,14
1;5.78;-32,6,24
1;5.66;-50,16,34
Where I am joining on the last field, the brain coordinates.
When I use the command
join -a 2 -e "NA" -1 2 -2 3 -t ";" -o "2.1 1.1 2.2 0" File1.txt File2.txt
I expect
2;54285;7.59;-40,-64,-2
2;NA;7.33;62,-60,14
1;5446;5.78;-32,6,24
1;NA;5.66;-50,16,34
But I get
2;54285;7.59;-40,-64,-2
2;NA;7.33;62,-60,14
1;NA;5.78;-32,6,24
1;NA;5.66;-50,16,34
Such that the cluster size is missing on row 3 (i.e., cluster size for cluster #1, "5446").
If I edit File2 to take out lines that don't have a match in File1, i.e.:
File2.txt
2;7.59;-40,-64,-2
1;5.78;-32,6,24
I get the expected output:
2;54285;7.59;-40,-64,-2
1;5446;5.78;-32,6,24
If I edit File2.txt like so, adding a line without a cluster-size value to cluster #1:
File2.txt
2;7.59;-40,-64,-2
1;5.78;-32,6,24
1;5.66;-50,16,34
I also get the expected output:
2;54285;7.59;-40,-64,-2
1;5446;5.78;-32,6,24
1;NA;5.66;-50,16,34
BUT, if I edit File2.txt like so, adding a line without a cluster-size value to cluster #2:
File2.txt
2;7.59;-40,-64,-2
2;7.33;62,-60,14
1;5.78;-32,6,24
Then I do not receive the expected output:
2;54285;7.59;-40,-64,-2
2;NA;7.33;62,-60,14
1;NA;5.78;-32,6,24
Can anyone give me any insight into why this is occurring? Have I done something wrong, or is there something quirky going on with join that I haven't been able to suss out from the man page?
Although alternative solutions to joining these files (that is, using different tools than join) , I am most interested in figuring out why the current command isn't working.

Input files to the join command must be sorted on join fields
Try this instead (note that this uses process substitution, which is a bashism)
join -a 2 -e "NA" -1 2 -2 3 -t ";" -o "2.1 1.1 2.2 0" <(sort -k2,2 -t';' File1.txt)\
<(sort -k3,3 -t';' File2.txt)
1;5446;5.78;-32,6,24
2;54285;7.59;-40,-64,-2
1;NA;5.66;-50,16,34
2;NA;7.33;62,-60,14

multiple field and numeric sort

List of files:
sysbench-size-256M-mode-rndrd-threads-1
sysbench-size-256M-mode-rndrd-threads-16
sysbench-size-256M-mode-rndrd-threads-4
sysbench-size-256M-mode-rndrd-threads-8
sysbench-size-256M-mode-rndrw-threads-1
sysbench-size-256M-mode-rndrw-threads-16
sysbench-size-256M-mode-rndrw-threads-4
sysbench-size-256M-mode-rndrw-threads-8
sysbench-size-256M-mode-rndwr-threads-1
sysbench-size-256M-mode-rndwr-threads-16
sysbench-size-256M-mode-rndwr-threads-4
sysbench-size-256M-mode-rndwr-threads-8
sysbench-size-256M-mode-seqrd-threads-1
sysbench-size-256M-mode-seqrd-threads-16
sysbench-size-256M-mode-seqrd-threads-4
sysbench-size-256M-mode-seqrd-threads-8
sysbench-size-256M-mode-seqwr-threads-1
sysbench-size-256M-mode-seqwr-threads-16
sysbench-size-256M-mode-seqwr-threads-4
sysbench-size-256M-mode-seqwr-threads-8
I would like to sort them by mode (rndrd, rndwr etc.) and then number:
sysbench-size-256M-mode-rndrd-threads-1
sysbench-size-256M-mode-rndrd-threads-4
sysbench-size-256M-mode-rndrd-threads-8
sysbench-size-256M-mode-rndrd-threads-16
sysbench-size-256M-mode-rndrw-threads-1
sysbench-size-256M-mode-rndrw-threads-4
sysbench-size-256M-mode-rndrw-threads-8
sysbench-size-256M-mode-rndrw-threads-16
....
I've tried the following loop but it's sorting by number but I need sequence like 1,4,8,16:
$ for f in $(ls -1A); do echo $f; done | sort -t '-' -k 7n
EDIT:
Please note that numeric sort (-n) sort it by number (1,1,1,1,4,4,4,4...) but I need sequence like 1,4,8,16,1,4,8,16...

Sort by more columns:
sort -t- -k5,5 -k7n
Primary sort is by 5th column (and not the rest, that's why 5,5), secondary sorting by number in the 7th column.

The for loop is completely unnecessary as is the -1 argument to ls when piping its output. This yields
ls -A | sort -t- -k 5,5 -k 7,7n
where the first key begins and ends at column 5 and the second key begins and ends at column 7 and is numeric.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

compare 2 csv find a match and output it using linux shell - shell

your given input and output don't match. 12717, 1319913, 17596 I cannot find them in your first file. I assume they are just example. And I think the following line is what you are looking for, so try this line: awk -F, 'NR==FNR{a[$0];next}$1 in a' 300513-code.csv 30-05-4UTF.csv

If you are attempting to link using the first field from each file(bash on linux) join -1 1 -2 1 -t, <(sort -k1,1 -t, 300513-code.csv) <(sort -k1,1 -t, 30-05-4UTF.csv)

Related

Multiplication of two variables containing tuples in BASH script

Sorting using -k

Unix sort: inconsistent between 2 files

`join` with -e "NA" parameter incorrectly fills "NA" into a non-empty field

multiple field and numeric sort

Categories

Resources