Insert column delimiters before pattern in a sorted file on a mac - macos

Have a resulting file which contains values from different XML files.
The file have 5 columns separated by ";" in case that all pattern matched.
First column = neutral Index
Second column = specific Index1
Third column = file does contain Index1
Fourth column = specific Index2
Fifth column = file does contain Index2
Not matching pattern with Index2 (like last three lines) should also have 5 columns, while the last two columns should be like the first two lines.
The sorted files looks like:
AAA;AAA.1D1;file_X;AAA.2D1;file_Y
AAA;AAA.1E1;file_A;AAA.2E1;file_B
AAA;AAA.2F1;file_C
BBB;BBB.2G1;file_D
CCC;CCC.1B1;file_H
YYY;YYY.2M1;file_N
The desired result would be:
AAA;AAA.1D1;file_X;AAA.2D1;file_Y
AAA;AAA.1E1;file_A;AAA.2E1;file_B
AAA;;;AAA.2F1;file_C
BBB;;;BBB.2G1;file_D
CCC;CCC.1B1;file_H;;
YYY;;;YYY.2M1;file_N
If you have any idea/hint, your help is appreciated! Thanks in advance!

Updated Answer
In the light of the updated requirement, I think you want something like this:
awk -F';' 'NF==3 && $2~/\.1/{$0=$0 ";;"}
NF==3 && $2~/\.2/{$0=$1 ";;;" $2 ";" $3} 1' file
which can be written as a one-liner:
awk -F';' 'NF==3 && $2~/\.1/{$0=$0 ";;"} NF==3 && $2~/\.2/{$0=$1 ";;;" $2 ";" $3} 1' YourFile
Original Answer
I would do that with awk:
awk -F';' 'NF==3{$0=$1 ";;;" $2 ";" $3}1' YourFile
AAA;AAA.1D1;file_X;AAA.2D1;file_Y
AAA;AAA.1E1;file_A;AAA.2E1;file_B
AAA;;;AAA.2F1;file_C
BBB;;;BBB.2G1;file_D
YYY;;;YYY.2M1;file_N
That says..."run awk on YourFile using ';' as field separator. If there are only 3 fields on any line, recreate the line using the existing first field, three semi-colons and then the other two fields. The 1 at the end, means print the current line`".
If you don't use awk much, NF refers to the number of fields, $0 refers to the entire current line, $1 refers to the first field on the line, $2 refers to the second field etc.

Related

Searching for a string between two characters

I need to find two numbers from lines which look like this
>Chr14:453901-458800
I have a large quantity of those lines mixed with lines that doesn't contain ":" so we can search for colon to find the line with numbers. Every line have different numbers.
I need to find both numbers after ":" which are separated by "-" then substract the first number from the second one and print result on the screen for each line
I'd like this to be done using awk
I managed to do something like this:
awk -e '$1 ~ /\:/ {print $0}' file.txt
but it's nowhere near the end result
For this example i showed above my result would be:
4899
Because it is the result of 458800 - 453901 = 4899
I can't figure it out on my own and would appreciate some help
With GNU awk. Separate the row into multiple columns using the : and - separators. In each row containing :, subtract the contents of column 2 from the contents of column 3 and print result.
awk -F '[:-]' '/:/{print $3-$2}' file
Output:
4899
Using awk
$ awk -F: '/:/ {split($2,a,"-"); print a[2] - a[1]}' input_file
4899

Transpose rows to column after nth column in bash

I have a file like below format:
$ cat file_in.csv
1308123;28/01/2019;28/01/2019;22/01/2019
1308456;20/11/2018;27/11/2018;09/11/2018;15/11/2018;10/11/2018;02/12/2018
1308789;06/12/2018;04/12/2018
1308012;unknown
How can i transpose as below, starting from second column:
1308123;28/01/2019
1308123;28/01/2019
1308123;22/01/2019
1308456;20/11/2018
1308456;27/11/2018
1308456;09/11/2018
1308456;15/11/2018
1308456;10/11/2018
1308456;02/12/2018
1308789;06/12/2018
1308789;04/12/2018
1308012;unknown
I'm testing my script, but obtain a wrong result
echo "123;23/05/2018;24/05/2018" | awk -F";" 'NR==3{a=$1";";next}{a=a$1";"}END{print a}'
Thanks in advance
1st Solution: Eaisest solution will be, loop through all fields(off course have set field separator as ;) and then print $1 along with all fields in new line. Also note that loop is running from i=2 to till value of NF leaving first field since we need to print in new line from column 2nd onwards.
awk 'BEGIN{FS=OFS=";"} {for(i=2;i<=NF;i++){print $1,$i}}' Input_file
2nd Solution: Using 1 time substitution(sub) and global substitutions(gsub) functionality of awk. Here I am changing very first occurence of ; with ###(assumed that your Input_file will NOT have this characters together, in case it is there choose any unique character(s) which are NOT in one's Input_file on place of ###), then globally subsituting ;(all occurences) with ORS val(a variable which has value of $1) and ; so make values in new column. Now finally remove ### from first field. Why we have done this approch if we DO NOT substitute very first occurence of ; with any other character then it will place a NEW LINE before substituion which we DO NOT want to have. (Also as per Ed sir's comment this solution was tested in 1 Input_file and may have issues while reading multiple Input_files)
awk 'BEGIN{FS=OFS=";"} {val=$1;sub(";","###");gsub(";",ORS val ";");sub("###",";",$1)} 1' Input_file
Another awk
awk -F";" '{ OFS="\n" $1 ";"; $1=$1;$1=""; printf("%s",$0) } ' file

How do I match columns in a small file to a larger file and do calculations using awk

I have this small file small.csv:
STATE|STAGE|SUBCAT_ID|TOTAL TIMING|FAMA_COEFF_TIMING|DB_IMPORT_TIMING|COMMENT
SUCCEEDED|fe|L3-002559|110|7|15
SUCCEEDED|staging|L3-002241|46||24
And this bigger file big.csv:
STATE|STAGE|SUBCAT_ID|TOTAL TIMING|FAMA_COEFF_TIMING|DB_IMPORT_TIMING|COMMENT
SUCCEEDED|fe|L3-004082|16|0|8
SUCCEEDED|staging|L3-002730|85||57
SUCCEEDED|staging|L3-002722|83||56
SUCCEEDED|fe|L3-002559|100|7|15
I need a command (probably awk) that will loop on the small.csv file to check if the 1st, 2nd and 3rd column match a record in the big.csv file and then calculate based on the 4th column the difference small-big. So in the example above, since the 1st record's first 3 columns match the 4th record in big.csv the output would be:
SUCCEEDED|fe|L3-002559|10
where 10 is 110-100
Thank you
Assuming that lines with similar first three fields do not occur more than twice in the two files taken together. This works:
awk -F '|' 'FNR!=1 { key = $1 "|" $2 "|" $3; if(a[key]) print key "|" a[key]-$4; else a[key]=$4 }' small.csv big.csv

search 2 fields in a file in another huge file, passing 2nd file only once

file1 has 100,000 lines. Each line has 2 fields such as:
test 12345678
test2 43213423
Another file has millions of lines. Here is an example of how the above file entries look in file2:
'99' 'databases' '**test**' '**12345678**'
'1002' 'exchange' '**test2**' '**43213423**'
I would like a way to grep these 2 fields from file1 so that I can find any line that contains both, but the gotcha is, I would like to search the 100,000 entries through the 2nd file once as looping a grep is very slow as it could loop 100,000 x 10,000,000.
Is that at all possible?
You can do this in awk:
awk -F"['[:blank:]]+" 'NR == FNR { a[$1,$2]; next } $4 SUBSEP $5 in a' file1 file2
First set the field separator so that the quotes around the fields in the second file are consumed.
The first block applies to the first file and sets keys in the array a. The comma in the array index translates to the control character SUBSEP in the key.
Lines are printed in the second file when the third and fourth fields (with the SUBSEP in between) match one of the keys. Due to the ' at the start of the line, the first field $1 is actually an empty string, so the fields you want are $4 and $5.
If your fields are always quoted in the second file, then you can do this instead:
awk -v q="'" 'NR == FNR { a[q $1 q,q $2 q]; next } $3 SUBSEP $4 in a' file file2
This inserts the quotes into the array a, so the fields in the second file match without having to consume the quotes.
fgrep and sed method:
sed "s/\b/'/g;s/\b/**/g" file1 | fgrep -f - file2
Modify a stream from file1 with sed to match the format of the second file, (i.e. surround the fields with single quotes and asterisks), and send the stream to standard output. The fgrep -f - inputs that stream as a list of fixed strings (but no regexps) and finds every matching line in file2.
Output:
'99' 'databases' '**test**' '**12345678**'
'1002' 'exchange' '**test2**' '**43213423**'

awk combine 2 commands for csv file formatting

I have a CSV file which has 4 columns. I want to first:
print the first 10 items of each column
only print the items in the third column
My method is to pipe the first awk command into another but i didnt get exactly what i wanted:
awk 'NR < 10' my_file.csv | awk '{ print $3 }'
The only missing thing was the -F.
awk -F "," 'NR < 10' my_file.csv | awk -F "," '{ print $3 }'
You don't need to run awk twice.
awk -F, 'NR<=10{print $3}'
This prints the third field for every line whose record number (line) is less than or equal to 10.
Note that < is different from <=. The former matches records one through nine, the latter matches records one through ten. If you need ten records, use the latter.
Note that this will walk through your entire file, so if you want to optimize your performance:
awk -F, '{print $3} NR>10{exit}'
This will print the third column. Then if the record number is greater than 10, it will exit. This does not step through your entire file.
Note also that awk's "CSV" matching is very simple; awk does not understand quoted fields, so the record:
red,"orange,yellow",green
has four fields, two of which have double quotes in them. YMMV depending on your input.

Resources