count the max number of _ and add additional semi-colon if some are missing - bash

I have several files with fields like below
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;deme_Fort_Email_am;04/02/2015;Deme_Fort_Postal
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;deme_faible_Email_am;18/02/2015;deme_Faible_Email_Relance_am
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;equi_Fort_Email_am;23/02/2015;trav_Fort_Email_am
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;trav_Faible_Email_pm;18/02/2015;trav_Faible_Email_Relance_pm
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;trav_Fort_Email_am;12/02/2015;Trav_Fort_Postal
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;voya_Faible_Email_am;29/01/2015;voya_Faible_Email_Relance_am
Aim is to have that
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;deme;Fort;Email;am;04/02/2015;Deme;Fort;Postal;;
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxxdeme;faible;Email;am;18/02/2015;deme;Faible;Email;Relance;am
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;equi;Fort;Email;am;23/02/2015;trav;Fort;Email;am;
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;trav;Faible;Email;pm;18/02/2015;trav;Faible;Email;Relance;pm
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;trav;Fort;Email;am;12/02/2015;Trav;Fort;Postal;;
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;voya;Faible;Email;am;29/01/2015;voya;Faible;Email;Relance;am
I'm counting the max of underscore, after the 7th field, for any line. I then change it to semi-colon and add additional semi-colon depending of the maximum underscore count found in all the lines.
I thought about using awk for that but I will only change ,with the command line below , every thing after the first field. My aim is also to add additional semi-colon
awk 'BEGIN{FS=OFS=";"} {for (i=7;i<=NF;i++) gsub(/_/,";", $i) } 1' file
Thanks.

Awk way
awk -F';' -vOFS=';' '{y=0;for(i=8;i<=NF;i++)y+=gsub(/_/,";",$i)
x=x<y?y:x;NF=NF+(x-y)}NR!=FNR' file{,}
Output
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;deme;Fort;Email;am;04/02/2015;Deme;Fort;Postal;;
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;deme;faible;Email;am;18/02/2015;deme;Faible;Email;Relance;am
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;equi;Fort;Email;am;23/02/2015;trav;Fort;Email;am;
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;trav;Faible;Email;pm;18/02/2015;trav;Faible;Email;Relance;pm
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;trav;Fort;Email;am;12/02/2015;Trav;Fort;Postal;;
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;voya;Faible;Email;am;29/01/2015;voya;Faible;Email;Relance;am
Explanation
awk -F';' -vOFS=';'
This sets the Field Separator and Output Field separator to ;.
y=0;
Initialised y as 0 on each line.
for(i=8;i<=NF;i++)y+=gsub(/_/,";",$i)
For each field from field 8 to the Number of Fields on the line(NF).Substitute _ with a ;.Increment y by the number of substitutions.
x=x<y?y:x
Check if x is less than y, if it is set x to yelse leave the same.
NF=NF+(x-y)
Set the number of field to the current number of fields + the difference between x and y.
NR!=FNR
This means that if the Total record number is not equal to the Files record number then print.Effectively means print anything that isn't the first file.
file{,}
Expands to file file so the file is read twice.
Resources
https://www.gnu.org/software/gawk/manual/html_node/String-Functions.html

Related

How to grep a pattern followed by a number, only if the number is above a certain value

I actually need to grep the entire line. I have a file with a bunch of lines that look like this
1 123213 A T . stuff=1.232;otherstuf=34;morestuff=121;AF=0.44;laststuff=AV
4 223152 D L . stuff=1.122;otherstuf=4;morestuff=41;AF=0.02;laststuff=RV
and I want to keep all the lines where AF>0.1. So for the lines above I only want to keep the first line.
Using gnu-awk you can do this:
awk 'gensub(/.*;AF=([^;]+).*/, "\\1", "1", $NF)+0 > 0.1' file
1 123213 A T . stuff=1.232;otherstuf=34;morestuff=121;AF=0.44;laststuff=AV
This gensub function parses out AF=<number> from last field of the input and captures number in captured group #1 which is used for comparison with 0.1.
PS: +0 will convert parsed field to a number.
You could use awk with multiple delimeters to extract the value and compare it:
$ awk -F';|=' '$8 > 0.1' file
Assuming that AF is always of the form 0.NN you can simply match values where the tens place is 1-9, e.g.:
grep ';AF=0.[1-9][0-9];' your_file.csv
You could add a + after the second character group to support additional digits (i.e. 0.NNNNN) but if the values could be outside the range [0, 1) you shouldn't try to match the field with regular expressions.
$ awk -F= '$5>0.1' file
1 123213 A T . stuff=1.232;otherstuf=34;morestuff=121;AF=0.44;laststuff=AV
If that doesn't do what you want when run against your real data then edit your question to provide more truly representative sample input/output.
I would use awk. Since awk supports alphanumerical comparisons you can simply use this:
awk -F';' '$(NF-1) > "AF=0.1"' file.txt
-F';' splits the line into fields by ;. $(NF-1) address the second last field in the line. (NF is the number of fields)

Match a single column entry in one file to a column entry in a second file that consists of a list

I need to match a single column entry in one file to a column entry in a second file that consists of a list (in shell). The awk command I've used only matches to the first word of the list, and doesn't scan through the entire list in the column field.
File 1 looks like this:
chr1:725751 LOC100288069
rs3131980 LOC100288069
rs28830877 LINC01128
rs28873693 LINC01128
rs34221207 ATP4A
File 2 looks like this:
Annotation Total Genes With Ann Your Genes With Ann) Your Genes No Ann) Genome With Ann) Genome No Ann) ln
1 path hsa00190 Oxidative phosphorylation 55 55 1861 75 1139 5.9 9.64 0 0 ATP12A ATP4A ATP5A1 ATP5E ATP5F1 ATP5G1 ATP5G2 ATP5G3 ATP5J ATP5O ATP6V0A1 ATP6V0A4 ATP6V0D2 ATP6V1A ATP6V1C1 ATP6V1C2 ATP6V1D ATP6V1E1 ATP6V1E2 ATP6V1G3 ATP6V1H COX10 COX17 COX4I1 COX4I2 COX5A COX6B1 COX6C COX7A1 COX7A2 COX7A2L COX7C COX8A NDUFA5 NDUFA9 NDUFB3 NDUFB4 NDUFB5 NDUFB6 NDUFS1 NDUFS3 NDUFS4 NDUFS5 NDUFS6 NDUFS8 NDUFV1 NDUFV3 PP PPA2 SDHA SDHD TCIRG1 UQCRC2 UQCRFS1 UQCRH
Expected output:
rs34221207 ATP4A hsa00190
(please excuse the formatting - all the columns are tab-delimited until the column of gene names, $14, called Genome...)
My command is this:
awk 'NR==FNR{a[$14]=$3; next}a[$2]{print $0 "\t" a[$2]}' file2 file 1
All help will be much appreciated!
You need to process files in the other order, and loop over your list:
awk 'NR==FNR{a[$2]=$1; next} {for(i=15;i<=NF;++i)if(a[$i]){print a[$i] "\t" $i "\t" $3}}' file1 file2
Explanation:
NR is a global "record number" counter that awk increments for each line read from each file. FNR is a per-file "record number" that awk resets to 1 on the first line of each file. So the NR==FNR condition is true for lines in the first file and false for lines in subsequent files. It is an awk idiom for picking out just the first file info. In this case, a[$2]=$1 stores the first field text keyed by the second field text. The next tells awk to stop short on the current line and to read and continue processing normally the next line. A next at the end of the first action clause like this is functionally like an ELSE condition on the remaining code if awk had such a syntax (which it doesn't): NR==FNR{a[$2]=$1} ELSE {for.... More clear and only slightly less time-efficient would have been to write instead NR==FNR{a[$2]=$1}NR!=FNR{for....
Now to the second action clause. No condition preceding it means awk will do it for every line that is not short-circuited by the preceding next, that is, all lines in files other than the first -- file2 only in this case. Your file2 has a list of potential keys starting in field #15 and extending to the last field. The awk built-in variable for the last field number is NF (number of fields). The for loop is pretty self-explanatory then, looping over just those field numbers. For each of those numbers i we want to know if the text in that field $i is a known key from the first file -- a[$i] is set, that is, evaluates to a non-empty (non-false) string. If so, then we've got our file1 first field in a[$i], our matching file1 second field in $i, and our file2 field of interest in $3 (the text of the current file2 3rd field). Print them tab-separated. The next here is an efficiency-only measure that stops all processing on the file2 record once we've found a match. If your file2 key list might contain duplicates and you want duplicate output lines if there is a match on such a duplicate, then you must remove that last next.
Actually now that I look again, you probably do want to find any multiple matches even on non-duplicates, so I have removed the 2nd next from the code.

extract each line followed by a line with a different value in column two

Given the following file structure,
9.975 1.49000000 0.295 0 0.4880 0.4929 0.5113 0.5245 2.016726 1.0472 -30.7449 1
9.975 1.49000000 0.295 1 0.4870 0.5056 0.5188 0.5045 2.015859 1.0442 -30.7653 1
9.975 1.50000000 0.295 0 0.5145 0.4984 0.4873 0.5019 2.002143 1.0854 -30.3044 2
is there a way to extract each line in which the value in column two is not equal to the value in column two in the following line?
I.e. from these three lines I would like to extract the second one, since 1.49 is not equal to 1.50.
Maybe with sed or awk?
This is how I do this in MATLAB:
myline = 1;
mynewline = 1;
while myline < length(myfile)
if myfile(myline,2) ~= myfile(myline+1,2)
mynewfile(mynewline,:) = myfile(myline,:);
mynewline = mynewline+1;
myline = myline+1;
else
myline = myline+1;
end
end
However, my files are so large now that I would prefer to carry out this extraction in terminal before transferring them to my laptop.
Awk should do.
<data awk '($2 != prev) {print line} {line = $0; prev = $2}'
A brief intro to awk: awk program consists of a set of condition {code} blocks. It operates line by line. When no condition is given, the block is executed for each line. BEGIN condition is executed before the first line. Each line is split to fields, which are accessible with $_number_. The full line is in $0.
Here I compare the second field to the previous value, if it does not match I print the whole previous line. In all cases I store the current line into line and the second field into prev.
And if you really want it right, careful with the float comparisons - something like abs($2 - prev) < eps (there is no abs in awk, you need to define it yourself, and eps is some small enough number). I'm actually not sure if awk converts to number for equality testing, if not you're safe with the string comparisons.
This might work for you (GNU sed):
sed -r 'N;/^((\S+)\s+){2}.*\n\S+\s+\2/!P;D' file
Read two lines at a time. Pattern match on the first two columns and only print the first line when the second column does not match.
Try following command:
awk '$2 != field && field { print line } { field = $2; line = $0 }' infile
It saves previous line and second field, comparing in next loop with current line values. The && field check is useful to avoid a blank line at the beginning of file, when $2 != field would match because variable is empty.
It yields:
9.975 1.49000000 0.295 1 0.4870 0.5056 0.5188 0.5045 2.015859 1.0442 -30.7653 1

How to combine two lines that share the same keyword?

lets say I have a file looking somewhat like this:
X NeedThis1 KEYWORD
.
.
NeedThis2 X KEYWORD
And I need to combine the two lines into one like this:
NeedThis2 NeedThis1 KEYWORD
It needs to be done for every line in that file that contains the same KEYWORD but it can't combine two lines that look like this (two X's at the first|second position)
X NeedThis1 KEYWORD
X NeedThis2 KEYWORD
I am considering myself bash-noob so any advice if it can be done with something like awk or sed would be appreciated.
awk '
{if ($1 == "X") end[$3] = $2; else start[$3] = $1}
END {for (kw in start) if (kw in end) print start[kw], end[kw], kw}
' file
Try this:
awk '
$1=="X" {key = $NF; value = $2; next}
$2=="X" && $NF==key {print value, $1, key}' file
Explanation:
When a line where first field is X, store the last field as key and second field as value.
Look for the next line where second field is X and last field matches the key stored from pervious action.
When found, print the value of last matched line along with first field of the current line and the key.
This will most definitely break if your data does not match the sample you have shown (if it has more spaces or fields in between), so feel free to adjust as per your needs.
I won't give you the full answer, but if you have some way to identify "KEYWORD" (not in your problem statement), then use a BASH associative array:
declare -A keys
while IFS= read -u3 -r line
do
set -- $line
eval keyword=\$$#
keys[$keyword]+=${line%$keyword}
done
you'll certainly have to do some more fiddling, but your problem statement is incomplete and some of the work needs to be an exercise for the reader.

How to use unix grep and the output together?

I am new to unix commands. I have a file named server.txt, which has 100 fields, the first line of the file is header.
I want to take a look at the fields at 99 and 100 only.
Field 99 is just some numbers, field 100 is a String.
The delimiter of each field which is a space.
My goal is to extract every tokens in the string(field100) by grep and regex,
and then output with the field99 with every token extracted from the String,
and skip the first 1000 lines of my records
----server.txt--
... ... ,field99,field100
... ... 5,"hi are"
... ... 3,"how is"
-----output.txt
header1,header2
5,hi
5,are
3,how
3,is
so i just have some idea, but i dont know how to combine all the scripts
Here is some of my thought:
sed 1000d server.txt cut -f99,100 -d' ' >output.txt
grep | /[A-Za-z]+/|
Sounds more like a job for awk.
awk -F, 'NR <= 1000 { next; }
{ gsub(/^\"|\"$/, "", $100); split($100, a, / /);
for (v=1; v<=length(a); ++v) print $99, a[v]; }' server.txt >output.txt
The general form of an awk program is a sequence of condition { action } expressions. The first line has the condition NR <= 1000 where NR is the current line number. If the condition is true, the next action skips to the next input line. Otherwise, we fall through to the next expression, which does not have a condition; so, it's uncoditional, for all input lines which reach here. It first cleans out the double quotes around the 100th field value, and then splits it on spaces into the array a. The for loop then loops over this array, printing the 99th field value and the vth element of the array, starting with v=1 and up through the end of the array.
The input file format is sort of cumbersome. The gsub and split stuff could be avoided with a slightly more sane input format. If you are new to awk, you should probably go look for a tutorial.
If you only want to learn one scripting language, I would suggest Perl or Python over awk, but it depends on your plans and orientation.

Resources