Search a CSV file for a value in the first column, if found shift the value of second column one row down - bash

I have CSV files that look like this:
786,1702
787,1722
-,1724
788,1769
789,1766
I would like to have a bash command that searches the first column for the - and if found then shifts the values in the second column down. The - reccurr several times in the first column and would need to start from the top to preserve the order of the second column.
The second column would be blank
Desired output:
786,1702
787,1722
-,
788,1724
789,1769
790,1766
So far I have: awk -F ',' '$1 ~ /^-$/' filename.csv to find the hyphens, but shifting the 2nd column down is tricky...

Assuming that the left column continues with incremental IDs to shift the right column until it is empty.
awk 'BEGIN{start=0;FS=","}$1=="-"{stack[stacklen++]=$2;print $1",";next}stacklen-start{stack[stacklen++]=$2;print $1","stack[start];delete stack[start++];next}1;END{for (i=start;i<stacklen;i++){print $1-start+i+1,stack[i]}}' filename.csv
# or
<filename.csv awk -F, -v start=0 '$1=="-"{stack[stacklen++]=$2;print $1",";next}stacklen-start{stack[stacklen++]=$2;print $1","stack[start];delete stack[start++];next}1;END{for (i=start;i<stacklen;i++){print $1-start+i+1,stack[i]}}'
Or, explained:
I am here using a shifted stack to avoid rewriting indexes. With start as the pointer to the first useful element of the stack, and stacklen as the last element. This avoids the costly operation of shifting all array elements whenever we want to remove the first element.
# chmod +x shift_when_dash
./shift_when_dash filename.csv
with shift_when_dash being an executable file containing:
#!/usr/bin/awk -f
BEGIN { # Everything in this block is executed once before opening the file
start = 0 # Needed because we are using it in a scalar context before initialization
FS = "," # Input field separator is a comma
}
$1 == "-" { # We match the special case where the first column is a simple dash
stack[stacklen++] = $2 # We store the second column on top of our stack
print $1 "," # We print the dash without a second column as asked by OP
next # We stop processing the current record and go on to the record
}
stacklen - start { # In case we still have something in our stack
stack[stacklen++] = $2 # We store the current 2nd column on the stack
print $1 "," stack[start] # We print the current ID with the first stacked element
delete stack[start++] # Free up some memory and increment our pointer
next
}
1 # We print the line as-is, without any modification.
# This applies to lines which were not skipped by the
# 'next' statements above, so in our case all lines before
# the first dash is encountered.
END {
for (i=start;i<stacklen;i++) { # For every element remaining in the stack after the last line
print $1-start+i+1 "," stack[i] # We print a new incremental id with the stack element
}
}
next is an awk statement similar to continue in other languages, with the difference that it skips to the next input line instead of the next loop element. It is useful to emulate a switch-case.

Related

Command output with empty values to csv

> lsblk -o NAME,LABEL,FSTYPE,MOUNTPOINT,SIZE,TYPE -x NAME
NAME LABEL FSTYPE MOUNTPOINT SIZE TYPE
nvme0n1 894.3G disk
nvme0n1p1 [SWAP] 4G part
nvme0n1p2 1G part
nvme0n1p3 root /home/cg/root 889.3G part
I need the output of this command in csv format, but all the methods I've tried so far don't handle the empty values correctly, thus generating bad rows like these I got with sed:
> lsblk -o NAME,LABEL,FSTYPE,MOUNTPOINT,SIZE,TYPE -x NAME | sed -E 's/ +/,/g'
NAME,LABEL,FSTYPE,MOUNTPOINT,SIZE,TYPE
nvme0n1,894.3G,disk
nvme0n1p1,[SWAP],4G,part
nvme0n1p2,1G,part
nvme0n1p3,root,/home/cg/root,889.3G,part
Any idea how to add the extra commas for the empty fields?
NAME,LABEL,FSTYPE,MOUNTPOINT,SIZE,TYPE
nvme0n1,,,,894.3G,disk
Make sure that the fields that are possibly empty are at the end of the line. And then re-arrange them in the required sequence.
lsblk -o NAME,SIZE,TYPE,FSTYPE,MOUNTPOINT,LABEL -x NAME | awk '{ print $1,";",$6,";",$4,";",$5,";",$2,";",$3 }'
Just:
lsblk -o NAME,LABEL,FSTYPE,MOUNTPOINT,SIZE,TYPE -x NAME -r | tr ' ' ','
Not really bash, but a quick and dirty Perl would be something like:
my $state=0;
my #input=<>;
my $maxlength=0;
for my $line ( 0 .. $#input){
my $curlength= length($input[$line]);
if ($curlength>$maxlength){$maxlength=$curlength;}
}
my $fill=' ' x $maxlength;
for my $line ( 0 .. $#input){
chomp $input[$line];
$input[$line]="$input[$line] $fill";
}
for (my $pos=0; $pos<$maxlength; $pos++){
my $spacecol=1;
for my $line ( 0 .. $#input){
if (substr($input[$line],$pos,1) ne ' '){
$spacecol=0;
}
}
if ($spacecol==1){
for my $line ( 0 .. $#input){
substr($input[$line],$pos,1)=';';
}
}
}
for my $line ( 0 .. $#input){
print "$input[$line]\n";
}
Assumptions:
output format is fixed-width
header record does not contain any blank fields
no fields contain white space (ie, only white space occurs between fields)
Design overview:
parse header to get initial index for each field; if all columns are left-justified then this would be all we need to do however, with the existence of right-justified columns (eg, SIZE) we need to look for right-justified values that are longer than the associated header field (ie, the value has a lower index than the associated header)
for non-header rows we loop through our set of potential fields, using substr()/match() to find the non-space fields in the line and ...
if said field starts and ends before the next field's index then add the field's value to our output variable but ...
if said field starts before next field's index but ends after next field's index then we're looking at a right-justified value of the next field which happens to have an earlier index than the associated header's index; in this case update the index for the next field and add a blank value (for the current field) to our output variable
if said field starts after the index of the next field then the current field is empty; again, add the empty/blank value to our output variable
once we've completed processing a line of input print the output to stdout
One awk idea:
awk '
BEGIN { OFS="," }
# use header record to determine initial set of indexes
FNR==1 { maxNF=NF
header=$0
out=sep=""
for (i=1;i<=maxNF;i++) {
match(header,/[^[:space:]]+/) # find first non-space string
ndx[i]=ndx[i-1] + prevlen + RSTART - (i==1 ? 0 : 1) # make note of index
out=out sep substr(header,RSTART,RLENGTH) # add value to our output variable
sep=OFS
prevlen=RLENGTH # need for next pass through loop
header=substr(header,RSTART+RLENGTH) # strip off matched string and repeat loop
}
print out # print header to stdout
ndx[1]=1 # in case 1st field is right-justified, override index and set to 1
next
}
# for rest of records need to determine which fields are empty and/or which fields need the associated index updated
{ out=sep=""
for (i=1;i<maxNF;i++) { # loop through all but last field
restofline=substr($0,ndx[i]) # work with current field thru to end of line
if ( match(restofline,/[^[:space:]]+/) ) # if we find a non-space match ...
if ( ndx[i]-1+RSTART < ndx[i+1] ) # if match starts before index of next field and ...
if ( ndx[i]-1+RSTART+RLENGTH < ndx[i+1] ) # ends before index of next field then ...
out=out sep substr(restofline,RSTART,RLENGTH) # append value to our output variable
else { # else if match finished beyond index of next field then ...
out=out sep "" # this field is empty and ...
diff=ndx[i+1]-(ndx[i]+RSTART-1) # figure the difference and ...
ndx[i+1]-=diff # update the index for the next field
}
else # current field is empty
out=out sep ""
sep=OFS
}
field=substr($0,ndx[maxNF]) # process last field
gsub(/[[:space:]]/,"",field) # remove all remaining spaces
print out, field # print new line to stdout
}
' lsblk.out
This generates:
NAME,LABEL,FSTYPE,MOUNTPOINT,SIZE,TYPE
nvme0n1,,,,894.3G,disk
nvme0n1p1,,,[SWAP],4G,part
nvme0n1p2,,,,1G,part
nvme0n1p3,root,,/home/cg/root,889.3G,part

Print all lines between line containing a string and first blank line, starting with the line containing that string

I've tried awk:
awk -v RS="zuzu_mumu" '{print RS $0}' input_file > output_file
The obtained file is the exact input_file but now the first line in file is zuzu_mumu.
How could be corrected my command?
After solved this, I've found the same string/patern in another arrangement; so I need to save all those records that match too, in an output file, following this rule:
if pattern match on a line, then look at previous lines and print the first line that follows an empty line, and print also the pattern match line and an empty line.
record 1
record 2
This is record 3 first line
info 1
info 2
This is one matched zuzu_mumu line
info 3
info 4
info 5
record 4
record 5
...
This is record n-1 first line
info a
This is one matched zuzu_mumu line
info b
info c
record n
...
I should obtain:
This is record 3 first line
This is one matched zuzu_mumu line
This is record n-1 first line
This is one matched zuzu_mumu line
Print all lines between line containing a string and first blank line,
starting with the line containing that string
I would use GNU AWK for this task. Let file.txt content be
Able
Baker
Charlie
Dog
Easy
Fox
then
awk 'index($0,"aker"){p=1}p{if(/^$/){exit};print}' file.txt
output
Baker
Charlie
Explanation: use index String function which gives either position of aker in whole line ($0) or 0 and treat this as condition, so this is used like is aker inside line? Note that using index rather than regular expression means we do not have to care about characters with special meaning, like for example .. If it does set p value to 1. If p then if it is empty line (it matches start of line followed by end of line) terminate processing (exit); print whole line as is.
(tested in gawk 4.2.1)
If you don't want to match the same line again, you can record all lines in an array and print the valid lines in the END block.
awk '
f && /zuzu_mumu/ { # If already found and found again
delete ary; entries=1; next; # Delete the array, reset entries and go to the next record
}
f || /zuzu_mumu/ { # If already found or match the word or interest
if(/^[[:blank:]]*$/){exit} # If only spaces, exit
f=1 # Mark as found
ary[entries++]=$0 # Add the current line to the array and increment the entry number
}
END {
for (j=1; j<entries; j++) # Loop and print the array values
print ary[j]
}
' file

In bash how to transform multimap<K,V> to a map of <K, {V1,V2}>

I am processing output from a file in bash and need to group values by their keys.
For example, I have the
13,47099
13,54024
13,1
13,39956
13,0
17,126223
17,52782
17,4
17,62617
17,0
23,1022724
23,79958
23,80590
23,230
23,1
23,118224
23,0
23,1049
42,72470
42,80185
42,2
42,89199
42,0
54,70344
54,72824
54,1
54,62969
54,1
in a file and group all values from a particular key into a single line as in
13,47099,54024,1,39956,0
17,126223,52782,4,62617,0
23,1022724,79958,80590,230,1,118224,0,1049
42,72470,80185,2,89199,0
54,70344,72824,1,62969,1
There are about 10000 entries in my input file. How do I transform this data in shell ?
awk to the rescue!
assuming keys are contiguous...
$ awk -F, 'p!=$1 {if(a) print a; a=p=$1}
{a=a FS $2}
END {print a}' file
13,47099,54024,1,39956,0
17,126223,52782,4,62617,0
23,1022724,79958,80590,230,1,118224,0,1049
42,72470,80185,2,89199,0
54,70344,72824,1,62969,1
Here is a breakdown of what #karakfa's code is doing, for us awk beginners. I've written this based on a toy dataset file:
1,X
1,Y
3,Z
p!=$1: check if the pattern p!=$1 is true
checks if variable p is equal to the first field of the current (first) line of file (1 in this case)
since p is undefined at this point it cannot be equal to 1, so p!=$1 is true and we continue with this line of code
if(a) print a: check if variable a exists and print a if it does exists
since a is undefined at this point the print a command is not executed
a=p=$1: set variables a and p equal to the value of the first field of the current (first) line (1 in this case)
a=a FS $2: set variable a equal to a combined with the value of the second field of the current (first) line separated by the field separator (1,X in this case)
END: since we haven't reached the end of file yet, we skip the the rest of this line of code
move to the next (second) line of file and restart the awk code on that line
p!=$1: check if the pattern p!=$1 is true
since p is 1 and the first field of the current (second) line is 1, p!=$1 is false and we skip the the rest of this line of code
a=a FS $2: set a equal to the value of a and the value of the second field of the current (second) line separated by the filed separator (1,X,Y in this case)
END: since we haven't reached the end of file yet, we skip the the rest of this line of code
move to the next (third) line of file and restart the awk code
p!=$1: check if the pattern p!=$1 is true
since p is 1 and $1 of the third line is 3, p!=$1 is true and we continue with this line of code
if(a) print a: check if variable a exists and print a if it does exists
since a is 1,X,Y at this point, 1,X,Y is printed to the output
a=p=$1: set variables a and p equal to the value of the first field of the current (third) line (3 in this case)
a=a FS $2: set variable a equal to a combined with the value of the second field of the current (third) line separated by the field separator (3,Z in this case)
END {print a}: since we have reached the end of file, execute this code
print a: print the last group a (3,Z in this case)
The resulting output is
1,X,Y
3,Z
Please let me know if there are any errors in this description.
Slight tweak to #karakfa's answer. If you want the separator between the key and the values to be different than the separator between the values, you can use this code:
awk -F, 'p==$1 {a=a "; " $2} p!=$1 {if(a) print a; a=$0; p=$1} END {print a}'

Match a single column entry in one file to a column entry in a second file that consists of a list

I need to match a single column entry in one file to a column entry in a second file that consists of a list (in shell). The awk command I've used only matches to the first word of the list, and doesn't scan through the entire list in the column field.
File 1 looks like this:
chr1:725751 LOC100288069
rs3131980 LOC100288069
rs28830877 LINC01128
rs28873693 LINC01128
rs34221207 ATP4A
File 2 looks like this:
Annotation Total Genes With Ann Your Genes With Ann) Your Genes No Ann) Genome With Ann) Genome No Ann) ln
1 path hsa00190 Oxidative phosphorylation 55 55 1861 75 1139 5.9 9.64 0 0 ATP12A ATP4A ATP5A1 ATP5E ATP5F1 ATP5G1 ATP5G2 ATP5G3 ATP5J ATP5O ATP6V0A1 ATP6V0A4 ATP6V0D2 ATP6V1A ATP6V1C1 ATP6V1C2 ATP6V1D ATP6V1E1 ATP6V1E2 ATP6V1G3 ATP6V1H COX10 COX17 COX4I1 COX4I2 COX5A COX6B1 COX6C COX7A1 COX7A2 COX7A2L COX7C COX8A NDUFA5 NDUFA9 NDUFB3 NDUFB4 NDUFB5 NDUFB6 NDUFS1 NDUFS3 NDUFS4 NDUFS5 NDUFS6 NDUFS8 NDUFV1 NDUFV3 PP PPA2 SDHA SDHD TCIRG1 UQCRC2 UQCRFS1 UQCRH
Expected output:
rs34221207 ATP4A hsa00190
(please excuse the formatting - all the columns are tab-delimited until the column of gene names, $14, called Genome...)
My command is this:
awk 'NR==FNR{a[$14]=$3; next}a[$2]{print $0 "\t" a[$2]}' file2 file 1
All help will be much appreciated!
You need to process files in the other order, and loop over your list:
awk 'NR==FNR{a[$2]=$1; next} {for(i=15;i<=NF;++i)if(a[$i]){print a[$i] "\t" $i "\t" $3}}' file1 file2
Explanation:
NR is a global "record number" counter that awk increments for each line read from each file. FNR is a per-file "record number" that awk resets to 1 on the first line of each file. So the NR==FNR condition is true for lines in the first file and false for lines in subsequent files. It is an awk idiom for picking out just the first file info. In this case, a[$2]=$1 stores the first field text keyed by the second field text. The next tells awk to stop short on the current line and to read and continue processing normally the next line. A next at the end of the first action clause like this is functionally like an ELSE condition on the remaining code if awk had such a syntax (which it doesn't): NR==FNR{a[$2]=$1} ELSE {for.... More clear and only slightly less time-efficient would have been to write instead NR==FNR{a[$2]=$1}NR!=FNR{for....
Now to the second action clause. No condition preceding it means awk will do it for every line that is not short-circuited by the preceding next, that is, all lines in files other than the first -- file2 only in this case. Your file2 has a list of potential keys starting in field #15 and extending to the last field. The awk built-in variable for the last field number is NF (number of fields). The for loop is pretty self-explanatory then, looping over just those field numbers. For each of those numbers i we want to know if the text in that field $i is a known key from the first file -- a[$i] is set, that is, evaluates to a non-empty (non-false) string. If so, then we've got our file1 first field in a[$i], our matching file1 second field in $i, and our file2 field of interest in $3 (the text of the current file2 3rd field). Print them tab-separated. The next here is an efficiency-only measure that stops all processing on the file2 record once we've found a match. If your file2 key list might contain duplicates and you want duplicate output lines if there is a match on such a duplicate, then you must remove that last next.
Actually now that I look again, you probably do want to find any multiple matches even on non-duplicates, so I have removed the 2nd next from the code.

extract each line followed by a line with a different value in column two

Given the following file structure,
9.975 1.49000000 0.295 0 0.4880 0.4929 0.5113 0.5245 2.016726 1.0472 -30.7449 1
9.975 1.49000000 0.295 1 0.4870 0.5056 0.5188 0.5045 2.015859 1.0442 -30.7653 1
9.975 1.50000000 0.295 0 0.5145 0.4984 0.4873 0.5019 2.002143 1.0854 -30.3044 2
is there a way to extract each line in which the value in column two is not equal to the value in column two in the following line?
I.e. from these three lines I would like to extract the second one, since 1.49 is not equal to 1.50.
Maybe with sed or awk?
This is how I do this in MATLAB:
myline = 1;
mynewline = 1;
while myline < length(myfile)
if myfile(myline,2) ~= myfile(myline+1,2)
mynewfile(mynewline,:) = myfile(myline,:);
mynewline = mynewline+1;
myline = myline+1;
else
myline = myline+1;
end
end
However, my files are so large now that I would prefer to carry out this extraction in terminal before transferring them to my laptop.
Awk should do.
<data awk '($2 != prev) {print line} {line = $0; prev = $2}'
A brief intro to awk: awk program consists of a set of condition {code} blocks. It operates line by line. When no condition is given, the block is executed for each line. BEGIN condition is executed before the first line. Each line is split to fields, which are accessible with $_number_. The full line is in $0.
Here I compare the second field to the previous value, if it does not match I print the whole previous line. In all cases I store the current line into line and the second field into prev.
And if you really want it right, careful with the float comparisons - something like abs($2 - prev) < eps (there is no abs in awk, you need to define it yourself, and eps is some small enough number). I'm actually not sure if awk converts to number for equality testing, if not you're safe with the string comparisons.
This might work for you (GNU sed):
sed -r 'N;/^((\S+)\s+){2}.*\n\S+\s+\2/!P;D' file
Read two lines at a time. Pattern match on the first two columns and only print the first line when the second column does not match.
Try following command:
awk '$2 != field && field { print line } { field = $2; line = $0 }' infile
It saves previous line and second field, comparing in next loop with current line values. The && field check is useful to avoid a blank line at the beginning of file, when $2 != field would match because variable is empty.
It yields:
9.975 1.49000000 0.295 1 0.4870 0.5056 0.5188 0.5045 2.015859 1.0442 -30.7653 1

Resources