Edit fields in csv files using bash - bash

I have a bunch of csv files that need "cleaning".
Specifically, there is a column that contains timestamp values, however some lines have a value of '1' instead.
What I wish to do, is replace those 1's with the last valid (timestamp) value, i.e. replace the value of i-th line with that of that of line i-1.
I provide a sample of the file
URL192.168.2.2,420042,20/07/2015 09:40:00,168430081,168430109
URL192.168.2.2,420042,20/07/2015 09:40:00,3232236038,3232236034
URL192.168.2.2,420042, 1,168430081,168430109
URL192.168.2.2,420042,20/07/2015 09:40:01,3232236038,3232236034
So in this example, the 1 must be replaced with 20/07/2015 09:40:00. I tried it using awk but couldn't nail it.

Assuming no commas in the other fields, an awk program like this should work:
BEGIN { FS = OFS = "," }
$3!=1 { prev = $3 }
$3==1 { $3 = prev }
{ print }
Warning: this is untested code.
The first line sets the field separator to a comma, for both input and output. The second line saves the timestamp of every row that has a timestamp in the third field. The third line writes the most recently saved timestamp to every row that doesn't have a timestamp in the third field. And the fourth line writes every input line, whether modified or not, to the output.
Let me know how you get on.

Related

Copy columns of a file to specific location of another pipe delimited file

I have a file suppose xyz.dat which has data like below -
a1|b1|c1|d1|e1|f1|g1
a2|b2|c2|d2|e2|f2|g2
a3|b3|c3|d3|e3|f3|g3
Due to some requirement, I am making two new files(aka m.dat and o.dat) from original xyz.dat.
M.dat contains columns 2|4|6 like below after running some logic on it -
b11|d11|f11
b22|d22|f22
b33|d33|f33
O.dat contains all the columns except 2|4|6 like below without any change in it -
a1|c1|e1|g1
a2|c2|e2|g2
a3|c3|e3|g3
Now I want to merge both M and O file to create back the original format xyz.dat file.
a1|b11|c1|d11|e1|f11|g1
a2|b22|c2|d22|e2|f22|g2
a3|b33|c3|d33|e3|f33|g3
Please note column positions can change for another file. I will get the columns positions like in above example it is 2,4,6 so need some generic command to run in loop to merge the new M and O file or one command in which I can pass the columns positions and it will copy the columns form M.dat file and past it in O.dat file.
I tried paste, sed, cut but not able to make any perfect command.
Please help.
To perform column-wise merge of two files, better to use a scripting engine (Python, Awk or Perl or even bash). Tools like paste, sed and cut do not have enough flexibility for those tasks (join may come close, but require extra work).
Consider the following awk based script
awk -vOFS='|' '-F|' '
{
getline s < "o.dat"
n = split(s. a)
# Print output, Add a[n], or $n, ... as needed based on actual number of fields.
print $1, a[1], $2, a[2], $3, a[3], a[4]
}
' m.dat
The print line can be customized to generate whatever column order
Based on clarification from OP, looks like the goal is: Given an input of two files, and list of columns where data should be merged from the 2nd file, produce an output file that contain the merge data.
For example:
awk -f mergeCols COLS=2,4,6 M=b.dat a.dat
# If file is marked executable (chmod +x mergeCols)
mergeCols COLS=2,4,6 M=b.dat a.dat
Will insert the columns from b.dat into columns 2, 4 and 6, whereas other column will include data from a.dat
Implementation, using awk: (create a file mergeCols).
#! /usr/bin/awk -f
BEGIN {
FS=OFS="|"
}
NR==1 {
# Set the column map
nc=split(COLS, c, ",")
for (i=1 ; i<=nc ; i++ ) {
cmap[c[i]] = i
}
}
{
# Read one line from merged file, split into tokens in 'a'
getline s < M
n = split(s, a)
# Merge columns using pre-set 'cmap'
k=0
for (i=1 ; i<=NF+nc ; i++ ) {
# Pick up a column
v = cmap[i] ? a[cmap[i]] : $(++k)
sep = (i<NF+nc) ? "|" : "\n"
printf "%s%s", v, sep
}
}

Sort numberic in a string of text

I tried some sort examble but can't find the way to solve this.I think i should find the right seperator and then sort it by numberic but it don't work as my desire.
This is my file:
abc_bla_bla_bla_reg0_bla_reg_1_0
abc_bla_bla_bla_reg0_bla_reg_5_0
abc_bla_bla_bla_reg0_bla_reg_2_0
abc_bla_bla_bla_reg0_bla_reg_10_0
abc_bla_bla_bla_reg0_bla_reg_15_0
abc_bla_bla_bla_reg2_bla_reg_15_0
abc_bla_bla_bla_reg2_bla_reg_9_0
abc_bla_bla_bla_reg2_bla_reg_7_0
abc_bla_bla_bla_reg3_bla_reg_26_0
abc_bla_bla_bla_reg3_bla_reg_3_0
abc_bla_bla_bla_reg3_bla_reg_5_0
And this is my desire result:
abc_bla_bla_bla_reg0_bla_reg_1_0
abc_bla_bla_bla_reg0_bla_reg_2_0
abc_bla_bla_bla_reg0_bla_reg_5_0
abc_bla_bla_bla_reg0_bla_reg_10_0
abc_bla_bla_bla_reg0_bla_reg_15_0
abc_bla_bla_bla_reg2_bla_reg_7_0
abc_bla_bla_bla_reg2_bla_reg_9_0
abc_bla_bla_bla_reg2_bla_reg_15_0
abc_bla_bla_bla_reg3_bla_reg_3_0
abc_bla_bla_bla_reg3_bla_reg_5_0
abc_bla_bla_bla_reg3_bla_reg_26_0
$ sort -t_ -k5,5 -k8,8n file
abc_bla_bla_bla_reg0_bla_reg_1_0
abc_bla_bla_bla_reg0_bla_reg_2_0
abc_bla_bla_bla_reg0_bla_reg_5_0
abc_bla_bla_bla_reg0_bla_reg_10_0
abc_bla_bla_bla_reg0_bla_reg_15_0
abc_bla_bla_bla_reg2_bla_reg_7_0
abc_bla_bla_bla_reg2_bla_reg_9_0
abc_bla_bla_bla_reg2_bla_reg_15_0
abc_bla_bla_bla_reg3_bla_reg_3_0
abc_bla_bla_bla_reg3_bla_reg_5_0
abc_bla_bla_bla_reg3_bla_reg_26_0
That may or may not produce the output you expect if the regN value in the 5th column can include 2-digit numbers.
Using awk
$awk -F"_" 'function print_array(arr,max){ for(i=1; i<=max; i++) if(a[i]){print a[i], a[i]="";} } key==$5{a[$8]=$0; key=$5; max=$8>max?$8:max} key!=$5{print_array(a,max); key=$5; a[$8]=$0; max=$8} END{print_array(a,max)}' file
Output:
abc_bla_bla_bla_reg0_bla_reg_1_0
abc_bla_bla_bla_reg0_bla_reg_2_0
abc_bla_bla_bla_reg0_bla_reg_5_0
abc_bla_bla_bla_reg0_bla_reg_10_0
abc_bla_bla_bla_reg0_bla_reg_15_0
abc_bla_bla_bla_reg2_bla_reg_7_0
abc_bla_bla_bla_reg2_bla_reg_9_0
abc_bla_bla_bla_reg2_bla_reg_15_0
abc_bla_bla_bla_reg3_bla_reg_3_0
abc_bla_bla_bla_reg3_bla_reg_5_0
abc_bla_bla_bla_reg3_bla_reg_26_0
Explanation:
awk -F"_" '
function print_array(arr,max) #Simply prints the hashed array from i=1 to max value array is holding
{
for(i=1; i<=max; i++)
if(a[i])
{print a[i], a[i]="";}
}
key==$5{a[$8]=$0; max=$8>max?$8:max} #Key here denotes the 5th field for eg. reg0 in line one. Initially key is null and it will satisfy the condition mentioned below i.e key!=$5. If the 5th field matches with the key set in previous line then push the record into array where the index in array will be the value at field 8 based on which you want to sort your results.
key!=$5{print_array(a,max); key=$5; a[$8]=$0; max=$8} #If key doesn't matches the 5th line it signifies we have a new record set and before proceeding further print the array we stored for previous record set based on 5th field.
END{print_array(a,max) #To print the last record set
}' file
key==$5{a[$8]=$0; max=$8>max?$8:max} : Key here denotes the 5th field for eg. reg0 in line one. Initially key is null and it will satisfy the condition mentioned below i.e key!=$5. If the 5th field $5 matches with the key set in previous line then push the record into array where the index in array will be the value at field 8 based on which you want to sort your results. This will work irrespective of the number of digits in $8.
key!=$5{print_array(a,max); key=$5; a[$8]=$0; max=$8} If key doesn't matches the 5th line it signifies we have a new record set and before proceeding further print the array we stored for previous record set based on 5th field.
END{print_array(a,max) Just to print the last record set
sort -V file
-V, --version-sort
natural sort of (version) numbers within text

Awk script - loop through row values of two columns in a csv file [duplicate]

This question already has an answer here:
Shell script - loop through values in multiple columns of a csv file
(1 answer)
Closed 5 years ago.
I am working with a huge CSV file (filtest.csv) that contains two columns. From column 1, I wanted to read current row and compare it with the value of the previous row. If it is greater OR equal, continue comparing and if the value of the current cell is smaller than the previous row - then i wanted to jump to the second column and take the value in the current row (of the second column). Next I wanted to divided the larger value we got in column 1 by the value in the same cell of column two with that of the smaller value in column 1. Let me clarify with this example. For example in the following table: the smaller value we will be depending on my requirement from Column 1 is 327 (because 327 is smaller than the previous value 340) - and then we take 500 (which is the corresponding cell value on column 2). Finally we divide 340 by 500 and get the value 0.68. My bash script should exit right after we print the value to the console.
338,800
338,550
339,670
340,600
327,500
301,430
299,350
284,339
284,338
283,335
283,330
283,310
282,310
282,300
282,300
283,290
In the following script, I tried it to do the division operation on the same row of the two columns and it works fine
awk -F, '$1<p && $2!=0{
val=$1/$2
if(val>=0.85 && val<=0.9)
{
print "value is:" $1/p
print "A"
}
else if(val==0.8)
{
print "B"
}
else if(val>=0.5 && val <=0.7)
{
print "C"
}
else if(val==0.5)
{
print "E"
}
else
{
print "D"
}
exit
}
{
p=$1
}' filetest.csv
But how can we loop through the values in two columns and perform control statements on two different rows of the two columns as i mentioned earlier?
From first description
awk -F, '$1<prev{print prev/$2;exit}{prev=$1}' <input.txt
At the end of each line, 1st column is stored in prev
Then when value of 1st column is least than prev, it prints the ratio and exits

Finding and saving the last occurrence of a string using awk

I need to find the last occurrence of a string in a plain text file (no delimiters or columns) and save its line number and the entire line in variables for later use in my script
Then I need to check if there is an occurrence of a second string after the line we just found.
I'm unsure of how to do this, I'm a scrub at bash. I'm not sure how to save results of awk in a variable, and I'm not sure of the logic i'd need to find the last occurrence of a string. Any advice/guidance would be amazing
# Remember last line on which we saw "string_to_match", and the line itself
/string_to_match/ { last1 = NR; line=$0 }
# Remember last line on which we saw "second_string"
/second_string/ { last2 = NR }
# At the end of the file, if last2 was after last1, print it.
END { if (last2 > last1) print last2 }
Basically just process each line in turn and every time you find the first string update the last1 and line variables.
Similarly, every time you see the first string update the last2 variable.
When you reach the end of the file last1 will be the last line on which you saw the first string. At that point you can see if the second string was seen after that point. You can also do whatever processing you need using last1 and line.

Change date and data cells in .csv file progressively

I have a file that I'm trying to get ready for my boss in time for his manager's meeting tomorrow morning at 8:00AM -8GMT. I want to retroactively change the dates in non consecutive rows in this .csv file: (truncated)
,,,,,
,,,,,sideshow
,,,
date_bob,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14
bob_available,531383,531383,531383,531383,531383,531383,531383,531383,531383,531383,531383,531383,531383,531383
bob_used,448312,448312,448312,448312,448312,448312,448312,448312,448312,448312,448312,448312,448312,448312
,,,
date_mel,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14
mel_available,343537,343537,343537,343537,343537,343537,343537,343537,343537,343537,343537,343537,343537,343537
mel_used,636159,636159,636159,636159,636159,636159,636159,636159,636159,636159,636159,636159,636159,636159
,,,
date_sideshow-ws2,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14
sideshow-ws2_available,936239,936239,936239,936239,936239,936239,936239,936239,936239,936239,936239,936239,936239,936239
sideshow-ws2_used,43441,43441,43441,43441,43441,43441,43441,43441,43441,43441,43441,43441,43441,43441
,,,
,,,,,simpsons
,,,
date_bart,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14
bart_available,62559,62559,62559,62559,62559,62559,62559,62559,62559,62559,62559,62559,62559,62559
bart_used,1135117,1135117,1135117,1135117,1135117,1135117,1135117,1135117,1135117,1135117,1135117,1135117,1135117,1135117
,,,
date_homer,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14
homer_available,17799,17799,17799,17799,17799,17799,17799,17799,17799,17799,17799,17799,17799,17799
homer_used,1179877,1179877,1179877,1179877,1179877,1179877,1179877,1179877,1179877,1179877,1179877,1179877,1179877,1179877
,,,
date_lisa,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14
lisa_available,3899,3899,3899,3899,3899,3899,3899,3899,3899,3899,3899,3899,3899,3899
lisa_used,1193777,1193777,1193777,1193777,1193777,1193777,1193777,1193777,1193777,1193777,1193777,1193777,1193777,1193777
In other words a row that now reads:
date_lisa,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14,09-17-14
would desirably read:
date_lisa,09-04-14,09-05-14,09-06-14,09-07-14,09-08-14,09-09-14,09-10-14,09-11-14,09-12-14,09-13-14,09-14-14,09-15-14,09-16-14,09-17-14
I'd like to make the daily available numbers less at the beginning and then get progressively bigger day by day. This will mean that the used rows will have to be proportionately smaller at the beginning and then get progressively bigger in lock step with the available rows as they shrink.
Not by a large amount don't make it look obvious just a few GB here and there. I plan to make pivot tables and graphs out of this and so it has to vary a little. BTW the numbers are all in MB as I generated them using df -m.
Thanks in advance if anyone can help me.
The following awk does what you need:
awk -F, -v OFS=, '
/^date/ {
split ($2, date, /-/);
for (i=2; i<=NF; i++) {
$i = date[1] "-" sprintf ("%02d", date[2] - NF + i) "-" date[3]
}
}
/available|used/ {
for (i=2; i<=NF; i++) {
$i = int (($i*i)/NF)
}
}1' csv
Set the Input and Output Field Separator to ,
All the lines that start with date, we split the second column to find the date part.
We iterate from second column to the end of the line and set the column to new calculated start date which basically uses the current date and the total number of fields.
All other lines remain as is and gets printed along with modified lines.
This has a caveat of not rolling over different months correctly.
For data fields we iterate from second column to the end of line and do a calculation to make them progressively greater than the previous one to match the original value for last field.

Resources