How to strip date in csv output using shell script? - bash

I have a few csv extracts that I am trying to fix up the date on, they are as follows:
"Time Stamp","DBUID"
2016-11-25T08:28:33.000-8:00,"5tSSMImFjIkT0FpiO16LuA"
The first column is always the "Time Stamp", I would like to convert this so it only keeps the date "2016-11-25" and drops the "T08:28:33.000-8:00".
The end result would be..
"Time Stamp","DBUID"
2016-11-25,"5tSSMImFjIkT0FpiO16LuA"
There are plenty of files with different dates.
Is there a way to do this in ksh? Some kind of for each loop to loop through all the files and replace the long time-stamp and leave just the date?

Use sed:
$ sed '2,$s/T[^,]*//' file
"Time Stamp","DBUID"
2016-11-25,"5tSSMImFjIkT0FpiO16LuA"
How it works:
2,$ # Skip header (first line) removing this will make a
# replacement on the first line as well.
s/T[^,]*// # Replace everything between T (inclusive) and , (exclusive)
# `[^,]*' Matches everything but `,' zero or more times

Here's one solution using a standard aix utility,
awk -F, -v OFS=, 'NR>1{sub(/T.*$/,"",$1)}1' file > file.cln && mv file.cln file
output
"Time Stamp","DBUID"
2016-11-25,"5tSSMImFjIkT0FpiO16LuA"
(but I no longer have access to an aix environment, so only tested with my local awk).
NR>1 skips the header line, and the sub() is limited to only the first field (up to the first comma). The trailing 1 char is awk shorthand for {print $0}.
If your data layout changes and you get extra commas in your data, this may required fixing.
IHTH

Using sed:
sed -i "s/\([0-9]\{4\}\)-\([0-9]\{2\}\)-\([0-9]\{2\}\).*,/\1-\2-\3,/" file.csv
Output:
"Time Stamp","DBUID"
2016-11-25,"5tSSMImFjIkT0FpiO16LuA"
-i edit files inplace
s substitute

This is a perfect job for awk, but unlike the previous answer, I recommend using the substring function.
awk -F, 'NR > 1{$1 = substr($1,1,10)} {print $0}' file.txt
Explanation
-F,: The -F flag sets a field separator, in this case a comma
NR > 1: Ignore the first row
$1: Refers to the first field
$1 = substr($1,1,10): Sets the first field to the first 10 characters of the field. In the example, this is the date portion
print $0: This will print the entire row

Related

Align numbers using only sed

I need to align decimal numbers with the "," symbol using only the sed command. The "," should go in the 5th position. For example:
183,7
2346,7
7,999
Should turn into:
183,7
2346,7
7,999
The maximum amount of numbers before the comma is 4. I have tried using this to remove spaces:
sed 's/ //g' input.txt > nospaces.txt
And then I thought about adding spaces depending on the number of digits before the comma, but I don't know how to do this using only sed.
Any help would be appreciated.
Assuming that there is only one number on each line; that there are at most four digits before the ,, and that there is always a ,:
sed 's/[^0-9,]*\([0-9]\+,[0-9]*\).*/ \1/;s/.*\(.....,.*\)/\1/;'
The first s gets rid of everything other than the (first) number on the line, and puts four spaces before it. The second one deletes everything before the fifth character prior to the ,, leaving just enough spaces to right justify the number.
The second s command might mangle input lines which didn't match the first s command. If it is possible that the input contains such lines, you can add a conditional branch to avoid executing the second substitution if the first one failed. With Gnu sed, this is trivial:
sed 's/[^0-9,]*\([0-9]\+,[0-9]*\).*/ \1/;T;s/.*\(.....,.*\)/\1/;'
T jumps to the end of the commands if the previous s failed. Posix standard sed only has a conditional branch on success, so you need to use this circuitous construction:
sed 's/[^0-9,]*\([0-9]\+,[0-9]*\).*/ \1/;ta;b;:a;s/.*\(.....,.*\)/\1/;'
where ta (conditional branch to a on success) is used to skip over a b (unconditional branch to end). :a is the label referred to by the t command.
if you change your mind, here is an awk solution
$ awk -F, 'NF{printf "%5d,%-d\n", $1,$2} !NF' file
183,7
2346,7
7,999
set the delimiter to comma and handle both parts as separate fields
Try with this:
gawk -F, '{ if($0=="") print ; else printf "%5d,%-d\n", $1, $2 }' input.txt
If you are using GNU sed, you could do as below
sed -r 's/([0-9]+),([0-9]+)/printf "%5s,%d" \1 \2/e' input.txt

Remove a header from a file during parsing

My script gets every .csv file in a dir and writes them into a new file together. It also edits the files such that certain information is written into every row for a all of a file's entries. For instance this file called "trap10c_7C000000395C1641_160110.csv":
"",1/10/2016
"Timezone",-6
"Serial No.","7C000000395C1641"
"Location:","LS_trap_10c"
"High temperature limit (�C)",20.04
"Low temperature limit (�C)",-0.02
"Date - Time","Temperature (�C)"
"8/10/2015 16:00",30.0
"8/10/2015 18:00",26.0
"8/10/2015 20:00",24.5
"8/10/2015 22:00",24.0
Is converted into this format
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,Location:,LS_trap_10c
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,High,temperature,limit,(�C),20.04
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,Low,temperature,limit,(�C),-0.02
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,Date,-,Time,Temperature,(�C)
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,8/10/2015,16:00,30.0
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,8/10/2015,18:00,26.0
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,8/10/2015,20:00,24.5
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,8/10/2015,22:00,24.0
I use this script to do this:
dos2unix *.csv
gawk '{print FILENAME, $0}' *.csv>>all_master.erin
sed -i 's/Serial No./SerialNo./g' all_master.erin
sed -i 's/ /,/g' all_master.erin
gawk -F, '/"SerialNo."/ {sn = $3}
/"Location:"/ {loc = $3}
/"([0-9]{1,2}\/){2}[0-9]{4} [0-9]{2}:[0-9]{2}"/ {lin = $0}
{$0 =loc FS sn FS $0}1' all_master.erin > formatted_log.csv
sed -i 's/\"//g' formatted_log.csv
sed -i '/^,/ d' formatted_log.csv
rm all_master.erin
printf "\nDone\n"
I want to remove the messy header from the formatted_log.csv file. I've tried and failed to use a sed, as it seems to remove things that I don't want to remove. Is sed the best way to approach this problem? The current sed fixes some problems with the header, but I want the header gone entirely. Any lines that say "serial no." and "location" are important and require information. The other lines can be removed entirely.
I suppose you edited your script before posting; as it stands, it will not produce the posted output (all_master.erin should be $(<all_master.erin) except in the first occurrence).
You don’t specify many vital details of the format of your input files, so we must guess them. Here are my guesses:
You ignore the first two lines and the subsequent empty third line.
The 4th and 5th lines are useful, since they provide the serial number and location you want to use in all lines of that file
The 6th, 7th and 8th lines are useless.
For each file, you want to discard the first four lines of the posted output.
With these assumptions, this is how I would modify your script:
#!/bin/bash
dos2unix *.csv
awk -vFS=, -vOFS=, \
'{gsub("\"","")}
FNR==4{s=$2}
FNR==5{l=$2}
FNR>8{gsub(" ",OFS);print l,s,FILENAME,$0}' \
*.csv > formatted_log.CSV
printf "\nDone\n"
Explanation of the awk script:
First we delete all double quotes with gsub("\"",""). Then, if the line number is 4, we set the variable s to the second field, which is the serial number. If the line number is 5, we set the variable l to the second field, which is the location. If the line number is greater than 8, we do two things. First, we execute gsub(" ",OFS) to replace all spaces with the value of the output field separator: this is needed because the intended output makes two separate fields of date and time, which were only one field in the input. Second, we print the line preceded by the values of l, s and FILENAME as requested.
Note that I’m using the (questionable) Unix trick of naming the output file with an all-caps extension .CSV to avoid it being wrongly matched by a subsequent *.csv. A better solution would be to put it in another directory, but I don’t know anything about your directory tree so I suggest you modify the output file name yourself.
You could use awk to remove anything
with less than 3 columns in your final file:
awk 'NF>=3' file

Replace string after first semicolon while retaining the string after that

I have a result file, values separated by ; as below:
137;AJP14028.1_VP35;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14037.1_VP35;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14352.1_VP35;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14846.1_VP35;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
and I want to change the second value (AJP14028.1_VP35) to only AJP14028, without the ".1_VP35" at the back. So the result will be:
137;AJP14028;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14037;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14352;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14846;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
Any idea on how to do this? I am trying to solve this using either sed or awk but I am not really familiar with them yet.
With that input, and focusing on the second field, you can use awk:
$ awk 'BEGIN{FS=OFS=";"} {split($2, arr, /\.1/); $2=arr[1]} 1' file
137;AJP14028;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14037;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14352;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14846;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
Explanation:
BEGIN{FS=OFS=";"} sets FS and OFS to ";". This splits the input on the ; character and set the output field separator to that same character.
{split($2, arr, /\.1/) splits the second field on the pattern of a literal .1 and places the result in an array.
$2=arr[1] is an awk idiom that resets the second field, $2, to the trimmed value. A side effect is the total record, $0 is reset using the output field separator, OFS
1 at the end is another awkism -- print the current record.
If you just have the fixed string .1_VP35 to remove (and you do not care if it is field specific) you can just used sed:
sed 's/\.1_VP35//' file
awk '{sub(/.1_VP35/,"")}1' file
137;AJP14028;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14037;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14352;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14846;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
sed -r 's/(^[^.]*)(.[^;]*)(.*)/\1\3/g' inputfile
137;AJP14028;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14037;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14352;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
137;AJP14846;HLA-A*02:01;MVAKYDFLV;0.79200;0.35000;0.87783;0.99826;0.30;<-E
Here: back referencing is used to divide the input line into three groups,seprated by `()'. Later they are referred as "\1" and so on.
The first group will match from the start of the line till the first dot.
The second group will match string followed by the first dot till the first semicolon.
The third group will match everything followed by it.
This might work for you (GNU sed):
sed 's/\(;[^.]*\)[^;]*/\1/' file
Make a back reference of the first ; and everything thereafter which is not a . and then remove everything from thereon which is not a ;.

I need to be able to print the largest record value from txt file using bash

I am new to bash programming and I hit a roadblock.
I need to be able to calculate the largest record number within a txt file and store that into a variable within a function.
Here is the text file:
student_records.txt
12345,fName lName,Grade,email
64674,fName lName,Grade,email
86345,fName lName,Grade,email
I need to be able to get the largest record number ($1 or first field) in order for me to increment this unique record and add more records to the file. I seem to not be able to figure this one out.
First, I sort the file by the first field in descending order and then, perform this operation:
largest_record=$(awk-F,'NR==1{print $1}' student_records.txt)
echo $largest_record
This gives me the following error on the console:
awk-F,NR==1{print $1}: command not found
Any ideas? Also, any suggestions on how to accomplish this in the best way?
Thank you in advance.
largest=$(sort -r file|cut -d"," -f1|head -1)
You need spaces, and quotes
awk -F, 'NR==1{print $1}'
The command is awk, you need a space after it so bash parses your command line properly, otherwise it thinks the whole thing is the name of the command, which is what the error messages is telling you.
Learn how to use the man command so you can learn how to invoke other commands:
man awk
This will tell you what the -F option does:
The -F fs option defines the input field separator to be the regular expression fs.
So in your case the field separator is a comma -F,
What follows in quotes is what you want awk to interpret, it says to match a line with the pattern NR==1, NR is special, it is the record number, so you want it to match the first record, following that is the action you want awk to take when that pattern matches, {print $1}, which says to print the first field (comma separated) of the line.
A better way to accomplish this would be to use awk to find the largest record for you rather than sorting it first, this gives you a solution that is linear in the number for records - you just want the max, no need to do extra work of sorting the whole file:
awk -F, 'BEGIN {max = 0} {if ($1>max) max=$1} END {print max}' student_records.txt
For this and other awk "one liners" look here.

bash delete line condition

I couldn't find a solution to conditionally delete a line in a file using bash. The file contains year dates within strings and the corresponding line should be deleted only if the year is lower than a reference value.
The file looks like the following:
'zg_Amon_MPI-ESM-LR_historical_r1i1p1_196001-196912.nc' 'MD5'
'zg_Amon_MPI-ESM-LR_historical_r1i1p1_197001-197912.nc' 'MD5'
'zg_Amon_MPI-ESM-LR_historical_r1i1p1_198001-198912.nc' 'MD5'
'zg_Amon_MPI-ESM-LR_historical_r1i1p1_199001-199912.nc' 'MD5'
'zg_Amon_MPI-ESM-LR_historical_r1i1p1_200001-200512.nc' 'MD5'
I want to get the year 1969 from line 1 and compare it to a reference (let's say 1980) and delete the whole line if the year is lower than the reference. This means in this case the code should remove the first two lines of the file.
I tried with sed and grep, but couldn't get it working.
Thanks in advance for any ideas.
You can use awk:
awk -F- '$4 > 198000 {print}' filename
This will output all the lines where the second date is later than 31/12/1979. This will not edit the file in-place, you would have to save the output to another file then move that in place of the original:
awk -F- '$4 > 198000 {print}' filename > tmp && mv tmp filename
Using sed (will edit in-place):
sed -i '/.*19[0-7][0-9]..\.nc/d' filename
This requires a little more thought, in that you will need to construct a regex to match any values which you don't want to be displayed.
Perhaps something like this:
awk -F- '{ if (substr($4,1,4) >= 1980) print }' input.txt

Resources