awk counting delimiter didnt go as expected - bash

I was on counting the number of delimiters which is '|#~' from my client data, actually I have to do this because sometimes I received less or more delimiters. I used to use this syntax to find the number of delimiters per row :
awk -F "|#~" '{print NF-1}' myDATA
it usually work, but somehow today it returns only 2 counts, meanwhile I expected 6. After I check the data manually, I can see 6 delimiter there, afterwards I tried to copy manually the row and paste it to the notepad++ , surprisingly not all the lines being copied, only some lines, and surprisingly it contains only 2 delimiter, as the script gave me. What make this happen ?
What I see and I want to copy : 0123|#~123123|#~21321303|#~00000009213123|#~ 002133123123.|#~ 000000000.|#~CITY
Paste result : 0123|#~123123|#~21321303
missing paste : |#~00000009213123|#~ 002133123123.|#~ 000000000.|#~CITY
It seems there is something in between 3rd delimiter with last character of 3rd field, because I have to split copy it 2 times into this site, it makes sense with the awk result that returns only 2 |#~ delimiters, but of course it is 6 not 2.

As your hexdump revealed, there are null bytes in your text file.
GNU Awk 4.1.4 and 5.1.0 seem to threat these as the end of the file. Example:
$ awk '{print NF}' <<< $'a b c\nx y'
3
2
$ awk '{print NF}' <<< $'a\0 b c\nx y'
1
In man awk I haven't found a way to change this behavior. However, you probably don't want the null bytes in your file to begin with. Therefore you can just delete them before applying awk. To delete all null bytes from a file use this command:
tr -d \\0 < /path/to/broken/input/file > /path/to/fixed/output/file

Related

How to make a table using bash shell?

I have multiple text files that have their own column. I hope to combine them into one text file like a table not a long column.
I tried 'paste' and 'column', but it did not make the shape that I wanted.
When I used the paste with two text files, it made a nice table.
paste height_1.txt height_2.txt > test.txt
The trouble starts from three or more text files.
paste height_1.txt height_2.txt height_3.txt > test.txt
At a glance, it seems nice. But when I plot the each column in the text.txt file in gnuplot(p "text.txt"), I could find some unexpected graph different from the original file especially in its last part.
The shape of the table is ruined in a strange way in the test.txt, causing the graph weird.
How could I make a well-structured table in the text file with bash shell?
Is it not useful to do this work with bash shell?
If yes, I will try this with python.
Height files are extracted from other *.csv files using awk.
Thank you so much for reading this question.
awk with simple concatenation can take the records for as many files as you have and join them together in a single output file for further processing. You simply provide the multiple input files as the files for awk to read and then concatenate each record using FNR (file record number) as an index and then use the END rule to print the combined records from all files.
For example, given 3 data files, e.g. data1.txt - data3.txt each with an integer in each row, e.g.
$ cat data1.txt
1
2
3
$ cat data2.txt
4
5
6
(7-9 in data3.txt, and presuming you have an equal number of records in each input file)
You could do:
awk '{a[FNR]=(FNR in a) ? a[FNR] "\t" $1 : $1} END {for (i in a) print a[i]}' data1.txt data2.txt data3.txt
(using a tab above with "\t" for the separator between columns of the output file -- you can change to suit your needs)
The result of the command above would be:
1 4 7
2 5 8
3 6 9
(note: this is what you would get with paste data1.txt data2.txt data3.txt, but presuming you have input that is giving paste problems, awk may be a bit more flexible)
Or using a "," as the separator, you would receive:
1,4,7
2,5,8
3,6,9
If your data file has more fields than a single integer and you want to compile all fields in each file, you can assign $0 to the array instead of the first field $1.
Spaced and formatted in multi-line format (for easier reading), the same awk script would be
awk '
{
a[FNR] = (FNR in a) ? a[FNR] "\t" $1 : $1
}
END {
for (i in a)
print a[i]
}
' data1.txt data2.txt data3.txt
Look things over and let me know if I misunderstood your question, or if you have further questions about this approach.

how to use awk to read a part of line including number of space?

I want to extract a value using "awk subtring" which should also count the number of spaces without any separator.
For example, below is the input, and I want to extract the "29611", including space,
201903011232101029 2961104E3021 223 0 12113 5 15 8288 298233 0 45 0 39 4
I used this method, but it used space as a separator:
more abbas.dat | awk '{print substr($1,1,16),substr($1,17,25)}'
Expected output should be :
201903011232101029 2961
But it prints only
201903011232101029
My question is how can we print using "substr" which count spaces?
I know, I can use this command to get the desired output but it is not helpful for my objective
more abbas.dat | awk '{print substr($1,1,16),substr($2,1,5)}'
1st solution: With your shown samples, please try following awk code. Written and tested in GNU awk. Using match function of awk here to get required output.
To print 1st field followed by varying spaces followed by 5 digits from 2nd field then use following:
awk 'match($0,/^[0-9]+[[:space:]]+[0-9]{5}/){print substr($0,RSTART,RLENGTH)}' Input_file
OR To print 16 letters in 1st field and 5 from second field including varying length of spaces between 1st and 2nd fields:
awk 'match($0,/^([0-9]{16})[^[:space:]]+([[:space:]]+)([0-9]{5})/,arr){print arr[1] arr[2] arr[3]}' Input_file
2nd solution: Using GNU grep please try following, considering that your 2nd column first 4 needed values can be anything(eg: digits, alphabets etc).
grep -oP '^\S+\s+.{5}' Input_file
OR to only match 4 digits in 2nd field have a minor change in above grep.
grep -oP '^\S+\s+\d{5}' Input_file
If there is always one space you can use the following command which will print the first group, plus the first 5 character of the second group.
N.B. It's not clear in the question whether you want 4 or 5 characters but that can be adjusted easily.
more abbas.dat | awk '{print $1" "substr($2,1,5) }'
I think the simplest way is to include "Fs" in your command.
more abbas.dat | awk -Fs '{print substr($1,1,16),substr($1,17,25)}'
$ awk '{print substr($0,1,24)}' file
201903011232101029 29611
If that's not all you need then edit your question to clarify your requirements.

Transforming a field in a csv file and resaving to another file with bash [duplicate]

This question already has answers here:
How can I change a certain field of a file into upper-case using awk?
(2 answers)
Closed last year.
I apologize in advance if this seems like a simple question. However, I am a beginner in bash commands and scripting, so I hope you all understand why I am not able to solve this on my own.
What I want to achieve is to change the values in one field of a csv file to uppercase, and then resave the csv file with the transformed field and all the other fields included, each retaining their index.
For instance, I have this csv:
1,Jun 4 2021,car,4856
2,Jul 31 2021,car,4154
3,Aug 14 2021,bus,4070
4,Aug 2 2021,car,4095
I want to transform the third field that holds the vehicle type into uppercase - CAR, BUS, etc. and then resave the csv file with the transformed field.
I have tried using the 'tr' command thus:
cut -d"," -f4 data.csv | tr '[:lower:]' '[:upper:]'
This takes the field and does the transformation. But how do I paste and replace the column in the csv file?
It did not work because the field argument cannot be passed into the tr command.
With GNU awk:
awk -i inplace 'BEGIN{FS=","; OFS=","} {$3=toupper($3)} {print}' file
Output to file:
1,Jun 4 2021,CAR,4856
2,Jul 31 2021,CAR,4154
3,Aug 14 2021,BUS,4070
4,Aug 2 2021,CAR,4095
See: How can I change a certain field of a file into upper-case using awk?, Save modifications in place with awk and 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
A gnu sed solution:
sed -i -E 's/^(([^,]+,){2})([^,]+)/\1\U\3/' file.csv
cat file
1,Jun 4 2021,CAR,4856
2,Jul 31 2021,CAR,4154
3,Aug 14 2021,BUS,4070
4,Aug 2 2021,CAR,4095
Explanation:
^: Start
(([^,]+,){2}): Match first 2 fields and capture them in group #1
([^,]+): Match 3rd field and capture it in group #3
\1: Put capture value of group #1 back in replacement
\U\3: Put uppercase capture value of group #3 back in replacement
Or a gnu-awk solution:
awk -i inplace 'BEGIN {FS=OFS=","} {$3 = toupper($3)} 1' file.csv
Using the cut and tr, you need to add paste to the mix.
SEP=","
IN="data.csv"
paste -d$SEP \
<( <$IN cut -d$SEP -f1,2 ) \
<( <$IN cut -d$SEP -f3 | tr '[:lower:]' '[:upper:]' ) \
<( <$IN cut -d$SEP -f4 )
I did factor out the repeating things - separator and input file - into variables SEP and IN respectively.
How it all works:
get the untransformed columns before #3
get col #3 and transform it with tr
get the remaining columns
paste it all together, line by line
the need for intermediate files is avoided by using shell substitution
Downsides:
the data seems to be read 3 times, but disk cache will help a lot
the data is parsed 3 times, for sure (by cut)
but unless your input is a few gigabytes, this does not matter

Counting no of delimiters for very large file(~50 GB) with shell scripts

I have a file 'test.txt' which contains over 2,000,000,000 records.
Each record is on a separate line and has multiple fields separated by a delimiter | .
Each row should have equal number of fields, but the problem is there can be cases where row has less or more delimiter
Can someone please suggest a most efficient way in Unix for larger files, by which I can identify the row. (Like getting count of | characters in each row in the file and throw error if | is less or more)
I tried
awk -F '|' 'NF != 35 {print NR, $0} ' test.txt
but while pressing enter i was getting number from 1 then 2(after second third enter button) then 3(after third enter button)
This should do the trick :
awk 'BEGIN { FS="|";}{ if (NF != 36) print $0}' yourFile.txt
Explanation :
BEGIN is used to do pre-processing in awk scripts before the main pattern matching is done. Over here I set the delimiter to match | instead of the default white space
NF is an internal variable used by awk to determine how many feilds are present in one row of your record. You wanted to check if a row contained more than or less than 35 delimiters.
That is equivalent to saying if there are more than or less than 36 feilds in a given row.
see this link for a good introduction to awk scripting
This doesn't answer your question, but awk shouldn't behave differently depending on file size, and the command you posted shouldn't prompt you to press Enter. Are you sure that the there isn't just some (console) buffering going on and the command would run to completion all the same without any input?
You could try this, which would feed awk's STDIN as many newlines as it wants to read:
yes '' | awk -F '|' 'NF != 35 {print NR, $0} ' test.txt
As for efficiency, apart from proper function, there really isn't any way to perform the desired operation any more efficiently than by looking at every single line (runtime O(n) where n is the number of lines).

awk output is acting weird

cat TEXT | awk -v var=$i -v varB=$j '$1~var , $1~varB {print $1}' > PROBLEM HERE
I am passing two variables from an array to parse a very large text file by range. And it works, kind of.
if I use ">" the output to the file will ONLY be the last three lines as verified by cat and a text editor.
if I use ">>" the output to the file will include one complete read of TEXT and then it will divide the second read into the ranges I want.
if I let the output go through to the shell I get the same problem as above.
Question:
It appears awk is reading every line and printing it. Then it goes back and selects the ranges from the TEXT file. It does not do this if I use constants in the range pattern search.
I undestand awk must read all lines to find the ranges I request.
why is it printing the entire document?
How can I get it to ONLY print the ranges selected?
This is the last hurdle in a big project and I am beating my head against the table.
Thanks!
give this a try, you didn't assign varB in right way:
yours: awk -v var="$i" -varB="$j" ...
mine : awk -v var="$i" -v varB="$j" ...
^^
Aside from the typo, you can't use variables in //, instead you have to specify with regular ~ match. Also quote your shell variables (here is not needed obviously, but to set an example). For example
seq 1 10 | awk -v b="3" -v e="5" '$0 ~ b, $0 ~ e'
should print 3..5 as expected
It sounds like this is what you want:
awk -v var="foo" -v varB="bar" '$1~var{f=1} f{print $1} $1~varB{f=0}' file
e.g.
$ cat file
1
2
foo
3
4
bar
5
foo
6
bar
7
$ awk -v var="foo" -v varB="bar" '$1~var{f=1} f{print $1} $1~varB{f=0}' file
foo
3
4
bar
foo
6
bar
but without sample input and expected output it's just a guess and this would not address the SHELL behavior you are seeing wrt use of > vs >>.
Here's what happened. I used an array to input into my variables. I set the counter for what I thought was the total length of the array. When the final iteration of the array was reached, there was a null value returned to awk for the variable. This caused it to print EVERYTHING. Once I correctly had a counter with the correct number of array elements the printing oddity ended.
As far as the > vs >> goes, I don't know. It did stop, but I wasn't as careful in documenting it. I think what happened is that I used $1 in the print command to save time, and with each line it printed at the end it erased the whole file and left the last three identical matches. Something to ponder. Thanks Ed for the honest work. And no thank you to Robo responses.

Resources