Awk uncorrect tab space - bash

I have some lines like the following saved into a txt file:
Mike Tyson 1 2 3 4 5
Alì 1 2 3 4 5
every different fields are separated with a tab, but in the first field I could have 2 words separated only by a space.
how can I have a correct interpretation by awk? I want only the values separated by tabs like this:
$a=mike tyson
$b=1
$c=2
etc etc....
now, i'm using a while cycle to read each line, finished by
done < <(awk 'NR>0' file.txt)
but this command sees the value "mike tyson" as two different fields.

Try to change to:
done < <(awk -F"\t" 'NR>0' file.txt)
awk does see any space (blanks and tabs) as filed separators.
Setting it to only tab, prevents it divide files with space.

The problem is not with awk, as you are interpreting the columns in bash.
Maybe you're looking for something like this:
IFS=$'\t'
while read a b; do
echo a=$a
echo b=$b
done < file.txt
In your sample code awk doesn't seem to play any role btw.

Related

How to make a table using bash shell?

I have multiple text files that have their own column. I hope to combine them into one text file like a table not a long column.
I tried 'paste' and 'column', but it did not make the shape that I wanted.
When I used the paste with two text files, it made a nice table.
paste height_1.txt height_2.txt > test.txt
The trouble starts from three or more text files.
paste height_1.txt height_2.txt height_3.txt > test.txt
At a glance, it seems nice. But when I plot the each column in the text.txt file in gnuplot(p "text.txt"), I could find some unexpected graph different from the original file especially in its last part.
The shape of the table is ruined in a strange way in the test.txt, causing the graph weird.
How could I make a well-structured table in the text file with bash shell?
Is it not useful to do this work with bash shell?
If yes, I will try this with python.
Height files are extracted from other *.csv files using awk.
Thank you so much for reading this question.
awk with simple concatenation can take the records for as many files as you have and join them together in a single output file for further processing. You simply provide the multiple input files as the files for awk to read and then concatenate each record using FNR (file record number) as an index and then use the END rule to print the combined records from all files.
For example, given 3 data files, e.g. data1.txt - data3.txt each with an integer in each row, e.g.
$ cat data1.txt
1
2
3
$ cat data2.txt
4
5
6
(7-9 in data3.txt, and presuming you have an equal number of records in each input file)
You could do:
awk '{a[FNR]=(FNR in a) ? a[FNR] "\t" $1 : $1} END {for (i in a) print a[i]}' data1.txt data2.txt data3.txt
(using a tab above with "\t" for the separator between columns of the output file -- you can change to suit your needs)
The result of the command above would be:
1 4 7
2 5 8
3 6 9
(note: this is what you would get with paste data1.txt data2.txt data3.txt, but presuming you have input that is giving paste problems, awk may be a bit more flexible)
Or using a "," as the separator, you would receive:
1,4,7
2,5,8
3,6,9
If your data file has more fields than a single integer and you want to compile all fields in each file, you can assign $0 to the array instead of the first field $1.
Spaced and formatted in multi-line format (for easier reading), the same awk script would be
awk '
{
a[FNR] = (FNR in a) ? a[FNR] "\t" $1 : $1
}
END {
for (i in a)
print a[i]
}
' data1.txt data2.txt data3.txt
Look things over and let me know if I misunderstood your question, or if you have further questions about this approach.

awk counting delimiter didnt go as expected

I was on counting the number of delimiters which is '|#~' from my client data, actually I have to do this because sometimes I received less or more delimiters. I used to use this syntax to find the number of delimiters per row :
awk -F "|#~" '{print NF-1}' myDATA
it usually work, but somehow today it returns only 2 counts, meanwhile I expected 6. After I check the data manually, I can see 6 delimiter there, afterwards I tried to copy manually the row and paste it to the notepad++ , surprisingly not all the lines being copied, only some lines, and surprisingly it contains only 2 delimiter, as the script gave me. What make this happen ?
What I see and I want to copy : 0123|#~123123|#~21321303|#~00000009213123|#~ 002133123123.|#~ 000000000.|#~CITY
Paste result : 0123|#~123123|#~21321303
missing paste : |#~00000009213123|#~ 002133123123.|#~ 000000000.|#~CITY
It seems there is something in between 3rd delimiter with last character of 3rd field, because I have to split copy it 2 times into this site, it makes sense with the awk result that returns only 2 |#~ delimiters, but of course it is 6 not 2.
As your hexdump revealed, there are null bytes in your text file.
GNU Awk 4.1.4 and 5.1.0 seem to threat these as the end of the file. Example:
$ awk '{print NF}' <<< $'a b c\nx y'
3
2
$ awk '{print NF}' <<< $'a\0 b c\nx y'
1
In man awk I haven't found a way to change this behavior. However, you probably don't want the null bytes in your file to begin with. Therefore you can just delete them before applying awk. To delete all null bytes from a file use this command:
tr -d \\0 < /path/to/broken/input/file > /path/to/fixed/output/file

Create final column containing row numbers in text file

I am new to using the Mac terminal. I need to add a tab delimited column to a text file with 3 existing columns. The columns look pretty much like this:
org1 1-20 1-40
org2 3-35 6-68
org3 16-38 40-16
etc.
I need them to look like this:
org1 1-20 1-40 1
org2 3-35 6-68 2
org3 16-38 40-16 3
etc.
My apologies if this question has been covered. Answers to similar questions are sometimes exceedingly esoteric and are not easily translatable to this specific situation.
In awk. print the record and the required tab and row count after it:
$ awk '{print $0 "\t" NR }' foo
org1 1-20 1-40 1
org2 3-35 6-68 2
org3 16-38 40-16 3
If you want to add the line numbers to the last column:
perl -i -npe 's/$/"\t$."/e' file
where
-i replaces the file in-pace (remove, if you want to print the result to the standard output);
-n causes Perl to apply the substitution to each line from the file, just like sed;
-p prints the result of expression;
-e accepts Perl expression;
s/.../.../e substitutes the first part to the second (delimited with slash), and the e flag causes Perl to evaluate the replacement as Perl expression;
$ is the end-of-line anchor;
$. variable keeps the number of the current line
In other words, the command replaces the end of the line ($) with a tab followed by the line number $..
You can paste the file next to the same file with line numbers prepended (nl), and all the other columns removed (cut -f 1):
$ paste infile <(nl infile | cut -f 1)
org1 1-20 1-40 1
org2 3-35 6-68 2
org3 16-38 40-16 3
The <(...) construct is called process substitution and basically allows you to treat the output of a command like a file.

Replacing a column in CSV file with another in bash

I have a csv file with a number of columns. I am trying to replace the second column with the second to last column from the same file.
For example, if I have a file, sample.csv
1,2,3,4,5,6
a,b,c,d,e,f
g,h,i,j,k,l
I want to output:
1,5,3,4,5,6
a,e,c,d,e,f
g,k,i,j,k,l
Can anyone help me with this task? Also note that I will be discarding the last two columns afterwards with the cut function so I am open to separating the csv file to begin with so that I can replace the column in one csv file with another column from another csv file. Whichever is easier to implement. Thanks in advance for any help.
How about this simpler awk:
awk 'BEGIN{FS=OFS=","} {$2=$(NF-1)}'1 sample.csv
EDIT: Noticed that you also want to discard last 2 columns. Use this awk one-liner:
awk 'BEGIN{FS=OFS=","} {$2=$(NF-1); NF=NF-2}'1 sample.csv
In bash
while IFS=, read -r -a arr; do
arr[1]="${arr[4]}";
printf -v output "%s," "${arr[#]}";
printf "%s\n" "${output%,}";
done < sample.csv
Pure bash solution, using IFS in a funny way:
# Set globally the IFS, you'll see it's funny
IFS=,
while read -ra a; do
a[1]=${a[#]: -2:1}
echo "${a[*]}"
done < file.csv
Setting globally the IFS variable is used twice: once in the read statement so that each field is split according to a coma and in the line echo "${a[*]}" where "${a[*]}" will expand to the fields of the array a separated by IFS... which is a coma!
Another special thing: you mentionned the second to last field, and that's exactly what ${a[#]: -2:1} will expand to (mind the space between : and -2), so that you don't have to count your number of fields.
Caveat. csv files need a special csv parser that is difficult to implement. This answer (and I guess all the other answers that will not use a genuine csv parser) might break if a field contains a coma, e.g.,
1,2,3,4,"a field, with a coma",5
If you want to discard the last two columns, don't use cut, but this instead:
IFS=,
while read -ra a; do
((${#a[#]}<2)) || continue # skip if array has less than two fields
a[1]=${a[#]: -2:1}
echo "${a[*]::${#a[#]}-2}"
done < file.csv

Removing newlines between tokens

I have a file that contains some information spanning multiple lines. In order for certain other bash scripts I have to work property, I need this information to all be on a single line. However, I obviously don't want to remove all newlines in the file.
What I want to do is replace newlines, but only between all pairs of STARTINGTOKEN and ENDINGTOKEN, where these two tokens are always on different lines (but never get jumbled up together, it's impossible for instance to have two STARTINGTOKENs in a row before an ENDINGTOKEN).
I found that I can remove newlines with
tr "\n" " "
and I also found that I can match patterns over multiple lines with
sed -e '/STARTINGTOKEN/,/ENDINGTOKEN/!d'
However, I can't figure out how to combine these operations while leaving the remainder of the file untouched.
Any suggestions?
are you looking for this?
awk '/STARTINGTOKEN/{f=1} /ENDINGTOKEN/{f=0} {if(f)printf "%s",$0;else print}' file
example:
kent$ cat file
foo
bar
STARTINGTOKEN xx
1
2
ENDINGTOKEN yy
3
4
STARTINGTOKEN mmm
5
6
7
nnn ENDINGTOKEN
8
9
kent$ awk '/STARTINGTOKEN/{f=1} /ENDINGTOKEN/{f=0} {if(f)printf "%s",$0;else print}' file
foo
bar
STARTINGTOKEN xx12ENDINGTOKEN yy
3
4
STARTINGTOKEN mmm567nnn ENDINGTOKEN
8
9
This seems to work:
sed -ne '/STARTINGTOKEN/{ :next ; /ENDINGTOKEN/!{N;b next;}; s/\n//g;p;}' "yourfile"
Once it finds the starting token it loops, picking up lines until it finds the ending token, then removes all the embedded newlines and prints it. Then repeats.
Using awk:
awk '$0 ~ /STARTINGTOKEN/ || l {l=sprintf("%s%s", l, $0)}
/ENDINGTOKEN/{print l; l=""}' input.file
This might work for you (GNU sed):
sed '/STARTINGTOKEN/!b;:a;$bb;N;/ENDINGTOKEN/!ba;:b;s/\n//g' file
or:
sed -r '/(START|END)TOKEN/,//{/STARTINGTOKEN/{h;d};H;/ENDINGTOKEN/{x;s/\n//gp};d}' file

Resources