Is there a way to treat a single column of integers as an array in order to extract certain digits? - bash

I am trying to treat a series of integers as an array in order to extract the "columns" of interest.
My data after extracting a column of integers looks something like:
01010101010
10101010101
00100111100
10111100000
01011000100
If I'm only interested in the 1st, 4th, and 11th integers, I'd like the output to look like this:
010
101
000
110
010
This problem is hard to describe in words, so I'm sorry for the lack of clarity. I've tried a number of suggestions, but many things such as awk's substr() are unable to skip positions (such as the 1st, 4th, and 11th positions here).

You can use the cut command:
cut -c 1,4,11 file
-c selects only characters.
or using (gnu) awk:
awk '{print $1 $4 $11}' FS= file
FS is the field separator which is set to nothing in order capture every single character.

With GNU awk which can use empty string as field separator, you could do:
awk -F '' '{print $1, $4, $11}' OFS='' infile

Could you please try following awk too.
awk '{print substr($0,1,1) substr($0,4,1) substr($0,11,1)}' Input_file

Related

awk: remove multiple tabs between each fields and output a line where each field is separated by a single tab

I have a file whose 11th line should in theory have 1011 columns yet it looks like there are more than 1 tabs between each of its field. More specifically,
If I use
awk '{print NF}' file
then I can see that the 11th line has the same number of fields as all the rest (except for the first ten lines, which have a different format. That's expected).
But if I use
awk 'BEGIN{FS="\t"} {print NF}' file
I can see that the 11th line has 2001 fields. Based on that, I suspect some of its fields are separted by more than one whitespaces.
I'd like to have each field separated by 1 tab only, so I tried
awk 'BEGIN{OFS="\t"} {print}' file > file.modified
However, this doesn't solve the problem as
awk 'BEGIN{FS="\t"} {print NF}' file.modified
still indicates that the 11th line has 2001 fields.
Can anyone point out a way to achieve my goal? Thanks a lot! I have put the first 100 lines of my file in the following google drive link.
https://drive.google.com/file/d/1qOjzjUnJKJpc4VpDxwKPBcqMS7MUuyKy/view?usp=sharing
To squeeze multiple tabs to one tab, you could use tr:
tr -s '\t' <file >file.modified
This might help with GNU awk:
awk 'BEGIN{FS="\t+"; OFS="\t"} {$1=$1; print}' file
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

checking that the rows in a file have the same number of columns

I have a number of tsv files, and I want to check that each file is correctly formatted. primarily, I want to check that each row has the right number of columns. is there a way to do this? I'd love a command line solution if there is one.
Adding this here because these answers were all close but didn't quite work for me, in my case I needed to specify the field separator for awk.
The following should return with a single line containing the number of columns (if every row has the same number of columns).
$ awk -F'\t' '{print NF}' test.tsv | sort -nu
8
-F is used to specify the field separator for awk
NF is the number of fields
-nu orders the field count for each row numerically and returns only the unique ones
If you get more than one row returned, then there are some rows of your .tsv with more columns than others.
To check that the .tsv is correctly formatted with each row having the same number of fields, the following should return 1 (as commented by kmace on the accepted answer) however I needed to add the -F'\t'
$ awk -F'\t' '{print NF}' test.tsv | sort -nu | wc -l
awk '{print NF}' test | sort -nu | head -n 1
This gives you the lowest number of columns in the file on any given row.
awk '{print NF}' test | sort -nu | tail -n 1
This gives you the highest number of columns in the file on any given row.
The result should be the same, if all the columns are present.
Note: this gives me an error on OS X, but not on Debian... maybe use gawk.
(I'm assuming that by "tsv", you mean a file whose columns are separated with tab characters.)
You can do this simply with awk, as long as the file doesn't have quoted fields containing tab characters.
If you know how many columns you expect, the following will work:
awk -F '\t' -v NCOLS=42 'NF!=NCOLS{printf "Wrong number of columns at line %d\n", NR}'
(Of course, you need to change the 42 to the correct value.)
You could also automatically pick up the number of columns from the first line:
awk -F '\t' 'NR==1{NCOLS=NF};NF!=NCOLS{printf "Wrong number of columns at line %d\n", NR}'
That will work (with a lot of noise) if the first line has the wrong number of columns, but it would fail to detect a file where all the lines have the same wrong number of columns. So you're probably better off with the first version, which forces you to specify the column count.
Just cleaning up #snd answer above:
number_uniq_row_lengths=`awk '{print NF}' $pclFile | sort -nu | wc -l`
if [ $number_uniq_row_lengths -eq 1 ] 2>/dev/null; then
echo "$pclFile is clean"
fi
awk is a good candidate for this. If your columns are separated by tabs (I guess it is what tsv means) and if you know how many of them you should have, say 17, you can try:
awk -F'\t' 'NF != 17 {print}' file.tsv
This will print all lines in file.tsv that has not exactly tab-separated 17 columns. If my guess is incorrect, please edit your question and add the missing information (column separators, number of columns...) Note that the tsv (and csv) format is trickier than it seems. The fields can contain the field separator, records can span on several lines... If it is your case, do not try to reinvent the wheel and use an existing tsv parser.

Remove spaces from a single column using bash

I was provided with a CSV file, that in a single column, uses spaces to denote a thousands separator (eg. 11 000 instead of 11,000 or 11000). The other columns have useful spaces within them, so I need to only fix this one column.
My data:
Date,Source,Amount
1/1/2013,Ben's Chili Bowl,11 000.90
I need to get:
Date,Source,Amount
1/1/2013,Ben's Chili Bowl,11000.90
I have been trying awk, sed, and cut, but I can't get it to work.
dirty and quick:
awk -F, -v OFS="," '{gsub(/ /,"",$NF)}1'
example:
kent$ echo "Date,Source,Amount
1/1/2013,Ben's Chili Bowl,11 000.90"|awk -F, -v OFS="," '{gsub(/ /,"",$NF)}1'
Date,Source,Amount
1/1/2013,Ben's Chili Bowl,11000.90
One possibility might be:
sed 's/\([0-9]\) \([0-9]\)/\1\2/'
This looks for two digits either side of a blank and keeps just the two digits. For the data shown, it would work fine. You can add a trailing g if you might have to deal with 11 234 567.89.
If you might have other columns with spaces between numbers, or not the first such column, you can use a similar technique/regex in awk with gsub() on the relevant field.
just with bash
$ echo "Date,Source,Amount
1/1/2013,Ben's Chili Bowl,11 000.90" |
while IFS=, read -r date source amount; do
echo "$date,$source,${amount// /}"
done
Date,Source,Amount
1/1/2013,Ben's Chili Bowl,11000.90

Unix cut: Print same Field twice

Say I have file - a.csv
ram,33,professional,doc
shaym,23,salaried,eng
Now I need this output (pls dont ask me why)
ram,doc,doc,
shayam,eng,eng,
I am using cut command
cut -d',' -f1,4,4 a.csv
But the output remains
ram,doc
shyam,eng
That means cut can only print a Field just one time. I need to print the same field twice or n times.
Why do I need this ? (Optional to read)
Ah. It's a long story. I have a file like this
#,#,-,-
#,#,#,#,#,#,#,-
#,#,#,-
I have to covert this to
#,#,-,-,-,-,-
#,#,#,#,#,#,#,-
#,#,#,-,-,-,-
Here each '#' and '-' refers to different numerical data. Thanks.
You can't print the same field twice. cut prints a selection of fields (or characters or bytes) in order. See Combining 2 different cut outputs in a single command? and Reorder fields/characters with cut command for some very similar requests.
The right tool to use here is awk, if your CSV doesn't have quotes around fields.
awk -F , -v OFS=, '{print $1, $4, $4}'
If you don't want to use awk (why? what strange system has cut and sed but no awk?), you can use sed (still assuming that your CSV doesn't have quotes around fields). Match the first four comma-separated fields and select the ones you want in the order you want.
sed -e 's/^\([^,]*\),\([^,]*\),\([^,]*\),\([^,]*\)/\1,\4,\4/'
$ sed 's/,.*,/,/; s/\(,.*\)/\1\1,/' a.csv
ram,doc,doc,
shaym,eng,eng,
What this does:
Replace everything between the first and last comma with just a comma
Repeat the last ",something" part and tack on a comma. Voilà!
Assumptions made:
You want the first field, then twice the last field
No escaped commas within the first and last fields
Why do you need exactly this output? :-)
using perl:
perl -F, -ane 'chomp($F[3]);$a=$F[0].",".$F[3].",".$F[3];print $a."\n"' your_file
using sed:
sed 's/\([^,]*\),.*,\(.*\)/\1,\2,\2/g' your_file
As others have noted, cut doesn't support field repetition.
You can combine cut and sed, for example if the repeated element is at the end:
< a.csv cut -d, -f1,4 | sed 's/,[^,]*$/&&,/'
Output:
ram,doc,doc,
shaym,eng,eng,
Edit
To make the repetition variable, you could do something like this (assuming you have coreutils available):
n=10
rep=$(seq $n | sed 's:.*:\&:' | tr -d '\n')
< a.csv cut -d, -f1,4 | sed 's/,[^,]*$/'"$rep"',/'
Output:
ram,doc,doc,doc,doc,doc,doc,doc,doc,doc,doc,
shaym,eng,eng,eng,eng,eng,eng,eng,eng,eng,eng,
I had the same problem, but instead of adding all the columns to awk, I just used (to duplicate the 2nd column):
awk -v OFS='\t' '$2=$2"\t"$2' # for tab-delimited files
For CSVs you can just use
awk -F , -v OFS=, '$2=$2","$2'

Bash script to extract entries from log file based on dates specified in another file?

I've got a pretty big comma-delimited CSV log file (>50000 rows, let's call it file1.csv) that looks something like this:
field1,field2,MM-DD-YY HH:MM:SS,field4,field5...
...
field1,field2,07-29-10 08:04:22.7,field4,field5...
field1,field2,07-29-10 08:04:24.7,field4,field5...
field1,field2,07-29-10 08:04:26.7,field4,field5...
field1,field2,07-29-10 08:04:28.7,field4,field5...
field1,field2,07-29-10 08:04:30.7,field4,field5...
...
As you can see, there is a field in the middle that is a time stamp.
I also have a file (let's call it file2.csv) that has a short list of times:
timestamp,YYYY,MM,DD,HH,MM,SS
20100729180031,2010,07,29,18,00,31
20100729180039,2010,07,29,18,00,39
20100729180048,2010,07,29,18,00,48
20100729180056,2010,07,29,18,00,56
20100729180106,2010,07,29,18,01,06
20100729180115,2010,07,29,18,01,15
What I would like to do is to extract only the lines in file1.csv that have times specified in file2.csv.
How do I do this with a bash script? Since file1.csv is quite large, efficiency would also be a concern. I've done very simple bash scripts before, but really don't know how to deal with this. Perhaps some implementation of awk? Or is there another way?
P.S. Complication 1: I manually spot checked some of the entries in both files to make sure they would match, and they do. There just needs to be a way to remove (or ignore) the extra ".7" at the end of the seconds ("SS") field in file1.csv.
P.P.S. Complication 2: Turns out the entries in list1.csv are all separated by about two seconds. Sometimes the time stamps in list2.csv fall right in between two of the entries in list1.csv! Is there a way to find the closest match in this case?
Taking advantage of John's answer, you could sort and join the files, printing just the columns you want (or all columns if the case). Please take a look below (note that I'm considering that you're using UNIX, like Solaris, so nawk could be faster than awk, also we don't have gawk that could facilitate even more):
# John's nice code
awk -F, '! /timestamp/ {print $3 "-" $4 "-" ($2-2000) " " $5 ":" $6 ":" $7}' file2.csv > times.list
# Sorting times.list file to prepare for the join
sort times.list -o times.list
# Sorting file1.csv
sort -t, -k3,3 file1.csv -o file1.csv
# Finally joining files and printing the rows that match the times
join -t, -1 3 -2 1 -o 1.1 1.2 1.3 1.4 1.5......1.50 file1.csv times.list
One special particularity from this method is that you could change it in order to work in several different cases, like with different columns order, and also in cases when the key columns are not concatenated. It would be very hard to do this with grep (using regexp or not)
If you have GNU awk (gawk), you can use this technique.
In order to match the nearest times, one approach would be to have awk print two lines for each line in file2.csv, then use that with grep -f as in John Kugelman's answer. The second line will have one second added to it.
awk -F, 'NR>1 {$1=""; print strftime("%m-%d-%y %H:%M:%S", mktime($0));
print strftime("%m-%d-%y %H:%M:%S", mktime($0) + 1)}' file2.csv > times.list
grep -f times.list file1.csv
This illustrates a couple of different techniques.
skip record number one to skip the header (using a match is actually better)
instead of dealing with each field individually, $1 is emptied and strftime creates the output in the desired format
mktime converts the string in the format "yyyy mm dd hh mm ss" (the -F, and the assignment to $1 removes the commas) to a number of seconds since the epoch, and we add 1 to it for the second line
One approach is to use awk to convert the timestamps in file2.csv to file1.csv's format, then use grep -f to search through file1.csv. This should be quite fast as it will only make one pass through file1.csv.
awk -F, '! /timestamp/ {print $3 "-" $4 "-" ($2-2000) " " $5 ":" $6 ":" $7}' file2.csv > times.list
grep -f times.list file1.csv
You could combine this all into one line if you wish:
grep -f <(awk -F, '! /timestamp/ {print $3 "-" $4 "-" ($2-2000) " " $5 ":" $6 ":" $7}' file2.csv) file1.csv

Resources