bash shell how to cut the first column out of a file - bash

so I have a file named 'file' that contains these characters
a 1 z
b 2 y
c 3 x
how can I cut the first column and put it in it's own file?
I know how to do the rest using the space as a delimiter like this:
cut -f1 -d ' ' file > filecolumn1
but I'm not sure how to cut just the first column since there isn't any character in the front that I can use as a delimiter.

The delimiter doesn't have to be before the column, it's between the columns. So use the same delimiter, and specify field 1.
cut -f1 -d ' ' file > filecolumn1

Barmar's got a good option. Another option is awk:
awk '{print $1}' file > output.txt
If you have delimiter, you could use -F switch and provide a delimiter. For example, if your data was like this:
a,1,2
b,2,3
c,3,4
you can use awk's -F switch in this manner:
awk -F',' '{print $1}' file > output.txt

Related

Getting last X fields from a specific line in a CSV file using bash

I'm trying to get as bash variable list of users which are in my csv file. Problem is that number of users is random and can be from 1-5.
Example CSV file:
"record1_data1","record1_data2","record1_data3","user1","user2"
"record2_data1","record2_data2","record2_data3","user1","user2","user3","user4"
"record3_data1","record3_data2","record3_data3","user1"
I would like to get something like
list_of_users="cat file.csv | grep "record2_data2" | <something> "
echo $list_of_users
user1,user2,user3,user4
I'm trying this:
cat file.csv | grep "record2_data2" | awk -F, -v OFS=',' '{print $4,$5,$6,$7,$8 }' | sed 's/"//g'
My result is:
user2,user3,user4,,
Question:
How to remove all "," from the end of my result? Sometimes it is just one but sometimes can be user1,,,,
Can I do it in better way? Users always starts after 3rd column in my file.
This will do what your code seems to be trying to do (print the users for a given string record2_data2 which only exists in the 2nd field):
$ awk -F',' '{gsub(/"/,"")} $2=="record2_data2"{sub(/([^,]*,){3}/,""); print}' file.csv
user1,user2,user3,user4
but I don't see how that's related to your question subject of Getting last X records from CSV file using bash so idk if it's what you really want or not.
Better to use a bash array, and join it into a CSV string when needed:
#!/usr/bin/env bash
readarray -t listofusers < <(cut -d, -f4- file.csv | tr -d '"' | tr ',' $'\n' | sort -u))
IFS=,
printf "%s\n" "${listofusers[*]}"
cut -d, -f4- file.csv | tr -d '"' | tr ',' $'\n' | sort -u is the important bit - it first only prints out the fourth and following fields of the CSV input file, removes quotes, turns commas into newlines, and then sorts the resulting usernames, removing duplicates. That output is then read into an array with the readarray builtin, and you can manipulate it and the individual elements however you need.
GNU sed solution, let file.csv content be
"record1_data1","record1_data2","record1_data3","user1","user2"
"record2_data1","record2_data2","record2_data3","user1","user2","user3","user4"
"record3_data1","record3_data2","record3_data3","user1"
then
sed -n -e 's/"//g' -e '/record2_data/ s/[^,]*,[^,]*,[^,]*,// p' file.csv
gives output
user1,user2,user3,user4
Explanation: -n turns off automatic printing, expressions meaning is as follow: 1st substitute globally " using empty string i.e. delete them, 2nd for line containing record2_data substitute (s) everything up to and including 3rd , with empty string i.e. delete it and print (p) such changed line.
(tested in GNU sed 4.2.2)
awk -F',' '
/record2_data2/{
for(i=4;i<=NF;i++) o=sprintf("%s%s,",o,$i);
gsub(/"|,$/,"",o);
print o
}' file.csv
user1,user2,user3,user4
This might work for you (GNU sed):
sed -E '/record2_data/!d;s/"([^"]*)"(,)?/\1\2/4g;s///g' file
Delete all records except for that containing record2_data.
Remove double quotes from the fourth field onward.
Remove any double quoted fields.

Comparing 2 files with a for loop in bash

I am trying to compare the values in 2 files. For each row in Summits3.txt I want to define the value in Column 1 as "Chr" and then find the rows in generef.txt which have my value for "Chr" in column 2.
Then I would like to output some info about that row from generef.txt to out.txt and then repeat until the end.
I am using the following script:
#!/bin/bash
IFS=$'\n'
for i in $(cat Summits3.txt)
do
Chr=$(echo "$i" | awk '{print $1}')
awk -v var="$Chr" '{
if ($2==""'${Chr}'"")
print $2, $3
}' generef.txt > out.txt
done
it "works" but its only comparing values from the last line of Summits3.txt. It seems like it not looping through the awk bit.
Anyway please help if you can!
I think you might be looking for something like this:
awk 'FNR == NR {a[$1]; next} $2 in a {print $2, $3}' Summits3.txt generef.txt > out.txt
Basically you read column one from the first file into an array (array index is your chr and the value is empty character) then for the second file print only rows where the second column is in the index set of the array. FNR row number in file that is currently being processed, NR row number of all processed rows so far. This is a general look-up command I use for pulling out genes or variants from one file that are present in the other.
In your code above it should be appending to out.txt: >> out.txt. But you have to make sure to re-set out.txt at each run.
Besides using external scripts inside a loop (that is expensive), the first thing we see is that you redirect your output to a file from insside the loop. The output files is recreated each time, so please change inte append (>>) or better move the redirection outdide the loop.
When you want to use a loop, try this
while read -r Chr other; do
cut -d" " -f2,3 generef.txt | grep -E "^${Chr} "
done < Summits3.txt > out.txt
When you want to avoid the loop (needed for large inputfiles), an awk or some combined command can be used.
The first solution can fail:
grep -f <(cut -d" " -f1 Summits3.txt) <(cut -d" " -f2,3 generef.txt)
You only want matches of the complete field Chr, so starting at the first position until a space ( I assume that is the field-sep).
grep -f <(cut -d" " -f1 Summits3.txt| sed 's/.*/^& /') <(cut -d" " -f2,3 generef.txt)

How to print the csv file excluding first column till end using awk

I have a csv file with dynamic columns.
I've tried to use awk -F , 'NF>1' resul1.txt but it still prints all columns.
Since it has dynamic columns.
Its quite difficult to print using print $1 till end.
Try this awk command:
awk -F, '{$1=""}1' input.txt | awk -vOFS=, '{$1=$1}1' > output.txt
Make the 1st field empty
Print out entire line again
try substr function :
substr(string, start [, length ])
Return a length-character-long substring of string, starting at character number start. The first character of a string is character
number one.For example, substr("washington", 5, 3) returns "ing".*
awk -F, '{print substr($0,length($1)+1+length(FS))}' file
You can use cut:
cut -d',' -f2- yourfile.csv > output.csv
Explanation:
-d - setting delimiter to ,
-f - fields to print
2- - from 2 field to end of line
With awk:
awk -F, '{sub(/[^,]+,/,"",$0);}1' OFS=, yourfile.csv > output.csv
With sed:
sed -i.bak 's/^[^,]\+,//g' yourfile.csv
-i - in-place edit

Get the contents of one column given another column

I have a tab separated file with 3 columns. I'd like to get the contents of the first column, but only for the rows where the 3rd column is equal to 8. How do I extract these values? If I just wanted to extract the values in the first column, I would do the following:
cat file1 | tr "\t" "~" | cut -d"~" -f1 >> file_with_column_3
I'm thinking something like:
cat file1 | tr "\t" "~" | if cut -d"~" -f3==8; then cut -d"~" -f1 ; fi>> file_with_column_3
But that doesn't quite seem to work.
Given that your file is tab delimited, it seems like this problem would be well suited for awk.
Something simple like below should work for you, though without any sample data I can't say for sure (try to always include this on questions on SO)
awk -F'\t' '$3==8 {print $1}' inputfile > outputfile
The -F'\t' sets the input delimiter as tab.
$3==8 compares if the 3rd column based on that delimiter is 8.
If so, the {print $1} is executed, which prints the first column.
Otherwise, nothing is done and awk proceeds to the next line.
If your file had a header you wanted to preserve, you could just modify this like the following, which tells awk to print if the current record number is 1.
awk -F'\t' 'NR==1 {print;} $3==8 {print $1}' inputfile > outputfile
awk can handle this better:
awk -F '\t' '$3 == 8 { print $1 }' file1
You can do it with bash only too:
cat x | while read y; do split=(${y}); [ ${split[2]} == '8' ] && echo $split[0]; done
The input is read in variable y, then split into an array. The IFS (input field separator) defaults to <space><tab<>newline>, so it splits on tabs too. The third field of the array is then compared to '8'. If it equals, it prints the first field of the array. Remember that fields in arrays start counting at zero.

split a file into segments?

I have a file containing text data which are separated by semicolon ";". I want to separate the data , in other words split where ; occurs and write the data to an output file. Is there any way to do with bash script?
You most likely want awk with the FS (field separator variable) set to ';'.
Awk is the tool of choice for column-based data (some prefer Perl, but not me).
echo '1;2;3;4;5
6;7;8;9;10' | awk -F\; '{print $3" "$5}'
outputs:
3 5
8 10
If you just want to turn semicolons into newlines:
echo '1;2;3;4;5
6;7;8;9;10' | sed 's/;/\n/g'
outputs the numbers 1 through 10 on separate lines.
Obviously those commands are just using my test data. If you want to use them on your own file, use something like:
sed 's/;/\n/g' <input_file >output_file
#!/bin/bash
while read -d ';' ITEM; do
echo "$ITEM"
done
Try:
cat original_file.txt | cut -d";" -f1 > new_file.txt
This will split each line in fields delimited by ";" and select the first field (-f1).
You can access other fields with -f1, -f2, ... or multiple fields with -f1-2, -f2-.
You can translate a character to another character by the 'tr' command.
cat input.txt | tr ';' '\n' > output.txt
Where \n is new line and if you want a tab only you should replace it with \t

Resources