Getting last X fields from a specific line in a CSV file using bash - bash

I'm trying to get as bash variable list of users which are in my csv file. Problem is that number of users is random and can be from 1-5.
Example CSV file:
"record1_data1","record1_data2","record1_data3","user1","user2"
"record2_data1","record2_data2","record2_data3","user1","user2","user3","user4"
"record3_data1","record3_data2","record3_data3","user1"
I would like to get something like
list_of_users="cat file.csv | grep "record2_data2" | <something> "
echo $list_of_users
user1,user2,user3,user4
I'm trying this:
cat file.csv | grep "record2_data2" | awk -F, -v OFS=',' '{print $4,$5,$6,$7,$8 }' | sed 's/"//g'
My result is:
user2,user3,user4,,
Question:
How to remove all "," from the end of my result? Sometimes it is just one but sometimes can be user1,,,,
Can I do it in better way? Users always starts after 3rd column in my file.

This will do what your code seems to be trying to do (print the users for a given string record2_data2 which only exists in the 2nd field):
$ awk -F',' '{gsub(/"/,"")} $2=="record2_data2"{sub(/([^,]*,){3}/,""); print}' file.csv
user1,user2,user3,user4
but I don't see how that's related to your question subject of Getting last X records from CSV file using bash so idk if it's what you really want or not.

Better to use a bash array, and join it into a CSV string when needed:
#!/usr/bin/env bash
readarray -t listofusers < <(cut -d, -f4- file.csv | tr -d '"' | tr ',' $'\n' | sort -u))
IFS=,
printf "%s\n" "${listofusers[*]}"
cut -d, -f4- file.csv | tr -d '"' | tr ',' $'\n' | sort -u is the important bit - it first only prints out the fourth and following fields of the CSV input file, removes quotes, turns commas into newlines, and then sorts the resulting usernames, removing duplicates. That output is then read into an array with the readarray builtin, and you can manipulate it and the individual elements however you need.

GNU sed solution, let file.csv content be
"record1_data1","record1_data2","record1_data3","user1","user2"
"record2_data1","record2_data2","record2_data3","user1","user2","user3","user4"
"record3_data1","record3_data2","record3_data3","user1"
then
sed -n -e 's/"//g' -e '/record2_data/ s/[^,]*,[^,]*,[^,]*,// p' file.csv
gives output
user1,user2,user3,user4
Explanation: -n turns off automatic printing, expressions meaning is as follow: 1st substitute globally " using empty string i.e. delete them, 2nd for line containing record2_data substitute (s) everything up to and including 3rd , with empty string i.e. delete it and print (p) such changed line.
(tested in GNU sed 4.2.2)

awk -F',' '
/record2_data2/{
for(i=4;i<=NF;i++) o=sprintf("%s%s,",o,$i);
gsub(/"|,$/,"",o);
print o
}' file.csv
user1,user2,user3,user4

This might work for you (GNU sed):
sed -E '/record2_data/!d;s/"([^"]*)"(,)?/\1\2/4g;s///g' file
Delete all records except for that containing record2_data.
Remove double quotes from the fourth field onward.
Remove any double quoted fields.

Related

how to use cut command -f flag as reverse

This is a text file called a.txt
ok.google.com
abc.google.com
I want to select every subdomain separately
cat a.txt | cut -d "." -f1 (it select ok From left side)
cat a.txt | cut -d "." -f2 (it select google from left side)
Is there any way, so I can get result from right side
cat a.txt | cut (so it can select com From right side)
There could be few ways to do this, one way which I could think of right now could be using rev + cut + rev solution. Which will reverse the input by rev command and then set field separator as . and print fields as per they are from left to right(but actually they are reversed because of the use of rev), then pass this output to rev again to get it in its actual order.
rev Input_file | cut -d'.' -f 1 | rev
You can use awk to print the last field:
awk -F. '{print $NF}' a.txt
-F. sets the record separator to "."
$NF is the last field
And you can give your file directly as an argument, so you can avoid the famous "Useless use of cat"
For other fields, but counting from the last, you can use expressions as suggested in the comment by #sundeep or described in the users's guide under
4.3 Nonconstant Field Numbers. For example, to get the domain, before the TLD, you can substract 1 from the Number of Fields NF :
awk -F. '{ print $(NF-1) }' a.txt
You might use sed with a quantifier for the grouped value repeated till the end of the string.
( Start group
\.[^[:space:].]+ Match 1 dot and 1+ occurrences of any char except a space or dot
){1} Close the group followed by a quantifier
$ End of string
Example
sed -E 's/(\.[^[:space:].]+){1}$//' file
Output
ok.google
abc.google
If the quantifier is {2} the output will be
ok
abc
Depending on what you want to do after getting the values then you could use bash for splitting your domain into an array of its components:
#!/bin/bash
IFS=. read -ra comps <<< "ok.google.com"
echo "${comps[-2]}"
# or for bash < 4.2
echo "${comps[${#comps[#]}-2]}"
google

How to place comma separated values into newline and remove all the id's before including colon

I have below command output from the Linux System where it fetches the all the account names by comma separated which I want to be placed into newline's, so remove all the command and place individual account name into newline.
$ getent group pi_infra
pi_infra:*:5899:pxf59093,pxv07744,pxa02374,pxa07513,pxa08599,pxa11102,pxa30995,pxa34158,pxf07822,pxf29346,pxf30902,pxf31604,pxf31606,pxf31953,pxf34985,pxf41740,pxf41778,pxf43236,pxf43917,pxf45518,pxf46461,pxf49051,pxf58440,pxf58523,pxf58621,pxf60794,pxf60938,pxf61299,pxf63061,pxp08000,pxp25916,pxp42841,pxp68003,pxp69833,pxp87972
$ cat pi_in| sed 's/,/\n/g'
$ cat pi_in| tr ',' '\n'
Result From the above.
pi_infra:*:5899:pxf59093
pxv07744
pxa02374
pxa07513
pxa08599
pxa11102
pxa30995
pxa34158
pxf07822
pxf29346
pxf30902
pxf31604
pxf31606
pxf31953
pxf34985
pxf41740
pxf41778
pxf43236
pxf43917
pxf45518
pxf46461
pxf49051
pxf58440
pxf58523
pxf58621
pxf60794
pxf60938
pxf61299
pxf63061
pxp08000
pxp25916
pxp42841
pxp68003
pxp69833
pxp87972
As i want to remove all the stuff before : and only want ID printed hence i've chosen to use below.
$ cat pi_in| cut -d":" -f4 | tr ',' '\n'
pxf59093
pxv07744
pxa02374
pxa07513
pxa08599
pxa11102
pxa30995
pxa34158
pxf07822
pxf29346
pxf30902
pxf31604
pxf31606
pxf31953
pxf34985
pxf41740
pxf41778
pxf43236
pxf43917
pxf45518
pxf46461
pxf49051
pxf58440
pxf58523
pxf58621
pxf60794
pxf60938
pxf61299
pxf63061
pxp08000
pxp25916
pxp42841
pxp68003
pxp69833
pxp87972
This above works fine but looking it this all can be integrated into one rather using tr and cut two times distinctly.
Preferably awk or sed would be appropriate.
Thanks.
In awk could you please try following.
awk -F':' '{gsub(",",ORS,$4);print $4}' Input_file
2nd solution:
awk '{sub(/.*:/,"");gsub(/,/,ORS)} 1' Input_file
$ sed 's/.*://; y/,/\n/' file
pxf59093
pxv07744
pxa02374
...
s/.*:// removes everything preceding the last colon, and the colon itself, and y/,/\n/ does what tr does in your approach.
This might work for you (GNU sed):
sed 'y/,/\n/;/:/!P;D' file
Translate ,'s to newlines and don't print any line with a : in it.
N.B. The solution by #oguz ismail is more efficient and faster (with regards to a sed solution).

bash check for words in first file not contained in second file

I have a txt file containing multiple lines of text, for example:
This is a
file containing several
lines of text.
Now I have another file containing just words, like so:
this
contains
containing
text
Now I want to output the words which are in file 1, but not in file 2. I have tried the following:
cat file_1.txt | xargs -n1 | tr -d '[:punct:]' | sort | uniq | comm -i23 - file_2.txt
xargs -n1 to put each space separated substring on a newline.
tr -d '[:punct:] to remove punctuations
sort and uniq to make a sorted file to use with comm which is used with the -i flag to make it case insensitive.
But somehow this doesn't work. I've looked around online and found similar questions, however, I wasn't able to figure out what I was doing wrong. Most answers to those questions were working with 2 files which were already sorted, stripped of newlines, spaces, and punctuation while my file_1 may contain any of those at the start.
Desired output:
is
a
file
several
lines
of
paste + grep approach:
grep -Eiv "($(paste -sd'|' <file2.txt))" <(grep -wo '\w*' file1.txt)
The output:
is
a
file
several
lines
of
I would try something more direct:
for A in `cat file1 | tr -d '[:punct:]'`; do grep -wq $A file2 || echo $A; done
flags used for grep: q for quiet (don't need output), w for word match
One in awk:
$ awk -F"[^A-Za-z]+" ' # anything but a letter is a field delimiter
NR==FNR { # process the word list
a[tolower($0)]
next
}
{
for(i=1;i<=NF;i++) # loop all fields
if(!(tolower($i) in a)) # if word was not in the word list
print $i # print it. duplicates are printed also.
}' another_file txt_file
Output:
is
a
file
several
lines
of
grep:
$ grep -vwi -f another_file <(cat txt_file | tr -s -c '[a-zA-Z]' '\n')
is
a
file
several
lines
of
This pipeline will take the original file, replace spaces with newlines, convert to lowercase, then use grep to filter (-v) full words (-w) case insensitive (-i) using the lines in the given file (-f file2):
cat file1 | tr ' ' '\n' | tr '[:upper:]' '[:lower:]' | grep -vwif file2

Remove space between 2 columns and insert commas - bash

I am using:
cut -f1-2 input.txt|sed 1d
The data is outputting like this:
/mnt/Hector/Data/benign/binary/benign-pete/ fd0977d5855d1295bd57383b17981a09
/mnt/Hector/Data/benign/binary/benign-pete/ fd34c32786aadab513f506c30c2cba33
/mnt/Hector/Data/benign/binary/benign-pete/ fe7d03512e0731e40be628524efbf317
I am trying to get it to output without a space like this and insert a comma between the file path and md5 check sum so excel can separate it properly:
/mnt/Hector/Data/benign/binary/benign-pete/,fd0977d5855d1295bd57383b17981a09
/mnt/Hector/Data/benign/binary/benign-pete/,fd34c32786aadab513f506c30c2cba33
/mnt/Hector/Data/benign/binary/benign-pete/,fe7d03512e0731e40be628524efbf317
I didn't see your input.txt, but try this line, do the job in one shot:
awk -v OFS="," 'NR>1{print $1,$2}' input.txt
This can make it:
$ tr -s " " < your_file | sed 's/ /,/g'
/mnt/Hector/Data/benign/binary/benign-pete/,fd0977d5855d1295bd57383b17981a09
/mnt/Hector/Data/benign/binary/benign-pete/,fd34c32786aadab513f506c30c2cba33
/mnt/Hector/Data/benign/binary/benign-pete/,fe7d03512e0731e40be628524efbf317
tr -s " " < your_file removes extra spaces. sed 's/ /,/g' replaces spaces with commas.

split a file into segments?

I have a file containing text data which are separated by semicolon ";". I want to separate the data , in other words split where ; occurs and write the data to an output file. Is there any way to do with bash script?
You most likely want awk with the FS (field separator variable) set to ';'.
Awk is the tool of choice for column-based data (some prefer Perl, but not me).
echo '1;2;3;4;5
6;7;8;9;10' | awk -F\; '{print $3" "$5}'
outputs:
3 5
8 10
If you just want to turn semicolons into newlines:
echo '1;2;3;4;5
6;7;8;9;10' | sed 's/;/\n/g'
outputs the numbers 1 through 10 on separate lines.
Obviously those commands are just using my test data. If you want to use them on your own file, use something like:
sed 's/;/\n/g' <input_file >output_file
#!/bin/bash
while read -d ';' ITEM; do
echo "$ITEM"
done
Try:
cat original_file.txt | cut -d";" -f1 > new_file.txt
This will split each line in fields delimited by ";" and select the first field (-f1).
You can access other fields with -f1, -f2, ... or multiple fields with -f1-2, -f2-.
You can translate a character to another character by the 'tr' command.
cat input.txt | tr ';' '\n' > output.txt
Where \n is new line and if you want a tab only you should replace it with \t

Resources