pipe result of cut to next argument and concat to string - bash

I have something like:
cut -d ' ' -f2 | xargs cat *VAR_HERE.ss
where I want to use the result of cut as a variable and concat the output of cut between * and . so that cat will use the name to output the appropriate file. for example if the result of cut is:
$ cut -d ' ' -f2
001
I essentially want the second command to do:
$ cat *001.txt
I have seen xargs been used, but I am struggling to find out how to use it to explicitly call the output of cut, rather than assuming the second command only requires the exact output. obviously here I want to concat the output of cut to a string.
thanks

You can do:
cut -d ' ' -f2 | xargs -I {} bash -c 'cat *{}.txt'

Related

GNU parallel with custom script doing string comparison

The follwoing script.sh compares part of a string (coming from stdin by cating a csv file) to a defined string and reports the differences in a certain format
#!/usr/bin/env bash
reference="ABCDEFG"
ref_transp=$(echo "$reference" | sed -e 's/\(.\)/\1\n/g')
while read line; do
line_transp=$(echo "$line" | cut -d',' -f2 | sed -e 's/\(.\)/\1\n/g')
output=$(paste -d ' ' <(echo "$ref_transp") <(echo "$line_transp") | grep -vnP '([A-Z]) \1' | sed -E 's/([0-9][0-9]*):([A-Z]) ([A-Z]*)/\2\1\3/' | grep '^[A-Z][0-9][0-9]*[A-Z*]$')
echo "$(echo ${line:0:35}, $output)"
done < "${1:-/dev/stdin}"
It is intendet to be executed on a number of rows from a very large file in the format
XYZ,ABMDEFG
and it works well when i use it in a pipe:
cat large_file | ./find_something.sh
However, when I try to use it with parallel, i get this error:
$ cat large_file | parallel ./find_something.sh
./find_something.sh: line 9: XYZ, ABMDEFG : No such file or directory
What is causing this? Is parallel supposed to work for something like this, if I want to redirect the output to a single file afterwards?
Less important side note: I'm rather proud of my string comparison method, but if someone has a faster way to get from comparing ABCDEFG and XYZ,ABMDEFG to obtain XYZ,C3M I'd be happy to hear that, too.
Edit:
I should have said, I also want to preserve the order of each line in the output, corresponding to the input. Is that possible using parallel?
Your script accepts its input from a file (defaulting to stdin), whereas parallel will pass input as arguments, not via stdin. In that sense, parallel is closer to xargs.
Presumably, you want each of the lines in large_file to be processed as a unit, possibly in parallel.
That means you need your script to only process one such line at a time, and let parallel call your script many times, once for each line.
So your script should look like this:
#!/usr/bin/env bash
reference="ABCDEFG"
ref_transp=$(echo "$reference" | sed -e 's/\(.\)/\1\n/g')
line="$1"
line_transp=$(echo "$line" | cut -d',' -f2 | sed -e 's/\(.\)/\1\n/g')
output=$(paste -d ' ' <(echo "$ref_transp") <(echo "$line_transp") | grep -vnP '([A-Z]) \1' | sed -E 's/([0-9][0-9]*):([A-Z]) ([A-Z]*)/\2\1\3/' | grep '^[A-Z][0-9][0-9]*[A-Z*]$')
echo "$(echo ${line:0:35}, $output)"
Then you can redirect to a file as follows:
cat large_file | parallel ./find_something.sh > output_file
-k keeps the order.
#!/usr/bin/env bash
doit() {
reference="ABCDEFG"
ref_transp=$(echo "$reference" | sed -e 's/\(.\)/\1\n/g')
while read line; do
line_transp=$(echo "$line" | cut -d',' -f2 | sed -e 's/\(.\)/\1\n/g')
output=$(paste -d ' ' <(echo "$ref_transp") <(echo "$line_transp") | grep -vnP '([A-Z]) \1' | sed -E 's/([0-9][0-9]*):([A-Z]) ([A-Z]*)/\2\1\3/' | grep '^[A-Z][0-9][0-9]*[A-Z*]$')
echo "$(echo ${line:0:35}, $output)"
done
}
export -f doit
cat large_file | parallel --pipe -k doit
#or
parallel --pipepart -a large_file --block -10 -k doit

Get second part of output separated by two spaces

I have this script
#!/bin/bash
path=$1
find "$path" -type f -exec sha1sum {} \; | sort | uniq -D -w 32
It outputs this:
3c8b9f4b983afa9f644d26e2b34fa3e03a2bef16 ./dups/dup1-1.txt
3c8b9f4b983afa9f644d26e2b34fa3e03a2bef16 ./dups/dup1.txt
ffc752244b634abb4ed68d280dc74ec3152c4826 ./dups/subdups/dup2-2.txt
ffc752244b634abb4ed68d280dc74ec3152c4826 ./dups/subdups/dup2.txt
Now I only want to save the last part (the path) in an array.
When I add this after the sort
| awk -F " " '{ print $1 }'
I get this as output:
3c8b9f4b983afa9f644d26e2b34fa3e03a2bef16
3c8b9f4b983afa9f644d26e2b34fa3e03a2bef16
ffc752244b634abb4ed68d280dc74ec3152c4826
ffc752244b634abb4ed68d280dc74ec3152c4826
When I change the $1 to $2, I get nothing, but I want to get the path of the file.
How should I do this?
EDIT:
This script
#!/bin/bash
path=$1
find "$path" -type f -exec sha1sum {} \; | awk '{ print $1 }' | sort | uniq -D -w 32
Outputs this
parallels#mbp:~/bin$ duper ./dups
3c8b9f4b983afa9f644d26e2b34fa3e03a2bef16
3c8b9f4b983afa9f644d26e2b34fa3e03a2bef16
ffc752244b634abb4ed68d280dc74ec3152c4826
ffc752244b634abb4ed68d280dc74ec3152c4826
When I change it to $2 it outputs this
parallels#mbp:~/bin$ duper ./dups
parallels#mbp:~/bin$
Expected Output
./dups/dup1-1.txt
./dups/dup1.txt
./dups/subdups/dup2-2.txt
./dups/subdups/dup2.txt
There are some files in the directory that are no duplicates of each other. Such as nodup1.txt and nodup2.txt. That's why it doesn't show up.
Change your find command to this:
find "$path" -type f -exec sha1sum {} \; | uniq -D -w 41 | awk '{print $2}' | sort
I moved the uniq as the first filter and it is taking into consideration just the first 41 characters, aiming to match just the sha1sum hash.
You can achieve the same result piping to tr and then cut:
echo '3c8b9f4b983afa9f644d26e2b34fa3e03a2bef16 ./dups/dup1-1.txt' |\
tr -s ' ' | cut -d ' ' -f 2
Outputs:
./dups/dup1-1.txt
-s ' ' on tr is to squeeze spaces
-d ' ' -f 2 on cut is to output the second field delimited by spaces
I like to use cut for stuff like this. With this input:
3c8b9f4b983afa9f644d26e2b34fa3e03a2bef16 ./dups/dup1-1.txt
I'd do cut -d ' ' -f 2 which should return:
./dups/dup1-1.txt
I haven't tested it though for your case.
EDIT: Gonzalo Matheu's answer is better as he ensured to remove any extra spaces between your outputs before doing the cut.

How to calculate the total size of all files from ls- l by using cut, eval, head, tr, tail, ls and echo

ls -l | tr -s " " | cut -d " " -f5
I tried above code and got following output.
158416
757249
574994
144436
520739
210444
398630
1219080
256965
684782
393445
157957
273642
178980
339245
How to add these numbers. I'm stuck here. Please no use of awk, perl, shell scripting etc.
It's easiest to use du. Somethings like:
du -h -a -c | tail -n1
Will give you the sum total. You can also use the -d argument to specify how deep the traversing should go like:
du -d 1 -h -a -c | tail -n1
You will have to clarify what you mean by "no use of shell scripting" for anyone to come up with a more meaningful answer.
You can try this way but $((...)) is shell scripting
eval echo $(( $(ls -l | tr -s ' ' | cut -d ' ' -f5 | tr '\n' '+') 0 ))
Don't parse the output of ls. It's not meant for parsing. Use du as with Martin Gergov's answer. Or du and find, or just du.
But if just adding numbers is the sole focus, (even if the input is iffy), here's the laziest possible method (first install num-utils):
ls -l | tr -s " " | cut -d " " -f5 | numsum
And there's other ways, see also: How can I quickly sum all numbers in a file?

shell cut command to remove characters

The current code below the grep & cut is outputting 51315123&category_id , I need to remove &category_id can cut be used to do that?
... | tr '?' '\n' | grep "^recipient_dvd_id=" | cut -d '=' -f 2 >> dvdIDs.txt
Yes, I would think so
... | cut -d '&' -f 1
If you're open to using AWK:
... | tr '?' '\n' | awk -F'[=&]' '/^recipient_dvd_id=/ {print $2}' >> dvdIDs.txt
AWK handles the regex and splitting fields, in this case using both '=' and '&' as field separators and printing the second column. Otherwise, you will need a second cut statement.

sort | uniq | xargs grep ... where lines contain spaces

I have a comma delimited file "myfile.csv" where the 5th column is a date/time stamp. (mm/dd/yyyy hh:mm). I need to list all the rows that contain duplicate dates (there are lots)
I'm using a bash shell via cygwin for WinXP
$ cut -d, -f 5 myfile.csv | sort | uniq -d
correctly returns a list of the duplicate dates
01/01/2005 00:22
01/01/2005 00:37
[snip]
02/29/2009 23:54
But I cannot figure out how to feed this to grep to give me all the rows.
Obviously, I can't use xargs straight up since the output contains spaces. I thought I could do uniq -z -d but for some reason, combining those flags causes uniq to (apparently) return nothing.
So, given that
$ cut -d, -f 5 myfile.csv | sort | uniq -d -z | xargs -0 -I {} grep '{}' myfile.csv
doesn't work... what can I do?
I know that I could do this in perl or another scripting language... but my stubborn nature insists that I should be able to do it in bash using standard commandline tools like sort, uniq, find, grep, cut, etc.
Teach me, oh bash gurus. How can I get the list of rows I need using typical cli tools?
sort -k5,5 will do the sort on fields and avoid the cut;
uniq -f 4 will ignore the first 4 fields for the uniq;
Plus a -D on the uniq will get you all of the repeated lines (vs -d, which gets you just one);
but uniq will expect tab-delimited instead of csv, so tr '\t' ',' to fix that.
Problem is if you have fields after #5 that are different. Are your dates all the same length? You might be able to add a -w 16 (to include time), or -w 10 (for just dates), to the uniq.
So:
tr '\t' ',' < myfile.csv | sort -k5,5 | uniq -f 4 -D -w 16
The -z option of uniq needs the input to be NUL separated. You can filter the output of cut through:
tr '\n' '\000'
To get zero separated rows. Then sort, uniq and xargs have options to handle that. Try something like:
cut -d, -f 5 myfile.csv | tr '\n' '\000' | sort -z | uniq -d -z | xargs -0 -I {} grep '{}' myfile.csv
Edit: the position of tr in the pipe was wrong.
You can tell xargs to use each line as an argument in its entirety using the -d option. Try:
cut -d, -f 5 myfile.csv | sort | uniq -d | xargs -d '\n' -I '{}' grep '{}' myfile.csv
This is a good candidate for awk:
BEGIN { FS="," }
{ split($5,A," "); date[A[0]] = date[A[0]] " " NR }
END { for (i in date) print i ":" date[i] }
Set field seperator to ',' (CSV).
Split fifth field on the space, stick result in A.
Concatenate the line number to the list of what we have already stored for that date.
Print out the line numbers for each date.
Try escaping the spaces with sed:
echo 01/01/2005 00:37 | sed 's/ /\\ /g'
cut -d, -f 5 myfile.csv | sort | uniq -d | sed 's/ /\\ /g' | xargs -I '{}' grep '{}' myfile.csv
(Yet another way would be to read the duplicate date lines into an IFS=$'\n' array and iterate over it in a for loop.)

Resources