Can Bash I/O redirection work within a Ruby script? - ruby

In Bash, I want to compare the fields of 2 different CSVs (field 2 of file1 and field 3 of file2):
diff <(cut -d, -f2 file1) <(cut -d, -f3 file2)
I tried to implement this more generally in Ruby:
def identical_files?(file1, field1, file2, field2)
%x{diff <(cut -d, -f#{field1} #{file1}) <(cut -d, -f#{field2} #{file2})}.blank?
end
Printing the output of the %x{} block, I see sh: Syntax error: "(" unexpected. Does I/O redirection not work when running shell commands within Ruby? Is this because it's only supported by bash but not sh?

It doesn’t work because, as the error you’re getting indicates, Ruby shells out to sh, not Bash. And, of course, sh does not support that syntax.
You can instead call Bash explicitly:
`bash -c 'cat <(echo foo)'` #=> "foo"

Is this because it's only supported by bash but not sh?
Yes.
Process substitution is not supported by sh, even when sh is actually bash (for compatibility).

Don't try to use something as simple as cut to process fields in a CSV file. CSV files can have embedded commas inside fields, which will fool cut, causing your code to do the wrong thing.
Instead, use something designed specifically to process CSV files, such as Ruby's CSV class. Something like this untested code will get you started:
require 'csv'
csv_file1 = CSV.open('file1')
csv_file2 = CSV.open('file2')
until (csv_file1.eof? || csv_file2.eof?) do
row1 = csv_file1.shift
row2 = csv_file2.shift
# do something to diff the fields
puts "#{ csv_file1.lineno }: #{ row1[1] } == #{ row2[2] } --> #{ row1[1] == row2[2] }"
end
[
[csv_file1, 'file1'],
[csv_file2, 'file2']
].each do |f, fn|
puts "Hit EOF for #{ fn }" if f.eof?
f.close
end

Related

Test if a value is in a csv file in bash

I have a 3-4M lines csv file (my_csv.csv) with two columns as :
col1,col2
val11,val12
val21,val22
val31,val32
...
The csv contains only two columns with one comma per line. Col1 and Col2 values are only strings (nothing else). The result shown above is the result of the command head my_csv.cs..
I would like to check if a string test_str is into the col2 values. What I mean here is, if test_str = val12 I would like the test to return True because val12 is located in column 2 (as show in the example).
But if test_str = val1244 I want the code to return False.
In python it would be something as :
import pandas as pd
df = pd.read_csv('my_csv.csv')
test_str = 'val42'
if test_str in df['col2'].to_list():
# Expected to return true
# Do the job
But I have no clues how to do it in bash.
(I know that df['col2'].to_list() is not a good idea, but I didn't want to use built-in pandas function for the code to be easier to understand)
awk is most suited amongst the bash utilities to handle csv data:
awk -F, -v val='val22' '$2 == val {print "found a match:", $0}' file
found a match: val21,val22
An equivalent bash loop would be like this:
while IFS=',' read -ra arr; do
if [[ ${arr[1]} == 'val22' ]]; then
echo "found a match: ${arr[#]}"
fi
done < file
But do keep in mind that Bash while read loop extremely slow compared to cat, why?
Parsing CSV is difficult... unless your fields do not contain commas, newlines... And you don't do what you want in bash, on a large file it will be extremely slow. You do it using utilities like awk or grep that would also be available with dash, zsh or another shell. So, if you have a very simple CSV format you can use, e.g., grep:
if grep -q ',val42$' my_csv.csv; then
<do that>
fi
We can also put the string to search for in a variable but remember that some characters have a special meaning in regular expressions and shall be escaped. Example if there are no special characters in the string to search for:
test_str="val42"
if grep -q ",$test_str$" my_csv.csv; then
<do that>
fi
3-4M rows is a small file to awk. might as well just do
{m,g}awk 'END { exit !index($_,","(__)"\n") }' RS='^$' FS='^$' __="${test_str}"

convert a file content using shell script

Hello everyone I'm a beginner in shell coding. In daily basis I need to convert a file's data to another format, I usually do it manually with Text Editor. But I often do mistakes. So I decided to code an easy script who can do the work for me.
The file's content like this
/release201209
a1,a2,"a3",a4,a5
b1,b2,"b3",b4,b5
c1,c2,"c3",c4,c5
to this:
a2>a3
b2>b3
c2>c3
The script should ignore the first line and print the second and third values separated by '>'
I'm half way there, and here is my code
#!/bin/bash
#while Loops
i=1
while IFS=\" read t1 t2 t3
do
test $i -eq 1 && ((i=i+1)) && continue
echo $t1|cut -d\, -f2 | { tr -d '\n'; echo \>$t2; }
done < $1
The problem in my code is that the last line isnt printed unless the file finishes with an empty line \n
And I want the echo to be printed inside a new CSV file(I tried to set the standard output to my new file but only the last echo is printed there).
Can someone please help me out? Thanks in advance.
Rather than treating the double quotes as a field separator, it seems cleaner to just delete them (assuming that is valid). Eg:
$ < input tr -d '"' | awk 'NR>1{print $2,$3}' FS=, OFS=\>
a2>a3
b2>b3
c2>c3
If you cannot just strip the quotes as in your sample input but those quotes are escaping commas, you could hack together a solution but you would be better off using a proper CSV parsing tool. (eg perl's Text::CSV)
Here's a simple pipeline that will do the trick:
sed '1d' data.txt | cut -d, -f2-3 | tr -d '"' | tr ',' '>'
Here, we're just removing the first line (as desired), selecting fields 2 & 3 (based on a comma field separator), removing the double quotes and mapping the remaining , to >.
Use this Perl one-liner:
perl -F',' -lane 'next if $. == 1; print join ">", map { tr/"//d; $_ } #F[1,2]' in_file
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
-F',' : Split into #F on comma, rather than on whitespace.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches

zsh sed expanding a variable with special characters and keeping them

I'm trying to store a string in a variable, then expand that variable in a sed command.
Several of the values I'm going to put in the variable before calling the command will have parentheses (with and without slashes before the left parentheses, but never before the right), new lines and other special characters. Also, the string will have double quotes around it in the file that's being searched, and I'd like to use those to limit only to the string I'm querying.
The command needs to be able to match with those special characters in the file. Using zsh / Mac OS, although if the command was compatible with bash 4.2 that'd be a nice bonus. echoing to xargs is fine too. Also, if awk would be better for this, I have no requirement to use sed.
Something like...
sed 's/"\"$(echo -E - ${val})\""/"${key}.localized"/g' "${files}"
Given that $val is the variable I described above, $key has no spaces (but underscores) & $files is an array of file paths (preferably compatible with spaces, but not required).
Example Input values for $val...
... "something \(customStringConvertible) here" ...
... "something (notVar) here" ...
... "something %# here" ...
... "something # 100% here" ...
... "something for $100.00" ...
Example Output:
... "some_key".localized ...
I was using the sed command to replace the examples above. The text I'm overwriting it with is pretty straight forward.
The key problem I'm having is getting the command to match with the special characters instead of expanding them and then trying to match.
Thanks in advance for any assistance.
awk is better since it provides functions that work with literal strings:
$ val='something \(customStringConvertible) here' awk 'index($0,ENVIRON["val"])' file
... "something \(customStringConvertible) here" ...
$ val='something for $100.00' awk 'index($0,ENVIRON["val"])' file
... "something for $100.00" ...
The above was run on this input file:
$ cat file
... "something \(customStringConvertible) here" ...
... "something (notVar) here" ...
... "something %# here" ...
... "something # 100% here" ...
... "something for $100.00" ...
With sed you'd have to follow the instructions at Is it possible to escape regex metacharacters reliably with sed to try to fake sed out.
It's not clear what your real goal is so edit your question to provide concise, testable sample input and expected output if you need more help. Having said that, it looks like you're doing a substitution so maybe this is what you want:
$ old='"something for $100.00"' new='here & there' awk '
s=index($0,ENVIRON["old"]) { print substr($0,1,s-1) ENVIRON["new"] substr($0,s+length(ENVIRON["old"])) }
' file
... here & there ...
or if you prefer:
$ old='"something for $100.00"' new='here & there' awk '
BEGIN { old=ENVIRON["old"]; new=ENVIRON["new"]; lgth=length(old) }
s=index($0,old) { print substr($0,1,s-1) new substr($0,s+lgth) }
' file
or:
awk '
BEGIN { old=ARGV[1]; new=ARGV[2]; ARGV[1]=ARGV[2]=""; lgth=length(old) }
s=index($0,old) { print substr($0,1,s-1) new substr($0,s+lgth) }
' '"something for $100.00"' 'here & there' file
... here & there ...
See How do I use shell variables in an awk script? for info on how I'm using ENVIRON[] vs ARGV[] above.

how ruby if column less than 4 print column 3?

Im triying to use this code but not work
ruby -a -F';' -ne if $F[2]<4 'puts $F[3]' ppp.txt
this is my file
mmm;2;nsfnjd
sadjjasjnsd;6;gdhjsd
gsduhdssdj;3;gsdhjhjsd
what is doing worng Please help me
First of all, instead of treating Ruby like some kind of fancy Perl and writing scripts like that, let's expand it into the Ruby code equivalent for clarity:
$; = ';'
while gets
$F = $_.split
if $F[2]<4
puts $F[3]
end
end
Your original code doesn't work, it can't possibly work because it's not valid Ruby code, and further, you're not properly quoting it to pass through the -e evaluation term. Trying to run it I get:
-bash: 4: No such file or directory
You're also presuming the array is 1-indexed, but it's not. It's 0-indexed. Additionally Ruby treats integer values as completely different from strings, never equivalent, not auto-converted. As such you need to call .to_i to convert.
Here's a re-written program that does the job:
File.open(ARGV[0]) do |fi|
fi.readlines.each do |line|
parts = line.chomp.split(';')
if parts[1].to_i < 4
puts parts[2]
end
end
end
I solved with this
ruby -a -F';' -ne ' if $F[1] < "4" ;puts $F[2] end ' ppp.txt

bash script to modify and extract information

I am creating a bash script to modify and summarize information with grep and sed. But it gets stuck.
#!/bin/bash
# This script extracts some basic information
# from text files and prints it to screen.
#
# Usage: ./myscript.sh </path/to/text-file>
#Extract lines starting with ">#HWI"
ONLY=`grep -v ^\>#HWI`
#replaces A and G with R in lines
ONLYR=`sed -e s/A/R/g -e s/G/R/g $ONLY`
grep R $ONLYR | wc -l
The correct way to write a shell script to do what you seem to be trying to do is:
awk '
!/^>#HWI/ {
gsub(/[AG]/,"R")
if (/R/) {
++cnt
}
END { print cnt+0 }
' "$#"
Just put that in the file myscript.sh and execute it as you do today.
To be clear - the bulk of the above code is an awk script, the shell script part is the first and last lines where the shell just calls awk and passes it the input file names.
If you WANT to have intermediate variables then you can create/print them with:
awk '
!/^>#HWI/ {
only = $0
onlyR = only
gsub(/[AG]/,"R",onlyR)
print "only:", only
print "onlyR:", onlyR
if (/R/) {
++cnt
}
END { print cnt+0 }
' "$#"
The above will work robustly, portably, and efficiently on all UNIX systems.
First of all, and as #fedorqui commented - you're not providing grep with a source of input, against which it will perform line matching.
Second, there are some problems in your script, which will result in unwanted behavior in the future, when you decide to manipulate some data:
Store matching lines in an array, or a file from which you'll later read values. The variable ONLY is not the right data structure for the task.
By convention, environment variables (PATH, EDITOR, SHELL, ...) and internal shell variables (BASH_VERSION, RANDOM, ...) are fully capitalized. All other variable names should be lowercase. Since
variable names are case-sensitive, this convention avoids accidentally overriding environmental and internal variables.
Here's a better version of your script, considering these points, but with an open question regarding what you were trying to do in the last line : grep R $ONLYR | wc -l :
#!/bin/bash
# This script extracts some basic information
# from text files and prints it to screen.
#
# Usage: ./myscript.sh </path/to/text-file>
input_file=$1
# Read lines not matching the provided regex, from $input_file
mapfile -t only < <(grep -v '^\>#HWI' "$input_file")
#replaces A and G with R in lines
for((i=0;i<${#only[#]};i++)); do
only[i]="${only[i]//[AG]/R}"
done
# DEBUG
printf '%s\n' "Here are the lines, after relpace:"
printf '%s\n' "${only[#]}"
# I'm not sure what you were trying to do here. Am I gueesing right that you wanted
# to count the number of R's in ALL lines ?
# grep R $ONLYR | wc -l

Resources