How do I handle newlines in shuf, calc etc? - bash

I have written a bash script:
for f in *.csv; do shuf -n 1000 "$f" > ./1000/"${f%.csv}_1000.csv" ; done
Which for each .csv files in a directory randomly writes 1000 lines to a new file with the suffix '_1000' in the directory /1000, i.e.
afolder/cat.csv
afolder/dog.csv
becomes:
afolder/1000/cat_1000.csv
afolder/1000/dog_1000.csv
Each record is a tweet. This works fine except when input files have newline characters. For example one of my tweet records has a text field with newline characters:
Hope Abbo gets his Sen in #bcafc trenches with McCall & Black..
More Warriors The Better
#ShoulderToShoulder
This is handled correctly in Libroffice Calc file
the three lines are kept together in one record (although this does not appear so in the image because calc has expanded the field).
When I look at the output, shuf has chosen one of the three text lines instead of keeping them together:
Is there anyway of telling shuf to keep them together?

Related

Making a list from data in a few variable bash script [duplicate]

I want to merge different lists with delimiter "-".
first list has 2 words
$ cat first
one
who
second list has 10000 words
$ cat second
languages
more
simple
advanced
home
expert
......
......
test
nope
i want two list merge, same ...
$cat merge-list
one-languages
one-more
....
....
who-more
....
who-test
who-nope
....
Paste should do the trick.
paste is a Unix command line utility which is used to join files horizontally (parallel merging) by outputting lines consisting of the sequentially corresponding lines of each file specified, separated by tabs, to the standard output.
Example
paste -d - file1 file2
EDIT:
I just saw that your two files have different length. Unfortunately paste is not
helping with these kinds of problems. But you could of course use something like this:
for i in `cat file1`; do
for j in `cat file2`; do
echo $i-$j
done
done

Bash script which adds space inside long words in Pages file

I like to convert documents to EPUB format because it is easier for me to read. However, if I do this for for example some code documentation, some really long lines of code are not readable in the EPUB, because they trail off-screen. I would like to automatically insert spaces in any words in a text file (specifically, a Pages document) over a certain length, so they are reduced to say, 10 character words, at maximum. Then, I will convert that Pages document to an EPUB.
How can I write a bash script which goes through a Pages document and inserts spaces into any word longer than, perhaps, 10 characters?
sed is your friend:
$ cat input.txt
a file with a
verylongwordinit to test with.
$ sed 's/[^[:space:]]\{10\}/& /g' input.txt
a file with a
verylongwo rdinit to test with.
For every sequence of 10 non-whitespace characters in each line, add a space after (The & in the replacement text is itself replaced with the matched text).
If you want to change the file inline instead of making a copy, ed comes into play:
ed input.txt <<'EOF'
s/[^[:space:]]\{10\}/& /g
w
EOF
(Or some versions of sed take an -i switch for inline editing)

Use grep only on specific columns in many files?

Basically, I have one file with patterns and I want every line to be searched in all text files in a certain directory. I also only want exact matches. The many files are zipped.
However, I have one more condition. I need the first two columns of a line in the pattern file to match the first two columns of a line in any given text file that is searched. If they match, the output I want is the pattern(the entire line) followed by all the names of the text files that a match was found in with their entire match lines (not just first two columns).
An output such as:
pattern1
file23:"text from entire line in file 23 here"
file37:"text from entire line in file 37 here"
file156:"text from entire line in file 156 here"
pattern2
file12:"text from entire line in file 12 here"
file67:"text from entire line in file 67 here"
file200:"text from entire line in file 200 here"
I know that grep can take an input file, but the problem is that it takes every pattern in the pattern file and searches for them in a given text file before moving onto the next file, which makes the above output more difficult. So I thought it would be better to loop through each line in a file, print the line, and then search for the line in the many files, seeing if the first two columns match.
I thought about this:
cat pattern_file.txt | while read line
do
echo $line >> output.txt
zgrep -w -l $line many_files/*txt >> output.txt
done
But with this code, it doesn't search by the first two columns only. Is there a way so specify the first two columns for both the pattern line and for the lines that grep searches through?
What is the best way to do this? Would something other than grep, like awk, be better to use? There were other questions like this, but none that used columns for both the search pattern and the searched file.
Few lines from pattern file:
1 5390182 . A C 40.0 PASS DP=21164;EFF=missense_variant(MODERATE|MISSENSE|Aag/Cag|p.Lys22Gln/c.64A>C|359|AT1G15670|protein_coding|CODING|AT1G15670.1|1|1)
1 5390200 . G T 40.0 PASS DP=21237;EFF=missense_variant(MODERATE|MISSENSE|Gcc/Tcc|p.Ala28Ser/c.82G>T|359|AT1G15670|protein_coding|CODING|AT1G15670.1|1|1)
1 5390228 . A C 40.0 PASS DP=21317;EFF=missense_variant(MODERATE|MISSENSE|gAa/gCa|p.Glu37Ala/c.110A>C|359|AT1G15670|protein_coding|CODING|AT1G15670.1|1|1)
Few lines from a file in searched files:
1 10699576 . G A 36 PASS DP=4 GT:GQ:DP 1|1:36:4
1 10699790 . T C 40 PASS DP=6 GT:GQ:DP 1|1:40:6
1 10699808 . G A 40 PASS DP=7 GT:GQ:DP 1|1:40:7
They both in reality are much larger.
It sounds like this might be what you want:
awk 'NR==FNR{a[$1,$2]; next} ($1,$2) in a' patternfile anyfile
If it's not then update your question to provide a clear, simple statement of your requirements and concise, testable sample input and expected output that demonstrates your problem and that we could test a potential solution against.
if anyfile is actually a zip file then you'd do something like:
zcat anyfile | awk 'NR==FNR{a[$1,$2]; next} ($1,$2) in a' patternfile -
Replace zcat with whatever command you use to produce text from your zip file if that's not what you use.
Per the question in the comments, if both input files are compressed and your shell supports it (e.g. bash) you could do:
awk 'NR==FNR{a[$1,$2]; next} ($1,$2) in a' <(zcat patternfile) <(zcat anyfile)
otherwise just uncompress patternfile to a tmp file first and use that in the awk command.
Use read to parse the pattern file's columns and add an anchor to the zgrep pattern :
while read -r column1 column2 rest_of_the_line
do
echo "$column1 $column2 $rest_of_the_line"
zgrep -w -l "^$column1\s*$column2" many_files/*txt
done < pattern_file.txt >> output.txt
read is able to parse lines into multiple variables passed as parameters, the last of which getting the rest of the line. It will separate fields around characters of the $IFS Internal Field Separator (by default tabulations, spaces and linefeeds, can be overriden for the read command by using while IFS='...' read ...).
Using -r avoids unwanted escapes and makes the parsing more reliable, and while ... do ... done < file performs a bit better since it avoids an useless use of cat. Since the output of all the commands inside the while is redirected I also put the redirection on the while rather than on each individual commands.

Combine .csv files on Mac OSX terminal does not use a new line in between

I have multiple csv files that I wish to merge into one.
a.csv
Field1,Field2,Field3
1,2,3
4,5,6
b.csv
Field4,Field5,Field6
7,8,9
10,11,12
When I run the following command on Mac OSX Terminal
cat *.csv >merged.csv
The files get concatenated as follows -
Field1,Field2,Field3
1,2,3
4,5,6Field4,Field5,Field6
7,8,9
10,11,12
However I would like the concatenation to take place in a seperate line.
Field1,Field2,Field3
1,2,3
4,5,6
Field4,Field5,Field6
7,8,9
10,11,12
How can this be done best?
cat *.csv + new line >merged.csv
The problem is that your first file (and probably the rest as well) doesn't have a newline at the end of the last line. In unix-style text files, every line is supposed to have a newline terminator at the end. Result: when you catenate the files together, there's no terminator at the end of the "4,5,6" line, so "Field4,Field5,Field6" gets treated as part of the same line.
Fortunately, there's a pretty simple solution: use something that processes (and appends) files line-by-line rather than just blindly sticking them together. Here's an example using awk:
awk '{print $0}' *.csv
BTW, I wouldn't recommend using the format somecmd *.csv >merged.csv, because merged.csv can wind up being both an input and output, leading to weird results. Whether this happens (and whether it matters) is complicated, but it's best to just avoid the issue by using a more specific wildcard pattern, putting the input and output in different directories, or something like that.

Reading the contents of a text file and assign to a variable using bash script

I want to read the contents of a text file and check for the filenames with extension .txt and find merge those .txt files.Is there a way I could do this using bash?
For example, if the text file contains,
file1.txt, file2.txt
I want to read the strings with .txt extension and find merge those files which is in another location.
I tried the below,
txt_file="/tmp/Muzi/tomerge.txt"
while read -r line;do
echo $line
done <"$txt_file"
But, this prints out the complete text file and I am completely new using bash.
There is a good deal of assuming involved, but... if I understood your question, you would have a tomerge.txt where on some lines a filename would appear, one per line, ending in .txt. If that is the case (and the filenames do not contain spaces) you can:
cat $(grep '[.]txt$' tomerge.txt)
It's not bash only (uses cat), it concatenates files corresponding to all lines ending in .txt we've collected from tomerge.txt.

Resources