I have a .csv file of character strings (about 5,400) that appear, in addition to many other strings, in a large .txt file of a huge corpus. I need to count the occurrences of each one of the 5,400 strings in the .txt corpus file. I'm using the shell (I have a Macbook Pro) and I don't know how to write a for loop with an input from one file to then work in another file. The input_file.csv looks like this:
A_back
A_bill
A_boy
A_businessman
A_caress
A_chat
A_con
A_concur
A_cool
A_cousin
A_discredit
A_doctor
A_drone_AP_on
A_fellow
A_flatter
A_friend
A_gay
A_giddy
A_guilty
A_harangue
A_ignore
A_indulge
A_interested
A_kind
A_laugh
A_laugh_AP_at
...
The corpus_file.txt I'm searching through is a cleaned and lemmatized corpus with one sentence per line; this is 4 lines of the text:
A_recently N_pennsylvania N_state_N_university V_launch a N_program that V_pay A_black N_student AP_for V_improve their N_grade a N_c AP_to N_c A_average V_bring 550 and N_anything A_high V_bring 1,100
A_here V_be the N_sort AP_of A_guilty N_kindness that V_kill
what N_kind AP_of N_self_N_respect V_be a A_black N_student V_go AP_to V_have AP_as PR_he or PR_she V_reach AP_out AP_to V_take 550 AP_for N_c N_work A_when A_many A_white N_student V_would V_be V_embarrass AP_by A_so A_average a N_performance
A_white N_student V_would V_be V_embarrass AP_by A_so A_average a N_performance
I am looking to count exactly how many times each of the strings in input_file.csv appear in corpus_file.txt. I can do one at a time with the following code:
grep -c A_guilty corpus_file.txt
And in a few seconds I get a count of how many times A_guilty appears in corpus_file.txt (it appears once in the bit of the corpus I have put here). However, I don't want to do that 5,400 times, so I'm trying to put it into a loop that will output each character string and its count.
I have tried to run the code below:
for input_file.csv in directory/path/folder/ do grep -c corpus_file.txt done
But it doesn't work. input_file.csv and corpus_file.txt are both in the same folder so have the same directory.
I'm hoping to end up with a list of the 5,400 character strings and the number of times each string appears in the large corpus_file.txt file. Something like this:
term - count
A_back - 2093
A_bill - 873
A_boy - 1877
A_businessman - 148
A_caress - 97
A_chat - 208
A_con - 633
This might be all you need:
$ cat words
sweet_talk
white_man
hispanic_american
$ cat corpus
foo
sweet_talk
bar
hispanic_american
sweet_talk
$ grep -Fowf words corpus | sort | uniq -c
1 hispanic_american
2 sweet_talk
If not then edit your question to clarify your requirements and provide more truly representative sample input/output.
Related
I'm new to bash shell scripting.
How can I compare 8 outputs of extension-less files (with only binary values) - same length of values, 0 or 1.
To clarify things, this is what I've done so far.
for d in */; do
find . -name base -execdir sh -c 'cat {} >> out' \;
done
I've Found all the files that are located in sub-folders, read & concatenated all the binary files into out file.
Now I have 8 out files (8 parent folders) that I need to compare with.
I've tried both "diff" and "cmp" - but they both work only with 2 files.
At the end, I need to check and verify if there is a difference between this 8 binary files and eventually to export the results and represent them in HEX format - example: if 2 of the out files are all '1' = F , and if all '0' = 0 . hence, the final results should be for example : FFFF 0000 (4 first files are all '1' , 4 last files are all '0').
What is the best option to do so? - Hope that I've managed to clarify my case.
Thanks a lot for the help.
Let me assume:
We have 8 (presumably binary) files, say: dir1/out.txt, dir2/out.txt, ..
dir8/out.txt.
We want to compare among these files and identify which files are identical
and which are not.
Then how about the steps:
To generate hash values of the files with e.g. sha256sum.
To compare the hash values and divide into groups based on the hash values.
I have created 8 test files, of those dir1/out.txt, dir2/out.txt and dir4/out.txt
are indentical, dir3/out.txt and dir7/out.txt are identical, and others
differ.
Then the hash values will look like:
sha256sum dir*/out.txt
298497ad818c3d927498537ed5ab4f9ae663747b6d00ec9a5d0bd9e30a6b714b dir1/out.txt
298497ad818c3d927498537ed5ab4f9ae663747b6d00ec9a5d0bd9e30a6b714b dir2/out.txt
e962879ef251f2117460cf0d5ce714e36a9ab79f2548c48e2121b4e573cf179b dir3/out.txt
298497ad818c3d927498537ed5ab4f9ae663747b6d00ec9a5d0bd9e30a6b714b dir4/out.txt
f45151f5253c62de69c95935f083b5649876fdb661412d4f32065a7b018bf68b dir5/out.txt
bdc26931acfb734b142a8d675f205becf27560dc461f501822de13274fe6fc8a dir6/out.txt
e962879ef251f2117460cf0d5ce714e36a9ab79f2548c48e2121b4e573cf179b dir7/out.txt
11a77c3d96c06974b53d7f40a577e6813739eb5c811b2a86f59038ea90add772 dir8/out.txt
To summarize the result, let me replace the hash values with group id, having
the same number for the same files in occurance order.
Here's the script:
sha256sum dir*/out.txt | awk '{if (!gid[$1]) gid[$1] = ++n; print $2 " " gid[$1]}'
The output:
dir1/out.txt 1
dir2/out.txt 1
dir3/out.txt 2
dir4/out.txt 1
dir5/out.txt 3
dir6/out.txt 4
dir7/out.txt 2
dir8/out.txt 5
where the second field shows the group id to indicate which files are identical.
Note that the group id does not represent the content of each file as:
if 2 of the out.txt files are all '1' = F , and if all '0' = 0,
because I have no idea how the files look like. If OP can provide the
example files, I could be more help.
BTW I'm still in doubt if the files are binary in ordinary sense because
OP is mentioning that "it's simply a file that contains 0 or 1 in its
value when I open it". It sounds to me the files are composed of
ascii "0"s and "1"s. My script above should work for both binary files
and text files anyway.
[Update]
According to the OP's information, here's a solution for the specific case:
#!/bin/bash
for f in dir*/out.txt; do
if [[ $(uniq "$f" | wc -l) = 1 ]]; then
echo -n "$(head -1 "$f" | tr 1 F)"
else
echo -n "-"
fi
done
echo
It digests the contents of each file to either of: 0 for all 0's, F for all 1's or - for the mixture case (possible error).
For instance, if dir{1..4}/out.txt are all 0's, dir5/out.txt is a mixture, and dir{6..8}/out.txt are all 1's, then the output will look like:
0000-FFF
I hope it will meet the OP's requirements.
If you are looking for records that are unique in your list of files
cat $path/$files|uniq -u>/tmp/output.txt
grep -f /tmp/output.txt $path/$files
I got these words
Frank_Sinatra
Dean_Martin
Ray_Charles
I want to generate 4 characters which will always match with those words and never change.
ej:
frk ) Frank_Sinatra
dnm ) Dean_Martin
Ray ) Ray_Charles
and it shall always match these 4 characters when I run it again (not random)
note:
Something like this:
String 32-bit checksum 8-bit checksum
ABC 326 0x146 70 0x46
ACB 410 0x19A 154 0x9A
BAC 350 0x15E 94 0x5E
BCA 450 0x1C2 194 0xC2
CAB 399 0x18F 143 0x8F
CBA 256 0x100 0 0x00
http://www.flounder.com/checksum.htm
Look at this command --->
echo -n Frank_Sinatra | md5sum
d0f7287be11d7bbfe53809088ea3b009 -
but instead of that long string, I wanted just 4 unique characters.
I did it like this:
echo -n "Frank_Sinatra" | md5sum > foo ; sed -i 's/./&\n#/4' foo
grep -v "#" foo > bar
I'm not going to write the entire program for you, but I can share some algorithm that can accomplish this. I can't guarantee that it is the most optimized algorithm.
Problem
Generate a 3-letter identifier for each line in a text file that is unique, such that grep will only match with the intended line.
Assumption
There exists a 3-letter identifier for each line such that grep will only match that line.
Algorithm
For every line in text file
Grab a permutation of the line, run grep on the file using that permutation.
If grep returns more than 2 lines, get a new permutation of the line, go back to previous step.
If grep returns only one line and that line matches our current line, we found a proper identifier. Store this identifier.
I have a question about random selection of a read from a sampled pair-end fastq files. I read some topics regarding this manner but none could solve my problem, which is:
I got two fastq files R1.fastq and R2.fastq. What I want to achieve is to randomly sample those files and from each sampled pair of reads I want to randomly select only one read.
What I did so far is...
I sampled my files using seqtk:
seqtk sample -s100 R1.fastq 10000 > R1_sample.fastq
seqtk sample -s100 R2.fastq 10000 > R2_sample.fastq
then I sorted each file by sequence ID like this:
paste - - - - < R1_sample.fastq | sort -k1 -t " " | tr "\t" "\n" > R1_sample_sorted.fastq
I did the same with R2_sample.fastq. Then I merged both sorted files so that R1 would be in one column and R2 in the second column:
pr -mts R1_sample_sorted.fastq R2_sample_sorted.fastq > merged.fastq
the file looks like this:
#D3YGT8Q1:297:C7T4RACXX:3:1101:1000 #D3YGT8Q1:297:C7T4RACXX:3:1101:1000
TGATGTTTGGATGTAAAGTGAAATATTAGTTGGCG AGCTTTCCTCACTATCTGCTTCATCCGCCAACTAA
+ +
BBBFFFFFFFFFFFIFFIFFIIIIFIIIFIIFIII B0<FFFFFFFFFFIIIIIIIIIIIIIIIIIIIIII
#D3YGT8Q1:297:C7T4RACXX:3:1101:1000 #D3YGT8Q1:297:C7T4RACXX:3:1101:1000
CCTCCTAGGCGACCCAGACAATTATACCCTAGCCA TGTTTAAGGGGTTGGCTAGGGTATAATTGTCTGGG
+ +
BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIII BBBFFFFFFFFFFIIIIIIIIBFFIIIIIIIIIII
#D3YGT8Q1:297:C7T4RACXX:3:1101:1000 #D3YGT8Q1:297:C7T4RACXX:3:1101:1000
TTCTATTTATTACCTCAGAAGTTTTTTTCTTCGCA GTAAAAGGCTCAGAAAAATCCTGCGAAGAAAAAAA
+ +
BBBFFFFFFFFFFIIIIIIIIFIIFIIIFIIIIII BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIII
And now I want to randomly select only one read from each pair. My initial idea was to use shuf to get a random number from range 1-2:
shuf -i1-2 -n1
and then somehow select the read cooresponding to the number I got from shuf. For example in the first iteration I got 1 so I pick the read from column 1, in the socond iteration I got 2 so from the next pair of reads I pick the read in the second column etc.
I got stuck here. So my question is, is there a neat way to do this? Maybe with awk or some other method? Any help will be very appreciated.
Comment to Ashafixs answer:
Thanks for your respond and sorry for the huge delay...!
I've tested your solutions and they both seem to have flaws.
For the first script I constructed test fastq files R1 and R2 each containing 6 reads. After running the script I expect it to output 6 reads as well (24 lines) in the correct order(ID,seq,desc,qual) but as a set of reads randomly selected from R1 or R2 file. What I got from the script is:
#D3YGT8Q1:297:C7T4RACXX:3:1101:10002:27381 2:N:0:ATGCTCGTTCTCTCGT
AGCTTTCCTCACTATCTGCTTCATCCGCCAACTAATATTTCACTTTACATCCAAACATCAAGATC
+
B0<FFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFIFIFIIIIIIIIII
#D3YGT8Q1:297:C7T4RACXX:3:1101:10004:50631 2:N:0:ATGCTCGTTCTCTCGT
#D3YGT8Q1:297:C7T4RACXX:3:1101:10007:32152 1:N:0:ATGCTCGTTCTCTCGT
GTAAGGTTAGGAGGGTGTTAATTATTAAAATTAAGGCGAAGTTTATTACTCTTTTTTGAATGTTG
+
BBBFFFFFFFFFFIIBFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFFFFFF
You can see that the output is not correct. The second read is missing three lines and there should be six reads in total not three. In addition each time I run the script it outputs different number of reads.
For the second script I input a merged fastq file like described above. The output looks similar to the first script output:
#D3YGT8Q1:297:C7T4RACXX:3:1101:10002:27381 2:N:0:ATGCTCGTTCTCTCGT
AGCTTTCCTCACTATCTGCTTCATCCGCCAACTAATATTTCACTTTACATCCAAACATCAAGATC
+
B0<FFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFIFIFIIIIIIIIII
#D3YGT8Q1:297:C7T4RACXX:3:1101:10004:50631 2:N:0:ATGCTCGTTCTCTCGT
#D3YGT8Q1:297:C7T4RACXX:3:1101:10004:50631 2:N:0:ATGCTCGTTCTCTCGT
TGTTTAAGGGGTTGGCTAGGGTATAATTGTCTGGGTCGCCTAGGAGGAGATCGGAAGAGCGTCGT
+
BBBFFFFFFFFFFIIIIIIIIBFFIIIIIIIIIIIFFFIIIIIIFIIIIIFIIIFFFFFFFFFFF
#D3YGT8Q1:297:C7T4RACXX:3:1101:10004:88140 1:N:0:ATGCTCGTTCTCTCGT
ACTGTAACTTAAAAATGATCAAATTATGTTTCCCATGCATCAGGTGCAATGAGAAGCTCTTCATC
+
BBBFFFFFFFFFFIIIIIIIIIIFIIIIIIFIIIIIIIIIIIIIFIIIIIIIIIIIIIIIIIIII
#D3YGT8Q1:297:C7T4RACXX:3:1101:10007:32152 2:N:0:ATGCTCGTTCTCTCGT
CTAGTTTTGACAACATTCAAAAAAGAGTAATAAACTTCGCCTTAATTTTAATAATTAACACCCTC
+
BBBFFFFFFFFFFIIIIIIIIIIIIIIFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFIII
but this time I always get five reads. Still missing one. And the second and third read headers are the same. This should not happen.
You could try the following script (it also works as one liner). First it gets all the headers from your first fastq file, then it randomly picks a fastq file and returns 4 lines from it.
Please note: This only works if both files have identical headers at identical positions.
#!/bin/bash
headers=$(grep # R1_sample.fastq)
var=1
for line in $headers ; do
r=$(shuf -i1-2 -n1)
tail -n +$var "R$r"_sample.fastq | grep -m 1 -A 4 $line
var=$((var+4))
done
Alternatively you could expand your merge and pick a column approach. cut is used to remove a random column from the merged output.
#!/bin/bash
headers=$(grep # merged.fastq)
var=1
for line in $headers ; do
r=$(shuf -i1-2 -n1)
tail -n +$var merged.fastq | grep -m 1 -A 4 $line | cut -d$'\t' -f$r
var=$((var+4))
done
Cut a file into several files according to numbers in a list:
$ wc -l all.txt
8500 all.txt
$ wc -l STS.*.txt
2000 STS.input.answers-forums.txt
1500 STS.input.answers-students.txt
2000 STS.input.belief.txt
1500 STS.input.headlines.txt
1500 STS.input.images.txt
How do I split my all.txt into the no. of lines of the STS.*.txt and then save them to the respective STS.output.*.txt?
I've been doing it manually as such:
$ sed '1,2000!d' all.txt > STS.output.answers-forums.txt
$ sed '2001,3500!d' all.txt > STS.output.answers-students.txt
$ sed '3501,5500!d' all.txt > STS.output.belief.txt
$ sed '5501,7000!d' all.txt > STS.output.headlines.txt
$ sed '7001,8500!d' all.txt > STS.output.images.txt
The all.txt input would look something like this:
$ head all.txt
2.3059
2.2371
2.1277
2.1261
2.0576
2.0141
2.0206
2.0397
1.9467
1.8518
Or sometimes all.txt looks like this:
$ head all.txt
2.3059 92.123
2.2371 1.123
2.1277 0.12452
2.1261123 213
2.0576 100
2.0141 0
2.02062 1
2.03972 34.123
1.9467 9.23
1.8518 9123.1
As for the STS.*.txt, they are just plain text lines, e.g.:
$ head STS.output.answers-forums.txt
The problem likely will mean corrective changes before the shuttle fleet starts flying again. He said the problem needs to be corrected before the space shuttle fleet is cleared to fly again.
The technology-laced Nasdaq Composite Index .IXIC inched down 1 point, or 0.11 percent, to 1,650. The broad Standard & Poor's 500 Index .SPX inched up 3 points, or 0.32 percent, to 970.
"It's a huge black eye," said publisher Arthur Ochs Sulzberger Jr., whose family has controlled the paper since 1896. "It's a huge black eye," Arthur Sulzberger, the newspaper's publisher, said of the scandal.
Wish you'd posted some sample input for splitting an input file of, say, 10 lines into output files of say, 2, 3, and 5 lines instead of 8500 lines into.... as that would have given us something to test a solution against. Oh well, this might work but is untested of course:
awk '
ARGIND < (ARGC-1) { outfile[NR] = gensub(/input/,"output","",FILENAME); next }
{ print > outfile[FNR] }
' STS.input.* all.txt
The above used GNU awk for ARGIND and gensub().
It just creates an array that maps each line number across all "input" files to the name of the "output" file that that same line number of "all.txt" should be written to.
Any time you write a loop in shell just to manipulate text you have the wrong approach. The guys who created shell also created awk for shell to call to manipulate text so just do that.
I would suggest writing a loop:
for file in answers-forums answers-students belief headlines images; do
lines=$(wc -l < "STS.input.$file.txt")
sed "$(( total + 1 )),$(( total + lines ))!d" all.txt > "STS.output.$file.txt"
(( total += lines ))
done
total keeps a track of how many lines have been read so far. The sed command extracts the lines from total + 1 to total + lines, writing them to the corresponding output file.
This question already has answers here:
How can I split a large text file into smaller files with an equal number of lines?
(12 answers)
Closed 5 years ago.
I was wondering if it was possible to split a file into equal parts (edit: = all equal except for the last), without breaking the line? Using the split command in Unix, lines may be broken in half. Is there a way to, say, split up a file in 5 equal parts, but have it still only consist of whole lines (it's no problem if one of the files is a little larger or smaller)? I know I could just calculate the number of lines, but I have to do this for a lot of files in a bash script. Many thanks!
If you mean an equal number of lines, split has an option for this:
split --lines=75
If you need to know what that 75 should really be for N equal parts, its:
lines_per_part = int(total_lines + N - 1) / N
where total lines can be obtained with wc -l.
See the following script for an example:
#!/usr/bin/bash
# Configuration stuff
fspec=qq.c
num_files=6
# Work out lines per file.
total_lines=$(wc -l <${fspec})
((lines_per_file = (total_lines + num_files - 1) / num_files))
# Split the actual file, maintaining lines.
split --lines=${lines_per_file} ${fspec} xyzzy.
# Debug information
echo "Total lines = ${total_lines}"
echo "Lines per file = ${lines_per_file}"
wc -l xyzzy.*
This outputs:
Total lines = 70
Lines per file = 12
12 xyzzy.aa
12 xyzzy.ab
12 xyzzy.ac
12 xyzzy.ad
12 xyzzy.ae
10 xyzzy.af
70 total
More recent versions of split allow you to specify a number of CHUNKS with the -n/--number option. You can therefore use something like:
split --number=l/6 ${fspec} xyzzy.
(that's ell-slash-six, meaning lines, not one-slash-six).
That will give you roughly equal files in terms of size, with no mid-line splits.
I mention that last point because it doesn't give you roughly the same number of lines in each file, more the same number of characters.
So, if you have one 20-character line and 19 1-character lines (twenty lines in total) and split to five files, you most likely won't get four lines in every file.
The script isn't even necessary, split(1) supports the wanted feature out of the box:
split -l 75 auth.log auth.log.
The above command splits the file in chunks of 75 lines a piece, and outputs file on the form: auth.log.aa, auth.log.ab, ...
wc -l on the original file and output gives:
321 auth.log
75 auth.log.aa
75 auth.log.ab
75 auth.log.ac
75 auth.log.ad
21 auth.log.ae
642 total
A simple solution for a simple question:
split -n l/5 your_file.txt
no need for scripting here.
From the man file, CHUNKS may be:
l/N split into N files without splitting lines
Update
Not all unix dist include this flag. For example, it will not work in OSX. To use it, you can consider replacing the Mac OS X utilities with GNU core utilities.
split was updated in coreutils release 8.8 (announced 22 Dec 2010) with the --number option to generate a specific number of files. The option --number=l/n generates n files without splitting lines.
coreutils manual
I made a bash script, that given a number of parts as input, split a file
#!/bin/sh
parts_total="$2";
input="$1";
parts=$((parts_total))
for i in $(seq 0 $((parts_total-2))); do
lines=$(wc -l "$input" | cut -f 1 -d" ")
#n is rounded, 1.3 to 2, 1.6 to 2, 1 to 1
n=$(awk -v lines=$lines -v parts=$parts 'BEGIN {
n = lines/parts;
rounded = sprintf("%.0f", n);
if(n>rounded){
print rounded + 1;
}else{
print rounded;
}
}');
head -$n "$input" > split${i}
tail -$((lines-n)) "$input" > .tmp${i}
input=".tmp${i}"
parts=$((parts-1));
done
mv .tmp$((parts_total-2)) split$((parts_total-1))
rm .tmp*
I used head and tail commands, and store in tmp files, for split the files
#10 means 10 parts
sh mysplitXparts.sh input_file 10
or with awk, where 0.1 is 10% => 10 parts, or 0.334 is 3 parts
awk -v size=$(wc -l < input) -v perc=0.1 '{
nfile = int(NR/(size*perc));
if(nfile >= 1/perc){
nfile--;
}
print > "split_"nfile
}' input
var dict = File.ReadLines("test.txt")
.Where(line => !string.IsNullOrWhitespace(line))
.Select(line => line.Split(new char[] { '=' }, 2, 0))
.ToDictionary(parts => parts[0], parts => parts[1]);
or
enter code here
line="to=xxx#gmail.com=yyy#yahoo.co.in";
string[] tokens = line.Split(new char[] { '=' }, 2, 0);
ans:
tokens[0]=to
token[1]=xxx#gmail.com=yyy#yahoo.co.in"