I'm just trying to count the occurrences of a word without writing an iteration file by file. I don't mind which kind of file it is. The closest I got is:
COUNT=$(grep -r -n -i "theWordImSearchingFor" .)
echo $COUNT
I thought about splitting that by spaces, but the problem is the output does not contain just the filename and the line but also the content (and that may have tons of spaces). w.g. I got:
./doc1.txt:29: This is the content containing theWordImSearchingFor but also other stuff
./doc1.txt:43: This is another line containing theWordImSearchingFor
./dir123/doc2.txt:339: .This is another...file...theWordImSearchingFor....
Any idea on how to keep it simple? TIA
To count the number of occurrences of a specific word, you need to use the same layout of code, but simpler. There are many ways to do this, but there are two much simpler versions of the word count that you have listed here.
The much two simpler versions,
1st way
2nd way
They both should work, unless problem with package installation.
I got two huge comma delimited files.
The 1st file has 280 million lines and the following columns
first name, last name, city, state, ID, email*, phone
John,Smith,LA,CA,123123123123,johnsmith#yahoo.com,12312312
Bob,Marble,SF,CA,120947810924,,48595920
Tai,Nguyen,SD,CA,134124124124,tainguyen#gmail.com,12041284
The 2nd file has 420 million lines and the following columns
first name, last name, city, state, email
John,Smith,LA,CA,johnsmith#hotmail.com
Bob,Marble,SF,CA,bobmarble#gmail.com
Tai,Nguyen,SD,CA,tainguyen#gmail.com
* a lot of these fields are empty
I want to merge all the lines from both files that has the first 4 columns match. Then fill in the missing emails of the first file by the emails from the second files if the email is not blank then don't change it. The process should be case insensitive. In case there are many instances that have the same 4 information then just ignore these instance and do the work on unique instances only.
The result should have the following columns and look like this
first name, last name, city, state, ID, email, phone
John,Smith,LA,CA,123123123123,johnsmith#yahoo.com,12312312
Bob,Marble,SF,CA,120947810924,bobmarble#gmail.com,48595920
Tai,Nguyen,SD,CA,134124124124,tainguyen#gmail.com,12041284
They should only print out things that has 4 columns matched not 1 or 2 or 3. My boss insist on using Bash shell script for this and I am a newbie in Bash. Please help me with a clear explanation as I am so newbie.
I do my reading and understand that awk require storing information into cpu memory. However, I can split the big files into small files and use awk in that case. I copy some code online and change it to my need but whenever it fills in the blank email, it also reformats the line delimiter from comma into space. I want to stop that but don't know how. Please help me to solve this problem. All advises and answers are highly appreciated.
awk -F "," 'NR==FNR{a[$1,$2,$3,$4]=$5;next}{if ($6 =="") $6=a[$1,$2,$3,$4];print}' file2.txt file1.txt > file3.txt
The awk approach you showed is not suited for files that big. It stores parts of the files in memory. With the same approach you would need to store either ... or ...
280 million entries of the form first name, last name, city, state → ID, phone
420 million entries of the form first name, last name, city, state → email
Assume we go with the first option and each entry takes up only 50 bytes of memory. To store all 280 million entries we need 280M·50B = 14'000 MB = 14 GB. This is the absolute minimum of memory you need to run the awk command. In reality it would be even more due to implementation details of associative arrays.
What you can do instead
Use the classical approach to the problem:
sort both files
join the files by their first four columns*
cut the desired columns from the joined result**
* needs some pre- and post-processing as join can only join one column.
** Since we have to re-arrange the email column cut is not sufficient. We can use awk instead.
#! /bin/bash
prefixWithKey() {
sed -E 's/([^,]*,){4}/\L&\E\t&/' "$1"
}
sortByKeyInPlace() {
sort -t $'\t' -k1,1 -o "$1" "$1"
}
joinByKey() {
join -t $'\t' "$1" "$2"
}
cutColumns() {
awk 'BEGIN{FS="\t|,\t*"; OFS=","} {print $5,$6,$7,$8,$9,$16,$11}'
}
file1="your 1st input file.csv"
file2="your 2nd input file.csv"
for i in "$file1" "$file2"; do
prefixWithKey "$i" > "$i.tmp"
sortByKeyInPlace "$i.tmp"
done
joinByKey "$file1.tmp" "$file2.tmp" | cutColumns > result.csv
rm "$file1.tmp" "$file2.tmp"
This script assumes that the input files have no headers and contain no tabs. We always take the email field from the 2nd file, no matter whether the email field of the 1st file was defined or not.
I barely tested this script because you didn't provide any example input. If you encounter some errors and share a short input leading to that error I would be happy to fix the script (if it needs fixing).
In theory the script could be written without temporary files. I intentionally used temporary files because of the input size. Programs like sort may run faster on files.
This script could be speed up, for instance by
Executing both calls to prefixWithKey in parallel.
Adding LC_ALL=C in front of commands like sort.
Adding options to sort, for instance -S 70%.
Further Alternatives
For files that big it could be faster to store them into a database and process them there. There is even the tool q for doing thinks like this in a single command, but from what I experienced it's very slow.
I want to put serialized numbers on defined positions in a text file.
My idea is to use character patterns in the file, count up a variable and put them by using sed in the file. I tried this:
for number in 1 2 3 4 ; do
sed -ibak "s/var/$number" file.txt > file2.txt
done
(the arguments 1 2 3 ... are not the best solution, but I think, it should work)
With this code and tiny variations of it, I get different results, but no success.
I can cut/paste the pattern in the text, but it is always the last argument inserted (="3"). Why doesn´t sed take the iterated variable? (which is counted up, I tested it with echo).
The first iteration replaces var by 1, the next iteration replaces exactly the same var by 2, etc. - because you operate on the same input every time, and the pattern isn't dynamic.
It's not clear what you want to achieve, so it's hard to provide a working solution.
It might be easier to reach for Perl:
perl -pe 's/picvar/"pic" . ++$i/e'
but I have a question about a small piece of code using the awk command. I have not found an answer/solution anywhere.
I am trying to parse an output file and extract all data between the 1st expression (including) ATOMIC and 2nd expression (excluding) Bond. This data is to be sent to a new file $1_geom. So far I have the following:
`awk '/ATOMIC/{flag=1;next}/Bond lengths in Bohr/{flag=0}flag' $1` >> $1_geom
This script will extract the correct data for me, but there are 2 problems:
The line ATOMICis not extracted with the data
The data is extracted and appended to a single line. I want the data to retain the formatting from the parsed file (5 columns, variable amount of lines). Please see attachment to see a visual. Visual Example Attachment. Is there a different way to append data (other than >>) so that I can keep formatting?
Any help is appreciated, thank you.
The next is causing the first match to be skipped; take it out if you don't want that.
The backticks by themselves are a shell syntax error (unless your Awk script happens to produce valid shell commands). I'm guessing you have a useless echo or something like that in your actual script which disarms the error, but instead produces the symptoms you describe.
This was part of a code in a csh script and I did have an "echo" in front of this line. Removing the "echo" makes it work perfectly and addresses the 2 questions that I had.
Ok,first post..
So I have this assignment to decrypt cryptograms by hand,but I also wanted to automate the process a little if not all at least a few parts,so i browsed around and found some sed and awk one liners to do some things I wanted done,but not all i wanted/needed.
There are some websites that sort of do what I want, but I really want to just do it in bash for some reason,just because I want to understand it better and such :)
The script would take a filename as parameter and output another file such as solution$1 when done.
if [ -e "$PWD/$1" ]; then
echo "$1 exists"
else
echo "$1 doesnt exists"
fi
Would start the script to see if the file in param exists..
Then I found this one liner
sed -e "s/./\0\n/g" $1 | while read c;do echo -n "$c" ; done
Which works fine but I would need to have the number of occurences per letter, I really don't see how to do that.
Here is what I'm trying to achieve more or less http://25yearsofprogramming.com/fun/ciphers.htm for the counting unique letter occurences and such.
I then need to put all letters in lowercase.
After this I see the script doing theses things..
-a subscript that scans a dictionary file for certain pattern and size of words
the bigger words the better.
For example: let's say the solution is the word "apparel" and the crypted word is "zxxzgvk"
is there a regex way to express the pattern that compares those two words and lists the word "apparel" in a dictionnary file because "appa" and "zxxz" are similar patterns and "zxxzgvk" is of similar length with "apparel"
Can this be part done and is it realistic to view the problem like this or is this just far fetched ?
Another subscript who takes the found letters from the previous output word and that swap
letters in the cryptogram.
The swapped letters will be in uppercase to differentiate them over time.
I'll have to figure out then how to proceed to maybe rescan the new found words to see if they're found in a dictionnary file partly or fully as well,then swap more letters or not.
Did anyone see this problem in the past and tried to solve it with the patterns in words
like i described it,or is this just too complex ?
Should I log any of the swaps ?
Maybe just scan through all the crypted words and swap as I go along then do another sweep
with having for constraint in the first sweep to not change uppercase letters(actually to use them as more precise patterns..!)
Anyone did some similar script/program in another langage? If so which one? Maybe I can relate somehow :)
Maybe we can use your insight as to how you thought out your code.
I will happily include the cryptograms I have decoded and the one I have yet to decode :)
Again, the focus of my assignment is not to do this script but just to resolve the cryptograms. But doing scripts or at least trying to see how I would do this script does help me understand a little more how to think in terms of code. Feel free to point me in the right directions!
The cryptogram itself is based on simple alphabetic substitution.
I have done a pastebin here with the code to be :) http://pastebin.com/UEQDsbPk
In pseudocode the way I see it is :
call program with an input filename in param and optionally a second filename(dictionary)
verify the input file exists and isnt empty
read the file's content and echo it on screen
transform to lowercase
scan through the text and count the amount of each letter to do a frequency analysis
ask the user what langage is the text supposed to be (english default)
use the response to specify which letter frequencies to use as a baseline
swap letters corresponding to the frequency analysis in uppercase..
print the changed document on screen
ask the user to swap letters in the crypted text
if user had given a dictionary file as the second argument
then scan the cipher for words and find the bigger words
find words with a similar pattern (some letters repeating letters) in the dictionary file
list on screen the results if any
offer to swap the letters corresponding in the cipher
print modified cipher on screen
ask again to swap letters or find more similar words
More or less it the way I see the script structured.
Do you see anything that I should add,did i miss something?
I hope this revised version is more clear for everyone!
Tl,dr to be frank. To the only question i've found - the answer is yes:) Please split it to smaller tasks and we'll be happy to assist you - if you won't find the answer to these smaller questions before.
If you can put it out in pseudocode, it would be easier. There's all kinds of text-manipulating stuff in unix. The means to employ depend on how big are your texts. I believe they are not so big, or you would have used some compiled language.
For example the easy but costly gawk way to count frequences:
awk -F "" '{for(i=1;i<=NF;i++) freq[$i]++;}END{for(i in freq) printf("%c %d\n", i, freq[i]);}'
As for transliterating, there is tr utility. You can forge and then pass to it the actual strings in each case (that stands true for Caesar-like ciphers).
grep -o . inputfile | sort | uniq -c | sort -rn
Example:
$ echo 'aAAbbbBBBB123AB' | grep -o . | sort | uniq -c | sort -rn
5 B
3 b
3 A
1 a
1 3
1 2
1 1