Sorting a text file & removing duplicates - sorting

I have a large text file with 4-digit codes and some information about them in every row. It looks something like this:
3456 information
1234 info
2222 Some ohter info
I need to sort this file, so the codes are in ascending order in the file. Also, some codes appear more than once, so I need to remove duplicates. Can I do this with perl, awk or some other scripting language?
Thanks in advance,
-skazhy

sort happybirthday.txt | uniq
From IBM.
1st result for Google: unix remove duplicate lines.

You can create a hash then read the file in line by line and for each line
split at the first space
check if the val(0), the number that you just split, is in the hash
if not the insert the val(1), rest of the line, into the hash with a key val(0)
continue
Then print the (sorted) hash to the file.

Related

Mining a text file for specific keywords

I have a 15MB file in .txt; I want to specify four to five keywords to see how many times they appeared in the file and then plot them in histogram like manner. Please suggest a way out
In bash, if you are looking for occurences of wordOne, wordTwo, or wordThree, you can do something like this:
cat myFile | egrep -o "wordOne|wordTwo|wordThree" | sort | uniq -c
This means: Read myFile, look for occurences of wordOne, wordTwo, and wordThree and output only those occurences, sort the lines of the output, remove duplicates while counting number of occurences.
Note here that this will match wordOneOther as well because we are not ensuring word boundaries. That can be done in several ways but depends on what you need and what your file looks like.
Plotting is a whole other question. You could either use something like gnuplot or you can paste the output to excel or matlab or something...
Install nltk. Will be very useful to do text analysis. Once you have installed it, the following code will help you to get word count of a specific word:
import nltk
with open('C:\\demo.txt', 'r') as f:
#The following line will create a list of words from the text.
words = nltk.word_tokenize(f.read())
#print (words)
#The following line will create a word-frequency combo, so you can call any word to get its frequency
fdist = nltk.FreqDist(words)
print (fdist['This'])
Once you have this data ready and stored in a format of your choice, you can use matplotlib or seaborn to create a simple histogram.

Using AWK to preserve lines based on a single line field being repeated/duplicate in a CSV file

Would someone help me form a script in Bash to keep only the unique lines, based solely on identifying duplicate values in a single field (the first field)
If I have data like this:
123456,23423,Smith,John,Jacob,Main St.,,Houston,78003<br>
654321,54524,Smith,Jenny,,Main St.,,Houston,78003<br>
332423,9023432,Gonzales,Michael,,Everyman,,Dallas,73423<br>
123456,324324,Bryant,Kobe,,Special St.,,New York,2311<br>
234324,232411,Willis,Bruce,,Sunset Blvd,,Hollywood,90210<br>
438329,34233,Moore,Mike,,Whatever,,Detroit,92343<br>
654321,43234,Smith,Jimbo,,Main St.,,Houston,78003<br>
And I like to only keep the lines which do not have matching first fields
(result would be a file with these contents below, based on above sample)
332423,9023432,Gonzales,Michael,,Everyman,,Dallas,73423<br>
234324,232411,Willis,Bruce,,Sunset Blvd,,Hollywood,90210<br>
438329,34233,Moore,Mike,,Whatever,,Detroit,92343<br>
What would the bash/awk approach be? Thanks in advance.

Replace specific commas in a csv file

I have a file like this:
gene_id,transcript_id(s),length,effective_length,expected_count,TPM,FPKM,id
ENSG00000000003.14,ENST00000373020.8,ENST00000494424.1,ENST00000496771.5,ENST00000612152.4,ENST00000614008.4,2.23231E3,2.05961E3,2493,2.112E1,1.788E1,00065a62-5e18-4223-a884-12fca053a109
ENSG00000001084.10,ENST00000229416.10,ENST00000504353.1,ENST00000504525.1,ENST00000505197.1,ENST00000505294.5,ENST00000509541.5,ENST00000510837.5,ENST00000513939.5,ENST00000514004.5,ENST00000514373.2,ENST00000514933.1,ENST00000515580.1,ENST00000616923.4,3.09456E3,2.92186E3,3111,1.858E1,1.573E1,00065a62-5e18-4223-a884-12fca053a109
The problem is that instead of ,, the file should've been tab delimited because the values starting from ENST (i.e. transcript_id(s)) are grouped in one column.
The number of ENST IDs is different in each line.
Each ENST ID has the same pattern: starts from ENST, followed by 11 digits followed by a period and then 1-3 digits: ^ENST[0-9]{11}[.][0-9]{1,3}.
I want to convert all the comma's between ENST ids to a : or any other character to read this as a csv file. Any help would be much appreciated. Thanks!
I imagine something as simple as
sed 's|,ENST|:ENST|g;s|:|,|' < /path/to/your/file
should work. No reason to over-complicate.

How to extract lines containing unique text in a column

I have a text file similar to
"3"|"0001"
"1"|"0003"
"1"|"0001"
"2"|"0001"
"1"|"0002"
i.e. a pipe-delimited text file containing quoted strings.
What I need to do is:
First, extract the first line which contains each value in the first column, producing
"3"|"0001"
"1"|"0003"
"2"|"0001"
Then, sort by the values in the first column, producing
"1"|"0003"
"2"|"0001"
"3"|"0001"
Performing the sort is easy - sort -k 1,1 -t \| - but I'm stuck on extracting the first line in the file which contains each value in the first column. I thought of using uniq but it doesn't do what I want, and it's "column-handling" abilities are limited to ignoring the first 'x' columns of space-or-tab delimited text.
Using the Posix shell (/usr/bin/sh) under HP-UX.
I'm kind of drawing a blank here. Any suggestions welcomed.
you can do:
awk -F'|' '!a[$1]++' file|sort...
The awk part will remove the duplicated lines, only leave the first occurrence.
I don't have a HP-unix box, I therefore cannot do real test. But I think it should go...

Sorting lines with vim by lines chunk

Can I sort lines in vim depending on a part of line and not the complete line?
e.g
My Name is Deus Deceit
I would like to sort depending on the column that the name starts + 6 columns
for example
sort by column 19-25 and vim will only check those characters for sorting.
If it can be done without a plugin that would be great. ty
Check out :help :sort. The command takes an options {pattern} whose matched text is skipped (i.e. sorting happens after the match.
For example, to sort by column 19+ (see :help /\%c and the related regexp atoms):
:sort /.*\%19c/

Resources