Transform Data with repeating attribute in each row to ARFF - format

I have a dataset as text file and data format is as follow,
ID: 1
Name: a
ID: 2
Name: b
ID: 3
Name: c
I want to convert this data format to be in arff format as follows
ID Name
1 a
2 b
3 c
Which tools should I use? It is a large dataset of 1GB with many rows. I got this dataset from snap.stadford.edu to practice Large data handling.

How about use the programming language of your choice?
The input format is text, the output format (arff) is also effectively text.
Why don't you write a program to convert the formats?

You can get the desired result with simple command line tools. If you have the data in one file called x.txt, use:
grep ID: x.txt | sed 's/^[^ ]\+ //' > a.txt
grep Name: x.txt | sed 's/^[^ ]\+ //' > b.txt
to get the data in two different files named a.txt and b.txt.
The files will have:
$ cat a.txt
1
2
3
$ cat b.txt
a
b
c
Then join the files with the paste command:
$ paste a.txt b.txt
1 a
2 b
3 c
This solution if very efficient, if the files are quite large, as you said.

Related

Is it possible to work with 'for loop grep' commands?

I have lots of files in every year directory
and in each file have long and large sentence like this for exmaple
List item
home/2001/2001ab.txt
the AAAS kill every one not but me and you and etc
the A1CF maybe color of full fill zombie
home/2002/2002ab.txt
we maybe know some how what
home/2003/2003ab.txt
Mr, Miss boston, whatever
aaas will will will long long
and in home directory, I got home/reference.txt (list of word file)
A1BG
A1CF
A2M
AAAS
I'd like to do count how many word in the file reference.txt is in every single year file
this is my code where I run in every year directory
home/2001/, home/2002/, home/2003/
# awk
function search () {
awk -v pattern="$1" '$0 ~ pattern {print}' *.txt > $1
}
# load custom.txt
for i in $(cat reference.txt)
do
search $i
done
# word count
wc -l * > line-count.txt
this is my result
home/2001/A1BG
$cat A1BG
0
home/2001/A1CF
$cat A1CF
1
home/2001/A2M
$cat A2M
0
home/2001/AAAS
$cat AAAS
1
home/2001/line-count.txt
$cat line-count.txt
2021ab.txt 2
A1BG
A1CF 1
A2M 0
AAAS 1
result line-count.txt file have all information what I want
but I have to do this work repeat manually
do cd directory
do run my code
and then cd directory
I have around 500 directory and file, it is not easy
and second problem is wasty bunch of file
create lots of file and takes too much time
because of this at first I'd likt use grep command
but I dont' know how to use list of file instead of single word
that is why I use awk
How can i do it more simple
at first I'd likt use grep command but I dont' know how to use list of
file instead of single word
You might use --file=FILE option for that purpose, selected file should hold one pattern per line.
How can i do it more simple
You might use --count option to avoid need of using wc -l for that, consider following simple example, let file.txt content be
123
456
789
and file1.txt content be
abc123
def456
and file2.txt content be
ghi789
xyz000
and file3.txt content be
xyz000
xyz000
then
grep --count --file=file.txt file1.txt file2.txt file3.txt
gives output
file1.txt:2
file2.txt:1
file3.txt:0
Observe that no files are created and file without matches does appear in output. Disclaimer: this solution assumes file.txt does not contain character of special meaning for GNU grep, if this does not hold do not use this solution.
(tested in GNU grep 3.4)

Delete header/column from .txt file with bash

I'm automating a workflow with a bash script on Mac OSX. In this workflow, I'd like to add a command that deletes a header from my table (.txt) file that is tab delimited. It looks as follows:
header1 header2 header3
a 1
b 2
c 3
d 4
e 5
f 6
As you can see, the third column, named header3, is empty.
I've noted this post or this one but I don't understand the arguments.
Could you suggest a line of code that automatically deletes the third column, or (even better) deletes the header called 'header3'?
awk is designed to work with whitespace-separated text columns:
awk '{print $1 "\t" $2}' input.txt > output.txt
I found the answer here in Table 2C.
sed s/header3//g input.txt > output.txt

Append data to the end of a specific line in text file

I admit to being a novice in bash script, but can't quite seem to figure out how to accomplish a key step in a script and couldn't quite find what I was looking for in other threads.
I am trying to extract some specific data (numerical values) from multiple .xml files and add those to a space or tab delimited text file. The files will be generated over time so I need a way to append a new dataset to the pre-existing text file.
For instance, I would like to extract values for 3 different categories, 1 per row or column, and the value for each category from multiple xml files. Basically, I want to build a continuous graph of the data from each of 3 categories over time.
I have the following code which will successfully extract the 3 numbers from the xml file and trim the unnecessary text:
#!/bin/sh
grep "<observation name=\"meanGhost\" type=\"float\">" "/Users/Erik/MRI/PHANTOM/2/phantom_qa/summaryQA.xml" \
| sed 's/<observation name=\"meanGhost\" type=\"float\">//g' \
| sed 's/<\/observation>//g' >> $HOME/Desktop/testxml.txt
grep "<observation name=\"meanBrightGhost\" type=\"float\">" "/Users/Erik/MRI/PHANTOM/2/phantom_qa/summaryQA.xml" \
| sed 's/<observation name=\"meanBrightGhost\" type=\"float\">//g' \
| sed 's/<\/observation>//g' >> $HOME/Desktop/testxml.txt
grep "<observation name=\"std\" type=\"float\">" "/Users/Erik/MRI/PHANTOM/2/phantom_qa/summaryQA.xml" \
| sed 's/<observation name=\"std\" type=\"float\">//g' \
| sed 's/<\/observation>//g' >> $HOME/Desktop/testxml.txt
This gives the output:
1.12
0.33
134.1
I would like to then read in another xml file to get:
1.12 1.45
0.33 0.54
134.1 144.1
I would be grateful for any help with doing this! Thanks in advance.
Erik
It's much safer to use proper XML handling tools. For example, in xsh, you can write something like
$f1 := open /Users/Erik/MRI/PHANTOM/2/phantom_qa/summaryQA.xml ;
$f2 := open /path/to/the/second/file.xml ;
echo ($f1 | $f2)//observation[#name="meanGhost"] ;
echo ($f1 | $f2)//observation[#name="meanBrightGhost"] ;
echo ($f1 | $f2)//observation[#name="std"] ;

unix delete rows from multiple files using input from another file

I have multiple (1086) files (.dat) and in each file I have 5 columns and 6384 lines.
I have a single file named "info.txt" which contains 2 columns and 6883 lines. First column gives the line numbers (to delete in .dat files) and 2nd column gives a number.
1 600
2 100
3 210
4 1200
etc...
I need to read in info.txt, find every-line number corresponding to values less than 300 in 2nd column (so it is 2 and 3 in above example). Then I need to read these values into sed-awk or grep and delete these #lines from each .dat file. (So I will delete every 2nd and 3rd row of dat files in the above example).
More general form of the question would be (I suppose):
How to read numbers as input from file, than assign them to the rows to be deleted from multiple files.
I am using bash but ksh help is also fine.
sed -i "$(awk '$2 < 300 { print $1 "d" }' info.txt)" *.dat
The Awk script creates a simple sed script to delete the selected lines; the script it run on all the *.dat files.
(If your sed lacks the -i option, you will need to write to a temporary file in a loop. On OSX and some *BSD you need -i "" with an empty argument.)
This might work for you (GNU sed):
sed -rn 's/^(\S+)\s*([1-9]|[1-9][0-9]|[12][0-9][0-9])$/\1d/p' info.txt |
sed -i -f - *.dat
This builds a script of the lines to delete from the info.txt file and then applies it to the .dat files.
N.B. the regexp is for numbers ranging from 1 to 299 as per OP request.
# create action list
cat info.txt | while read LineRef Index
do
if [ ${Index} -lt 300 ]
then
ActionReq="${ActionReq};${Index} b
"
fi
done
# apply action on files
for EachFile in ( YourListSelectionOf.dat )
do
sed -i -n -e "${ActionReq}
p" ${EachFile}
done
(not tested, no linux here). Limitation with sed about your request about line having the seconf value bigger than 300. A awk is more efficient in this operation.
I use sed in second loop to avoid reading/writing each file for every line to delete. I think that the second loop could be avoided with a list of file directly given to sed in place of file by file
This should create a new dat files with oldname_new.dat but I havent tested:
awk 'FNR==NR{if($2<300)a[$1]=$1;next}
!(FNR in a)
{print >FILENAME"_new.dat"}' info.txt *.dat

how to join files in unix without sorting

I am trying to join 2 csv files based on a key in unix.
My files are really huge 5GB each and sorting them is taking too long.
I want to repeat this procedure for 50 such joins.
Can someone tell me how to join without sorting and quickly.
Unfortunately there is no way around the sorting. But please take a look at some utility scripts I have written here: (https://github.com/stefan-schroedl/tabulator). You can use them if you keep the header of the column names as the first line in each file. There is a script 'tbljoin' that will take care of the sorting and column counting for you. For example, say you have
Employee.csv:
employee_id|employee_name|department_id
4|John|10
1|Monica|4
12|Louis|5
20|Peter|2
21|David|3
13|Barbara|6
Dept.csv:
department_id|department_name
1|HR
2|Manufacturing
3|Engineering
4|Marketing
5|Sales
6|Information technology
7|Security
Then the command tbljoin Employee.csv Dept.csv produces
employee_id|employee_name|department_id|department_name
20|Peter|2|Manufacturing
21|David|3|Engineering
1|Monica|4|Marketing
12|Louis|5|Sales
13|Barbara|6|Information technology.
tabulator contains many other useful features, e.g., for simple rearranging of columns.
Here is the example with two files having data delimited by pipe
Data from employee.csv with key employee_id, Name and department_id delimited by pipe.
Employee.csv
4|John | 10
1|Monica|4
12|Louis|5
20|Peter|2
21|David|3
13|Barbara|6
Department file with deparment_id and its name delimited by pipe.
Dept.csv
1|HR
2| Manufacturing
3| Engineering
4 |Marketing
5| Sales
6| Information technology
7| Security
command:
join -t “|” -1 3 -2 1 Employee_sort.csv Dept.csv
-t “| ” indicated files are delimited by pipe
-1 3 for third column of file 1 i.e deparment_id from Employee_sort.csv file
-2 1 for first column of file 2 i.e. deparment_id from Dept.csv file
By using above command, we get following output.
2|20|Peter| Manufacturing
3|21|David| Engineering
4|1|Monica| Marketing
5|12|Louis| Sales
6|13|Barbara| Information technology
If you want to get everything from file 2 and corresponding entries in file 1
You can also use -a and -v options
try following commands
join -t “|” -1 3 -2 1 -v2 Employee_sort.csv Dept.csv
join -t “|” -1 3 -2 1 -a2 Employee_sort.csv Dept.csv
I think that you could avoid using join (and thus sorting your file), but this is not a quick solution :
In both files, replace all pipes and all double-spaces with spaces :
sed -i 's/|/ /g;s/ / /g' Employee.csv Dept.csv
run these code lines as a bash script :
cat Employee.csv | while read a b c
do
cat Dept.csv | while read d e
do
if [ "$c" -eq "$d" ] ; then
echo -e "$a\t$b\t$c\t$e"
fi
done
done
Note that looping takes a long time

Resources