another question about hadoop. Is it possible for reducing a list to a map? I mean I have al list like this after the map()
KEY: VALUE:
aaa word
string
word
text
string
word
is it possible to reduce the list to the following structure?
KEY: VALUE:
aaa word, 3
string, 2
text, 1
thanks
manuel
What I would do is the following: due to you are trying to implement the typical word count but on a list that is associated to a key, I would extend such wordwount example by producing at the output of the mappers (key,value) pairs such as:
aaa-word,1
aaa-string,1
aaa-word,1
aaa-text,1
aaa-string,1
aaa-word,1
I.e. I would add the aaa information to all the output pairs. Then, the reducer would behave as usual: by receiving lists of values whose keys are the same; then, the common key is splited into aaa and the word; in addition, the length of the list is returned, which is concatenated to the word.
(aaa-word,1),(aaa-word,1),(aaa-word,1)-->(aaa,word-3)
(aaa-string,1),(aaa-string,1)-->(aaa,string-2)
(aaa-text,1)-->(aaa,text-1)
Related
I am trying to build a string of values to be inserted into an SQL IN list. For example -
SELECT * FROM TABLE WHERE field IN ('AAA', 'BBB', 'CCC', 'DDD')
The list that I want needs to be constructed from values within a single column of my dataset but I'm struggling to find a way to concatenate those values.
My first thought was to use CASESTOVARS to put each of the values into columns prior to concat. This is simple but the number of cases is variable.
Is there a way to concat all fields without specifying?
Or is there a better way to go about this?
Unfortunately Python is not an option for me in this instance.
A simple sample dataset would be -
CasestoConcat
AAA
BBB
CCC
DDD
You can use the lag function for this.
First creating a bit of sample data to demonstrate on:
data list free/grp (F1) txt (a5).
begin data
1 "aaa" 1 "bb" 1 "cccc" 2 "d" 2 "ee" 2 "fff" 2 "ggggg" 3 "hh" 3 "iii"
end data.
Now the following code makes sure that rows that belong together are consecutive. You can also sort by any other relevant variable to keep the combined text in a specific order.
sort cases by grp.
string merged (A1000).
compute merged=txt.
if $casenum>1 and grp=lag(grp) merged=concat(rtrim(merged), " ", rtrim(lag(merged))).
exe.
At this point if you want to just keep the line that has all the concatenated texts, you can use this:
add files /file=* /by grp /last=lst.
select if lst=1.
exe.
I have a file with several lines of data. The fields are not always in the same position/column. I want to search for 2 strings and then show only the field and the data that follows. For example:
{"id":"1111","name":"2222","versionCurrent":"3333","hwVersion":"4444"}
{"id":"5555","name":"6666","hwVersion":"7777"}
I would like to return the following:
"id":"1111","hwVersion":"4444"
"id":"5555","hwVersion":"7777"
I am struggling because the data isn't always in the same position, so I can't chose a column number. I feel I need to search for "id" and "hwVersion" Any help is GREATLY appreciated.
Totally agree with #KamilCuk. More specifically
jq -c '{id: .id, hwVersion: .hwVersion}' <<< '{"id":"1111","name":"2222","versionCurrent":"3333","hwVersion":"4444"}'
Outputs:
{"id":"1111","hwVersion":"4444"}
Not quite the specified output, but valid JSON
More to the point, your input should probably be processed record by record, and my guess is that a two column output with "id" and "hwVersion" would be even easier to parse:
cat << EOF | jq -j '"\(.id)\t\(.hwVersion)\n"'
{"id":"1111","name":"2222","versionCurrent":"3333","hwVersion":"4444"}
{"id":"5555","name":"6666","hwVersion":"7777"}
EOF
Outputs:
1111 4444
5555 7777
Since the data looks like a mapping objects and even corresponding to a JSON format, something like this should do, if you don't mind using Python (which comes with JSON) support:
import json
def get_id_hw(s):
d = json.loads(s)
return '"id":"{}","hwVersion":"{}"'.format(d["id"], d["hwVersion"])
We take a line of input string into s and parse it as JSON into a dictionary d. Then we return a formatted string with double-quoted id and hwVersion strings followed by column and double-quoted value of corresponding key from the previously obtained dict.
We can try this with these test input strings and prints:
# These will be our test inputs.
s1 = '{"id":"1111","name":"2222","versionCurrent":"3333","hwVersion":"4444"}'
s2 = '{"id":"5555","name":"6666","hwVersion":"7777"}'
# we pass and print them here
print(get_id_hw(s1))
print(get_id_hw(s2))
But we can just as well iterate over lines of any input.
If you really wanted to use awk, you could, but it's not the most robust and suitable tool:
awk '{ i = gensub(/.*"id":"([0-9]+)".*/, "\\1", "g")
h = gensub(/.*"id":"([0-9]+)".*/, "\\1", "g")
printf("\"id\":\"%s\",\"hwVersion\":\"%s\"\n"), i, h}' /your/file
Since you mention position is not known and assuming it can be in any order, we use one regex to extract id and the other to get hwVersion, then we print it out in given format. If the values could be something other then decimal digits as in your example, the [0-9]+ but would need to reflect that.
And for the fun if it (this preserves the order) if entries from the file, in sed:
sed -e 's#.*\("\(id\|hwVersion\)":"[0-9]\+"\).*\("\(id\|hwVersion\)":"[0-9]\+"\).*#\1,\3#' file
It looks for two groups of "id" or "hwVersion" followed by :"<DECIMAL_DIGITS>".
I have a text file with two columns. The values in the first column ("key") are all different, the values in the second column - these strings have a length between 10 and approximately 200 - have some duplicates. The number of duplicates varies. Some strings - especially the longer ones - don't have any duplicate, while others might have 20 duplicate occurancies.
key1 valueX
key2 valueY
key3 valueX
key4 valueZ
I would like to represent this data as a hash. Because of the large number of keys and the existence of duplicate values, I am wondering, whether some method of sharing common strings would be helpful.
The data in the file is kind of "constant", i.e. I can put effort (in time of space) to preprocess it in a suitable way, as long as it is accessed efficiently, once it is entered my application.
I will now outline an algorithm, where I believe this would solve the problem. My question is, whether the algorithm is sound, respectively whether it could be improved. Also, I would like to know whether using freeze on the strings would provide an additional optimization:
In a separated preprocessing process, I find out which strings values are indeed duplicate, and I annotate the data accordingly (i.e. create a third column in the file), in that all occurances of a repeated string except the first occurance, have a pointer to the first occurance:
key1 valueX
key2 valueY
key3 valueX key1
key4 valueZ
When I read in my application the data into memory (line by line), I use this annotation, to create a pointer to the original string, instead of allocating a new one:
if columns.size == 3
myHash[columns[0]] = columns[1] # First occurance of the string
else
myHash[columns[0]] = myHash[columns[2]].dup # Subsequent occurances
end
Will this achieve my goal? Can it be done any better?
One way you could do this is using symbols.
["a", "b", "c", "a", "d", "c"].each do |c|
puts c.intern.object_id
end
417768 #a
313128 #b
312328 #c
417768 #a
433128 #d
312328 #c
Note how c got the same value.
You can turn a string into a symbol with the intern method. If you intern an equal string you should get the same symbol out, like a flyweight pattern.
If you save the symbol in your hash you'll just have each string a single time. When it's time to use the symbol just call .to_s on the symbol and you'll get the string back. (Not sure how the to_s works, it may do creation work on each call.) Another idea would be to cache strings your self, ie have an integer to string cache hash and just put the integer key in your data structures. When you need the string you can look it up.
Given a string value of arbitrary length, you're supposed to determine the frequency of words which are anagrams of each other.
public static Map<String, Integer> generateAnagramFrequency(String str)
{ ... }
For example: if the string is "find art in a rat for cart and dna trac"
your output should be a map:
find -> 1
art -> 2
in -> 1
a -> 1
cart -> 2
and -> 2
The keys should be the first occurrence of the word, and the number is the number of anagrams of that word, including itself.
The solution i came up with so for is to sort all the words and compare each character from both strings till the end of either strings. It would be O(logn). I am looking for some other efficient method which doesn't change the 2 strings being compared. Thanks.
I've written a JavaScript implementation of creation a n-gram (word analysis), at Extract keyphrases from text (1-4 word ngrams).
This function can easily be altered to analyse the frequency of anagrams:Replace s = text[i]; by s = text[i].sort(), so that the order of characters doesn't matter any more.
Create a "signature" for each word by sorting its letters alphabetically. Sort the words by their signatures. Run through the sorted list in order; if the signature is the same as the previous signature, you have an anagram.
I have a large set of words (about 10,000) and I need to find if any of those words appear in a given block of text.
Is there a faster algorithm than doing a simple text search for each of the words in the block of text?
input the 10,000 words into a hashtable then check each of the words in the block of text if its hash has an entry.
Faster though I don't know, just another method (would depend on how many words you are searching for).
simple perl examp:
my $word_block = "the guy went afk after being popped by a brownrabbit";
my %hash = ();
my #words = split /\s/, $word_block;
while(<DATA>) { chomp; $hash{$_} = 1; }
foreach $word (#words)
{
print "found word: $word\n" if exists $hash{$word};
}
__DATA__
afk
lol
brownrabbit
popped
garbage
trash
sitdown
Try out the Aho-Corasick algorithm:
http://en.wikipedia.org/wiki/Aho-Corasick_algorithm
Build up a trie of your words, and then use that to find which words are in the text.
The answer heavily depends on the actual requirements.
How large is the word list?
How large is the text block?
How many text blocks must be processed?
How often must each text block be processed?
Do the text blocks or the word list change? If, how frequent?
Assuming relativly small text blocks compared to the word list and processing each text block only once, I suggest to put the words from the word list into a hash table. Then you can perform a hash lookup for each word in the text block and find out if the word list contains the word.
If you have to process the text blocks multiple times, I suggest to invert the text blocks. Inverting a text block means creating a list for each word that containing all the text blocks containing the specific word.
In still other situations it might be helpful to generate a bit vector for each text block with one bit per word indicating if the word is contained in the text block.
you can build a graph used as a state machine and when you process the ith character of your input word - Ci - you try to go to the ith level of your graph by checking if your previous node, linked to Ci-1, has a child node linked to Ci
ex: if you have the following words in your corpus
("art", "are", "be", "bee")
you will have the following nodes in your graph
n11 = 'a'
n21 = 'r'
n11.sons = (n21)
n31 = 'e'
n32= 't'
n21.sons = (n31, n32)
n41='art' (here we have a leaf in our graph and the word build from all the upper nodes is associated to this node)
n31.sons = (n41)
n42 = 'are' (here again we have a word)
n32.sons = (n42)
n12 = 'b'
n22 = 'e'
n12.sons = (n22)
n33 = 'e'
n34 = 'be' (word)
n22.sons = (n33,n34)
n43 = 'bee' (word)
n33.sons = (n43)
during your process if you go through a leaf while you are processing the last character of your input word, and only in this case, it means that your input is in your corpus.
This method is more complicated to implement than a single Dictionary or Hashtable but it will be much more optimized in term of memory use
The Boyer-Moore string algorithm should work. depending on the size/# or words in the block of text, you might want to use it as the key to search the word list (are there more words in the list then in the block). Also - you probably want to remove any dups from both lists.