LC-3 Analyzing given piece of text - lc3

How do I store a given piece of text in memory? For example:
“I had a dream last night that a man had me over for dinner. I said that I did not like what we had, but he did not mind.”
I want to store each word in a separate memory location.
Eventually I want to output the number of characters in each word.

In LC3 (and in general) you will want to store each letter at a separate memory location.
Your LC3 program needs to create a label at a location in memory with enough space to store all of these characters. Call this your FIRST_CHARACTER label.
If you need to find each "word" individually after you read in an arbitrary sentence, you may also need to also store offset "distance" in character from the start of FIRST_CHARACTER. You would have an offset value stored in memory for each word you read. You would store these values starting at some other tag in memory called OFFSET_COUNTS. So the first word would always start at location FIRST_CHARACTER. The second word would start at location FIRST_CHARACTER + (the value found at OFFSET_COUNTS+1), etc.
There are other approaches, but if this is what you need to do, you will need to have some way of finding each word in memory after the fact.
Another approach could be to just search through the entire string stored at FIRST_CHARACTER and count SPACES in the original string until you get to the start of the first word you are looking for.
Have a look at the lc3tutor.org "get a line" code sample for more concise examples of how to store a string in memory if you need that.
Good luck!
Jeff

Related

`FastText.wv.save_word2vec_format()` creates some entries with two words on one line

FastText.wv.save_word2vec_format() creates some entries with two words on one line. This is a problem because it breaks the KeyedVectors.load_word2vec_format() function, which expects one word followed by x floats, where x is the number of dimensions in the vector. I haven't been able to prove that this is a bug, so has anyone had this problem?
My solution has been to prune the resulting file by removing lines with two space-separated words. In the large data sets that I used, there were between three and ten occurrences per data set. I also double checked that no words in the vocabulary and no words in the data set contained a space.
In every occurrence, the two component words had their own entries as a single word. Is this intended for, maybe, particularly high co-occurring pairs?
Is this expected behavior? If so, why? And why is there no accounting for this in the loading function?
In general, if save_word2vec_format() seems to have prefixed a particular line of floats with a string that includes more whitespace than just an ending space, then it is almost certain that the matching key in the model includes such whitespace.
In particular, consider a written file with the usual special header-line (containing the counts of further vectors & dimensionality), which on its 10th line (counting from 1 as if using cat -n FILENAME) shows this symptom. In such a case, I'd expect ft_model.wv.index_to_key[8] to reveal a key with a matching internal whitespace.
Are you sure this isn't the case for your occurrences? (How had you checked "that no words in the vocabulary and no words in the data set contained a space"? Might you have missed some more exotic whitespace character that somehow became a plain space while saving?)
If the actual key contains whitespace, the issue is that something in the prior preprocessing/training left such internal-whitespace tokens in the model, and solving that may be the best way to resolve the issue.
If not, there's a deeper problem & it'd be helpful to figure a minimal way to trigger the problem. Examining the ft_model.wv.index_to_key values for the keys all around the problem key might help narrow things down.

Is there the best data structure to only insert and search for particular words in the text file line by line?

I am struggling with my homework, I have to design data structure/s suitable for the specific scenario. I have a text to load line by line. After that i have to:
Print line numbers on which a given word exists.
Print the total number of times a given word occurs (on a specific line)
Print whole line of words.
I must not use programming language, it must be only description of this problem.
I was thinking about linked list of arrays, where one Node is a line that contains array of words. I do not need much space for that but in the worst case searching operation will be O(n*n).
I also have tries in my mind, however number of either lines or words in particular line is not defined but set to be max 4 bytes integer thus it can use a lot of space.

Search a string as you type the character

I have contacts stored in my mobile. Lets say my contacts are
Ram
Hello
Hi
Feat
Eat
At
When I type letter 'A' I should get all the matching contacts say "Ram, Feat, Eat, At".
Now I type one more letter T. Now my total string is "AT" now my program should reuse the results of previous search for "A". Now it should return me "Feat, Eat, At"
Design and develop a program for this.
This is interview question at Samsung mobile development
I tried solving with Trie data structures. Could not get good solution for reusing already searched string results. I also tried solution with dictionary data structure, solution has same disadvantage as Trie.
question is how do I search the contacts for each letter typed reusing the search results of earlier searched string? What data structure and algorithm should be used for efficiently solving the problem.
I am not asking for program. So programming language is immaterial for me.
State machine appears to be good solution. Does anyone have suggestion?
Solution should be fast enough for million contacts.
It kind of depends on how many items you're searching. If it's a relatively small list, you can do a string.contains check on everything. So when the user types "A", you search the entire list:
for each contact in contacts
if contact.Name.Contains("A")
Add contact to results
Then the user types "T", and you sequentially search the previous returned results:
for each contact in results
if contact.Name.Contains("AT")
Add contact to new search results
Things get more interesting if the list of contacts is huge, but for the number of contacts that you'd normally have in a phone (a thousand would be a lot!), this is going to work very well.
If the interviewer said, "use the results from the previous search for the new search," then I suspect that this is the answer he was looking for. It would take longer to create a new suffix tree than to just sequentially search the previous result set.
You could optimize this a little bit by storing the position of the substring along with the contact so that all you have to do the next time around is check to see if the next character is as expected, but doing so complicates the algorithm a bit (you have to treat the first search as a special case, and you have to explicitly check string lengths, etc.), and is unlikely to provide much benefit after the first few characters because the size of the list to be searched would be pretty small. The pure sequential search with contains check is going to be plenty fast. Users wouldn't notice the few microseconds you'd save with that optimization.
Update after edit to question
If you want to do this with a million contacts, sequential search might not be the best way to go at the start. Although I'd still give it a try. "Fast enough for a million contacts" raises the question of what exactly "fast enough" means. How long does it take to search one million contacts for the existence of a single letter? How long is the user willing to wait? Remember also that you only have to show one page of contacts before the user takes another action. And you can almost certainly to that before the user presses the second key. Especially if you have a background thread doing the search while the foreground thread handles input and writing the first page of matched strings to the display.
Anyway, you could speed up the initial search by creating a bigram index. That is, for each bigram (sequence of two characters), build a list of names that contain that bigram. You'll also want to create a list of strings for each single character. So, given your list of names, you'd have:
r - ram
a - ram, feat, eat, a
m - ram
h - hello, hi
...
ra - ram
am - ram
...
at - feat, eat, at
...
etc.
I think you get the idea.
That bigram index gets stored in a dictionary or hash map. There are only 325 possible bigrams in the English language, and of course the 26 letters, so at most your dictionary is going to have 351 entries.
So you have almost instant lookup of 1- and 2-character names. How does this help you?
An analysis of Project Gutenberg text shows that the most common bigram in the English language occurs only 3.8% of the time. I realize that names won't share exactly that distribution, but that's a pretty good rough number. So after the first two characters are typed, you'll probably be working with less than 5% of the total names in your list. Five percent of a million is 50,000. With just 50,000 names, you can start using the sequential search algorithm that I described originally.
The cost of this new structure isn't too bad, although it's expensive enough that I'd certainly try the simple sequential search first, anyway. This is going to cost you an extra 2 million references to the names, in the worst case. You could reduce that to a million extra references if you build a 2-level trie rather than a dictionary. That would take slightly longer to lookup and display the one-character search results, but not enough to be noticeable by the user.
This structure is also very easy to update. To add a name, just go through the string and make entries for the appropriate characters and bigrams. To remove a name, go through the name extracting bigrams, and remove the name from the appropriate lists in the bigram index.
Look up "generalized suffix tree", e.g. https://en.wikipedia.org/wiki/Generalized_suffix_tree . For a fixed alphabet size this data structure gives asymptotically optimal solution to find all z matches of a substring of length m in a set of strings in O(z + m) time. Thus you get the same sort of benefit as if you restricted your search to the matches for the previous prefix. Also the structure has optimal O(n) space and build time where n is the total length of all your contacts. I believe you can modify the structure so that you just find the k strings that contain the substring in O(k + m) time, but in general you probably shouldn't have too many matches per contact that have a match, so this may not even be necessary.
What I'm thinking to do is, keeping track of the so far matched string. Suppose in the first step, we identify the strings those have "A" in them and we keep trace of the positions of 'A". Then in the next step we only iterate over these strings and instead of searching them full we only check for occurrence of "T" as the next character to "A" we kept trace in the previous step and so on.

English word capture

I have a big text file which contains many English words. However it contains German and French words as well. I need to capture all English words in it.
I reckon, firstly I read all file from the disk and convert it into an array, second I match the all words against unix English word dictionary like here, yet it is not a good solution because of the size of each file. If I do in that way, complexity will be high, and I don't want that.
Do you have any idea how I can do it with Ruby in a simple way?
First thing you could do is to put the english dictionary to a set (instead of array). This way, the lookup is O(1) and overall complexity is O(N) instead of O(NxM).

How to Find Exact Row in Log File

If you have a big log file, billions of lines long. The files have some columns, like IP addresses: xxx.xxx.xxx.xxx.
How can I find exact one line quickly, like if I want to find 123.123.123.123.
A naive line-by-line search seems too slow.
If you don't have any other information to go on (such as a date range, assuming the file is sorted), then line-by-line search is your best option. Now, that doesn't mean you need to read in lines. Also, it might be more efficient for you to search backwards because you know the entry is recent.
The general approach (for searching backwards) is this:
Declare a buffer. You will read chunks of the file at a time into this buffer as fast as possible (preferably by using low-level operating system calls that can read directly without any buffering/caching).
So you seek to the end of your file minus the size of your buffer and read that many bytes.
Now you search forwards through your buffer for the first newline character. Remember that offset for later, as it represents a partial line. Starting at next line, you search forward to the end of the buffer looking for your string. If it has to be in a certain column but other columns could contain that value, then you need to do some parsing.
Now you continue to search backwards through your file. You seek to the last position you read from minus the chunk size plus the offset that you found when you searched for a newline character. Now, you read again. If you like you can move that partial line to the end of the buffer and read fewer bytes but it's not going to make a huge difference if your chunks are large enough.
And you continue until you reach the beginning of the file. There is of course a special case when the number of bytes to read is less than the chunk size (namely, you don't ignore the first line). I assume that you won't reach the beginning of the file because it seems clear that you don't want to search the entire thing.
So that's the approach when you have no idea where the value is. If you do have some idea on ordering, then of course you probably want to do a binary search. In that case you can use smaller chunk sizes (enough to at least catch a full line).
You really need to search for some regularity in the file and exploit that, Barring that, then if you have more processors you could split the file into sections and search in parallel - assuming I/O would not then be a bottleneck.

Resources