vertex ai matching service machines - google-cloud-vertex-ai

i want to try vertex ai vector matching service but when i do the deployment it always puts n 16 machines on my index. i want to try with something smaller but i dont see option to specific machine type during deployment.
here is my deployment command:
gcloud ai index-endpoints deploy-index 2056746450917785600 \
--deployed-index-id=postanndeploy \
--display-name=smallindexdeploy \
--index=5486800517113839616 \
--min-replica-count=1 \
--max-replica-count=2 \
--project=myproject \
--region=us-central1
after i am done deploying when i look at deployment i see n16 machine. which is way bigger than i need for my POC. is there a way i can do it in smaller machine and not worry about costs.
Manish

The machine type for a matching engine deployment will depend on the sharding level you set while creating the index. The default sharding if nothing is specified is SHARD_SIZE_MEDIUM.
SHARD_SIZE_MEDIUM uses e2-standard-16 machines by default. You can try with SHARD_SIZE_SMALL for smaller use cases. The number of machines spinned will depend on the size of the index. The machines used for SHARD_SIZE_SMALL will be e2-standard-2 instance.
However, the number of such instances spinned up will depend up on your overall index size. The matching engine monitoring dashboard will tell you how many instances are created for your specific index.
You can find more docs from google related to the index size and instances here.

Related

SphinxSearch - Different Nodes using shared data

We are in the process of building a SphinxSearch Cluster using Amazon EC2 instances. We did a sample test like several instances using the same shared file system (Elastic File System). Our idea is, in a cluster we might have more than 10 nodes, But we can use a single instance to index documents and keep it in Elastic File System and can shared by multiple nodes for reading.
Our test worked fine, But technically any problem with this approach? (Like locking issue etc)
Can someone please suggest on this
Thanks in Advance
If you're ok with having N copies of the index you can do as follows:
build an index in one place in a temp folder
rename the files so they include .new.
distribute the index to all the other places using rsync or whatever you like. Some even do broadcasting with UFTP
rotate the indexes at once in all the places by sending HUP to the searchds or better by doing RELOAD INDEX (http://docs.manticoresearch.com/latest/html/sphinxql_reference/reload_index_syntax.html), it normally takes only few ms so we can say that your new index replaces the previous one simultaneously on all the nodes
previously (and perhaps still in Sphinx) there was an issue with rotating the index (either by --rotate or RELOAD) in case it was processing a long query (the rotate just had to wait). It was fixed in Manticoresearch recently.
This is tried'n'true solution people use in production for years, but if you really want to share the same files among multiple searchd instances you can softlink all the files except .spl, but then to rotate the index in the searchd instances using the links (not the actual files) you'll need to restart the searchd instances which doesn't look good in general, but in some special cases may be still a good solution.

Elasticsearch bulk update is extremely slow

I am indexing a large amount of daily data ~160GB per index into elasticsearch. I am facing this case where I need to update almost all the docs in the indices with a small amount of data(~16GB) which is of the format
id1,data1
id1,data2
id2,data1
id2,data2
id2,data3
.
.
.
My update operations start happening at 16000 lines per second and in over 5 minutes it comes down to 1000 lines per second and doesnt go up after that. The update process for this 16GB of data is currently longer than the time it takes for my entire indexing of 160GB to happen
My conf file for the update operation currently looks as follows
output
{
elasticsearch {
action => "update"
doc_as_upsert => true
hosts => ["host1","host2","host3","host4"]
index => "logstash-2017-08-1"
document_id => "%{uniqueid}"
document_type => "daily"
retry_on_conflict => 2
flush_size => 1000
}
}
The optimizations I have done to speed up indexing in my cluster based on the suggestions here https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html are
Setting "indices.store.throttle.type" : "none"
Index "refresh_interval" : "-1"
I am running my cluster on 4 instances of the d2.8xlarge EC2 instances. I have allocated 30GB of heap to each nodes.
While the update is happening barely any cpu is used and the load is very less as well.
Despite everything the update is extremely slow. Is there something very obvious that I am missing that is causing this issue? While looking at the threadpool data I find that the number of threads working on bulk operations are constantly high.
Any help on this issue would be really helpful
Thanks in advance
There are a couple of rule-outs to try here.
Memory Pressure
With 244GB of RAM, this is not terribly likely, but you can still check it out. Find the jstat command in the JDK for your platform, though there are visual tools for some of them. You want to check both your Logstash JVM and the ElasticSearch JVMs.
jstat -gcutil -h7 {PID of JVM} 2s
This will give you a readout of the various memory pools, garbage collection counts, and GC timings for that JVM as it works. It will update every 2 seconds, and print headers every 7 lines. Spending excessive time in the FCT is a sign that you're underallocated for HEAP.
I/O Pressure
The d2.8xlarge is a dense-storage instance, and may not be great for a highly random, small-block workload. If you're on a Unix platform, top will tell you how much time you're spending in IOWAIT state. If it's high, your storage isn't up to the workload you're sending it.
If that's the case, you may want to consider provisioned IOP EBS instances rather than the instance-local stuff. Or, if your stuff will fit, consider an instance in the i3 family of high I/O instances instead.
Logstash version
You don't say what version of Logstash you're using. Being StackOverflow, you're likely to be using 5.2. If that's the case, this isn't a rule-out.
But, if you're using something in the 2.x series, you may want to set the -w flag to 1 at first, and work your way up. Yes, that's single-threading this. But the ElasticSearch output has some concurrency issues in the 2.x series that are largely fixed in the 5.x series.
With elasticsearch version 6.0 we had an exactly same issue of slow updates on aws and the culprit was slow I/O. Same data was upserting on a local test stack completely fine but once in cloud on ec2 stack, everything was dying after an initial burst of speedy inserts lasting only for few minutes.
Local test stack was a low-spec server in terms of memory and cpu but contained SSDs.
s3 stack was EBS volumes with default gp2 300 IOPS.
Converting the volumes to type io1 with 3000 IOPS solved the issue and everything got back on track.
I am using amazon aws elasticsearch service version 6.0 . I need heavy write/insert from serials of json file to the elasticsearch for 10 billion items . The elasticsearch-py bulk write speed is really slow most of time and occasionally high speed write . i tried all kinds of methods , such as split json file to smaller pieces , multiprocess read json files , parallel_bulk insert into elasticsearch , nothing works . Finally , after I upgraded io1 EBS volume , everything goes smoothly with 10000 write IOPS .

Neo4j in Docker - Max Heap Size Causes Hard crash 137

I'm trying to spin up a Neo4j 3.1 instance in a Docker container (through Docker-Compose), running on OSX (El Capitan). All is well, unless I try to increase the max-heap space available to Neo above the default of 512MB.
According to the docs, this can be achieved by adding the environment variable NEO4J_dbms_memory_heap_maxSize, which then causes the server wrapper script to update the neo4j.conf file accordingly. I've checked and it is being updated as one would expect.
The problem is, when I run docker-compose up to spin up the container, the Neo4j instance crashes out with a 137 status code. A little research tells me this is a linux hard-crash, based on heap-size maximum limits.
$ docker-compose up
Starting elasticsearch
Recreating neo4j31
Attaching to elasticsearch, neo4j31
neo4j31 | Starting Neo4j.
neo4j31 exited with code 137
My questions:
Is this due to a Docker or an OSX limitation?
Is there a way I can modify these limits? If I drop the requested limit to 1GB, it will spin up, but still crashes once I run my heavy query (which is what caused the need for increased Heap space anyway).
The query that I'm running is a large-scale update across a lot of nodes (>150k) containing full-text attributes, so that they can be syncronised to ElasticSearch using the plug-in. Is there a way I can get Neo to step through doing, say, 500 nodes at a time, using only cypher (I'd rather avoid writing a script if I can, feels a little dirty for this).
My docker-compose.yml is as follows:
---
version: '2'
services:
# ---<SNIP>
neo4j:
image: neo4j:3.1
container_name: neo4j31
volumes:
- ./docker/neo4j/conf:/var/lib/neo4j/conf
- ./docker/neo4j/mnt:/var/lib/neo4j/import
- ./docker/neo4j/plugins:/plugins
- ./docker/neo4j/data:/data
- ./docker/neo4j/logs:/var/lib/neo4j/logs
ports:
- "7474:7474"
- "7687:7687"
environment:
- NEO4J_dbms_memory_heap_maxSize=4G
# ---<SNIP>
Is this due to a Docker or an OSX limitation?
NO Increase the amount of available RAM to Docker to resolve this issue.
Is there a way I can modify these limits? If I drop the requested
limit to 1GB, it will spin up, but still crashes once I run my heavy
query (which is what caused the need for increased Heap space
anyway).
The query that I'm running is a large-scale update across a lot of
nodes (>150k) containing full-text attributes, so that they can be
syncronised to ElasticSearch using the plug-in. Is there a way I can
get Neo to step through doing, say, 500 nodes at a time, using only
cypher (I'd rather avoid writing a script if I can, feels a little
dirty for this).
N/A This is a NEO4J specific question. It might be better to seperate this from the Docker questions listed above.
3.The query that I'm running is a large-scale update across a lot of nodes (>150k) containing full-text attributes, so that they can be syncronised to ElasticSearch using the plug-in. Is there a way I can get Neo to step through doing, say, 500 nodes at a time, using only cypher (I'd rather avoid writing a script if I can, feels a little dirty for this).
You can do this with the help of apoc plugin for neo4j, more specifically apoc.periodic.iterate
or apoc.periodic.commit
.
If you will use apoc.periodic.commit your first match should be specific like in example you mark which nodes have you already synced, because it sometimes fall in the loop:
call apoc.periodic.commit("
match (user:User) WHERE user.synced = false
with user limit {limit}
MERGE (city:City {name:user.city})
MERGE (user)-[:LIVES_IN]->(city)
SET user.synced =true
RETURN count(*)
",{limit:10000})
If you use apoc.periodic.iterate you can run it in parallel mode:
CALL apoc.periodic.iterate(
"MATCH (o:Order) WHERE o.date > '2016-10-13' RETURN o",
"with {o} as o MATCH (o)-[:HAS_ITEM]->(i) WITH o, sum(i.value) as value
CALL apoc.es.post(host-or-port,index-or-null,type-or-null,
query-or-null,payload-or-null) yield value return *", {batchSize:100, parallel:true})
Note that there is no need for second MATCH clause and apoc.es.post is a function for apoc that can send post requests to elastic search.
see documentation for more info

Does hadoop use folders and subfolders

I have started learning Hadoop and just completed setting up a single node as demonstrated in hadoop 1.2.1 documentation
Now I was wondering if
When files are stored in this type of FS should I use a hierachial mode of storage - like folders and sub-folders as I do in Windows or files are just written into as long as they have a unique name?
Is it possible to add new nodes to the single node setup if say somebody were to use it in production environment. Or simply can a single node be converted to a cluster without loss of data by simply adding more nodes and editing the configuration?
This one I can google but what the hell! I am asking anyway, sue me. What is the maximum number of files I can store in HDFS?
When files are stored in this type of FS should I use a hierachial mode of storage - like folders and sub-folders as I do in Windows or files are just written into as long as they have a unique name?
Yes, use the directories to your advantage. Generally, when you run jobs in Hadoop, if you pass along a path to a directory, it will process all files in that directory. So.. you really have to use them anyway.
Is it possible to add new nodes to the single node setup if say somebody were to use it in production environment. Or simply can a single node be converted to a cluster without loss of data by simply adding more nodes and editing the configuration?
You can add/remove nodes as you please (unless by single-node, you mean pseudo-distributed... that's different)
This one I can google but what the hell! I am asking anyway, sue me. What is the maximum number of files I can store in HDFS?
Lots
To expand on climbage's answer:
Maximum number of files is a function of the amount of memory available to your Name Node server. There is some loose guidance that each metadata entry in the Name Node requires somewhere between 150-200 bytes of memory (it alters by version).
From this you'll need to extrapolate out to the number of files, and the number of blocks you have for each file (which can vary depending on file and block size) and you can estimate for a given memory allocation (2G / 4G / 20G etc), how many metadata entries (and therefore files) you can store.

Architecture - How to efficiently crawl the web with 10,000 machine?

Let’s pretend I have a network of 10,000 machines. I want to use all those machines to crawl the web as fast as possible. All pages should be downloaded only once. In addition there must be no single point of failure and we must minimize the number of communication required between machines. How would you accomplish this?
Is there anything more efficient than using consistent hashing to distribute the load across all machines and minimize communication between them?
Use a distributed Map Reduction system like Hadoop to divide the workspace.
If you want to be clever, or doing this in an academic context then try a Nonlinear dimension reduction.
Simplest implementation would probably be to use a hashing function on the name space key e.g. the domain name or URL. Use a Chord to assign each machine a subset of the hash values to process.
One Idea would be to use work queues (directories or DB), assuming you will be working out storage such that it meets your criteria for redundancy.
\retrieve
\retrieve\server1
\retrieve\server...
\retrieve\server10000
\in-process
\complete
1.) All pages to be seeds will be hashed and be placed in the queue using the hash as a file root.
2.) Before putting in the queue you check the complete and in-process queues to make sure you don't re-queue
3.) Each server retrieves a random batch (1-N) files from the retrieve queue and attempts to move it to the private queue
4.) Files that fail the rename process are assumed to have been “claimed” by another process
5.) Files that can be moved are to be processed put a marker in in-process directory to prevent re-queuing.
6.) Download the file and place it into the \Complete queue
7.) Clean file out of the in-process and server directories
8.) Every 1,000 runs check the oldest 10 in-process files by trying to move them from their server queues back into the general retrieve queue. This will help if a server hangs and also should load balance slow servers.
For the Retrieve, in-process and complete servers most file systems hate millions of files in 1 directory, Divide storage into segments based on the characters of the hash \abc\def\123\ would be the directory for file abcdef123FFFFFF…. If you were scaling to billions of downloads.
If you are using a mongo DB instead of a regular file store much of these problems would be avoided and you could benefit from the sharding etc…

Resources