-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About replicating Meilisearch #3494
Comments
Thanks for the write-up kero! Cc @gmourier |
I am sharing this here for someone stumbling up on that issue. We have a dedicated product discussion to let us know what you need. |
Here are some news about what we plan to do in the coming months. We worked with @dureuill to create a fake Meilisearch program that behaves like Meilisearch, i.e., with a task queue and everything. This little experiment will run on top of datafuselabs/openraft. We followed the guide and wanted to make it work with LMDB and its commit and transactional systems. We use this fake Meilisearch experiment to meet the Cloud team requirements and adapt the solution to something useful that can be easily deployed and maintained. Once everyone is happy with it, we will transpose the changes on Meilisearch itself. It's easier to work on a small codebase first to experiment than a code as bog as the Meilisearch one. |
More news 🎉 We discussed the current implementation, and @irevoire and @ManyTheFish thought about the potential solution of not sending the whole update files through the Raft log. It would be too big. They think we can make the When a new update arrives, the leader stores the file in this folder and sends the task info through the Raft log. The followers will then store the raw tasks with the same date-times as the leader (for task filters to be consistent across the cluster). We can use the The mDNS communication system is relatively easy to implement. However, @irevoire suggested using the Quickwit/ChitChat solution. We must commit at the right time and ensure that errors while writing on disk are well handled. |
Meeting notes and potential next steps.
Main TODO:
|
Hi there, I have a little bit of experience using Openraft. I had a similar ideia of using something else to distribute the snapshot datafuselabs/openraft#507, in this issue I talked with the author of Openraft. So if you guys want to take a look. Also I'm free to help if needed, I know a little bit how Meilisearch deals with tasks and the task queue |
Hello, I come back on the replication solution, Raft doesn't seem to be the best approach we found so far, to remind our must-haves in terms of replication:
With the current Meilisearch architecture, RAFT doesn't play well with:
A potential alternative would be to rely on Zookeeper to handle the leader election and the order consistency of the task to process and on S3 to store and propagates the updates and snapshots. Note that the S3 is necessary with a raft architecture too. This solution seems the easiest to implement, seems promising for a future HA on WRITE operations, and seems to allow having bigger machines specialized for indexing and smaller machines for Search. Before starting to implement a POC, we should ensure the following needs:
|
I would strongly vote for etcd over zookeeper if you're going to run a lockserver. A good majority of installations are going to happen on kubernetes, which runs on etcd, ensuring long term support. Zookeeper has stagnated for years. |
Hello @derekperkins, |
zookeeper was definitely the original choice, but many popular datastores are eliminating it. etcd, mainly because it is the primary datastore for kubernetes, is easier to run in a cloud-native way, and is more likely to meet the needs for meilisearch for the next decade. |
@ManyTheFish @derekperkins there is a paper discussing the use o etcd used by kubernetes to build a native consensus algorithm in kubernetes. paper |
FWIW, I'm a maintainer of Vitess, distributed MySQL. We initially had an interface to enable any lockserver, and we had implementations for etcd, zookeeper, consul, and k8s. k8s and consul have already been deprecated, and we're really only keeping zookeeper around because of existing users. If we were starting over today, we would most likely require etcd. Starting with Zookeeper in a new project would be a mistake IMO. |
Hey, I would love to hear more about what is wrong with Zookeeper; I’m struggling to understand the concrete issues with it? Also, I looked a bit at Also, I’ve heard a bunch of times that basic raft doesn’t play well with geo-replication. Since etcd relies on raft, do you have info to share on the subject? |
There's nothing concretely wrong with Zookeeper. IIRC, we have encountered a few types of queries that were less performant in zookeeper, but nothing major.
This is my core concern. Zookeeper was already stagnating while it was required for many popular projects, and all of these migrations are just going to accelerate that. I'm not pitching for Meilisearch to write their own custom implementation, just to use etcd that by virtue of being the only datastore for kubernetes, is more likely to be around longer than zookeeper. To be fair, etcd is currently looking for some maintenance help, but again, the kubernetes connection makes it impossible to abandon.
I guess that's subjective, I've had the opposite experience, though it admittedly has been a few years since I tried zookeeper. Bitnami maintains charts for both: etcd & zookeeper. I have experience with the etcd chart and it's really easy to deploy. Given the success I've had with it and some trust in bitnami, I'd guess that zookeeper is similarly easy to stand up. Zookeeper runs in Java, and takes more resources to run than etcd written in Go does. Zookeeper was written pre-cloud, where etcd was built in a cloud-native era, which is very hand-wavy and not objective, but plays a role in my opinion. Planetscale runs on Vitess, using etcd. This is a quote from someone on their team about how little it takes for them to manage it at scale.
I'm not the expert in this sphere, but raft has been adopted by most of the NewSQL competitors, the two I mentioned above migrating away from zookeeper, Clickhouse and Kafka, and your closest competitor typesense:
I don't think the choice of consensus algorithm is a reason to choose one over the other.
At least from what is visible on this thread, the discussion didn't move to zookeeper until 2 days ago. If zookeeper was already built out and in place, I wouldn't suggest to anyone to do a rip and replace. My comments are based on the assumption that this is a pretty green field project. Again, I'm not an expert in the weeds of zookeeper vs etcd, so I don't claim to 100% know the right choice for Meilisearch. I'm making some intuitive projections based on what I'm seeing in the distributed database world. As a ~2 year satisfied user of Meilisearch continuing to expand our usage, I want to see what's best for the project. If I have a specific bias, it's mostly based on operating etcd for several years in conjunction with our Vitess installation, all running on top of etcd in Kubernetes. TL;DR: Zookeeper is probably fine, but choosing it now goes against trends in other distributed database systems |
Hey, thanks a lot for your comment that helps a lot!
The question was more « Is it ok to run instances all over the world in different dc »? Because a lot of people complain that raft is way too verbose and costs a lot of money. For example, afaik typesense isn’t geo-replicated in their cloud. So, if you’re using
To keep it short, let’s say that the HA is a common effort with the meilisearch cloud team and thus we can’t swap our solution every week or no one is going to progress. But ideally, we would really like to take the time to try out You also said you dropped support for consul, I see that it has been proposed to us as an alternative to zookeeper, do you have any info to share about this one as well? 👀 |
My answer is the same as before - almost every new database system is using raft for running clusters around the world, so it's hard for me to believe that there's a fundamental problem with the algorithm. Vague references to "lots of people complain" don't help identify any specific issues you're facing.
The way that Vitess uses etcd / zookeeper is fundamentally different than what you're describing here. They aren't a part of the query serving path, they are only used for specific low-bandwidth operations like failovers. Actual data replication is handled by MySQL.
I only know what's happening on this thread. My only goal was to add some more context so that zookeeper was chosen for its merits and not because it used to be the only choice. If I were in your shoes, I would reach out to the people most actively committing in these other db systems and ask their opinion, what tradeoffs they've made, and how they would approach it for a db without existing replication like meilisearch. I'm sure you could find someone willing to get on a quick call to answer some of your questions, I just don't have the topical expertise to get any deeper.
Best of luck, I'm excited to continue watching Meilisearch evolve! |
Generally speaking, the kind of feedback I heard was like;
|
Team has worked on this, but the work is not over |
Are you talking about replicating all the search index data (write/update/delete) via Raft? My understanding was that Raft was only used to coordinate replica set health. But any synchronisation of actual DB content (search index data in this case) would be handled outside Raft (usually as a stream of commands from the master node to slave nodes). |
Is anyone currently working on this? |
Hello @andtii To give you more insight, since we observed replication is more an enterprise need, we want to focus on making Meilisearch replicate on our cloud offer first. However, an opensource replication solution will come once we are sure we manage to make it available on the cloud. |
@curquiza Thnx for the input. At the moment we are running tests with latest meilisearch and multi instance pods against a shared nfs disk in azure kubernetes. The problem we can see is that when you create a new index on one pod the other pods will not get the new index. Is there a way to run meilisearch like this? How are you running it in the cloud at the moment to support scale out? |
Hey @andtii,
No, sorry, you absolutely cannot do that;
As curqui stated, currently, we're focused on polishing our enterprise solution first, which I don’t know if I can share details about. But basically it comes down to having a big queue that replicates all the updates to as many meilisearch instances as you want. If you want, you can also give a try to the two previous open solutions I made before. But keep in mind that I only did manual tests on it. There is a good chance it’s hiding some bugs, and as far as I know, no one else ever tried it. Here’s the PR: #3971 To make it work, you must deploy a zookeeper + an S3 server, which all meilisearch instances will connect to. |
Any progress in this area? We looked into typesense a bit since it supports scaleout using raft but its nowhere near the performance of meilisearch from our perspective but still a showstopper since our infra needs to scale |
Yes for us too: I'm loving the Melisearch technology but without replication and a way to ensure high-availability we currently can not use this in production for any of our projects. |
Thank you both of you for using and appreciating Meilisearch. I would like to know, what do you mean by "high-availability" for you? It can be very different depending on users. You have to know, on our Cloud offer, we ensure a minimal strong SLA, and you can even ask for more if you fall into the Entreprise plan. In this case, our team will deploy the necessary to ensure the minimal uptime you expect, so the high availability you expect. |
This morning I had a meeting with @dureuill, and we discussed the different solutions to replicate Meilisearch we could implement and the pros and cons of each.
First, we must define what we want to achieve. Meilisearch is very fast to boot. It doesn't need to process anything before being able to serve requests, not even if a crash occurs. The reason is that it uses LMDB and not RocksDB or SQLite, which are WAL-based embedded databases. The high-availability feature of the Cloud highly depends on this feature, but it doesn't apply in situations where an entire cluster is down.
These different solutions can fix the weaknesses of the current design. They are more or less complex to implement.
So that you know, no single solution described below straightforwardly manages task cancellation, and therefore task cancellation will be disabled for the prototypes.
A Single Writer Broadcasts its Task Queue to multiple Readers
This solution is the easiest to implement by far. The principle is that we broadcast the tasks received by the single writer to the other readers. There are different quirks to think about, but the current Meilisearch codebase rests unchanged.
A Meilisearch server would be allowed to receive user write requests. Every time it successfully writes it to its task queue, it broadcasts it to all the previously registered Meilisearch Readers. The Meilisearch Readers also store this task, but only if the highest task id follows the previous one. If not, it asks for the set of missing tasks to the Meilisearch Writer. The Meilisearch Writer will generate a dump or send the list of tasks depending on whether the task content files are available (we delete the task content files when a task is finished).
Pros:
Cons:
The Task Queue is Replicated using a Raft or Paxos Consensus Protocol
This solution looks to be the ideal one. Multiple Meilisearch servers can synchronize the task queue together. Sending the tasks to each other and committing changes at the same time. Unfortunately, it is not the easiest to build as consensus protocols are hard to implement right.
Here is a list of available Rust replication libraries: tikv/raft-rs, datafuselabs/openraft, benschulz/paxakos. And here is a list of some C/C++ libraries we can probably wrap: baidu/braft, canonical/raft, willemt/raft, logcabin/logcabin.
Pros:
Cons:
The Task Queue is a Message Broker
A message broker is a, most of the time, distributed event store in which we can send events that every listener will read.
In this solution, all Meilisearch servers were listening to the same broker and subscribed simultaneously. When a written request is sent to any Meilisearch server, this server will store it in the broker, and all the cluster members will receive it and start processing it.
When a Meilisearch server is lacking, it must start from a dump or process the tasks in the order from the start, which can take a lot of time. I didn't take the time to think more about that, but there surely be a communication between two Meilisearch servers to ask for a dump at a moment which can complexify the solution.
Pros:
Cons:
The text was updated successfully, but these errors were encountered: