About replicating Meilisearch #3494

Kerollmops · 2023-02-14T13:49:51Z

This morning I had a meeting with @dureuill, and we discussed the different solutions to replicate Meilisearch we could implement and the pros and cons of each.

First, we must define what we want to achieve. Meilisearch is very fast to boot. It doesn't need to process anything before being able to serve requests, not even if a crash occurs. The reason is that it uses LMDB and not RocksDB or SQLite, which are WAL-based embedded databases. The high-availability feature of the Cloud highly depends on this feature, but it doesn't apply in situations where an entire cluster is down.

These different solutions can fix the weaknesses of the current design. They are more or less complex to implement.

So that you know, no single solution described below straightforwardly manages task cancellation, and therefore task cancellation will be disabled for the prototypes.

A Single Writer Broadcasts its Task Queue to multiple Readers

This solution is the easiest to implement by far. The principle is that we broadcast the tasks received by the single writer to the other readers. There are different quirks to think about, but the current Meilisearch codebase rests unchanged.

A Meilisearch server would be allowed to receive user write requests. Every time it successfully writes it to its task queue, it broadcasts it to all the previously registered Meilisearch Readers. The Meilisearch Readers also store this task, but only if the highest task id follows the previous one. If not, it asks for the set of missing tasks to the Meilisearch Writer. The Meilisearch Writer will generate a dump or send the list of tasks depending on whether the task content files are available (we delete the task content files when a task is finished).

Pros:

Seems easy enough to create a prototype.
Elegant solution where we keep the engine as is.
Highly available on the read side.

Cons:

Highly available on the read side only.

The Task Queue is Replicated using a Raft or Paxos Consensus Protocol

This solution looks to be the ideal one. Multiple Meilisearch servers can synchronize the task queue together. Sending the tasks to each other and committing changes at the same time. Unfortunately, it is not the easiest to build as consensus protocols are hard to implement right.

Here is a list of available Rust replication libraries: tikv/raft-rs, datafuselabs/openraft, benschulz/paxakos. And here is a list of some C/C++ libraries we can probably wrap: baidu/braft, canonical/raft, willemt/raft, logcabin/logcabin.

Pros:

Highly available on reads and writes.
The replication library manages the replication.

Cons:

It doesn't seem easy to create a prototype.
It can be hard to debug network and consistency issues.
There is no actual production-grade replication library in Rust.

The Task Queue is a Message Broker

A message broker is a, most of the time, distributed event store in which we can send events that every listener will read.

In this solution, all Meilisearch servers were listening to the same broker and subscribed simultaneously. When a written request is sent to any Meilisearch server, this server will store it in the broker, and all the cluster members will receive it and start processing it.

When a Meilisearch server is lacking, it must start from a dump or process the tasks in the order from the start, which can take a lot of time. I didn't take the time to think more about that, but there surely be a communication between two Meilisearch servers to ask for a dump at a moment which can complexify the solution.

Pros:

The broker manages replication and high availability.
Highly available on reads and writes.

Cons:

Depending on the broker, the user must set up and monitor another program.
We must maintain another task queue that lies in the broker queue.

dureuill · 2023-02-14T13:56:42Z

Thanks for the write-up kero!

Cc @gmourier

gmourier · 2023-02-27T11:33:13Z

I am sharing this here for someone stumbling up on that issue. We have a dedicated product discussion to let us know what you need.

Kerollmops · 2023-07-18T15:26:46Z

Here are some news about what we plan to do in the coming months.

We worked with @dureuill to create a fake Meilisearch program that behaves like Meilisearch, i.e., with a task queue and everything. This little experiment will run on top of datafuselabs/openraft. We followed the guide and wanted to make it work with LMDB and its commit and transactional systems.

We use this fake Meilisearch experiment to meet the Cloud team requirements and adapt the solution to something useful that can be easily deployed and maintained. Once everyone is happy with it, we will transpose the changes on Meilisearch itself. It's easier to work on a small codebase first to experiment than a code as bog as the Meilisearch one.

Kerollmops · 2023-07-24T10:17:24Z

More news 🎉

We discussed the current implementation, and @irevoire and @ManyTheFish thought about the potential solution of not sending the whole update files through the Raft log. It would be too big. They think we can make the data.ms/update_files folder an S3fs-backed directory for all members.

When a new update arrives, the leader stores the file in this folder and sends the task info through the Raft log. The followers will then store the raw tasks with the same date-times as the leader (for task filters to be consistent across the cluster). We can use the register_raw_task function. Other members of the cluster will be able to read the file from there. We must only delete the file once the associated Raft log has been wiped out.

The mDNS communication system is relatively easy to implement. However, @irevoire suggested using the Quickwit/ChitChat solution.

We must commit at the right time and ensure that errors while writing on disk are well handled.

ManyTheFish · 2023-07-26T10:17:04Z

Meeting notes and potential next steps.

create a new task
- leader creates a task
- leader writes heavy files in a synchronized file system
- leader sends the task skeleton
receive and commit task
- followers receive the task and commit independently
create a batch
- leader computes the task ids of the next batch to process
- leader sends the batch's task ids to everybody
receive and process batch
- followers receive the task ids and create the batch
- followers process the batch
- synchronize commit? (-> load a snapshot on failure)
- document addition that needs an index creation?
- behavior when every node fails?
snapshots of failures
- write snapshot files in a synchronized file system
- send a snapshot with a path
- how to auto-clean old snapshots?
other
- How to test? (https://github.com/mininet/mininet?)
- API keys?

Main TODO:

Index and process batch
Ensure that the snapshot process works
- when a machine fall
- when a new machine is added to the cluster
Ask for help with the consistency of the cluster at the end of an indexing process (Corum check?)
auto-discovery (mdns)
Ability to exclude a node when it can't resynchronize with the cluster
if a node responds "BAD REQUEST" it should be excluded from the cluster

Gab-Menezes · 2023-07-31T14:36:29Z

Hi there, I have a little bit of experience using Openraft. I had a similar ideia of using something else to distribute the snapshot datafuselabs/openraft#507, in this issue I talked with the author of Openraft. So if you guys want to take a look.

Also I'm free to help if needed, I know a little bit how Meilisearch deals with tasks and the task queue

ManyTheFish · 2023-08-01T10:21:41Z

Hello, I come back on the replication solution, Raft doesn't seem to be the best approach we found so far, to remind our must-haves in terms of replication:

Meilisearch should be highly available on READ
Meilisearch should not have any single point of failure
Meilisearch should be geo-replicated

With the current Meilisearch architecture, RAFT doesn't play well with:

geo-replication, which is a must-have
Two-phase commit, needed to ensure indexing consistency
An higher complexity for few benefits

A potential alternative would be to rely on Zookeeper to handle the leader election and the order consistency of the task to process and on S3 to store and propagates the updates and snapshots. Note that the S3 is necessary with a raft architecture too. This solution seems the easiest to implement, seems promising for a future HA on WRITE operations, and seems to allow having bigger machines specialized for indexing and smaller machines for Search.

Before starting to implement a POC, we should ensure the following needs:

ensure the Meilisearch cloud infrastructure can handle a zookeeper (seems possible)
ensure zookeeper and S3 are well geo-replicated and consistent
prepare an architectural plan for the final solution
provide a time estimation for each step of the solution

derekperkins · 2023-08-01T15:35:53Z

I would strongly vote for etcd over zookeeper if you're going to run a lockserver. A good majority of installations are going to happen on kubernetes, which runs on etcd, ensuring long term support. Zookeeper has stagnated for years.

ManyTheFish · 2023-08-01T16:14:03Z

Hello @derekperkins,
thank you for your suggestion, we will investigate the etcd solution too and see what is more suitable for us, we keep experimenting with Zookeeper because it is one of the most battle-tested solutions in terms of replication despite the stagnation.
See you!

derekperkins · 2023-08-01T16:21:35Z

zookeeper was definitely the original choice, but many popular datastores are eliminating it. etcd, mainly because it is the primary datastore for kubernetes, is easier to run in a cloud-native way, and is more likely to meet the needs for meilisearch for the next decade.

Gab-Menezes · 2023-08-03T00:12:29Z

@ManyTheFish @derekperkins there is a paper discussing the use o etcd used by kubernetes to build a native consensus algorithm in kubernetes. paper

derekperkins · 2023-08-03T00:20:02Z

FWIW, I'm a maintainer of Vitess, distributed MySQL. We initially had an interface to enable any lockserver, and we had implementations for etcd, zookeeper, consul, and k8s. k8s and consul have already been deprecated, and we're really only keeping zookeeper around because of existing users. If we were starting over today, we would most likely require etcd. Starting with Zookeeper in a new project would be a mistake IMO.

irevoire · 2023-08-03T11:04:01Z

Hey, I would love to hear more about what is wrong with Zookeeper; I’m struggling to understand the concrete issues with it?
The main resource I’ve read was the Jepsen analysis which was really positive https://aphyr.com/posts/291-call-me-maybe-zookeeper, and most people seem to be leaving Zookeeper for a custom solution crafted strictly for their use case (which we don’t have the time to develop right now).

Also, I looked a bit at etcd I think it fulfills our requirement, but it seems harder to use and deploy. I’m not especially against it, I would just like to understand better what’s the point in moving to etcd.

Also, I’ve heard a bunch of times that basic raft doesn’t play well with geo-replication. Since etcd relies on raft, do you have info to share on the subject?

derekperkins · 2023-08-03T22:36:15Z

Hey, I would love to hear more about what is wrong with Zookeeper; I’m struggling to understand the concrete issues with it?

There's nothing concretely wrong with Zookeeper. IIRC, we have encountered a few types of queries that were less performant in zookeeper, but nothing major.

most people seem to be leaving Zookeeper for a custom solution

This is my core concern. Zookeeper was already stagnating while it was required for many popular projects, and all of these migrations are just going to accelerate that. I'm not pitching for Meilisearch to write their own custom implementation, just to use etcd that by virtue of being the only datastore for kubernetes, is more likely to be around longer than zookeeper. To be fair, etcd is currently looking for some maintenance help, but again, the kubernetes connection makes it impossible to abandon.

I looked a bit at etcd I think it fulfills our requirement, but it seems harder to use and deploy

I guess that's subjective, I've had the opposite experience, though it admittedly has been a few years since I tried zookeeper. Bitnami maintains charts for both: etcd & zookeeper. I have experience with the etcd chart and it's really easy to deploy. Given the success I've had with it and some trust in bitnami, I'd guess that zookeeper is similarly easy to stand up.

Zookeeper runs in Java, and takes more resources to run than etcd written in Go does. Zookeeper was written pre-cloud, where etcd was built in a cloud-native era, which is very hand-wavy and not objective, but plays a role in my opinion.

Planetscale runs on Vitess, using etcd. This is a quote from someone on their team about how little it takes for them to manage it at scale.

I’ve heard a bunch of times that basic raft doesn’t play well with geo-replication

I'm not the expert in this sphere, but raft has been adopted by most of the NewSQL competitors, the two I mentioned above migrating away from zookeeper, Clickhouse and Kafka, and your closest competitor typesense:

I don't think the choice of consensus algorithm is a reason to choose one over the other.

I would just like to understand better what’s the point in moving to etcd

At least from what is visible on this thread, the discussion didn't move to zookeeper until 2 days ago. If zookeeper was already built out and in place, I wouldn't suggest to anyone to do a rip and replace. My comments are based on the assumption that this is a pretty green field project.

Again, I'm not an expert in the weeds of zookeeper vs etcd, so I don't claim to 100% know the right choice for Meilisearch. I'm making some intuitive projections based on what I'm seeing in the distributed database world. As a ~2 year satisfied user of Meilisearch continuing to expand our usage, I want to see what's best for the project. If I have a specific bias, it's mostly based on operating etcd for several years in conjunction with our Vitess installation, all running on top of etcd in Kubernetes.

TL;DR: Zookeeper is probably fine, but choosing it now goes against trends in other distributed database systems

irevoire · 2023-08-07T08:45:01Z

Hey, thanks a lot for your comment that helps a lot!

I'm not the expert in this sphere, but raft has been adopted by most of the NewSQL competitors, the two I mentioned above migrating away from zookeeper, Clickhouse and Kafka, and your closest competitor typesense:

The question was more « Is it ok to run instances all over the world in different dc »? Because a lot of people complain that raft is way too verbose and costs a lot of money. For example, afaik typesense isn’t geo-replicated in their cloud.
Kafka is, but they built a custom raft specifically for Kafka.

So, if you’re using etcd a lot I was just wondering how it is deployed and if it’s a good idea to have it synchronize for example one instance in Paris and one in Tokyo.

At least from what is visible on this thread, the discussion didn't move to zookeeper until 2 days ago. If zookeeper was already built out and in place, I wouldn't suggest to anyone to do a rip and replace. My comments are based on the assumption that this is a pretty green field project.

To keep it short, let’s say that the HA is a common effort with the meilisearch cloud team and thus we can’t swap our solution every week or no one is going to progress. But ideally, we would really like to take the time to try out etcd and probably other stuff at some point.

You also said you dropped support for consul, I see that it has been proposed to us as an alternative to zookeeper, do you have any info to share about this one as well? 👀

derekperkins · 2023-08-07T14:46:36Z

The question was more « Is it ok to run instances all over the world in different dc »? Because a lot of people complain that raft is way too verbose and costs a lot of money

My answer is the same as before - almost every new database system is using raft for running clusters around the world, so it's hard for me to believe that there's a fundamental problem with the algorithm. Vague references to "lots of people complain" don't help identify any specific issues you're facing.

if you’re using etcd a lot I was just wondering how it is deployed and if it’s a good idea to have it synchronize for example one instance in Paris and one in Tokyo.

The way that Vitess uses etcd / zookeeper is fundamentally different than what you're describing here. They aren't a part of the query serving path, they are only used for specific low-bandwidth operations like failovers. Actual data replication is handled by MySQL.

To keep it short, let’s say that the HA is a common effort with the meilisearch cloud team and thus we can’t swap our solution every week or no one is going to progress

I only know what's happening on this thread. My only goal was to add some more context so that zookeeper was chosen for its merits and not because it used to be the only choice.

If I were in your shoes, I would reach out to the people most actively committing in these other db systems and ask their opinion, what tradeoffs they've made, and how they would approach it for a db without existing replication like meilisearch. I'm sure you could find someone willing to get on a quick call to answer some of your questions, I just don't have the topical expertise to get any deeper.

Best of luck, I'm excited to continue watching Meilisearch evolve!

irevoire · 2023-08-08T09:26:28Z

Vague references to "lots of people complain" don't help identify any specific issues you're facing.

Generally speaking, the kind of feedback I heard was like;

Raft sends way too many logs everywhere, which costs a lot of money when going from one dc to another, especially when considering the kind of data we need to send
Which brings us to the « multi-raft » implementation that can synchronize multiple raft clusters with fewer messages in between cluster => in rust, for example, openraft which is way easier to use than raft-rs doesn’t support it afaik
Once you have a multi-raft cluster, people now complain that it’s still super expensive because you need raft cluster of 3 instances replicated all over the world instead of a single machine, for example. So if you just want to geo-replicate one instance in Tokyo, Paris, and SF, for example, instead of deploying only 3 instances and having degraded performances when one goes down you must deploy 3 instances in each place for a total of 9 instances.

curquiza · 2023-09-11T16:45:09Z

Team has worked on this, but the work is not over
Moving this to the next Milestone (v1.5.0) since we will continue the work on it

sandstrom · 2023-09-17T15:18:43Z

Raft sends way too many logs everywhere, which costs a lot of money when going from one dc to another, especially when considering the kind of data we need to send

Are you talking about replicating all the search index data (write/update/delete) via Raft?

My understanding was that Raft was only used to coordinate replica set health. But any synchronisation of actual DB content (search index data in this case) would be handled outside Raft (usually as a stream of commands from the master node to slave nodes).

andtii · 2024-03-21T11:18:30Z

Is anyone currently working on this?

curquiza · 2024-03-21T16:07:31Z

Hello @andtii
This is a work the team made in the past months (implementation design and proof of concept). But we had to delay the final implementation to work on other priorities. This is something we will start again during the year probably.

To give you more insight, since we observed replication is more an enterprise need, we want to focus on making Meilisearch replicate on our cloud offer first. However, an opensource replication solution will come once we are sure we manage to make it available on the cloud.

andtii · 2024-03-22T09:13:51Z

@curquiza Thnx for the input. At the moment we are running tests with latest meilisearch and multi instance pods against a shared nfs disk in azure kubernetes. The problem we can see is that when you create a new index on one pod the other pods will not get the new index. Is there a way to run meilisearch like this? How are you running it in the cloud at the moment to support scale out?

irevoire · 2024-03-25T09:56:54Z

Hey @andtii,

Is there a way to run meilisearch like this?

No, sorry, you absolutely cannot do that;

meilisearch is not designed to be run multiple times on the same database.
Meilisearch is not designed to work on shared nfs disks; this will cause database corruption.

As curqui stated, currently, we're focused on polishing our enterprise solution first, which I don’t know if I can share details about. But basically it comes down to having a big queue that replicates all the updates to as many meilisearch instances as you want.
You must take care that the queue always sends tasks in order.

If you want, you can also give a try to the two previous open solutions I made before. But keep in mind that I only did manual tests on it. There is a good chance it’s hiding some bugs, and as far as I know, no one else ever tried it.
We don’t consider it as production ready, so run it at your own risk.
But if you do run it, giving us feedback and bug reports would really help us to design a better open-source solution in the future.

Here’s the PR: #3971

To make it work, you must deploy a zookeeper + an S3 server, which all meilisearch instances will connect to.

andtii · 2024-04-19T14:15:26Z

Any progress in this area? We looked into typesense a bit since it supports scaleout using raft but its nowhere near the performance of meilisearch from our perspective but still a showstopper since our infra needs to scale

dedene · 2024-04-19T14:17:49Z

Yes for us too: I'm loving the Melisearch technology but without replication and a way to ensure high-availability we currently can not use this in production for any of our projects.

curquiza · 2024-04-22T08:08:36Z

Hello @andtii and @dedene

Thank you both of you for using and appreciating Meilisearch.
Unfortunately, so far we don't have the resources to work on pure replication system on the opensource product, sorry.

I would like to know, what do you mean by "high-availability" for you? It can be very different depending on users.

You have to know, on our Cloud offer, we ensure a minimal strong SLA, and you can even ask for more if you fall into the Entreprise plan. In this case, our team will deploy the necessary to ensure the minimal uptime you expect, so the high availability you expect.

Kerollmops added the thought label Feb 14, 2023

curquiza added the tech discussion label Feb 14, 2023

curquiza removed the thought label Mar 23, 2023

curquiza linked a pull request Mar 30, 2023 that will close this issue

Cluster #3593

Draft

curquiza added the prototype available You can test this feature using the available prototype label Mar 30, 2023

curquiza added this to the v1.4.0 milestone Jul 26, 2023

curquiza assigned Kerollmops, ManyTheFish, irevoire and dureuill Jul 26, 2023

ManyTheFish mentioned this issue Aug 2, 2023

Zookeeper ha #3971

Draft

8 tasks

curquiza removed this from the v1.4.0 milestone Sep 11, 2023

curquiza added this to the v1.5.0 milestone Sep 11, 2023

curquiza modified the milestones: v1.5.0, v1.6.0 Oct 23, 2023

curquiza removed this from the v1.6.0 milestone Dec 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About replicating Meilisearch #3494

About replicating Meilisearch #3494

Kerollmops commented Feb 14, 2023 •

edited

dureuill commented Feb 14, 2023

gmourier commented Feb 27, 2023

Kerollmops commented Jul 18, 2023

Kerollmops commented Jul 24, 2023

ManyTheFish commented Jul 26, 2023 •

edited

Gab-Menezes commented Jul 31, 2023 •

edited

ManyTheFish commented Aug 1, 2023

derekperkins commented Aug 1, 2023

ManyTheFish commented Aug 1, 2023 •

edited

derekperkins commented Aug 1, 2023

Gab-Menezes commented Aug 3, 2023

derekperkins commented Aug 3, 2023

irevoire commented Aug 3, 2023 •

edited

derekperkins commented Aug 3, 2023 •

edited

irevoire commented Aug 7, 2023

derekperkins commented Aug 7, 2023 •

edited

irevoire commented Aug 8, 2023

curquiza commented Sep 11, 2023

sandstrom commented Sep 17, 2023

andtii commented Mar 21, 2024

curquiza commented Mar 21, 2024

andtii commented Mar 22, 2024

irevoire commented Mar 25, 2024

andtii commented Apr 19, 2024

dedene commented Apr 19, 2024

curquiza commented Apr 22, 2024

About replicating Meilisearch #3494

About replicating Meilisearch #3494

Comments

Kerollmops commented Feb 14, 2023 • edited

A Single Writer Broadcasts its Task Queue to multiple Readers

The Task Queue is Replicated using a Raft or Paxos Consensus Protocol

The Task Queue is a Message Broker

dureuill commented Feb 14, 2023

gmourier commented Feb 27, 2023

Kerollmops commented Jul 18, 2023

Kerollmops commented Jul 24, 2023

ManyTheFish commented Jul 26, 2023 • edited

Gab-Menezes commented Jul 31, 2023 • edited

ManyTheFish commented Aug 1, 2023

derekperkins commented Aug 1, 2023

ManyTheFish commented Aug 1, 2023 • edited

derekperkins commented Aug 1, 2023

Gab-Menezes commented Aug 3, 2023

derekperkins commented Aug 3, 2023

irevoire commented Aug 3, 2023 • edited

derekperkins commented Aug 3, 2023 • edited

irevoire commented Aug 7, 2023

derekperkins commented Aug 7, 2023 • edited

irevoire commented Aug 8, 2023

curquiza commented Sep 11, 2023

sandstrom commented Sep 17, 2023

andtii commented Mar 21, 2024

curquiza commented Mar 21, 2024

andtii commented Mar 22, 2024

irevoire commented Mar 25, 2024

andtii commented Apr 19, 2024

dedene commented Apr 19, 2024

curquiza commented Apr 22, 2024

Kerollmops commented Feb 14, 2023 •

edited

ManyTheFish commented Jul 26, 2023 •

edited

Gab-Menezes commented Jul 31, 2023 •

edited

ManyTheFish commented Aug 1, 2023 •

edited

irevoire commented Aug 3, 2023 •

edited

derekperkins commented Aug 3, 2023 •

edited

derekperkins commented Aug 7, 2023 •

edited