Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vearch 集群状态异常 #747

Open
gy7779 opened this issue Jan 22, 2024 · 15 comments
Open

Vearch 集群状态异常 #747

gy7779 opened this issue Jan 22, 2024 · 15 comments

Comments

@gy7779
Copy link

gy7779 commented Jan 22, 2024

Vearch 版本 v3.4.2
使用如下镜像 docker 部署:
image
master * 1
router * 1
ps * 2
配置文件如下:

 [global]
  name = "vearch-test"
  data = ["/home/vearch/vearch-test_ps_65.218-13181/db","/home/vearch/vearch-test_ps_65.218-13181/db1"]
  log = "/home/vearch/vearch-test_ps_65.218-13181/logs"
  level = "info"
  signkey = "secret"
  skip_auth = true

  [[masters]]
  name = "vearch-test-master"
  address = "192.168.65.218"
  api_port = 13817
  etcd_port = 13818
  etcd_peer_port = 13819
  etcd_client_port = 13820
  pprof_port = 13821

  [ps]
  rpc_port = 13181
  raft_heartbeat_port = 13182
  raft_replicate_port = 13183
  heartbeat-interval = 200 #ms
  raft_retain_logs = 11000
  raft_replica_concurrency = 1
  raft_snap_concurrency = 1
  engine_dwpt_num = 8
  pprof_port = 13184
  private = false

集群启动后查询集群状态(/_cluster/stats):

[
  {
    "status": 550,
    "ip": "192.168.65.218:13081",
    "labels": null,
    "err": "runtime error: invalid memory address or nil pointer dereference"
  },
  {
    "status": 550,
    "ip": "192.168.65.218:13181",
    "labels": null,
    "err": "runtime error: invalid memory address or nil pointer dereference"
  }
]

端口状态(/list/server):
接口报 500。

docker logs 日志如下:

2024/01/22 08:53:44 [Recovery] 2024/01/22 - 08:53:44 panic recovered:
runtime error: invalid memory address or nil pointer dereference
/env/app/go/src/runtime/panic.go:260 (0x4542b5)
/env/app/go/src/runtime/signal_unix.go:835 (0x454285)
/env/app/go/src/github.com/vearch/vearch/util/server/rpc/rpc_client.go:66 (0x12a44db)
/env/app/go/src/github.com/vearch/vearch/client/ps.go:216 (0x12e21c7)
/env/app/go/src/github.com/vearch/vearch/client/ps.go:191 (0x12e1ce9)
/env/app/go/src/github.com/vearch/vearch/client/ps_admin_service.go:166 (0x12e2f36)
/env/app/go/src/github.com/vearch/vearch/client/ps_admin_service.go:155 (0x12f2304)
/env/app/go/src/github.com/vearch/vearch/master/cluster_api.go:447 (0x12f22f8)
/root/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174 (0xf6ef41)
/root/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/recovery.go:102 (0xf6ef2c)
/root/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174 (0xf6e046)
/root/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/logger.go:240 (0xf6e029)
/root/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174 (0xf6d0d0)
/root/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/gin.go:620 (0xf6cd38)
/root/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/gin.go:576 (0xf6c87c)
/env/app/go/src/net/http/server.go:2947 (0x70174b)
/env/app/go/src/net/http/server.go:1991 (0x6fc966)
/env/app/go/src/runtime/asm_amd64.s:1594 (0x472200)

容器日志:

{"level":"warn","ts":"2024-01-22T08:48:26.748714Z","caller":"etcdserver/server.go:343","msg":"exceeded recommended request limit","max-request-bytes":33554432,"max-request-size":"34 MB","recommended-request-bytes":10485760,"recommended-request-size":"10 MB"}
{"level":"warn","ts":"2024-01-22T08:48:26.753986Z","caller":"auth/store.go:1234","msg":"simple token is not cryptographically signed"}
INFO 2024-01-22 08:48:26,745 config.go:420 master's name:[vearch-test-master] master's domain:[192.168.65.218] and local master's ip:[192.168.65.218]
INFO 2024-01-22 08:48:26,746 config.go:427 found local master successfully :master's name:[vearch-test-master] master's ip:[192.168.65.218] and local master's name:[$MASTER_NAME]
INFO 2024-01-22 08:48:26,746 config.go:261 etcd init cluster state: [new]
INFO 2024-01-22 08:48:26,747 signals.go:49 Wait Signals...
INFO 2024-01-22 08:48:28,055 server.go:83 Server is ready!
INFO 2024-01-22 08:48:28,055 config.go:113 vearch etcd address is 192.168.65.218:13820
INFO 2024-01-22 08:48:28,057 monitor_service.go:71 skip register monitoring
INFO 2024-01-22 08:48:30,160 config.go:113 vearch etcd address is 192.168.65.218:13820
INFO 2024-01-22 08:48:30,161 meta.go:54 Server create meta to file is: /home/fsp/vearch/vearch-gy-test_ps_65.218-13081/db/server_meta.txt
INFO 2024-01-22 08:48:30,162 signals.go:49 Wait Signals...
INFO 2024-01-22 08:48:30,180 server.go:203 to register master, nodeId:[1], times : 0
INFO 2024-01-22 08:48:30,189 server.go:217 register master ok, nodeId:[1]
INFO 2024-01-22 08:48:30,191 server.go:147 vearch server successful startup...
INFO 2024-01-22 08:48:31,492 config.go:113 vearch etcd address is 192.168.65.218:13820
INFO 2024-01-22 08:48:31,493 signals.go:49 Wait Signals...
INFO 2024-01-22 08:48:31,493 meta.go:54 Server create meta to file is: /home/fsp/vearch/vearch-gy-test_ps_65.218-13181/db/server_meta.txt
INFO 2024-01-22 08:48:31,500 server.go:203 to register master, nodeId:[2], times : 0
INFO 2024-01-22 08:48:31,505 server.go:217 register master ok, nodeId:[2]
INFO 2024-01-22 08:48:31,506 server.go:147 vearch server successful startup...
INFO 2024-01-22 08:48:27,872 config.go:113 vearch etcd address is 192.168.65.218:13820
INFO 2024-01-22 08:48:27,884 master_cache.go:421 to start cache job begin
INFO 2024-01-22 08:48:28,063 master_cache.go:572 cache inited ok use time 179.376538ms
INFO 2024-01-22 08:48:28,064 signals.go:49 Wait Signals...

我在集群启动后没有进行任何操作,直接查集群状态出现如上异常,我想知道是我配置的问题还是镜像的问题?

@zcdb
Copy link
Member

zcdb commented Jan 22, 2024

你这配置文件是全的吗?没看到router的配置信息

@gy7779
Copy link
Author

gy7779 commented Jan 22, 2024

你这配置文件是全的吗?没看到router的配置信息

上面没有 router 的配置:

[router]
  # port for server
  port = 13001
  pprof_port = 13002
  plugin_path = "plugin"

加上这个就全了

@zcdb
Copy link
Member

zcdb commented Jan 22, 2024

你启动的时候配置了router吗

@gy7779
Copy link
Author

gy7779 commented Jan 22, 2024

你启动的时候配置了router吗

配置了,master、router、ps 都可以正常启动

@zcdb
Copy link
Member

zcdb commented Jan 22, 2024

你是通过什么方式启动的master、router、ps?

@gy7779
Copy link
Author

gy7779 commented Jan 22, 2024

docker run -d --name vearch-gy-test_master_65.218-13817 --net=host -v $DEPLOYBINDIR/conf/server_config.toml:/vearch/config.toml -v /home/fsp/vearch/vearch-gy-test_master_65.218-13817/db:/home/fsp/vearch/vearch-gy-test_master_65.218-13817/db -v /home/fsp/vearch/vearch-gy-test_master_65.218-13817/db1:/home/fsp/vearch/vearch-gy-test_master_65.218-13817/db1 -v /home/fsp/vearch/vearch-gy-test_master_65.218-13817/logs:/home/fsp/vearch/vearch-gy-test_master_65.218-13817/logs d000ea0175ea master

docker run -d --name vearch-gy-test_router_65.218-13001 --cpuset-mems="1" --cpuset-cpus="1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31" --net=host -v $DEPLOYBINDIR/conf/server_config.toml:/vearch/config.toml -v /home/fsp/vearch/vearch-gy-test_router_65.218-13001/db:/home/fsp/vearch/vearch-gy-test_router_65.218-13001/db -v /home/fsp/vearch/vearch-gy-test_router_65.218-13001/db1:/home/fsp/vearch/vearch-gy-test_router_65.218-13001/db1 -v /home/fsp/vearch/vearch-gy-test_router_65.218-13001/logs:/home/fsp/vearch/vearch-gy-test_router_65.218-13001/logs d000ea0175ea router

docker run -d --name vearch-gy-test_ps_65.218-13081 -m 26000M --memory-swap 0M --net=host -v $DEPLOYBINDIR/conf/server_config.toml:/vearch/config.toml -v /home/fsp/vearch/vearch-gy-test_ps_65.218-13081/db:/home/fsp/vearch/vearch-gy-test_ps_65.218-13081/db -v /home/fsp/vearch/vearch-gy-test_ps_65.218-13081/db1:/home/fsp/vearch/vearch-gy-test_ps_65.218-13081/db1 -v /home/fsp/vearch/vearch-gy-test_ps_65.218-13081/logs:/home/fsp/vearch/vearch-gy-test_ps_65.218-13081/logs d000ea0175ea ps

@zcdb
Copy link
Member

zcdb commented Jan 22, 2024

我这边按你的配置启动了1个master,1个router和1个ps是没有问题的。我现在机器有限,你第二个ps是如何启动的,用一模一样的参数吗,是在新机器上启动的吗?

@gy7779
Copy link
Author

gy7779 commented Jan 22, 2024

都是一台机器,命令也是一样的,https://hub.docker.com/r/vearch/vearch/tags f212feadc153 你也是用这个最新的镜像测的吗,我之前用的 3.2.7 版本的是没有问题的,用这个最新的镜像就报异常

@zcdb
Copy link
Member

zcdb commented Jan 22, 2024

是的,我用的就是这个镜像

@gy7779
Copy link
Author

gy7779 commented Jan 23, 2024

是的,我用的就是这个镜像

我使用这个镜像,也测试了 1个master,1个router和1个ps 的情况,同样出现了相同的异常。
此外我更换了新的机器进行测试,使用相同的配置任然出现了相同的异常。
以下是我唯一收集到的异常日志:

2024/01/23 01:51:43 [Recovery] 2024/01/23 - 01:51:43 panic recovered:
runtime error: invalid memory address or nil pointer dereference
/env/app/go/src/runtime/panic.go:260 (0x4542b5)
/env/app/go/src/runtime/signal_unix.go:835 (0x454285)
/env/app/go/src/github.com/vearch/vearch/util/server/rpc/rpc_client.go:66 (0x12a44db)
/env/app/go/src/github.com/vearch/vearch/client/ps.go:216 (0x12e21c7)
/env/app/go/src/github.com/vearch/vearch/client/ps.go:191 (0x12e1ce9)
/env/app/go/src/github.com/vearch/vearch/client/ps_admin_service.go:166 (0x12e2f36)
/env/app/go/src/github.com/vearch/vearch/client/ps_admin_service.go:155 (0x12f2304)
/env/app/go/src/github.com/vearch/vearch/master/cluster_api.go:447 (0x12f22f8)
/root/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174 (0xf6ef41)
/root/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/recovery.go:102 (0xf6ef2c)
/root/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174 (0xf6e046)
/root/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/logger.go:240 (0xf6e029)
/root/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174 (0xf6d0d0)
/root/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/gin.go:620 (0xf6cd38)
/root/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/gin.go:576 (0xf6c87c)
/env/app/go/src/net/http/server.go:2947 (0x70174b)
/env/app/go/src/net/http/server.go:1991 (0x6fc966)
/env/app/go/src/runtime/asm_amd64.s:1594 (0x472200)

一个完全新的集群,调用 /list/server 就会产生如上异常。

我测试的机器 avx 和 avx512 指令集都是有的,CPU 型号是 Intel(R) Xeon(R) Silver 4208 CPU @ 2.10GHz ,这两台机器运行 3.2.7 版本的镜像都是正常的,使用最新的镜像就有问题

@zcdb
Copy link
Member

zcdb commented Jan 23, 2024

我这边也启动了两台ps试了一下,也是可以的。麻烦贴一下master,router和ps容器的日志吧

@gy7779
Copy link
Author

gy7779 commented Jan 23, 2024

docker logs:
master

2024/01/23 01:49:01 startup.go:65: [ERROR] start server by version:[v3.4.2] commitID:[74199eebaac04fd23e5818e8209a6f109062628e]
2024/01/23 01:49:01 startup.go:74: [ERROR] The Config File Is: /vearch/config.toml
2024/01/23 01:49:01 startup.go:82: [ERROR] The cluster prefix is: /vearch-gy-test
[GIN] 2024/01/23 - 01:49:05 | 200 |    3.079169ms |  192.168.65.210 | POST     "/register?clusterName=vearch-gy-test&nodeID=1"
[GIN] 2024/01/23 - 01:49:13 | 200 |   24.138447ms |  192.168.16.125 | GET      "/_cluster/stats"
[GIN] 2024/01/23 - 01:49:13 | 404 |       1.089µs |  192.168.16.125 | GET      "/favicon.ico"
[GIN] 2024/01/23 - 01:49:16 | 200 |   21.437802ms |  192.168.16.125 | GET      "/_cluster/stats"
[GIN] 2024/01/23 - 01:49:16 | 200 |   21.769217ms |  192.168.16.125 | GET      "/_cluster/stats"
[GIN] 2024/01/23 - 01:50:19 | 200 |   21.147779ms |  192.168.16.125 | GET      "/_cluster/stats"
[GIN] 2024/01/23 - 01:50:20 | 200 |   21.643313ms |  192.168.16.125 | GET      "/_cluster/stats"
[GIN] 2024/01/23 - 01:50:26 | 200 |   20.976514ms |  192.168.16.125 | GET      "/_cluster/stats"
[GIN] 2024/01/23 - 01:51:43 | 500 |    4.881923ms |  192.168.16.125 | GET      "/list/server"


2024/01/23 01:51:43 [Recovery] 2024/01/23 - 01:51:43 panic recovered:
runtime error: invalid memory address or nil pointer dereference
/env/app/go/src/runtime/panic.go:260 (0x4542b5)
/env/app/go/src/runtime/signal_unix.go:835 (0x454285)
/env/app/go/src/github.com/vearch/vearch/util/server/rpc/rpc_client.go:66 (0x12a44db)
/env/app/go/src/github.com/vearch/vearch/client/ps.go:216 (0x12e21c7)
/env/app/go/src/github.com/vearch/vearch/client/ps.go:191 (0x12e1ce9)
/env/app/go/src/github.com/vearch/vearch/client/ps_admin_service.go:166 (0x12e2f36)
/env/app/go/src/github.com/vearch/vearch/client/ps_admin_service.go:155 (0x12f2304)
/env/app/go/src/github.com/vearch/vearch/master/cluster_api.go:447 (0x12f22f8)
/root/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174 (0xf6ef41)
/root/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/recovery.go:102 (0xf6ef2c)
/root/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174 (0xf6e046)
/root/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/logger.go:240 (0xf6e029)
/root/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/context.go:174 (0xf6d0d0)
/root/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/gin.go:620 (0xf6cd38)
/root/go/pkg/mod/github.com/gin-gonic/gin@v1.9.1/gin.go:576 (0xf6c87c)
/env/app/go/src/net/http/server.go:2947 (0x70174b)
/env/app/go/src/net/http/server.go:1991 (0x6fc966)
/env/app/go/src/runtime/asm_amd64.s:1594 (0x472200)

router

2024/01/23 01:49:02 startup.go:65: [ERROR] start server by version:[v3.4.2] commitID:[74199eebaac04fd23e5818e8209a6f109062628e]
2024/01/23 01:49:02 startup.go:74: [ERROR] The Config File Is: /vearch/config.toml
2024/01/23 01:49:02 startup.go:82: [ERROR] The cluster prefix is: /vearch-gy-test

ps

2024/01/23 01:49:05 startup.go:65: [ERROR] start server by version:[v3.4.2] commitID:[74199eebaac04fd23e5818e8209a6f109062628e]
2024/01/23 01:49:05 startup.go:74: [ERROR] The Config File Is: /vearch/config.toml
2024/01/23 01:49:05 startup.go:82: [ERROR] The cluster prefix is: /vearch-gy-test
2024/01/23 01:49:05 server.go:184: INFO : server pid:1

容器日志:
vearch-log.tar.gz

@zcdb
Copy link
Member

zcdb commented Jan 23, 2024

看日志都是正常的。麻烦贴一下环境相关信息吧,系统版本还有详细的cpuinfo

@gy7779
Copy link
Author

gy7779 commented Jan 24, 2024

看日志都是正常的。麻烦贴一下环境相关信息吧,系统版本还有详细的cpuinfo

cat /etc/os-release

NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

lscpu

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
座: 2
NUMA 节点: 2
厂商 ID: GenuineIntel
CPU 系列: 6
型号: 85
型号名称: Intel(R) Xeon(R) Silver 4208 CPU @ 2.10GHz
步进: 7
CPU MHz: 2100.000
BogoMIPS: 4200.00
虚拟化: VT-x
L1d 缓存: 32K
L1i 缓存: 32K
L2 缓存: 1024K
L3 缓存: 11264K
NUMA 节点0 CPU: 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30
NUMA 节点1 CPU: 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin intel_pt ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke avx512_vnni md_clear spec_ctrl intel_stibp flush_l1d arch_capabilities

@zcdb
Copy link
Member

zcdb commented Jan 25, 2024

我找了一台支持avx512的机器也试了,但是还是无法复现你说的情况

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants