Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal Server Error and EIO on fuse mount for remote EC shard read failure. #5465

Open
eliphatfs opened this issue Apr 3, 2024 · 14 comments

Comments

@eliphatfs
Copy link

Describe the bug
When a EC shard is compromised on disk, the servers did not try to recover from other shards but gives an 500 internal error.

Apr 03 16:13:02 aries-b02 seaweedfs-volume-0[501258]: I0403 16:13:02.645677 store_ec.go:288 read remote ec shard 1201.5 from 10.8.150.91:8080
Apr 03 16:13:02 aries-b02 seaweedfs-volume-0[501258]: I0403 16:13:02.647847 volume_server_handlers_read.go:160 read /1201,0a4a115622d1530d isNormalVolume false error: readbytes: entry not found: offset 27663713352 found id 34313936356436363739353636653636 size -2007146114, expected size 168275

The problem happens 100% with the file id and the shard at https://s3-haosu.nrp-nautilus.io/ruoxi-bucket/1201.tar.

System Setup

  • /usr/local/bin/weed server -volume=0 -filer -dir=/weedfs
  • /usr/local/bin/weed volume -max=400 -dir=/weedfs -mserver=10.8.149.13:9333 on 8 machines different from the master
  • These commands are done by systemctl services with setups exactly as the wiki page.
  • OS version: Ubuntu 22.04
  • output of weed version: version 8000GB 3.63 54d7748 linux amd64
  • if using filer, show the content of filer.toml: no filer.toml
  • security:
[access]
ui = false

[grpc]
ca = "/etc/ariesdockerd/certs/Aries_SeaweedFS_CA.crt"

[grpc.volume]
cert = "/etc/ariesdockerd/certs/volume01.crt"
key  = "/etc/ariesdockerd/certs/volume01.key"

[grpc.master]
cert = "/etc/ariesdockerd/certs/master01.crt"
key  = "/etc/ariesdockerd/certs/master01.key"

[grpc.filer]
cert = "/etc/ariesdockerd/certs/filer01.crt"
key  = "/etc/ariesdockerd/certs/filer01.key"

[grpc.client]
cert = "/etc/ariesdockerd/certs/client01.crt"
key  = "/etc/ariesdockerd/certs/client01.key"
  • master:
# Put this file to one of the location, with descending priority
#    ./master.toml
#    $HOME/.seaweedfs/master.toml
#    /etc/seaweedfs/master.toml
# this file is read by master

[master.maintenance]
# periodically run these scripts are the same as running them from 'weed shell'
scripts = """
  lock
  ec.encode -fullPercent=95 -quietFor=1h
  ec.balance -force
  volume.deleteEmpty -quietFor=24h -force
  volume.balance -force
  unlock
"""
sleep_minutes = 17          # sleep minutes between each script execution


[master.sequencer]
type = "raft"     # Choose [raft|snowflake] type for storing the file id sequence
# when sequencer.type = snowflake, the snowflake id must be different from other masters
sequencer_snowflake_id = 0     # any number between 1~1023


# configurations for tiered cloud storage
# old volumes are transparently moved to cloud for cost efficiency
[storage.backend]
[storage.backend.s3.default]
enabled = false
aws_access_key_id = ""         # if empty, loads from the shared credentials file (~/.aws/credentials).
aws_secret_access_key = ""     # if empty, loads from the shared credentials file (~/.aws/credentials).
region = "us-east-2"
bucket = "your_bucket_name"    # an existing bucket
endpoint = ""
storage_class = "STANDARD_IA"

# create this number of logical volumes if no more writable volumes
# count_x means how many copies of data.
# e.g.:
#   000 has only one copy, copy_1
#   010 and 001 has two copies, copy_2
#   011 has only 3 copies, copy_3
[master.volume_growth]
copy_1 = 7                # create 1 x 7 = 7 actual volumes
copy_2 = 6                # create 2 x 6 = 12 actual volumes
copy_3 = 3                # create 3 x 3 = 9 actual volumes
copy_other = 1            # create n x 1 = n actual volumes

# configuration flags for replication
[master.replication]
# any replication counts should be considered minimums. If you specify 010 and
# have 3 different racks, that's still considered writable. Writes will still
# try to replicate to all available volumes. You should only use this option
# if you are doing your own replication or periodic sync of volumes.
treat_replication_as_minimums = false

Expected behavior
When the EC shard fails, the volume server should try to recover the shard from other shards.

@eliphatfs
Copy link
Author

I loaded the original volume from snapshot before ec and it worked. So it should be a problem during ec encode or distribution.

@eliphatfs
Copy link
Author

I manually rebuilt the shard but the error remains the same. It seems that there is an error in the ec building process, either system consistency is broken or there is a bug in the algorithm.

@chrislusf
Copy link
Collaborator

Could you please make a copy of both the original and the ec volumes? And if the volume is ECed again, does it have the same problem?

@eliphatfs
Copy link
Author

Thank you, I will check today.

@eliphatfs
Copy link
Author

Yes. It reproduces.

ruoxi@aries-05:~$ curl http://10.8.149.9:8080/1201,0a4b19575d6a9583 -v
*   Trying 10.8.149.9:8080...
* Connected to 10.8.149.9 (10.8.149.9) port 8080 (#0)
> GET /1201,0a4b19575d6a9583 HTTP/1.1
> Host: 10.8.149.9:8080
> User-Agent: curl/7.81.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Accept-Ranges: bytes
< Content-Disposition: inline; filename="3.h5"
< Content-Length: 123495
< Content-Type: application/x-hdf
< Etag: "664dce57"
< Last-Modified: Tue, 12 Mar 2024 09:20:26 GMT
< Server: SeaweedFS Volume 8000GB 3.63
< Date: Thu, 04 Apr 2024 22:44:19 GMT
<
Warning: Binary output can mess up your terminal. Use "--output -" to tell
Warning: curl to output it to your terminal anyway, or consider "--output
Warning: <FILE>" to save to a file.
* Failure writing output to destination
* Closing connection 0
ruoxi@aries-05:~$ sudo weed shell
[sudo] password for ruoxi:
I0404 22:44:29.931814 masterclient.go:210 master localhost:9333 redirected to leader 10.8.149.9:9333
.master: localhost:9333
> lock
> ec.encode -h
Usage of ec.encode:
  -collection string
        the collection name
  -force
        force the encoding even if the cluster has less than recommended 4 nodes
  -fullPercent float
        the volume reaches the percentage of max volume size (default 95)
  -parallelCopy
        copy shards in parallel (default true)
  -quietFor duration
        select volumes without no writes for this period (default 1h0m0s)
  -volumeId int
        the volume id
> ec.encode -force
collect volumes quiet for: 3600 seconds and 95.0% full
ec encode volumes: [1201]
markVolumeReadonly 1201 on 10.8.149.9:8080 ...
generateEcShards  1201 on 10.8.149.9:8080 ...
parallelCopyEcShardsFromSource 1201 10.8.149.9:8080
allocate 1201.[0 1 2 3 4 5 6 7 8 9 10 11 12 13] 10.8.149.9:8080 => 10.8.149.9:8080
mount 1201.[0 1 2 3 4 5 6 7 8 9 10 11 12 13] on 10.8.149.9:8080
unmount 1201.[] from 10.8.149.9:8080
delete 1201.[] from 10.8.149.9:8080
delete volume 1201 from 10.8.149.9:8080
> unlock
> q
unknown command: q
> ruoxi@aries-05:~$ curl http://10.8.149.9:8080/1201,0a4b19575d6a9583 -v
*   Trying 10.8.149.9:8080...
* Connected to 10.8.149.9 (10.8.149.9) port 8080 (#0)
> GET /1201,0a4b19575d6a9583 HTTP/1.1
> Host: 10.8.149.9:8080
> User-Agent: curl/7.81.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 500 Internal Server Error
< Server: SeaweedFS Volume 8000GB 3.63
< Date: Thu, 04 Apr 2024 22:46:10 GMT
< Content-Length: 0
<
* Connection #0 to host 10.8.149.9 left intact
ruoxi@aries-05:~$

@eliphatfs
Copy link
Author

Interestingly, it also recovers after re-decoding.

ruoxi@aries-05:~$ sudo weed shell
I0404 22:47:08.184649 masterclient.go:210 master localhost:9333 redirected to leader 10.8.149.9:9333
.master: localhost:9333
> ec.decode -h
Usage of ec.decode:
  -collection string
        the collection name
  -volumeId int
        the volume id
> ec.decode -volumeId=1201
error: need to run "lock" first to continue
> lock
> ec.decode -volumeId=1201
ec volume 1201 shard locations: map[10.8.149.9:8080:16383]
collectEcShards: ec volume 1201 collect shards to 10.8.149.9:8080 from: map[10.8.149.9:8080:16383]
generateNormalVolume from ec volume 1201 on 10.8.149.9:8080
unmount ec volume 1201 on 10.8.149.9:8080 has shards: [0 1 2 3 4 5 6 7 8 9 10 11 12 13]
unmount 1201.[0 1 2 3 4 5 6 7 8 9 10 11 12 13] from 10.8.149.9:8080
delete ec volume 1201 on 10.8.149.9:8080 has shards: [0 1 2 3 4 5 6 7 8 9 10 11 12 13]
delete 1201.[0 1 2 3 4 5 6 7 8 9 10 11 12 13] from 10.8.149.9:8080
> unlock
> ruoxi@aries-05:~curl http://10.8.149.9:8080/1201,0a4b19575d6a9583 -v-v
*   Trying 10.8.149.9:8080...
* Connected to 10.8.149.9 (10.8.149.9) port 8080 (#0)
> GET /1201,0a4b19575d6a9583 HTTP/1.1
> Host: 10.8.149.9:8080
> User-Agent: curl/7.81.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Accept-Ranges: bytes
< Content-Disposition: inline; filename="3.h5"
< Content-Length: 123495
< Content-Type: application/x-hdf
< Etag: "664dce57"
< Last-Modified: Tue, 12 Mar 2024 09:20:26 GMT
< Server: SeaweedFS Volume 8000GB 3.63
< Date: Thu, 04 Apr 2024 22:47:58 GMT
<
Warning: Binary output can mess up your terminal. Use "--output -" to tell
Warning: curl to output it to your terminal anyway, or consider "--output
Warning: <FILE>" to save to a file.
* Failure writing output to destination
* Closing connection 0

@eliphatfs
Copy link
Author

I temporarily created a public port share at http://184.105.6.184:8000/1201-full.tar

@eliphatfs
Copy link
Author

eliphatfs commented Apr 4, 2024

The behavior is still the same after regenerating the index with weed fix (before ec encode).

@chrislusf
Copy link
Collaborator

could not access http://184.105.6.184:8000/1201-full.tar

@eliphatfs
Copy link
Author

I just tried the link again and I can access it?

@eliphatfs
Copy link
Author

Let me try google drive then..

@eliphatfs
Copy link
Author

It is tricky to set up credentials on remote server for google drive, so I made a s3 share instead.

endpoint: https://s3-haosu.nrp-nautilus.io
bucket: seaweed

Could you please try loading the files with something like rclone?

Or for public URLs, they are like https://s3-haosu.nrp-nautilus.io/seaweed/bug5465/1201-ec/1201.ecx
https://s3-haosu.nrp-nautilus.io/seaweed

bug5465/1201-ec/1201.ec00
bug5465/1201-ec/1201.ec01
bug5465/1201-ec/1201.ec02
bug5465/1201-ec/1201.ec03
bug5465/1201-ec/1201.ec04
bug5465/1201-ec/1201.ec05
bug5465/1201-ec/1201.ec06
bug5465/1201-ec/1201.ec07
bug5465/1201-ec/1201.ec08
bug5465/1201-ec/1201.ec09
bug5465/1201-ec/1201.ec10
bug5465/1201-ec/1201.ec11
bug5465/1201-ec/1201.ec12
bug5465/1201-ec/1201.ec13
bug5465/1201-ec/1201.ecj
bug5465/1201-ec/1201.ecx
bug5465/1201-ec/vol_dir.uuid
bug5465/1201-vol/1201.dat
bug5465/1201-vol/1201.idx
bug5465/1201-vol/1201.vif

@chrislusf
Copy link
Collaborator

I have downloaded the file. need some time to debug.

@eliphatfs
Copy link
Author

Bump, how is this going on?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants