Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
Vasiliy Tolstov
@vtolstov
as i see now qemu send write request with len 4 and remove oid from vdi data index. i don't understand where sheep process determine oid object file and send fallocate to it.(in my code i'm try to determine inside write obj handler , read at offset oid name and send fallocate to it) but not understand that this right solution.
Vasiliy Tolstov
@vtolstov
anybody lives?
Hitoshi Mitake
@mitake
@vtolstov for the first question, I think we cannot avoid the data copy. It needs cost but there are no good alternatives AFAIK
for the question of rename, the combination of clone/delete can be an alternative
Hitoshi Mitake
@mitake
for the question of discard, please see here https://github.com/sheepdog/sheepdog/blob/master/sheep/gateway.c#L801-L825 and here https://github.com/sheepdog/sheepdog/blob/master/sheep/gateway.c#L831-L851 and here. It determines removed objects based on the content of inode objects
Hitoshi Mitake
@mitake
@ShienXie sorry for keeping you waiting, below is my answer to your questions
  1. How sheep handles the situation when one recovery process is still running, another recovery is triggered.
    The case is handled here: https://github.com/sheepdog/sheepdog/blob/master/sheep/recovery.c#L1361-L1373
    The global variable current_rinfo is used for storing ongoing recovery work. The next one is stored in next_rinfo and it waits the completion of the ongoing recovery.

  2. How sheep handles the situation when one recovery process is still running, a hardware or network failure occurs.
    Every hardware failure (node crash or network partition) increases the epoch number. It must be handled by the recovery process. In the case of auto recovery is enabled, it is handled as described in the first answer. If manual recovery is selected, user triggered recovery will create the recovery work.

  3. Is "total number of multiple nested recovery(s) being triggered at same time" limited?
    The total number is 1. sheep allows only one pending recovery work as handled here:
    https://github.com/sheepdog/sheepdog/blob/master/sheep/recovery.c#L1364-L1366

Vasiliy Tolstov
@vtolstov
@mitake thanks!
Vasiliy Tolstov
@vtolstov
@mitake i have new bunch of questions. Why oid calculation doing in qemu side when read/write request? Why qemu sheepdog driver does not only send vid , len and buffer when read/write ? As qemu connects to only one sheep process at time requests processed sequential.
Shien Xie
@ShienXie
@mitake Also thanks!
Hitoshi Mitake
@mitake
@vtolstov this is because we cannot have an assumption that an object written/read by qemu doesn't share vid with its inode. Such a situation is common because of snapshot and clone
Hitoshi Mitake
@mitake
@vtolstov sorry, I mean "we cannot have an assumption that an object written/read by qemu shares vid with its inode. "
Vasiliy Tolstov
@vtolstov
@mitake but why this detection is not possible in sheepdog daemon side?
Hitoshi Mitake
@mitake
@vtolstov a single object can belong to multiple inodes, so the relation isn't obvious from sheep side
Vasiliy Tolstov
@vtolstov
@mitake how qemu knows about it?
if qemu sends request with base vid and oid that need to be readed/writed why sheep can't detect that this oid belongs to other vids ?
Hitoshi Mitake
@mitake
@vtolstov technically yes. But the operation is costly (sheep needs to read inode object) and qemu needs to update the inode object in the case of COW. If it is performed in sheep, qemu needs to refresh its inode. So doing it in qemu side isn't reasonable.
Vasiliy Tolstov
@vtolstov
@mitake thanks for answers , i'm read about partition ring and hash ring, why sheepdog uses hash ring and not partition?
Hitoshi Mitake
@mitake
what is the partition ring?
Vasiliy Tolstov
@vtolstov
@mitake ?
Vasiliy Tolstov
@vtolstov
@mitake also i read about discard links that you provide, and don't understand why in case of discard qemu does not send DISCARD request, but WRITE request??
zeventh
@zeventh
is it by any chance possible to add a node just for quorum that does not store any data? We would like to setup a two-node cluster that holds VMs and a third node only for quorum/witness.
Shien Xie
@ShienXie
@mitake I have a question about sheepdog's tgt scsi driver. I found no matter how many/ how fast requests are being sent to tgt from initiator, at most 32 requests can be forwarded to sheep concurrently. Why is that?
Vasiliy Tolstov
@vtolstov
@ShienXie hi! Im try to add into sheepdog hash sum support,so all object have hash sum on it inode.
But I have questions and don't understand how can have fast answers for my questions.
Vasiliy Tolstov
@vtolstov
@mitake does putting hash sum on inode object is suitable?
In case of writing data to some object we need recalc sum of it and write to inode offset for this block
Big disadvantage - if block size if 16Mb = we need to recalc all of 16Mb not only writed for example 1Mb.
Second question - in case of EC mode - what checksum? data and parity? or something else? where to store this hash?
Hitoshi Mitake
@mitake
@ShienXie you mean tgt's sheepdog driver limits concurrently forwarded requests up to 32? I don't see it. Could you show how did you find it?
@vtolstov what is the use case of hashsum in inode?
also, sheep requires write request for discarding objects because it can be expressed as updating an inode object
Vasiliy Tolstov
@vtolstov

@mitake i think that update hash sum of data object not need additional seek to write new hash, because inode object have many chases to be on different drive or node.
But now i think that recalc full data object on every write is expensive, and may be provide some header before data in data object?
for example

struct {
  version uint8
  data_offset uint64 or smaller
  hashes  []uint64 or something like this
}

in this case we can divide data object on fixed sized blocks for example 1Mb, in this case writing to specific offset some data need to recalc only part of data....
But this not helps to get more speed in case of check or recovery, when we need to know full hash of object to check/recover.

Shien Xie
@ShienXie
@mitake I already had the answer. It is the iscsi initialtor which poses a limit of 32 by default. Not the tgt's sheepdog driver. Thanks.
Vasiliy Tolstov
@vtolstov
@mitake i found solution - this is like adler32
compute checksum for data object / chunk size (for example chunk size 1Mb, for 4Mv object we have 4 sums) and sum of all sums.
In this case if we update partialy file we need to recompute needed checksums for needed blocks, and update sum.
In case of recovery firstly we can check sum of checksums, if it differ - fetch whole file or it parts (if we want to minimize network overhead)
Vasiliy Tolstov
@vtolstov
@mitake can you explain me how sheepdog works in EC mode in case of client partially overwrite some data in oid ?
Does sheepdog needs to re-read all block data (4Mb for example), update in memory and write again. Or something else?
Hitoshi Mitake
@mitake
@vtolstov in such a case, you need to run dog vdi check manually
Vasiliy Tolstov
@vtolstov
@mitake I mean internally how sheepdog works in EC mode when client wants to write some data. Can you explain?
Hitoshi Mitake
@mitake
it isn't supported. users must do the checking after crash during write operations
Vasiliy Tolstov
@vtolstov
@mitake sorry for my English. I'm interesting on tech aspects how sheepdog works in case of vdi have ec mode like 3:2.
I'm read about reed-solomon codes and understand that for each original data we need to compute parity and write with data to disk.
In case of new data all fine, but if vm inside qemu try to write some data over exiting , as i understand sheepdog needs to read back from disk all oid, modify part of it in memory, compute parity and write again to disk. Or i miss something?
Hitoshi Mitake
@mitake
@vtolstov sorry for my late reply. you mean what happen when an object on disk is updated partially?
Vasiliy Tolstov
@vtolstov
@mitake yes, i'm try to check the code but can't understand what do sheepdog server when clients partially update oid.
Hitoshi Mitake
@mitake
@vtolstov currently the behaviour is undefined. We need to restore healthy state from the recent snapshot
garnser
@garnser
hi all, I'm currently running some experiments with sheepdog, I'm trying to create a logical data-center out of two physical locations. However I'm at loss to identify if it's possible to ensure that a full replica is present in either locations at any given time?
i.e. if DC1 crashes I want to be sure that DC2 has sufficient data to continue operations
Vasiliy Tolstov
@vtolstov
@mitake hi! i'm analyze erasure coding and thinks:
1) ec mode suitable only for object storage, because in case of block storage we need to re-read part of oid to recalculate parity for this oid.
2) copies mode faster, because we only need to write to all nodes at specific offset data
where i'm wrong?
Hitoshi Mitake
@mitake
@vtolstov I think you are correct. The problems are shared with RAID5. Anyway ec mode of sheepdog isn't mature so I suggest just using the simple replication mode.
sanwuqi
@sanwuqi
There's no chat here