Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
Frank
@fwessels
:thumbsup:
Phil B
@pb30
s3git doesnt store file names or paths right?
Frank
@fwessels
That is correct, not by itself
A layer on top of s3git (using the s3git-go lib) could take care of this, either by storing it in s3git as eg some sort of JSON object or by storing it in some KV database like DynamoDB or Redis etc.
We are also thinking of tagging like behaviour, ie to tag one or more objects and subsequently being able to either query for the objects with a tag (or set of tags) or to pin all objects with a tag (make sure they are available locally, even when offline)
Frank
@fwessels
With very large repositories “traditional” filenames however tend to become harder to deal with, i.e. how are you going to name 100 million objects as in the https://github.com/s3git/s3git#clone-the-yfcc100m-dataset example?
(of course you can name them file1 through to file100000000 but how much information does that still convey? :smile: )
With these kinds of reposities you ideally want really good search technology as a layer on top of s3git so that you can query the repo (and ideally have dependable automatic tagging of objects out of taxonomies or something)
A. Elleuch
@vadmeste
@fwessels, we can say that the internal state of s3git will be safe and not corrupted if https://github.com/s3git/s3git-go/tree/master/internal is stable, right?
Frank
@fwessels
That is correct, I expect most changes still in the core sub-module — this is where the meta-objects live like for commits & trees. All these objects are versioned so that it should be possible increase the version of an object if breaking changes are introduced while maintaining backwards compatility
Frank
@fwessels
The cas part that is responsible for the blobs themselves using the BLAKE2 tree hashing is stable, meaning that you can always get your data back out that you put in
And if you push in ‘hydrated’ mode the files are reconstructed in the cloud storage, see https://github.com/s3git/s3git/blob/master/BLAKE2.md#hydrated
Frank
@fwessels
@vadmeste any particular reason for asking?
BTW Also please checkout the Ruby gem at https://github.com/s3git/s3git-rb including examples for nice and easy integration for the DevOps community
A. Elleuch
@vadmeste
@fwessels, not really.. I am slowly digging into the code, I found some little bugs like s3git status in an uninitialized repo or s3git log which freezes when the user does not make any commit yet, so I thought it would be good if there would be that simple API which should be not big and solid enough and it seems you already did that ^^
Frank
@fwessels
Great, please submit any bugs that you find, it would be great to fix them
A. Elleuch
@vadmeste
sure
Frank
@fwessels
I also need to expand on the testing side of the s3git-go, but if this is tested well, then both the s3git and s3git-rb should be much less error prone
It should actually not be that hard to also create a wrapper for eg. Python or so, the Ruby gem basically interfaces to an libs3git.h/.so which is built like this go build -buildmode=c-shared -o libs3git.so libs3git.go
Really nice feature introduced with Go 1.5
A. Elleuch
@vadmeste
yes.. not difficult with java too
Frank
@fwessels
Right, well, basically just about any language… :smile:
Let me know if you have any particular questions about the code, be happy to answer
I also want to write an ‘Architecture/Design’ kind of document but haven’t gotten around to doing that
A. Elleuch
@vadmeste
yes, however, with java you have to remember that it has a garbage collector that can free objects in any moment, so shared objects between java and c/c++ will need a proper way to deal with that
sure, I will let you know
This is basically the interface to and from Go, which is pretty straightforward, except for the adding and getting of blobs which uses a workaround through a file at the moment (need to change to a proper stream here)
A. Elleuch
@vadmeste
yes, I see
Phil B
@pb30
are there plans to make chunk size configurable or is there a optimization its 5mb? wondering if it could be much lower
Frank
@fwessels
Yes, there are plans for this and the reason that it is 5mb by default is that it is in sync with S3 for multi-part uploads
It could even set set as low as 640 kb (or lower still) so that DynamoDB could be used as back end in the cloud
Any particular reason for asking?
David Gamba
@DavidGamba
Hi, is there a way to get filenames of some other metadata for my files in s3git? Once I clone I can't tell apart my binary files
Frank
@fwessels
Not yet, at the moment you would have to do that yourself
Also a layer on top of s3git (using the s3git-go lib) could take care of this, eg by storing this information in DynamoDB or anothe KV store
We are also thinking of tagging like behaviour, ie to tag one or more objects and subsequently being able to either query for the objects with a tag
(and you could tag an object with a filename)
David Gamba
@DavidGamba
Thanks @fwessels. For smaller repos maintaining this information in a file or sqlite db might be easier to use as you wouldn't have to use anything other than S3. WDYT?
Frank
@fwessels
Correct, that would be great. In fact if you use the s3git-go library then that should be fairly straightforward (and a nice addition)
Frank
@fwessels
@pb30 if you pull that lastest commit and build from source, you can now define the chunksize
Create a new repo using s3git init, and then edit the file .s3git.config
In there you will find the s3gitChunkSize property which by default is still 5MB
However you can change it to eg. 1048576 for 1MB chunks
Note that changing the chunksize has consequences for the hashes as they are computed
Thus for identical content, if you add them while using different chunksizes you will get different hashes
(and the content will be stored twice, so you will not loose any content, but best is to change it after initialization and leave it at that)
Frank
@fwessels
(note also that at the moment there is not yet an s3git config command or silimar to nicely set it from the command line)
Frank
@fwessels
Just added a new repo with support for Python: https://github.com/s3git/s3git-py
Frank
@fwessels