Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • May 29 00:04
    SageSystems opened #138
  • May 21 08:44
    schnaader review_requested #136
  • May 21 08:43
    schnaader review_requested #137
  • Apr 29 03:25
    as-com commented #126
  • Apr 29 03:24
    as-com opened #137
  • Apr 25 22:44
    as-com commented #126
  • Apr 25 22:43
    as-com commented #126
  • Apr 25 22:26
    as-com commented #126
  • Apr 08 23:07
    M-Gonzalo commented #135
  • Apr 08 23:06
    M-Gonzalo commented #135
  • Mar 25 17:17
    WillyPillow edited #136
  • Mar 23 03:54
    WillyPillow opened #136
  • Feb 11 19:00
    redthing1 commented #55
  • Jan 19 15:04
    schnaader commented #135
  • Jan 19 14:36
    rayman3003 commented #135
  • Jan 19 10:52
    schnaader closed #135
  • Jan 19 10:33
    rayman3003 commented #135
  • Jan 19 10:05
    schnaader commented #135
  • Jan 18 18:31
    rayman3003 opened #135
  • Nov 22 2021 22:23
    schnaader closed #134
Prince Gupta
@guptaprince

File: File:html book 20161029.zip

Precomp(schnaader/precomp-cpp@9ddc049): -intense0 -cn-
Time: 33 minute(s), 15 second(s)
Size: 162,947,142 bytes

PZLIB V3 (Hotfix): -m2 -x -s -b128m -t1
Time: 2.8745 minutes
Size: 170,417,418 bytes

Reflate(v1l1): c
Time: 15.781 seconds
size: 170,417,418 bytes

Don't you think precomp is way too slow?
Is it because very small zlib streams inside or any thing else?

Christian Schneider
@schnaader
Intense mode is very slow because of the temporary files that are created even if they aren't used, next version will fix that. After that, it should be comparable to the pzlib result. Note that reflate will still be faster because it only recompresses once, something that will be adressed by difflate.
Abhishek Sharma
@RamiroCruzo
Yahallo sir, long time no see, finally difflate is coming?
Prince Gupta
@guptaprince
what should be the blueprint to remove temps
Prince Gupta
@guptaprince
i m thinking of taking the lazy approach, fmemopen for Linux and for the windows using CreateFileMap and Then use_fdopen
for FILE*
for buffers have to write half of the precomp again i think
Christian Schneider
@schnaader
Look at the paq variants with zlib routines, they are what I'll use as blueprints. Basically, no need to keep the whole file in memory, only 64 KB portions since deflate uses 32 KB windows and recompression needs another 32 KB as “lookback“
Christian Schneider
@schnaader
E.g. have a look at paq8px_v75.cpp
Christian Schneider
@schnaader
The memory mapping way is possible for all the other streams but also has drawbacks - the stream has to fit into memory with lazy approaches which was a limit that lead me to using temporary files originally. Also, as you said, there is no portable mmap, so it's an ifdef and testing hell. But it's a way to go, e.g. I did similar with packJPG/MP3 (everything in memory for streams up to 64 MB, use temporary files else)
Things will get more complicated with stdin/stdout support which basically means "can't use fseek, have to use workarounds"
E.g. Issue #55
Abhishek Sharma
@RamiroCruzo
Instead of Creating a filemap, we can go around with the good old circular buffers, load a big affordable chunk into mem, that'll still allow us to use seek and will be one read and write only
jagannatharjun
@jagannatharjun
i was working on antiz and notice that it uses diff files, precomp uses them too?
Christian Schneider
@schnaader
Looking at the source, I guess you're talking about handling of partial matches and everything around the recomp-tresh/diffbytes variables.
Yes, Precomp has that, too. It calls bytes these "penalty bytes" and tolerates some of them in recompressed streams, e.g. see the function compare_file_mem_penalty.
Christian Schneider
@schnaader
I think both solutions aren't the best. I originally implemented them in Precomp because of PDF files which often have one differing byte in streams (Precomp stores them together with 4 byte offsets, so byte count is multiplied by 5 and it shows "Penalty bytes were used: 5 bytes"). See FlashMX.pdf (the ZIP is calles pdf-test.zip) from this site for example.
Christian Schneider
@schnaader
The best way to handle these differences would not be on byte level, but on bit level as Deflate uses bitstreams. To make things worse, Deflate packs those bitstreams to bytes in reverse, e.g. the bitstream "ABCDEFGH IJKLMNOP" (16 bits) is encoded as "HGFEDCBA PONMLKJI" so even if e.g. the part "HIJKLM" would differ, an approach that doesn't consider this would get two seperated diffs.
Christian Schneider
@schnaader
If you want to test something in this direction, have a look at Matt Mahoney's Silesia page. The file "silesia.zip" from there has a ZIP stream for each of the 12 files. I didn't test it with AntiZ yet, but Precomp processes 8 of the streams and most of them are "heavy" partial matches, like 1.4 MB of 6.0 MB match.
Using "precomp -cn" on silesia.zip gives 88 MB instead of 67 MB which doesn't look bad at first, but considering that uncompressing every stream completely leads to 211 MB, it is.
Ah, I have tested it with AntiZ before, see this post at encode.ru
jagannatharjun
@jagannatharjun
precomp stores full file hash of the original file?
jagannatharjun
@jagannatharjun
one more thing it looks like Antiz uses diff for every files rather then finding optimal zlib params, doesn't this is what we need for headerless aka deflate stream
jagannatharjun
@jagannatharjun
Even after no Temporary file precomp is twice fast then antiz :sparkles:
Christian Schneider
@schnaader
Precomp doesn't store a hash if -co is used, see schnaader/precomp-cpp#66
-cn, not -co
AntiZ has a slightly different approach on the params, uses some heuristics to make guesses about them. Also, IIRC, it doesn't remember parameters that were used before, which speeds up Pump for most files
Pump= Precomp damn autocorrect
jagannatharjun
@jagannatharjun
antiz tries to extract information from headers.. Precomp not?
Christian Schneider
@schnaader
Precomp does it too, but the headers are too small and unreliable. See RFC 1950: https://www.ietf.org/rfc/rfc1950.txt
There are 2 bits called "compression level" (FLEVEL) that indicate that fastest/fast/default/max compression was used, but with this information, you can't really decide the real compression level setting (0-9). At least the window size in there is reliable.
On the other hand, in other formats like ZIP, there simply is no ZLIB header, so you don't know anything.
These are pure deflate streams, https://www.ietf.org/rfc/rfc1951.txt
Christian Schneider
@schnaader
Precomp uses just the window site and ignores FLEVEL because it could be anything and it's not needed for decompression. AntiZ uses the information, but I think this is bad if FLEVEL is not correct.
jagannatharjun
@jagannatharjun
making windows size -15 and initializing compression bit by bit helps me detect them and gives me almost correct inflation size
Test
but for first file it ratio should be around 316(by pzlib), i assume this is because of diff
byte by byte* XD
jagannatharjun
@jagannatharjun
how about abstract hxim/paq8px@b1f1f50 recently introduced in paq8px_v125
Christian Schneider
@schnaader
Yeah, I've seen that one. The latest Precomp version (0.4.6) is quite similar (everything in memory until 64 MB exceeded), but still writes temporary files when recursion is used - so it could be useful there.
The culprits in Precomp are things like the Multi PNGs and various parsings that jump between file positions. Also, recursion gets a filename instead of an abstract file/memory pointer. Changing it is possible, but quite some work, so I delayed it for now.
jagannatharjun
@jagannatharjun
some thing like this can give 64 mb or so window
Christian Schneider
@schnaader
Things I'd like to do before are refactorings (at the moment much code is duplicated) and unit tests (so it can be automatically determined that such big changes didn't break anything)
Even small changes can lead to subtle bugs, like schnaader/precomp-cpp#76 shows
jagannatharjun
@jagannatharjun
such abstraction will help in refactoring
ZlibWrapper
jagannatharjun
@jagannatharjun
and unit tests
jagannatharjun
@jagannatharjun
"fout_fput*" what's wrong in casting data directly as char?
Christian Schneider
@schnaader
Endianness mostly, though I guess a change in endianness would break something else anyway..
Christian Schneider
@schnaader
Since the software and the pcf files are platform independent, I wanted to make sure that 0x1234 is always written as 0x12, 0x34 instead of the usual 0x34, 0x12 and this won't change with platforms/compilers - also don't like the 3412 format because it isn't readable in hex editors :smile:
But there might be some clever way to both cast and ensure endianness.
jagannatharjun
@jagannatharjun
cpp20 wil have a compile time endianness detection
i think we can still do it with constexpr