Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • May 29 00:04
    SageSystems opened #138
  • May 21 08:44
    schnaader review_requested #136
  • May 21 08:43
    schnaader review_requested #137
  • Apr 29 03:25
    as-com commented #126
  • Apr 29 03:24
    as-com opened #137
  • Apr 25 22:44
    as-com commented #126
  • Apr 25 22:43
    as-com commented #126
  • Apr 25 22:26
    as-com commented #126
  • Apr 08 23:07
    M-Gonzalo commented #135
  • Apr 08 23:06
    M-Gonzalo commented #135
  • Mar 25 17:17
    WillyPillow edited #136
  • Mar 23 03:54
    WillyPillow opened #136
  • Feb 11 19:00
    redthing1 commented #55
  • Jan 19 15:04
    schnaader commented #135
  • Jan 19 14:36
    rayman3003 commented #135
  • Jan 19 10:52
    schnaader closed #135
  • Jan 19 10:33
    rayman3003 commented #135
  • Jan 19 10:05
    schnaader commented #135
  • Jan 18 18:31
    rayman3003 opened #135
  • Nov 22 2021 22:23
    schnaader closed #134
Christian Schneider
@schnaader
Yes, Precomp has that, too. It calls bytes these "penalty bytes" and tolerates some of them in recompressed streams, e.g. see the function compare_file_mem_penalty.
Christian Schneider
@schnaader
I think both solutions aren't the best. I originally implemented them in Precomp because of PDF files which often have one differing byte in streams (Precomp stores them together with 4 byte offsets, so byte count is multiplied by 5 and it shows "Penalty bytes were used: 5 bytes"). See FlashMX.pdf (the ZIP is calles pdf-test.zip) from this site for example.
Christian Schneider
@schnaader
The best way to handle these differences would not be on byte level, but on bit level as Deflate uses bitstreams. To make things worse, Deflate packs those bitstreams to bytes in reverse, e.g. the bitstream "ABCDEFGH IJKLMNOP" (16 bits) is encoded as "HGFEDCBA PONMLKJI" so even if e.g. the part "HIJKLM" would differ, an approach that doesn't consider this would get two seperated diffs.
Christian Schneider
@schnaader
If you want to test something in this direction, have a look at Matt Mahoney's Silesia page. The file "silesia.zip" from there has a ZIP stream for each of the 12 files. I didn't test it with AntiZ yet, but Precomp processes 8 of the streams and most of them are "heavy" partial matches, like 1.4 MB of 6.0 MB match.
Using "precomp -cn" on silesia.zip gives 88 MB instead of 67 MB which doesn't look bad at first, but considering that uncompressing every stream completely leads to 211 MB, it is.
Ah, I have tested it with AntiZ before, see this post at encode.ru
jagannatharjun
@jagannatharjun
precomp stores full file hash of the original file?
jagannatharjun
@jagannatharjun
one more thing it looks like Antiz uses diff for every files rather then finding optimal zlib params, doesn't this is what we need for headerless aka deflate stream
jagannatharjun
@jagannatharjun
Even after no Temporary file precomp is twice fast then antiz :sparkles:
Christian Schneider
@schnaader
Precomp doesn't store a hash if -co is used, see schnaader/precomp-cpp#66
-cn, not -co
AntiZ has a slightly different approach on the params, uses some heuristics to make guesses about them. Also, IIRC, it doesn't remember parameters that were used before, which speeds up Pump for most files
Pump= Precomp damn autocorrect
jagannatharjun
@jagannatharjun
antiz tries to extract information from headers.. Precomp not?
Christian Schneider
@schnaader
Precomp does it too, but the headers are too small and unreliable. See RFC 1950: https://www.ietf.org/rfc/rfc1950.txt
There are 2 bits called "compression level" (FLEVEL) that indicate that fastest/fast/default/max compression was used, but with this information, you can't really decide the real compression level setting (0-9). At least the window size in there is reliable.
On the other hand, in other formats like ZIP, there simply is no ZLIB header, so you don't know anything.
These are pure deflate streams, https://www.ietf.org/rfc/rfc1951.txt
Christian Schneider
@schnaader
Precomp uses just the window site and ignores FLEVEL because it could be anything and it's not needed for decompression. AntiZ uses the information, but I think this is bad if FLEVEL is not correct.
jagannatharjun
@jagannatharjun
making windows size -15 and initializing compression bit by bit helps me detect them and gives me almost correct inflation size
Test
but for first file it ratio should be around 316(by pzlib), i assume this is because of diff
byte by byte* XD
jagannatharjun
@jagannatharjun
how about abstract hxim/paq8px@b1f1f50 recently introduced in paq8px_v125
Christian Schneider
@schnaader
Yeah, I've seen that one. The latest Precomp version (0.4.6) is quite similar (everything in memory until 64 MB exceeded), but still writes temporary files when recursion is used - so it could be useful there.
The culprits in Precomp are things like the Multi PNGs and various parsings that jump between file positions. Also, recursion gets a filename instead of an abstract file/memory pointer. Changing it is possible, but quite some work, so I delayed it for now.
jagannatharjun
@jagannatharjun
some thing like this can give 64 mb or so window
Christian Schneider
@schnaader
Things I'd like to do before are refactorings (at the moment much code is duplicated) and unit tests (so it can be automatically determined that such big changes didn't break anything)
Even small changes can lead to subtle bugs, like schnaader/precomp-cpp#76 shows
jagannatharjun
@jagannatharjun
such abstraction will help in refactoring
ZlibWrapper
jagannatharjun
@jagannatharjun
and unit tests
jagannatharjun
@jagannatharjun
"fout_fput*" what's wrong in casting data directly as char?
Christian Schneider
@schnaader
Endianness mostly, though I guess a change in endianness would break something else anyway..
Christian Schneider
@schnaader
Since the software and the pcf files are platform independent, I wanted to make sure that 0x1234 is always written as 0x12, 0x34 instead of the usual 0x34, 0x12 and this won't change with platforms/compilers - also don't like the 3412 format because it isn't readable in hex editors :smile:
But there might be some clever way to both cast and ensure endianness.
jagannatharjun
@jagannatharjun
cpp20 wil have a compile time endianness detection
i think we can still do it with constexpr
Christian Schneider
@schnaader
I'm open to any code quality suggestions like this. The code as it is uses very little modern concepts, I'm pretty sure parts of it are still from the original code from 2006.
Too much features, not enough refactoring, not enough time, the usual..
jagannatharjun
@jagannatharjun
will you accept an external library for CLI
may be depreciated precomf?
Christian Schneider
@schnaader
This would help very much, yes. As long as it keeps the parameter syntax, of course. Would also solve schnaader/precomp-cpp#9 , I guess
jagannatharjun
@jagannatharjun
i can but can we use this, then we don't have to worry about such things
boost has also it's own CLI interface so maybe start using boost, u never know what u need next
Christian Schneider
@schnaader
Will look into both, thanks. Boost also has endian buffers (http://www.boost.org/doc/libs/1_62_0/libs/endian/doc/buffers.html ), so yeah, like you said, never know what's needed next
jagannatharjun
@jagannatharjun
i can start working on it if you say
jagannatharjun
@jagannatharjun
can i add ZlibWrapper to precomp
Christian Schneider
@schnaader
Sure, feel free to fork and send pull requests :+1: Thanks in advance!
pasha-zzz
@pasha-zzz
Sometimes streams are encoded by base64 (maybe mime or uue)... What about recompression of such streams?
Christian Schneider
@schnaader
Base64 wrapped by mime should already work, others most certainly won't. The problem is that most text could be base64, so many potential false positives. I'll try a potential solution I had in mind that is similar to intense/brute mode - it will decode base64 always, so false positives, too, but will only keep it if something else (PNG/gif/...) is found in recursion
Also see schnaader/precomp-cpp#43
pasha-zzz
@pasha-zzz
Maybe add option to test for b64? My file contains "base64:77u/PCFET0NUWVB....", "base64:iVBORw0KGgoAAAANSU..." and so on. About text files: texts w/o spaces? Rare case. Simply add minimal b64 length for detection I think...