Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
  • Sep 09 19:58

    kobalicek on master

    Fixed stroking of paths that ha… (compare)

  • Sep 05 06:43

    kobalicek on master

    Improved <blend2d-debug.h> (compare)

  • Sep 05 06:42

    kobalicek on master

    Improved <blend2d-debug.h> (compare)

  • Sep 02 06:21
    kobalicek commented #48
  • Sep 01 18:33
    licaili193 opened #48
  • Aug 18 20:30
    abu-irrational commented #47
  • Aug 18 05:59
    kobalicek commented #47
  • Aug 18 05:58
    kobalicek labeled #47
  • Aug 17 21:55
    abu-irrational opened #47
  • Aug 10 12:40

    kobalicek on master

    CI fixes (compare)

  • Aug 10 11:40

    kobalicek on master

    Minor corrections and travis CI… (compare)

  • Aug 10 11:36

    kobalicek on master

    Minor corrections and travis CI… (compare)

  • Aug 08 10:09
    maximilian578 commented #46
  • Aug 08 10:05
    maximilian578 edited #46
  • Jul 28 20:10

    kobalicek on master

    Updated CMakeLists.txt and C-AP… OpenType - fixed GPOS implement… (compare)

  • Jul 28 13:34

    kobalicek on master

    OpenType - added support for CI… (compare)

  • Jul 28 10:38
    maximilian578 edited #46
  • Jul 28 10:37
    maximilian578 synchronize #46
  • Jul 28 10:05
    maximilian578 synchronize #46
  • Jul 28 09:42
    maximilian578 edited #46
Sean
@smcallis_gitlab
not my call
Petr Kobalicek
@kobalicek
yeah understand :)
Sean
@smcallis_gitlab
yeah intel 13 can't even handle holding a std::atomic by value because it's not copyable
and it doesn't handle the copy elision properly in all cases
Petr Kobalicek
@kobalicek
I had most problems with constexpr on MSVC
maybe Intel would be equally bad :)
but should be solved atm
I mean why people even use Intel compiler? I check godbolt from time to time to make a comparison and it doesn't really output better code than clang, usually at the same level as MSVC
Sean
@smcallis_gitlab
we use it mostly because it used to be better than gcc and it comes with mkl/svml
Petr Kobalicek
@kobalicek
yeah used to be is the correct term :)
Sean
@smcallis_gitlab
I agree totally
once gcc 5 came around it's vectorize got really good
Petr Kobalicek
@kobalicek
Even MSVC used to be better, now look at it :)
Sean
@smcallis_gitlab
MKL's FFT beats FFTW I'm sad to admit
Petr Kobalicek
@kobalicek
yeah I like the autovectorization support in general. In Blend2D I basically rely on it as it's better to just code C++ than trying to mix C++ and intrinsics - that usually make recent compilers confused more than just using C++
Sean
@smcallis_gitlab
I've used it before to backport assembly to older compilers, because I'm a bad person
ref, this fast atan approximation I wrote: https://pastebin.com/YbSsEDUh
let recent GCC vectorize it and ported the assembly back =D
Petr Kobalicek
@kobalicek
yeah nice, thinking how much this would be similar to the approximation used by Blend2D
that is used to render conical gradients
Sean
@smcallis_gitlab
Probably pretty close, I think I measured a an absolute error of < 2e-4 with those coefficients which is < .1 degree
good enough for my uses
Sam Molloy
@cowtung
@kobalicek Thank you for the code, I'll check back in a few weeks or so to see if things are working as I need them to.
Petr Kobalicek
@kobalicek
Yeah sure, the premultiply/unpremultiply conversion would improve, but the pixel results you were getting won't as they were already correct
Petr Kobalicek
@kobalicek

So guys over the weekend I was improving pixel converter, and adding some optimizations to already supported conversions, and writing more tests to make sure that they produce the expected result.

One thing that doesn't give "100%" expected result is unpremultiply. I have written a tool to compare "ALL" the possibilities and I have discovered that in the table that I use there is 15000 errors in total. The errors are small, 1 value difference between the expected result and result obtained through floating point calculations and then properly rounded.

So... I started checking how to improve this and I have found a way - to have mul table and also add table, so the final unpremultiply per component would look like this:

uint32_t unpremultiply(uint32_t c, uint32_t a) {
  return (c * mulTable[a] + addTable[a]) >> Precision;
}
The old looked like this:
uint32_t unpremultiply(uint32_t c, uint32_t a) {
  return (c * rcpTable[a]) >> 16;
}
So there is one addition more per component, which is totally okay. With the new table I have 100% correct unpremultiply with precision 13 and higher - 100% matching the floating point impl.
So I will use the new approach instead of the old one, and as a bonus, the new approach can be easily simdified at a baseline (SSE2) optimization level, because since the precision is 13 bits I can use pmaddwd instructioction to do the multiplication instead of relying on 32-bit multiply, which is supported from SSE4.1 and has higher latency than all others
Sean
@smcallis_gitlab
this is to convert between pre-multiplied alpha and not?
Petr Kobalicek
@kobalicek
yeah, premultiply / unpremultiply
Blend2D at the moment only supports premultiplied pixels in BLImage, but to make interop better, and to support reading/writing arbitrary pixels there is BLPixelConverter, which can be configured to do a lot of possible conversions
Sean
@smcallis_gitlab
I'm convinced 32-bit ARGB pre-multiplied is the One True format
Petr Kobalicek
@kobalicek
yeah of course :)
but... if you read BMP, for example
that is RGB24
Sean
@smcallis_gitlab
AGG handled anything and it made it a lot more complex for marginal benefit I think
sure
do it at the edges
Petr Kobalicek
@kobalicek
the converter can use SIMD to convert such pixels into 32-bit
Sean
@smcallis_gitlab
yeah that's nice
Petr Kobalicek
@kobalicek
there is a lot of nice optimizations here
with PSHUFB you can do miracles on x86 :)
Sean
@smcallis_gitlab
I got spoiled by my sandbridge because it had two shuffle units
then Intel got cheap and took one out >=(
Petr Kobalicek
@kobalicek
I'm trying to not use PSHUFB in generic code, because also ATOMs have this instruction very slow, so only when it's really needed
such as conversion :)
Sean
@smcallis_gitlab
it's too bad because it's basically a requirement if you're working on complex data, which I do a fair bit
so you always have to deinterleave/interleave the real imaginary
Petr Kobalicek
@kobalicek
ARM has a very similar instruction that does 64 index lookup
which is pretty neat as well