Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • Sep 13 10:17
    Build #1532 passed
  • Sep 13 10:15

    sk1p on continuous

    (compare)

  • Sep 13 10:15

    sk1p on continuous

    (compare)

  • Sep 13 10:14

    sk1p on gh-pages

    Update documentation (compare)

  • Sep 13 10:13

    sk1p on continuous

    (compare)

  • Sep 13 10:13

    sk1p on continuous

    (compare)

  • Sep 13 10:11

    sk1p on gh-pages

    Update documentation (compare)

  • Sep 13 10:10
    Build #1531 passed
  • Sep 13 10:04

    sk1p on master

    Test cases for ROI and postproc… fix: aux params didn't behave c… Call copy_for_partition() after… and 1 more (compare)

  • Sep 13 10:04
    sk1p closed #426
  • Sep 13 10:03
    sk1p commented #426
  • Sep 13 10:03
    Build #1529 passed
  • Sep 13 10:01

    sk1p on master

    Adapt documentation for #426 (compare)

  • Sep 13 10:01
    sk1p closed #428
  • Sep 13 10:00
    uellue opened #428
  • Sep 13 09:57
    sk1p closed #417
  • Sep 13 09:55
    sk1p milestoned #426
  • Sep 13 09:55
    sk1p milestoned #425
  • Sep 13 09:55
    sk1p milestoned #424
  • Sep 13 09:55
    sk1p milestoned #422
Dieter Weber
@uellue
In general, your notebook is now a nice test bed to try out things like that. :+1:
Jaeweon Shin
@jae.shin_gitlab
Ah you are totally right. I will fix that right away
meanwhile I was trying to put in more cases like scaling a small subset of components and how small they have to be to be ignored by PCA, etc. Do you have any other test cases in mind that I can try out?
Dieter Weber
@uellue

You could push the limits with scaling a bit more. Right now you have a linspace between 0.05 and 1, which is not even two orders of magnitude. Have a look at https://docs.scipy.org/doc/numpy/reference/generated/numpy.logspace.html and really, really push it. The interesting question is when it starts showing different results than processing the whole data with an established algorithm.

In general, the "quality measure" is difference between established PCA algorithm and our distributed version. As soon as they diverge, we've reached the limit.

Jaeweon Shin
@jae.shin_gitlab
:+1:
Dieter Weber
@uellue
Also, your norm_diff can measure the "global" error. If we, for example, mess with only one frame, that gets buried. You could have a second measure, which is the maximum difference and the coordinate of that value (np.argmax()) between PCA and reconstruction with established method, and our PCA and reconstruction.
One can look at both absolute and relative difference
In general, I'd say it is looking very good. Now we can see where the breaking point really is. :-)
It starts to look like a small paper when we are done with this.
Another question is if we actually recover the same components in the same order
Jaeweon Shin
@jae.shin_gitlab

which is the maximum difference and the coordinate of that value (np.argmax()) between PCA and reconstruction with established method, and our PCA and reconstruction

Sorry could you please elaborate a bit more on this? I'm not really understanding what you meant by this.

Dieter Weber
@uellue
You subtract the two reconstructions -- benchmark algorithm and ours -- and see what is the maximum difference, and where that occurred
Jaeweon Shin
@jae.shin_gitlab

if we actually recover the same components in the same order

I think to do this, one would have to differentiate the magnitude of how much components get reflected in the data because the only condition I have at the moment is that components are orthonormal and loading is random

what is the maximum difference, and where that occurred

Okay. I now understand. Then np.argmax would give a single feature of a single frame which may not exactly be informative so I will select ~100 coordinates at which the difference is maximal to see is the frame that we perturb is the one that leads to these differences

Dieter Weber
@uellue

I think to do this, one would have to differentiate the magnitude of how much components get reflected in the data

Yes, that's true. :-) I have a feeling that our bounding box method definitely works well as long as the loadings have a Gaussian distribution. Perhaps you can try out some very much non-Gaussian distributions where one could suspect that looking at the bounding box could give a different order of the components between the benchmark algorithm and ours

so I will select ~100 coordinates at which the difference is maximal

Perfect!

as long as the loadings have a Gaussian distribution

Or any other distribution where the points on the bounding box have a similar distribution as the points more towards the center that we cut away

Jaeweon Shin
@jae.shin_gitlab
Okay. Will try those out
Dieter Weber
@uellue

You could push the limits with scaling a bit more.

Ah, a point here is if we can "fish out" a really, really weak component just above the noise floor. Perhaps you could think of a way to measure how well we recovered that weak component? Looking at the global error will not work because the component is so weak that we would not notice if we missed its contribution in the global result.

All in all, I'd say that the current method looks pretty good. :-) All this is now about finding and understanding any limits and differences to doing a PCA on the full data with established methods.
And as a side note, these can make nice unit tests. In case anyone modifies the PCA code, we make sure that the result is not only almost correct in benign cases, but in all tricky corner cases.
Jaeweon Shin
@jae.shin_gitlab

you could think of a way to measure how well we recovered that weak component

hmm that'd be more involved than the case of component contained in a frame because in the latter case, we could select coordinates with maximal differences and see if that correspond to that frame, as you said.

Dieter Weber
@uellue
Yes, that is right. Still, it is important to check if we got all weak components right.
Jaeweon Shin
@jae.shin_gitlab
right
Dieter Weber
@uellue
I have a gut feeling that we need to make sure that limiting intermediate data to the bounding box will not accidentally drop or bury weak components somehow.
Jaeweon Shin
@jae.shin_gitlab
Wouldn't that depend more on how loading is defined as you previously mentioned? If the loading follows a more 'centered' distribution, then the bounding box is likely to work well, etc?
Dieter Weber
@uellue
Could be!
I don't understand it well enough to make that call without testing it.
Jaeweon Shin
@jae.shin_gitlab
Yeah I will first test by changing the loading
Dieter Weber
@uellue
The approach I'd take is to try a lot of things, see if/where it is different to the benchmark algorithm, and then look for explanations. :-D
If there are no differences, we can work on explaining why there aren't any.
Jaeweon Shin
@jae.shin_gitlab
okay
Dieter Weber
@uellue
(That is perhaps different from the way the "scientific method" is taught. I believe the theory says you first have hypotheses and then prove them with experiments or proofs. In reality, you build your hypotheses by playing around with experiments based on pure guesswork and looking for something interesting. If you find something, THEN you work on explaining what you found.)
That is just more efficient if your experiments are fast
Jaeweon Shin
@jae.shin_gitlab
I agree with you in that building hypotheses and then testing it is more of a traditional approach and lots of people in science seems to take explore data -> build theory approach nowadays, especially with introduction of big data.
Dieter Weber
@uellue
Yes :-)
Jaeweon Shin
@jae.shin_gitlab
@uellue So I've tried most of the methods we talked about. Unfortunately, I could not yet find a case where benchmark test differed and the PCA failed. I've tried 1) non-centered loading using heavy tail distributions, 2) include frames that are complete random and change the proportions of such frames (both our PCA and standard PCA didnt perform well as soon as random noise frame was included, 3) compare max norm differences (when adding a component to a single frame; didn't catch that single frame), 4) different ways of scaling, etc. Do you have any more cases that I can potentially try out?
Dieter Weber
@uellue
Perhaps you could make an overview for yourself and think of more ways our method could behave differently and what might be situations in which it could struggle. So far we have really only scratched the surface and tested a tiny part of the parameter space for data that is a linear combination of components. You could think of more ways to generate test data than your current method, including hand-crafted examples. Furthermore you could get an overview of all the ways how results can be different, even if they lead to the same reconstruction. Again, we have only explored a small part of that. More than enough to do, and you can get creative here!
I.e. behave different from the reference PCA
Jaeweon Shin
@jae.shin_gitlab

@uellue I tried with few different cases of testing examples but couldn't quite differentiate the standard pca from the pca we developed. I'm mostly focusing on the hyperbox property of the developed pca, as that is where the most noticeable difference lies conceptually. I've checked against gamma, exponential as examples of heavy tail distributions because I thought not using centered distribution might disturb the hyperbox issue. I also tried bimodal distribution but it seems like the choice of distribution for loading doesn't really affect the performance. I looked at how data gets transformed by looking the projected data from standard PCA and the new PCA, and the difference was almost none (I had to take absolute values to remove the signs since signs are nondeterministic), suggesting that both do very similar work. Do you have any ideas that I can pursue here?

Meanwhile, as an update, I was working on implementing nnmf and was thinking about using hyperbox again here. Since PCA and NNMF has the same formulation (argmin ||X - AH||) with some differences in the constraints, I think hyperbox idea could work again here. One problem is that NMF doesn't in general have incremental methods so we have a problem at the merging stage. There is a way of computing NMF separately for some subset of the columns of the matrix and then combining them together at the end, however. I didn't think it's going to work, as we don't have access to subset of the columns of data but only subset of the rows (observations) of the data, but I think transposing the matrix might do the trick since if X (data) = A(score) * H (dictionary), X^T = (H^T)*(A^T). I'm mostly debugging at the moment because of the problem with dimensions but I will let you know if I make any progress on this.

Dieter Weber
@uellue

@jae.shin_gitlab OK, here a few points that could be pursued:

  • PCA returns the components in the order of their standard deviation, right? Did you check if reference PCA and hyperbox PCA return components in the same order? Can we generate data in such a way that the hyperbox gives a different order of the components than the reference PCA? As far as I can see, you've only chosen distributions where larger extrema of the hyperbox surface are linked to larger standard deviation. In particular, you seem to have chosen the same distribution for all the dimensions, which is not always the case in real input data. We can, for example, design data with very low standard deviation (all points near the average) but one or two large outliers that define the box in one of the dimensions, and another dimension where all points are at the extrema, which are only slightly smaller than the one with outliers. If the number of samples is large enough, the first one should have a much smaller standard deviation than the second one, while only looking at the hyperbox will place the first one first. Perhaps you can play a bit further to "mess" with the way how the hyperbox samples data in order to produce an example where the order of components is very different, for example by including points that are extrema of ALL the dimensions, but having a very large number of samples that are below these extrema with very different distributions? That way we could eliminate a large fraction of the input data from the merging PCA and instead inject a different distribution there that is not representative for the data. This approach could generate an example where full PCA behaves differently from hyperbox PCA.
  • You've always generated the data from an orthonormal base. What if you generate it from at least some base vectors that are almost collinear? Do reference PCA and hyperbox PCA show the same performance in distinguishing these components somehow, in particular in the presence of added noise? Can you somehow make the hyperbox method "trip" and lump components together that the full PCA can separate, for example using the methods described above?
  • As far as I could see, the loadings have been scattered all over the place. What happens if they are somehow segregated between partitions?
  • What is the influence of the number of partitions? You've always used four, as far as I can see.

That is what I meant with just scratching the surface of what input data can look like! After trying these things in hand-crafted examples, you could also develop a fuzzing algorithm that compares the PCAs for a larger number of combinations of all these factors to perhaps find some where there are differences.

After getting a handle on correctness, another topic that we haven't covered yet is efficiency. Under which conditions does the parallelized PCA version bring advantages over the full PCA? We already had one case where it doesn't make sense: If the data actually has the full dimensionality, the hyperbox version is less efficient since you'd have to do the full PCA at the merging stage anyway. How can you design an optimal calculation scheme for a given PCA task? What are the relevant factors here?

Jaeweon Shin
@jae.shin_gitlab

Thanks for the input! I will look into each of the points that you mentioned. As for:

Did you check if reference PCA and hyperbox PCA return components in the same order?

I did. They do return the same component vectors indeed and after adjusting for sign effect, the norm difference was very small so I took it that they essentially identify the same component vectors.

Jaeweon Shin
@jae.shin_gitlab
@uellue I think I've tested most of the cases you mentioned. When I checked the performance, it was slower than using standard PCA but that was due to the property of incremental PCA. After all, any method with partial batch aspect will be naturally slower than full batch method, but since one of our purposes of using incremental algorithm is to fit large size data, incremental PCA outperforms standard PCA in that respect. I'm currently writing documentation and will be finishing it soon.
Dieter Weber
@uellue
Hi @jae.shin_gitlab , thanks for the info and for your work! I'll have a look at it on Monday since I am traveling atm.
Dieter Weber
@uellue

Hi @jae.shin_gitlab , now I have taken a look at it. The implementation looks good to me, good work! For me the question is still if the hyperbox method works under all circumstances and what its limits are. Did you cover the issues that I described on August 6th? Very importantly, it is still unclear if the method is mathematically valid under all circumstances, if it is really identical or just giving an approximation, how good the approximation is etc.

Furthermore, one could describe that other sampling methods than the hyperbox are thinkable.

Given the fact that GSoC is almost over, how should we proceed? We could merge the code as-is as prototype, i.e. move everything under the prototypes folder, or you could add a clear description about the state of testing and validation, clearly marking it as experimental and describing in detail what the open questions are. If you are interested in turning this into a scientific publication, the open points from my comment above should all be covered first.

See also my comments on the pull request!
Jaeweon Shin
@jae.shin_gitlab
@uellue Sorry for the late reply. So I tried to cover most of the issues that you previously described, but could not find a case where the hyperbox method was rendered numb. I guess this is slightly worse than finding a case where hyperbox method doesn't work because we don't really know where it can potentially fail... Clearly, it is slightly worse than running standard PCA on full batch data, but so far on all the experiments I did, the errors were negligible. As for how to proceed from here, I think a better way is to, as you mentioned, add more to the documentation describing what has been checked and what remains to be done for future checks.
Dieter Weber
@uellue
Ok, if you already wrote code for all these cases, what about including that in unit tests as well as describing it in documentation? With that we can make a nice publication out of it, I think.
Jaeweon Shin
@jae.shin_gitlab
You meant including the tests in the testing file right? I'll work on it.
Dieter Weber
@uellue
@jae.shin_gitlab Yes, the files in the tests/ subfolder.