sk1p on gh-pages
Update documentation (compare)
sk1p on gh-pages
Update documentation (compare)
sk1p on master
Test cases for ROI and postproc… fix: aux params didn't behave c… Call copy_for_partition() after… and 1 more (compare)
sk1p on master
Adapt documentation for #426 (compare)
You could push the limits with scaling a bit more. Right now you have a linspace between 0.05 and 1, which is not even two orders of magnitude. Have a look at https://docs.scipy.org/doc/numpy/reference/generated/numpy.logspace.html and really, really push it. The interesting question is when it starts showing different results than processing the whole data with an established algorithm.
In general, the "quality measure" is difference between established PCA algorithm and our distributed version. As soon as they diverge, we've reached the limit.
which is the maximum difference and the coordinate of that value (np.argmax()) between PCA and reconstruction with established method, and our PCA and reconstruction
Sorry could you please elaborate a bit more on this? I'm not really understanding what you meant by this.
if we actually recover the same components in the same order
I think to do this, one would have to differentiate the magnitude of how much components get reflected in the data because the only condition I have at the moment is that components are orthonormal and loading is random
what is the maximum difference, and where that occurred
Okay. I now understand. Then np.argmax would give a single feature of a single frame which may not exactly be informative so I will select ~100 coordinates at which the difference is maximal to see is the frame that we perturb is the one that leads to these differences
I think to do this, one would have to differentiate the magnitude of how much components get reflected in the data
Yes, that's true. :-) I have a feeling that our bounding box method definitely works well as long as the loadings have a Gaussian distribution. Perhaps you can try out some very much non-Gaussian distributions where one could suspect that looking at the bounding box could give a different order of the components between the benchmark algorithm and ours
so I will select ~100 coordinates at which the difference is maximal
as long as the loadings have a Gaussian distribution
Or any other distribution where the points on the bounding box have a similar distribution as the points more towards the center that we cut away
You could push the limits with scaling a bit more.
Ah, a point here is if we can "fish out" a really, really weak component just above the noise floor. Perhaps you could think of a way to measure how well we recovered that weak component? Looking at the global error will not work because the component is so weak that we would not notice if we missed its contribution in the global result.
you could think of a way to measure how well we recovered that weak component
hmm that'd be more involved than the case of component contained in a frame because in the latter case, we could select coordinates with maximal differences and see if that correspond to that frame, as you said.
@uellue I tried with few different cases of testing examples but couldn't quite differentiate the standard pca from the pca we developed. I'm mostly focusing on the hyperbox property of the developed pca, as that is where the most noticeable difference lies conceptually. I've checked against gamma, exponential as examples of heavy tail distributions because I thought not using centered distribution might disturb the hyperbox issue. I also tried bimodal distribution but it seems like the choice of distribution for loading doesn't really affect the performance. I looked at how data gets transformed by looking the projected data from standard PCA and the new PCA, and the difference was almost none (I had to take absolute values to remove the signs since signs are nondeterministic), suggesting that both do very similar work. Do you have any ideas that I can pursue here?
Meanwhile, as an update, I was working on implementing nnmf and was thinking about using hyperbox again here. Since PCA and NNMF has the same formulation (
argmin ||X - AH||) with some differences in the constraints, I think hyperbox idea could work again here. One problem is that NMF doesn't in general have incremental methods so we have a problem at the merging stage. There is a way of computing NMF separately for some subset of the columns of the matrix and then combining them together at the end, however. I didn't think it's going to work, as we don't have access to subset of the columns of data but only subset of the rows (observations) of the data, but I think transposing the matrix might do the trick since if
X (data) = A(score) * H (dictionary), X^T = (H^T)*(A^T). I'm mostly debugging at the moment because of the problem with dimensions but I will let you know if I make any progress on this.
@jae.shin_gitlab OK, here a few points that could be pursued:
That is what I meant with just scratching the surface of what input data can look like! After trying these things in hand-crafted examples, you could also develop a fuzzing algorithm that compares the PCAs for a larger number of combinations of all these factors to perhaps find some where there are differences.
After getting a handle on correctness, another topic that we haven't covered yet is efficiency. Under which conditions does the parallelized PCA version bring advantages over the full PCA? We already had one case where it doesn't make sense: If the data actually has the full dimensionality, the hyperbox version is less efficient since you'd have to do the full PCA at the merging stage anyway. How can you design an optimal calculation scheme for a given PCA task? What are the relevant factors here?
Thanks for the input! I will look into each of the points that you mentioned. As for:
Did you check if reference PCA and hyperbox PCA return components in the same order?
I did. They do return the same component vectors indeed and after adjusting for sign effect, the norm difference was very small so I took it that they essentially identify the same component vectors.
Hi @jae.shin_gitlab , now I have taken a look at it. The implementation looks good to me, good work! For me the question is still if the hyperbox method works under all circumstances and what its limits are. Did you cover the issues that I described on August 6th? Very importantly, it is still unclear if the method is mathematically valid under all circumstances, if it is really identical or just giving an approximation, how good the approximation is etc.
Furthermore, one could describe that other sampling methods than the hyperbox are thinkable.
Given the fact that GSoC is almost over, how should we proceed? We could merge the code as-is as prototype, i.e. move everything under the prototypes folder, or you could add a clear description about the state of testing and validation, clearly marking it as experimental and describing in detail what the open questions are. If you are interested in turning this into a scientific publication, the open points from my comment above should all be covered first.