ocramz on gh-pages
Add `sampling` (compare)
ocramz on gh-pages
Add kdt, Supervised Learning se… (compare)
ocramz on gh-pages
Add arrayfire (compare)
ocramz on gh-pages
add inliterate (compare)
ocramz on gh-pages
update hvega entry (compare)
ocramz on gh-pages
Add pcg-random (compare)
ocramz on gh-pages
Fix graphite link Merge pull request #41 from alx… (compare)
@complyue pretty interesting case study!
If the initial goal is to obtain a qualitative understanding of the data, do those qualitative properties materially change working with downsampled versions? I'm probably not close enough to your use case, usually for me insight-oriented analyses of large datasets hit diminishing returns well before full dataset scans because the inferences (qualitative description or explicit parameter estimation) converge well before that. Capability-oriented ML models like language and vision are a different story of course...
On the topic of visualization i've always thought there should be a way to automatically do dimensionality reduction (like UMAP) for EDA for any arbitrary structured data by declaratively specifying a set of columns that could be heterogeneous in nature. Columns of numerical data should be automatically normalized, categorical variables should be automatically be run through categorical embeddings, etc. I haven't seen anyone outside commercial vendors tackle this though, although I've always felt it should be doable in a pretty general way with a bit of effort.
I'm probably not close enough to your use case to comment much usefully (also not sure what this "narrative" approach you mention refers to). Thanks for sharing though.
@YPares yeah it's an interesting development, though I haven't had time to try it myself.
Given the amount of resource and adoption jupyter has, it's always felt way more stagnated than it should be as a technology. Maybe it's time for new players like this and Observable to come in.
Does it solve the reproducibility + in-memory state spaghetti of jupyter notebooks? If it does would be especially interested.
@YPares Cool! Will read it.
By the way, did you see streamly
? https://github.com/composewell/streamly
After a few posts on how to make streaming benchmarks: https://github.com/composewell/streaming-benchmarks
I've not taken a deep dive behind Observable. The creator of D3 Mike Bostock is one of the people behind it. It seems to solve the out-of-control notebook state problems way better than jupyter does, taking advantage of javascript's asynchronous aspects (cells = promises, disallow variable re-definition).
This stuff is far beyond jupyter IMO:
https://observablehq.com/collection/@observablehq/explorables
I like this one which is a technical topic many researchers even are not well versed in (but super important in ml):
https://observablehq.com/@tophtucker/theres-plenty-of-room-in-the-corners?collection=@observablehq/explorables
On the other hand, it's all javascript which would require a cultural sea change for it to take over as mainstream in the data science / machine learning world. I wouldn't put anything beyond javascript, but a lot of shifts would have to align.
iodide is based on webassembly
nextjournal has some functional programming ideas behind:
The good news is that I believe that it is possible to fix computational reproducibility by applying a principle of functional programming at the systems level: immutability (which is what we're doing at Nextjournal). I will show how applying immutability to code, data and the computational environment gives us a much better chance of keeping things runnable in a follow-up post.
catch
an AFException, but I think the "pure" return types in the main API are misleading
looked at falcon, about the data size it handled:
10M flights in the browser and ~180M flights or ~1.7B stars when connected to OmniSciDB (formerly known as MapD)
make me remember that at a time about months ago, I'd had to increase chrome's heap size to around 12GB to visualize one of my dataset (with BokehJS frontend and golang backend), as even the 64-bit version of chrome has a default heap size limit around 3.5GB:
performance.memory.jsHeapSizeLimit/1024/1024/1024
3.501772880554199
seems they haven't hit this limit, while Bokeh already exceeded.