ocramz on gh-pages
Add `sampling` (compare)
ocramz on gh-pages
Add kdt, Supervised Learning se… (compare)
ocramz on gh-pages
Add arrayfire (compare)
ocramz on gh-pages
add inliterate (compare)
ocramz on gh-pages
update hvega entry (compare)
ocramz on gh-pages
Add pcg-random (compare)
ocramz on gh-pages
Fix graphite link Merge pull request #41 from alx… (compare)
@complyue vega is supported by python in the form of altair bindings https://altair-viz.github.io/ use it all the time when working with python!
Once you get to million datapoints, I tend to lean towards bespoke apps that either serve data on-demand or expose data at the right level of granularity (google maps style). By the time one is dealing with > 30k datapoints, you're either thinking of the data in the form of a density, or inspecting points in a local region of the data space. But I do get there's something nice about a framework that takes care of this for you without building from scratch.
i'm a big fan of crossfiltering/linking concepts as well. There's probably room in the DS ecosystem for a rshiny killer with crossfiltering as a basis.
@austinvhuang at the very early stage when choosing a vis kit (years ago), I intentionally avoided declarative plotting tools, i.e. echarts, plotly etc. I decided that later interaction with the origin data source is important, incremental updates to the chart would be always on the way, I had been thinking the implementation of stock k-charts being updated in realtime at that time. while bokeh fits in pretty well of this idea. but today I'd say that's not that important.
wrt data size as the problem for me, my team is not particularly strong at data modeling, they need to see sth before capture sth meaningful from the data, then start informed analysis. I developed a home brew array database, that mmap an entire dataset (sizing from 20GB ~ 300GB) into each computing node's memory address space, each node with typical 144GB RAM, it's trivial for a backend process to fully scan the dataset by means of memory reads. repeated scans are perfectly cached by the os's kernel page and shared by all processes on that node, so only the 1st scan on a node needs to consume bandwidth to the storage server to fill its kernel page cache. so throughput of massive data is really cheap in my env.
at my hand now is the problem of efficiency in data analyzing to solve. I identified it as the under capability for my analysts to describe the data well enough with what they've got. I'm investigating into some sorta of narrative
methods to do data description, leading them to start by telling what they'd like to see, then in order to see that, what's needed, and so on, hopefully finally to land in what data we actually have. I started haskelling for exactly this purpose, in finding a proper DSL to establish the communication.
so far the DSL is not as ideal, as free monad seems unacceptable performance killer, we have to stay with mtl style, simple transformers or even vanilla monad. and I actually found my direction points to massive concurrent events simulation to achieve the narrative style data description, tho haskell seems pretty good at handling concurrency and parallelism, I've found no reference implementation for my idea.
@complyue pretty interesting case study!
If the initial goal is to obtain a qualitative understanding of the data, do those qualitative properties materially change working with downsampled versions? I'm probably not close enough to your use case, usually for me insight-oriented analyses of large datasets hit diminishing returns well before full dataset scans because the inferences (qualitative description or explicit parameter estimation) converge well before that. Capability-oriented ML models like language and vision are a different story of course...
On the topic of visualization i've always thought there should be a way to automatically do dimensionality reduction (like UMAP) for EDA for any arbitrary structured data by declaratively specifying a set of columns that could be heterogeneous in nature. Columns of numerical data should be automatically normalized, categorical variables should be automatically be run through categorical embeddings, etc. I haven't seen anyone outside commercial vendors tackle this though, although I've always felt it should be doable in a pretty general way with a bit of effort.
I'm probably not close enough to your use case to comment much usefully (also not sure what this "narrative" approach you mention refers to). Thanks for sharing though.
@YPares yeah it's an interesting development, though I haven't had time to try it myself.
Given the amount of resource and adoption jupyter has, it's always felt way more stagnated than it should be as a technology. Maybe it's time for new players like this and Observable to come in.
Does it solve the reproducibility + in-memory state spaghetti of jupyter notebooks? If it does would be especially interested.
@YPares Cool! Will read it.
By the way, did you see streamly
? https://github.com/composewell/streamly
After a few posts on how to make streaming benchmarks: https://github.com/composewell/streaming-benchmarks
I've not taken a deep dive behind Observable. The creator of D3 Mike Bostock is one of the people behind it. It seems to solve the out-of-control notebook state problems way better than jupyter does, taking advantage of javascript's asynchronous aspects (cells = promises, disallow variable re-definition).
This stuff is far beyond jupyter IMO:
https://observablehq.com/collection/@observablehq/explorables
I like this one which is a technical topic many researchers even are not well versed in (but super important in ml):
https://observablehq.com/@tophtucker/theres-plenty-of-room-in-the-corners?collection=@observablehq/explorables
On the other hand, it's all javascript which would require a cultural sea change for it to take over as mainstream in the data science / machine learning world. I wouldn't put anything beyond javascript, but a lot of shifts would have to align.
iodide is based on webassembly
nextjournal has some functional programming ideas behind:
The good news is that I believe that it is possible to fix computational reproducibility by applying a principle of functional programming at the systems level: immutability (which is what we're doing at Nextjournal). I will show how applying immutability to code, data and the computational environment gives us a much better chance of keeping things runnable in a follow-up post.