by

Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • May 20 05:04

    ocramz on gh-pages

    Add `sampling` (compare)

  • May 19 09:03

    ocramz on gh-pages

    Add kdt, Supervised Learning se… (compare)

  • Apr 14 01:32
    tonyday567 removed as member
  • Jan 30 07:37

    ocramz on gh-pages

    Add arrayfire (compare)

  • Jan 02 12:51

    ocramz on gh-pages

    add inliterate (compare)

  • Jan 02 12:43

    ocramz on gh-pages

    update hvega entry (compare)

  • Jul 01 2019 09:43
    dmvianna added as member
  • Jun 15 2019 04:55

    ocramz on gh-pages

    Add pcg-random (compare)

  • Jun 14 2019 16:08
    ocramz labeled #42
  • Jun 14 2019 16:08
    ocramz labeled #42
  • Jun 14 2019 16:08
    ocramz labeled #42
  • Jun 14 2019 16:08
    ocramz labeled #42
  • Jun 14 2019 16:08
    ocramz labeled #42
  • Jun 14 2019 16:08
    ocramz labeled #42
  • Jun 14 2019 16:08
    ocramz opened #42
  • Jun 14 2019 16:08
    ocramz opened #42
  • Jun 06 2019 18:21

    ocramz on gh-pages

    Fix graphite link Merge pull request #41 from alx… (compare)

  • Jun 06 2019 18:21
    ocramz closed #41
  • Jun 06 2019 18:21
    ocramz closed #41
  • Jun 06 2019 17:32
    alx741 opened #41
Compl Yue
@complyue
@ocramz thanks! is stack working to install stock GHC on your system? if so I think maybe it's distributing in other compression format than xz, and maybe I can re-pack the bindist to make it work.
Marco Z
@ocramz
yep I routinely use stack :)
Compl Yue
@complyue
okay, found
Compl Yue
@complyue
@ocramz I've updated hadui-demo to use bz2, please pull and try build again
I'm away from my mac, it's packed on linux, not quite sure it will work out but I think very prolly.
Austin Huang
@austinvhuang
@complyue any reason for bokeh vs vega?
Compl Yue
@complyue
@austinvhuang thanks for pointing! we used to be python centric, so ignored vega in the 1st place, you just reminded me that we've drifted off python ecosystem, so vega is an option now :)
Compl Yue
@complyue
@austinvhuang a quick refresh, I think we'll stay with bokeh coz its acceptable lags in visualizing data points at an order of millions, due to its design to render with WebGL by default, https://www.anaconda.com/python-data-visualization-2018-why-so-many-libraries/ check out the 'Data Size' section there. bokeh has long been battle tested with us in this regard.
another killer feature of bokeh for us is this: https://docs.bokeh.org/en/latest/docs/user_guide/interaction/linking.html#userguide-interaction-linking we usually have a few, sometimes up to 30 figures shown, with their x axis or both x+y linked for zoom/pan/selection . I havn't tried hard enough with other frameworks to implement this effect, but bokeh just works.
Doug Burke
@DougBurke
@complyue Vega can do liked views for pan, zoom, and selection - e.g. see http://hackage.haskell.org/package/hvega-0.4.1.1/docs/Graphics-Vega-Tutorials-VegaLite.html#g:29 - but I have not tried it out on very-large datasets (my guess is that it isn't optimised for this use case).
Compl Yue
@complyue
@DougBurke yeah, this feature seems on a par. you even made it work with IHaskell 👍, I wish I had dug harder in stackage/hackage ;-)
um, d3 based visualization all comes at a bottleneck of data size lower than WebGL based ones, I hit the wall 2~3 years ago, and bcoz of python, have been stuck with bokeh all along.
Isaac Shapira
@fresheyeball_gitlab
Howdy!
I am here to leeeeaaaarn!
Yves Parès
@YPares
Hi!
We are here to teeeeaaaaaaach!
(within the limits of the reasonable)
Isaac Shapira
@fresheyeball_gitlab
@YPares many sauces of awesome
Austin Huang
@austinvhuang

@complyue vega is supported by python in the form of altair bindings https://altair-viz.github.io/ use it all the time when working with python!

Once you get to million datapoints, I tend to lean towards bespoke apps that either serve data on-demand or expose data at the right level of granularity (google maps style). By the time one is dealing with > 30k datapoints, you're either thinking of the data in the form of a density, or inspecting points in a local region of the data space. But I do get there's something nice about a framework that takes care of this for you without building from scratch.

i'm a big fan of crossfiltering/linking concepts as well. There's probably room in the DS ecosystem for a rshiny killer with crossfiltering as a basis.

Austin Huang
@austinvhuang
welcome @fresheyeball_gitlab !
Compl Yue
@complyue

@austinvhuang at the very early stage when choosing a vis kit (years ago), I intentionally avoided declarative plotting tools, i.e. echarts, plotly etc. I decided that later interaction with the origin data source is important, incremental updates to the chart would be always on the way, I had been thinking the implementation of stock k-charts being updated in realtime at that time. while bokeh fits in pretty well of this idea. but today I'd say that's not that important.

wrt data size as the problem for me, my team is not particularly strong at data modeling, they need to see sth before capture sth meaningful from the data, then start informed analysis. I developed a home brew array database, that mmap an entire dataset (sizing from 20GB ~ 300GB) into each computing node's memory address space, each node with typical 144GB RAM, it's trivial for a backend process to fully scan the dataset by means of memory reads. repeated scans are perfectly cached by the os's kernel page and shared by all processes on that node, so only the 1st scan on a node needs to consume bandwidth to the storage server to fill its kernel page cache. so throughput of massive data is really cheap in my env.

at my hand now is the problem of efficiency in data analyzing to solve. I identified it as the under capability for my analysts to describe the data well enough with what they've got. I'm investigating into some sorta of narrative methods to do data description, leading them to start by telling what they'd like to see, then in order to see that, what's needed, and so on, hopefully finally to land in what data we actually have. I started haskelling for exactly this purpose, in finding a proper DSL to establish the communication.

so far the DSL is not as ideal, as free monad seems unacceptable performance killer, we have to stay with mtl style, simple transformers or even vanilla monad. and I actually found my direction points to massive concurrent events simulation to achieve the narrative style data description, tho haskell seems pretty good at handling concurrency and parallelism, I've found no reference implementation for my idea.

Compl Yue
@complyue
I'd recognize visualization in my scenario as less hypothetical showcasing but more blind data exploration, where more data we see in the first place, the more meaningful the clues can be extracted, with fixed brain power/capability we have in the team.
Austin Huang
@austinvhuang

@complyue pretty interesting case study!

If the initial goal is to obtain a qualitative understanding of the data, do those qualitative properties materially change working with downsampled versions? I'm probably not close enough to your use case, usually for me insight-oriented analyses of large datasets hit diminishing returns well before full dataset scans because the inferences (qualitative description or explicit parameter estimation) converge well before that. Capability-oriented ML models like language and vision are a different story of course...

On the topic of visualization i've always thought there should be a way to automatically do dimensionality reduction (like UMAP) for EDA for any arbitrary structured data by declaratively specifying a set of columns that could be heterogeneous in nature. Columns of numerical data should be automatically normalized, categorical variables should be automatically be run through categorical embeddings, etc. I haven't seen anyone outside commercial vendors tackle this though, although I've always felt it should be doable in a pretty general way with a bit of effort.

I'm probably not close enough to your use case to comment much usefully (also not sure what this "narrative" approach you mention refers to). Thanks for sharing though.

Austin Huang
@austinvhuang

@YPares yeah it's an interesting development, though I haven't had time to try it myself.

Given the amount of resource and adoption jupyter has, it's always felt way more stagnated than it should be as a technology. Maybe it's time for new players like this and Observable to come in.

Does it solve the reproducibility + in-memory state spaghetti of jupyter notebooks? If it does would be especially interested.

Compl Yue
@complyue
@austinvhuang Thanks for sharing your insights, very informative to me. I can't describe my thoughts on solid theoretical ground, but I feel sorta down-sampling is with the approach we're proceeding, we use a event system to adapt to the varying of significance each model w/ params contributes to the overall expectation, along the timeline, and present captured sum only with sufficient importance to human for estimation and possible analyze. or simply put, the vast majority of data, even with some parts meaningful, would render each other mutually plain noise. to find the boundaries of meaningful parts from the data is the most hard work I presume, and tailoring by importance should be effective in attacking this. ideally there should be only 1~2% data left from the dataset left for human analyze, we definitely need vis tools beyond there, but yet still need vis tools to arrive there.
Yves Parès
@YPares
Hello guys, just to say that the blog post presenting porcupine is up https://www.tweag.io/posts/2019-10-30-porcupine.html cc @mgajda @mmesch
@austinvhuang I don't know about Observable. I should check them out
I'm not sure Netflix has solved reproducibility of notebooks but they tout they've made a big step towards it anyway :)
Michał J. Gajda
@mgajda

@YPares Cool! Will read it.

By the way, did you see streamly? https://github.com/composewell/streamly
After a few posts on how to make streaming benchmarks: https://github.com/composewell/streaming-benchmarks

I wonder if we can make an improved Porcupine+Streamly pipeline platform that can replace Shake as well?
Yves Parès
@YPares
I invited you to https://gitter.im/tweag/porcupine to continue the discussion
Austin Huang
@austinvhuang

I've not taken a deep dive behind Observable. The creator of D3 Mike Bostock is one of the people behind it. It seems to solve the out-of-control notebook state problems way better than jupyter does, taking advantage of javascript's asynchronous aspects (cells = promises, disallow variable re-definition).

This stuff is far beyond jupyter IMO:

https://observablehq.com/collection/@observablehq/explorables

I like this one which is a technical topic many researchers even are not well versed in (but super important in ml):
https://observablehq.com/@tophtucker/theres-plenty-of-room-in-the-corners?collection=@observablehq/explorables

On the other hand, it's all javascript which would require a cultural sea change for it to take over as mainstream in the data science / machine learning world. I wouldn't put anything beyond javascript, but a lot of shifts would have to align.

Compl Yue
@complyue
wow, I'm astonished by observable notebooks 😲, regret not discovered it earlier, lucky to see here, thanks!!
they even achieved this far with bare metal js, make me feel that type-safety is less a concern at all aspects they did so well.
Compl Yue
@complyue
also the 1st time I get to know about the 'ball' concept, many thanks for sharing! @austinvhuang
Yves Parès
@YPares
Well all JS might even be an advantage for us :) ==> compile to JS
and this one: https://nextjournal.com/
both interesting approaches

iodide is based on webassembly
nextjournal has some functional programming ideas behind:

The good news is that I believe that it is possible to fix computational reproducibility by applying a principle of functional programming at the systems level: immutability (which is what we're doing at Nextjournal). I will show how applying immutability to code, data and the computational environment gives us a much better chance of keeping things runnable in a follow-up post.

Yves Parès
@YPares
"In Nextjournal, you can use multiple programming language runtimes together in a single notebook. Values can be exchanged between runtimes using files."
Files?? C'mon...
^^
show some love
Compl Yue
@complyue
I've not been using arrayfire, but love the idea, and appreciate their effort to make it viable.
Compl Yue
@complyue
for Haskell, I'd think a GHC backend spitting CUDA C would be even more sexy, but not seeing a clue it will come into being.
Bogdan Penkovsky
@masterdezign
Marco Z
@ocramz
@complyue that's what accelerate does, more or less
Compl Yue
@complyue
@ocramz thanks! this place is full of surprises. the only accelerate I knew before now is https://developer.apple.com/documentation/accelerate , now I know acceleratehs, too cool :D
Marco Z
@ocramz
@complyue :) yeah sorry for not elaborating earlier, I was on my mobile. http://hackage.haskell.org/package/accelerate does really a few things and I'm not too qualified for describing them in detail, but one of these used to be rewriting high-level Haskell array programs in terms of predefined Cuda C splices, which are then executed on the GPU. Nowadays IIUC this is subsumed by compiling to LLVM intermediate representation : https://hackage.haskell.org/package/accelerate-llvm-ptx
there are some actual experts lurking around here, such as the main author @tmcdonell and one of the core LLVM-Hs contributors @cocreature