by

Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
  • May 20 05:04

    ocramz on gh-pages

    Add `sampling` (compare)

  • May 19 09:03

    ocramz on gh-pages

    Add kdt, Supervised Learning se… (compare)

  • Apr 14 01:32
    tonyday567 removed as member
  • Jan 30 07:37

    ocramz on gh-pages

    Add arrayfire (compare)

  • Jan 02 12:51

    ocramz on gh-pages

    add inliterate (compare)

  • Jan 02 12:43

    ocramz on gh-pages

    update hvega entry (compare)

  • Jul 01 2019 09:43
    dmvianna added as member
  • Jun 15 2019 04:55

    ocramz on gh-pages

    Add pcg-random (compare)

  • Jun 14 2019 16:08
    ocramz labeled #42
  • Jun 14 2019 16:08
    ocramz labeled #42
  • Jun 14 2019 16:08
    ocramz labeled #42
  • Jun 14 2019 16:08
    ocramz labeled #42
  • Jun 14 2019 16:08
    ocramz labeled #42
  • Jun 14 2019 16:08
    ocramz labeled #42
  • Jun 14 2019 16:08
    ocramz opened #42
  • Jun 14 2019 16:08
    ocramz opened #42
  • Jun 06 2019 18:21

    ocramz on gh-pages

    Fix graphite link Merge pull request #41 from alx… (compare)

  • Jun 06 2019 18:21
    ocramz closed #41
  • Jun 06 2019 18:21
    ocramz closed #41
  • Jun 06 2019 17:32
    alx741 opened #41
Stefan Dresselhaus
@Drezil
ah, ok :)
Compl Yue
@complyue
btw, I've haskelled not long, does not many ones compiling GHC for themselves? when I think of it, what's in my mind is tensorflow, there you almost always compile it yourself as the stock release is way to conservative about hardware requirements, you have to compile with sse4.2 avx etc enabled to not waste capabilities on decent CPUs
Stefan Dresselhaus
@Drezil
not really .. as with tensorflow you don't care about ssse4 & avx if you use a graphics-card anyway ...
most is: pip install tensorflow & use it.
at least in my experience..
might be different if you have a big, dedicated team at a big corporation .. but most of my experiences are in science (students to post-docs that are happy when it just works(tm)) and in small businesses where you just have 1-2 people doing ML
Compl Yue
@complyue
okay, my team is less than 10p, but have to spin up dozens of rack servers at constant crunching, so that's not typical :)
Guillaume Desforges
@GuillaumeDesforges
Hey everybody!
I was thinking about writing some sort of cached pipeline for my Machine Learning experimentation workflow. For the mechanism I am thinking about, I need to compare the running function plus inputs with previously ran functions and inputs. I was thinking about comparing hashes.
Would it be possible to get the hash of a piece of source code (at compile time I guess)?
Yves Parès
@YPares
@GuillaumeDesforges Hi! Hashing the source code would be difficult to do without some template haskell trickery. http://hackage.haskell.org/package/funflow does what you want, and it just requires you to update a salt whenever you modify your code and want to invalidate your cache
Tony Day
@tonyday567
@complyue Is hadui like a ghci replacement, similar to intero, say, with interactive custom commands, or is it more fundamentally a different way to use GHC? Or something else?
Compl Yue
@complyue
@tonyday567 my own usecase is to have vscode+hie open developing all code of the stack project, with hadui-dev run as the ide's build task (it keeps running all the time, when runtime errors occur, code), then a browser window open hadui page to enter interactive code, mostly parameter scripts to trigger plotting in the web browser.
Compl Yue
@complyue
hadui-dev will print source location to the ide console when runtime error occurs, so you just click the link to navi to source in vscode. i'm just now working on the error formatting code, and found non-trival to extract useful info from runtime errors with ghc :/
Compl Yue
@complyue
planned hadui-dap to support breakpoint, single stepping, var reveal etc in vscode to debug hadui project, while no tight schedule for that.
Compl Yue
@complyue
Surprise! it may be painless for you to try out hadui right now. I was wrong to recognize stack's custom GHC instance feature as immature, that coz I'd built GHC with its new hadrian method, I'm amazed to find out that stack actually works pretty well with the bindist of GHC built with good old make.
so you are 3 cmds away from a running hadui-demo on your machine, given you are on decent macOS or Linux. please let me know whether it succeeded or not if you ever tried. https://github.com/complyue/hadui-demo
Marco Z
@ocramz
@complyue trying the demo (osx mavericks), but my system doesn't have xz it seems
this project is a very cool idea, my only concern (pretty much like @tonyday567 ) is with the custom ghc build. Congratulations for getting it all to work together though!
Compl Yue
@complyue
@ocramz thanks! is stack working to install stock GHC on your system? if so I think maybe it's distributing in other compression format than xz, and maybe I can re-pack the bindist to make it work.
Marco Z
@ocramz
yep I routinely use stack :)
Compl Yue
@complyue
okay, found
Compl Yue
@complyue
@ocramz I've updated hadui-demo to use bz2, please pull and try build again
I'm away from my mac, it's packed on linux, not quite sure it will work out but I think very prolly.
Austin Huang
@austinvhuang
@complyue any reason for bokeh vs vega?
Compl Yue
@complyue
@austinvhuang thanks for pointing! we used to be python centric, so ignored vega in the 1st place, you just reminded me that we've drifted off python ecosystem, so vega is an option now :)
Compl Yue
@complyue
@austinvhuang a quick refresh, I think we'll stay with bokeh coz its acceptable lags in visualizing data points at an order of millions, due to its design to render with WebGL by default, https://www.anaconda.com/python-data-visualization-2018-why-so-many-libraries/ check out the 'Data Size' section there. bokeh has long been battle tested with us in this regard.
another killer feature of bokeh for us is this: https://docs.bokeh.org/en/latest/docs/user_guide/interaction/linking.html#userguide-interaction-linking we usually have a few, sometimes up to 30 figures shown, with their x axis or both x+y linked for zoom/pan/selection . I havn't tried hard enough with other frameworks to implement this effect, but bokeh just works.
Doug Burke
@DougBurke
@complyue Vega can do liked views for pan, zoom, and selection - e.g. see http://hackage.haskell.org/package/hvega-0.4.1.1/docs/Graphics-Vega-Tutorials-VegaLite.html#g:29 - but I have not tried it out on very-large datasets (my guess is that it isn't optimised for this use case).
Compl Yue
@complyue
@DougBurke yeah, this feature seems on a par. you even made it work with IHaskell 👍, I wish I had dug harder in stackage/hackage ;-)
um, d3 based visualization all comes at a bottleneck of data size lower than WebGL based ones, I hit the wall 2~3 years ago, and bcoz of python, have been stuck with bokeh all along.
Isaac Shapira
@fresheyeball_gitlab
Howdy!
I am here to leeeeaaaarn!
Yves Parès
@YPares
Hi!
We are here to teeeeaaaaaaach!
(within the limits of the reasonable)
Isaac Shapira
@fresheyeball_gitlab
@YPares many sauces of awesome
Austin Huang
@austinvhuang

@complyue vega is supported by python in the form of altair bindings https://altair-viz.github.io/ use it all the time when working with python!

Once you get to million datapoints, I tend to lean towards bespoke apps that either serve data on-demand or expose data at the right level of granularity (google maps style). By the time one is dealing with > 30k datapoints, you're either thinking of the data in the form of a density, or inspecting points in a local region of the data space. But I do get there's something nice about a framework that takes care of this for you without building from scratch.

i'm a big fan of crossfiltering/linking concepts as well. There's probably room in the DS ecosystem for a rshiny killer with crossfiltering as a basis.

Austin Huang
@austinvhuang
welcome @fresheyeball_gitlab !
Compl Yue
@complyue

@austinvhuang at the very early stage when choosing a vis kit (years ago), I intentionally avoided declarative plotting tools, i.e. echarts, plotly etc. I decided that later interaction with the origin data source is important, incremental updates to the chart would be always on the way, I had been thinking the implementation of stock k-charts being updated in realtime at that time. while bokeh fits in pretty well of this idea. but today I'd say that's not that important.

wrt data size as the problem for me, my team is not particularly strong at data modeling, they need to see sth before capture sth meaningful from the data, then start informed analysis. I developed a home brew array database, that mmap an entire dataset (sizing from 20GB ~ 300GB) into each computing node's memory address space, each node with typical 144GB RAM, it's trivial for a backend process to fully scan the dataset by means of memory reads. repeated scans are perfectly cached by the os's kernel page and shared by all processes on that node, so only the 1st scan on a node needs to consume bandwidth to the storage server to fill its kernel page cache. so throughput of massive data is really cheap in my env.

at my hand now is the problem of efficiency in data analyzing to solve. I identified it as the under capability for my analysts to describe the data well enough with what they've got. I'm investigating into some sorta of narrative methods to do data description, leading them to start by telling what they'd like to see, then in order to see that, what's needed, and so on, hopefully finally to land in what data we actually have. I started haskelling for exactly this purpose, in finding a proper DSL to establish the communication.

so far the DSL is not as ideal, as free monad seems unacceptable performance killer, we have to stay with mtl style, simple transformers or even vanilla monad. and I actually found my direction points to massive concurrent events simulation to achieve the narrative style data description, tho haskell seems pretty good at handling concurrency and parallelism, I've found no reference implementation for my idea.

Compl Yue
@complyue
I'd recognize visualization in my scenario as less hypothetical showcasing but more blind data exploration, where more data we see in the first place, the more meaningful the clues can be extracted, with fixed brain power/capability we have in the team.
Austin Huang
@austinvhuang

@complyue pretty interesting case study!

If the initial goal is to obtain a qualitative understanding of the data, do those qualitative properties materially change working with downsampled versions? I'm probably not close enough to your use case, usually for me insight-oriented analyses of large datasets hit diminishing returns well before full dataset scans because the inferences (qualitative description or explicit parameter estimation) converge well before that. Capability-oriented ML models like language and vision are a different story of course...

On the topic of visualization i've always thought there should be a way to automatically do dimensionality reduction (like UMAP) for EDA for any arbitrary structured data by declaratively specifying a set of columns that could be heterogeneous in nature. Columns of numerical data should be automatically normalized, categorical variables should be automatically be run through categorical embeddings, etc. I haven't seen anyone outside commercial vendors tackle this though, although I've always felt it should be doable in a pretty general way with a bit of effort.

I'm probably not close enough to your use case to comment much usefully (also not sure what this "narrative" approach you mention refers to). Thanks for sharing though.

Austin Huang
@austinvhuang

@YPares yeah it's an interesting development, though I haven't had time to try it myself.

Given the amount of resource and adoption jupyter has, it's always felt way more stagnated than it should be as a technology. Maybe it's time for new players like this and Observable to come in.

Does it solve the reproducibility + in-memory state spaghetti of jupyter notebooks? If it does would be especially interested.

Compl Yue
@complyue
@austinvhuang Thanks for sharing your insights, very informative to me. I can't describe my thoughts on solid theoretical ground, but I feel sorta down-sampling is with the approach we're proceeding, we use a event system to adapt to the varying of significance each model w/ params contributes to the overall expectation, along the timeline, and present captured sum only with sufficient importance to human for estimation and possible analyze. or simply put, the vast majority of data, even with some parts meaningful, would render each other mutually plain noise. to find the boundaries of meaningful parts from the data is the most hard work I presume, and tailoring by importance should be effective in attacking this. ideally there should be only 1~2% data left from the dataset left for human analyze, we definitely need vis tools beyond there, but yet still need vis tools to arrive there.
Yves Parès
@YPares
Hello guys, just to say that the blog post presenting porcupine is up https://www.tweag.io/posts/2019-10-30-porcupine.html cc @mgajda @mmesch
@austinvhuang I don't know about Observable. I should check them out
I'm not sure Netflix has solved reproducibility of notebooks but they tout they've made a big step towards it anyway :)
Michał J. Gajda
@mgajda

@YPares Cool! Will read it.

By the way, did you see streamly? https://github.com/composewell/streamly
After a few posts on how to make streaming benchmarks: https://github.com/composewell/streaming-benchmarks

I wonder if we can make an improved Porcupine+Streamly pipeline platform that can replace Shake as well?
Yves Parès
@YPares
I invited you to https://gitter.im/tweag/porcupine to continue the discussion
Austin Huang
@austinvhuang

I've not taken a deep dive behind Observable. The creator of D3 Mike Bostock is one of the people behind it. It seems to solve the out-of-control notebook state problems way better than jupyter does, taking advantage of javascript's asynchronous aspects (cells = promises, disallow variable re-definition).

This stuff is far beyond jupyter IMO:

https://observablehq.com/collection/@observablehq/explorables

I like this one which is a technical topic many researchers even are not well versed in (but super important in ml):
https://observablehq.com/@tophtucker/theres-plenty-of-room-in-the-corners?collection=@observablehq/explorables

On the other hand, it's all javascript which would require a cultural sea change for it to take over as mainstream in the data science / machine learning world. I wouldn't put anything beyond javascript, but a lot of shifts would have to align.

Compl Yue
@complyue
wow, I'm astonished by observable notebooks 😲, regret not discovered it earlier, lucky to see here, thanks!!
they even achieved this far with bare metal js, make me feel that type-safety is less a concern at all aspects they did so well.