we had to execute some Spark queries written in C# against a Spark hosted on Datastax Enterprise Cassandra
as @vasily-kirichenko pointed out it's probably less effort to use an off-the-shelf tool like Spark for doing deep learning style stuff
since you can leverage a lot of built-in infrastructure and ecosystem knowledge that way
there might be legitimate reasons for using Akka.NET to do it
(one I can think of is needing to execute real-time reactions to deep learnings discovered in real time, but that might be a bit contrived)
lots of users do use Akka.NET for real-time machine learning, but it's not using the deep multi-level networks and stuff (edit: as far as I am personally aware)
they're doing simpler problems like real-time classification
spark parallelize data and its processing automagically, gives fault tolerance and checkpoining etc. You can implement it yourself, but it's a lot of work. Also, spark is JVM and it means that you have nearly bug free clients for all data storages, message brokers, kafka, etc, etc. and lots of libs. .NET is faaaar behind and I doubt it will ever catch up.
my experience working with some of our customers who use Spark
traditionally .NET companies with large amounts of data that is still being processed using really old OLAP systems
is that ones who already have the personnel or time to invest in learning enough of the JVM ecosystem are usually happy with the results
others decide that it's worth the trouble of rolling something in-house because there's too many unknowns all at once
going down the Spark route
not issuing a right/wrong judgment on either
but saying that the key success factor in adopting Spark has been being able to commit to supporting the JVM platform long-term, even in a majority .NET shop
we used Hive in addition to Akka.NET at my last company and had great success with it, but that's because we'd committed to understanding the JVM ecosystem years earlier when we went all-in on Apache Cassandra for our storage solution
later on we were able to port those Hive jobs to a very early version of Spark
anyway, bit of a tangent - but you should think critically about what's the right tool for the job both today and years from now
I absolutely agree. Our company decided to do all new projects in Scala (and there is no rush, it's a strategic decision) because we are really tired with the state of big data in .NET.
that's why stuff like Mobius exists in the first place
even Microsoft threw in the towel on their own big data solutions
Dryad et al
they decided it was easier to port all of those old C# queries to run on top of the JVM via an adapter layer like Mobius
and leverage the benefits of thousands of man-years worth of work there
I've not seen great activity in Mobius repo tho.
the Mobius project itself is basically a series of transpilation hacks
I agree with you there
I think with many of these OSS projects Microsoft has released lately
stuff that's not core to their business
or to their customers
and it's still scary to use it in production. One of the main points of using Spark is that it's used by lots and lots of large companies, so there is a great chance it will work with zero problems for you.
i.e. Mobius being a good example
they let it languish once they get it to a state where it solves MSFT's internal problems
and don't really commit to supporting it
social capital is a huge part of the value of an OSS ecosystem in general
TBH, I like the java/scala community a lot more, than .NET and MS as a whole.
(so far :) )
I haven't been on the contribution side of any real JVM project much
I'm either ;)
but the "MS will do everything for us" mindset is unhealthy.
looking at some of the stuff MSFT is doing around .NET Core
i.e. killing off the need for third party libraries for things like dependency injection