Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
PhillHenry
@PhillHenry
In what way is it not a POJO? Is it making network connections etc?
it is not makiing any network connections
or db connection etc
but the usage of this Class is, create one time and then use it's getPath, findPathBetween etc methods
PhillHenry
@PhillHenry
Ok, let me rephrase. It's just a lightweight Java object, right? So, creating 2-3 million lightweight objects does not sound like it's going to bring the JVM down. Depending on exactly how lightweight it is, you may very well not even see any significant increas in GC activity.
Now, if it's starting Threads, making connections or reading files, all bets are off. But I suspect it is not.
Selcuk Cabuk
@xsenko
hmm maybe you are right, maybe I just need to monitor how it is going for big graphs
PhillHenry
@PhillHenry
Premature optimization etc etc etc :)
Selcuk Cabuk
@xsenko
but I guess even memory won't be a problem, the creation time of this instance will be a problem. because for like 300k vertices graph it takes half, and I'm creating this instance for each time when I want to find the shortest path between 2 vertices
and graphs with 2-3million vertex time is going to increase :)
PhillHenry
@PhillHenry
Write a small test harness that creates 2-3 million of them. I imagine that it will complete in a matter of seconds (disclaimer: I've not tried it).
And that would be just one JVM. In real life, that 2-3 million will be spread across X nodes in your cluster.
If the test harness takes minutes, then worry about the object allocation. But until then, I wouldn't worry about it at all.
14 replies
Curran McKenzie
@Curran_McKenzie_twitter

Hi I am trying to copy a query into spark sql. Does the syntax below make sense. I am just constantly getting an error and can't fix it. I took in two data tables from a database and joined them together in one and created a temp view referred to as temp... so the from temp is the temp view that I created

SELECT
case when RESULT_CD is not null then substr(RESULT_CD,1,1)
else
case when RESULT_RSN_CD is not null then RESULT_RSN_CD else null end
end
as LOSSCD, LGL_SYS_ID as LGL_SYS_ID
FROM (
SELECT
LGL_SYS_ID,
MATTER_SYS_ID,
row_number() OVER (PARTITION BY LGL_SYS_ID ORDER BY
(cast(substr(MATTER_SYS_ID,instr(MATTER_SYS_ID,':')1)+) DESC) as int) rank ,
FTR
SYS_ID, RESULT_RESLTN_CD, RESULT_RSN_CD
FROM TEMP
WHERE HCO_MATTER.LGL_MATTER_SYS_ID = HCO_FTR.FTR_SYS_ID
ORDER BY LGL_MATTER_HCS_SYS_ID desc)
WHERE rank=1 --

luhuang89
@luhuang89
D Cameron Mauch
@DCameronMauch
Lesson painfully learned: Spark can not read files that start with an underscore. Gives error about not being able to derive schema. Same exact file without leading underscore works fine. Note, subdirectories that start with underscore are okay, just not the file itself.
Graeme Cliffe
@WarSame

Hey everyone, I'm trying something which is not working and was hoping I could get some advice. I have a column which I am trying to use the actual value of to select from another column based off of that value. I.e. my first column A is a 3 letter month description like "DEC", I want to use that to select from the column "DEC_VAL", or if it is "NOV" I want to select from "NOV_VAL".

So far I've got it to the point that I can correctly concat and all of that to get the right column name - but I can't figure out how to select from that column given its name. I'm trying a lot of variants like:

.withColumn(
"othercol",
lit(col("mycol"))
)

but I can't seem to make it figure out how to use the literal value of that column to select the column name. Is this just chasing ghosts or is there some way to do this that I'm just not aware of?

Selcuk Cabuk
@xsenko
quick question, when I worked on local machine with spark, it creates partition equal to my core (let's say 8), so it means that spark also creates 8 executor? And assign each partition to one executor?
and also is it means that each of 8 executor have 1 core?
Selcuk Cabuk
@xsenko
Ah okeym my mistake. Documentation says in local-mode spark creates one JVM which means it will create only one executor.
Then if I got 8 partitions, in local-mode will spark execute one partition at a time? And also will it means that my single executor have 8 cores?
Vivek Mathew
@vivekgmathew
spark DataType cannot be cast to scala.math.BigDecimal exception at org.apachespark.sql.catalyst.expression.GeneratedClass. I am not doing any explicit cast to scala.math.BigDecimal, but somehow this exception is thrown.
skestle
@skestle
I'm building spark 3.2 and have found that https://issues.apache.org/jira/browse/SPARK-33888 introduces a bug/regression. Should I comment on the ticket or raise a new bug?
(or rather, the PR associated with the resolved jira ticket)
Eric K Richardson
@ekrich
I would think if the fix is merged you should comment on the issue sense it says fixed in 3.2 but I have not contributed to Spark. I think they will tell you what to do.
s/sense/since
skestle
@skestle
Yeah, that's what I was thinking - thanks for validating my intuition
Eric K Richardson
@ekrich
:thumbsup:
Does 3.2 have Scala 2.13 support?
srini-daruna
@srini-daruna
Hi, I have issue with my spark job that processes heavily nested json data. I am running job using EMR. After read stage, nothing is happening.
the data size is close to 1TB. Instances used are 10 * R5.16xlarge(64 cores, 512G)
please help
Joaquín Chemile
@jchemile
Hello! I've a problem to import a Scala Class (A Map Which has some keys) in a Jupyter Notebook. Someone know the proper way to import it?. I've created a question in stackoverflow without success: https://stackoverflow.com/questions/65785506/import-custom-scala-object-in-jupyter-notebook-with-sparkcontext-addfile
Thanks!!
Alessandro Calvio
@xAlessandroC
Hi, I'm trying to submit a spark job via REST API interface. I'm able to launc successfully a drive making a POST request to <URL>:6066/v1/submissions/create, but the problem is that the task isn't actually running, in the Web UI i have Worker tab to NONE and Status tab to SUBMITTED. How can I solve my problem?
Jelmer Kuperus
@jelmerk
if i have an rdd, is it ok to use that rdd from multiple java threads, each starting a new spark job ?
PhillHenry
@PhillHenry
@srini-daruna There's too little information here to form a conclusion. My advice is check the threads and see where they're spending all their time.
@jelmerk No as it's Spark's job to handle all the resouces such as threads.
Jelmer Kuperus
@jelmerk
@PhillHenry well i am pretty sure its possible to spin up jobs in parallel, i've seen that a bunch of times, just not sharing an rdd
PhillHenry
@PhillHenry
It's certainly possible to have Spark running in a multi-tenant environment where the cluster runs jobs for different sessions in parallel. But why do you want to multithread access to a particular RDD? It sounds like you're not using Spark as it's supposed to be used. What are you trying to achieve? There may be better ways to achieve the same aim.
Nick
@GrigorievNick
Hi,
I find strange issue.
spark sql function input_file_name() corrupt column pruning on physical query plan level.
Does anyone know how to deal with it?
When I use this function in select or in columnWith, parquet/orc files physical plan use all columns, mean it take all column from file, not only column from the projection.
Nick
@GrigorievNick
I sue spark 2.4.7
santanu mohanty
@km_santanu_twitter
How can I find the consumer group id of a spark structrued streaming application?
Nick
@GrigorievNick
Spark uses random if you do not specify your own. Because spark do not manage offset in Kafka, it uses WAL.
Nick
@GrigorievNick
1 reply
So by default, it will spark-Kafka-source + Some random name based on spark run.
Eric K Richardson
@ekrich
How about 2.3.4? If it isn't Scala 2.12 it stinks. What is that real version 3.0.1 (latest)?
TheLastPOTATO
@AhmetGurbuzz
Does Spark collect_list send data to the driver? spark 2.4.0
Sheldon Liu
@sheldon1iu
When using spark's overwrite mode to write data to clickhouse, the value of the createTableOptions parameter ENGINE=MergeTree(p1,p2,p3,8192), where p1 must be a date, but there is no date field when using sql to create a table, I don’t know what to do in spark.
Yotam Hochman
@Yotamho
Hi guys,
I am trying to use Spark’s Catalyst as a standalone engine to run sql on in-memory collections of InternalRows (specifically GenericInternalRow),
Because my kind of “query" is always a single-and-full scan of small files, but very dynamic and written in SQL, it seemed like a nice idea to avoid the overhead of a full Spark cluster.
My biggest concern right now is that I am not able to use an Analyzer to resolve my LogicalPlan references, because I can’t create a CatalogManager which is private[sql],
  1. Any ideas on how to overcome it?
  2. Is there a project that trying to accomplish the same thing?