Spark Join Hacks With Scala

in #spark3 years ago (edited)

Only because I tend to come back to these over and over again, here's some join hacks with Scala in Spark:

Reference 'Column' Is Ambiguous, Could Be 'Column', 'Column' Error

This will fail because it's in both data frames and this will throw the above error:

myDf.join(myOtherDf, $"column" === $"column", "inner").select("othercolumn")

In Scala, we can solve this by referring only to the column:

myDf.join(myOtherDf, Seq("column"), "inner").select("othercolumn")

Comparing String Date In One Format To Another String Date In Another Format

This will depend on the actual format, so in this example I'm using the format of M/d/y, but this may not be the format you're using. Below this we'll only look at an example where we're just setting them equal to each other, not the full join condition:

to_date($"datefield","M/d/y") = to_date(lit("01/01/1999"),"M/d/y") ///notice the different format in lit

Filter Early In A Join

This is a useful technique to boost the performance of our queries because we'll filter on the join and reduce the amount of data that we get back.

myDf.join(myOtherDf.where($"myId" === "Y"), $"columnId" === $"otherColumnId", "inner").select($"A", $"B")

The key part here is the where clause after the .join(myOtherDf. Ultimately, we want to bring back as little data as possible and this filter assists us with that.

Inverting IsIn In Where Clauses (Is Not In A Where Clause)

We can invert the isin functionality for a where clause. A review of this:

myDf.where($"myColumnStatus".isin("1","2")

What if we wanted everything else but those status'? We could invert the where clause here and get everything but 1 and 2:

myDf.where(!$"myColumnStatus".isin("1","2")

Left Anti Joins and Filters - Important!

You cannot apply filters with columns used in a left anti join on the table that's being anti joined. This is because this is already a filter and these columns are not being returned but being filtered out from the result. The key is to pay attention to the join condition; if the join condition is CreditId for an example, then that means that the table joined against will have CreditIds not included in the result set.

More ETLHelp posts: