Only because I tend to come back to these over and over again, here's some join hacks with Scala in Spark:
Reference 'Column' Is Ambiguous, Could Be 'Column', 'Column' Error
This will fail because it's in both data frames and this will throw the above error:
myDf.join(myOtherDf, $"column" === $"column", "inner").select("othercolumn")
In Scala, we can solve this by referring only to the column:
myDf.join(myOtherDf, Seq("column"), "inner").select("othercolumn")
Comparing String Date In One Format To Another String Date In Another Format
This will depend on the actual format, so in this example I'm using the format of M/d/y
, but this may not be the format you're using. Below this we'll only look at an example where we're just setting them equal to each other, not the full join condition:
to_date($"datefield","M/d/y") = to_date(lit("01/01/1999"),"M/d/y") ///notice the different format in lit
Filter Early In A Join
This is a useful technique to boost the performance of our queries because we'll filter on the join and reduce the amount of data that we get back.
myDf.join(myOtherDf.where($"myId" === "Y"), $"columnId" === $"otherColumnId", "inner").select($"A", $"B")
The key part here is the where clause after the .join(myOtherDf
. Ultimately, we want to bring back as little data as possible and this filter assists us with that.
Inverting IsIn In Where Clauses (Is Not In A Where Clause)
We can invert the isin
functionality for a where clause. A review of this:
myDf.where($"myColumnStatus".isin("1","2")
What if we wanted everything else but those status'? We could invert the where clause here and get everything but 1 and 2:
myDf.where(!$"myColumnStatus".isin("1","2")
Left Anti Joins and Filters - Important!
You cannot apply filters with columns used in a left anti join on the table that's being anti joined. This is because this is already a filter and these columns are not being returned but being filtered out from the result. The key is to pay attention to the join condition; if the join condition is CreditId
for an example, then that means that the table joined against will have CreditId
s not included in the result set.
More ETLHelp posts: