Important aggregations in spark
December 21, 2016 Leave a comment
Three main aggregations
- reduceByKey(): It has internal combiner, used when aggregation in the data is high. Its used only when INTERMEDIATE/COMBINER aggregation logic is same as that of FINAL/REDUCER AGGREGATION logic
- aggregateByKey(): Its similar to reduceByKey(). It has internal custom combiner. This is used to initialize some default value
- combineByKey(): Its similar to reduceByKey(). It also has internal custom combiner. This is used to initialize dynamic value (by reading the input record and have some logic in place to initialize)
Comparision
- aggregateByKey() and reduceByKey() are sub types of combineByKey()
- In aggregateByKey() and combineByKey(), TYPE of INPUT value need not be same as that of the OUTPUT
- If we want to use custom logic in combiner than we go for aggregateByKey() or combineByKey() and in reduceByKey(), the combiner logic will be same as that of reducer.
Other important aggregations:
– groupByKey(): Used when combiner is not required, and hence its used when there are not many aggregations on the dataset. It provides much more flexibility for complex operations than other aggregations.
– countByKey(): Unlike all the above methods which are transformations, this is an action
Recent Comments