Important aggregations in spark
December 21, 2016 Leave a comment
Three main aggregations
- reduceByKey(): It has internal combiner, used when aggregation in the data is high. Its used only when INTERMEDIATE/COMBINER aggregation logic is same as that of FINAL/REDUCER AGGREGATION logic
- aggregateByKey(): Its similar to reduceByKey(). It has internal custom combiner. This is used to initialize some default value
- combineByKey(): Its similar to reduceByKey(). It also has internal custom combiner. This is used to initialize dynamic value (by reading the input record and have some logic in place to initialize)
- aggregateByKey() and reduceByKey() are sub types of combineByKey()
- In aggregateByKey() and combineByKey(), TYPE of INPUT value need not be same as that of the OUTPUT
- If we want to use custom logic in combiner than we go for aggregateByKey() or combineByKey() and in reduceByKey(), the combiner logic will be same as that of reducer.
Other important aggregations:
– groupByKey(): Used when combiner is not required, and hence its used when there are not many aggregations on the dataset. It provides much more flexibility for complex operations than other aggregations.
– countByKey(): Unlike all the above methods which are transformations, this is an action