hive optu
3 ways optimize:
design table(creatin)
(2 options:
partitioning and bucketing)
both divide the data into small parts.
structure queries.(efficient queries)
(join takes lots of times)--query level
(join optimizations)
simplified queries.(simple queries).
(windowing function)
partitioning:
dividing data based on columns
in dir:
user/hive/warehouse/treandytech.db/customers/state=CA
user/hive/warehouse/treandytech.db/customers/state=NY
only dir will scan
less data scan--performance gain
if we use partition columns then-- optimization
partioning --should done on most common queries.
issues:
if we have lots of distinct values-- cardinality--very high--then we won't do partitining
lots of folder will be created.
two types partioning:
static:
(we should hv idea on data and load manually)
dynamic:
(partitions created automatically.)
(we don't know data)
static is faster than dynamic
partitioning--works well with low cardinality
if more distinct values--then--go for bucketing
no 0f partition= no of distinct values
bucketing:
-------------------
we hv to define fix no of bucket---during table creation.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
accoring to data--we do trials t+hen decide no of buckets.
each partition= folder
each bucket= file
while quering
select * from orders where id = 4
bucket--mod will work
if bucket= 3
then 4 % 3 = 1..
it will check in 1st bucket
high cardinality--go bucketing
by default no of partition-- set to 10000
if it exceeded then hive gives error..u can change this no --but u should not+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
partition--varies lot in size..
but bucketing-- almost same
bucketing is a good sample-- if someone ask a sample
--then give him a backet.
bucket--is a good sample
-------------------------
we can combine both partitioning and bucketing in hive table
bucketing + partitioning--we can't have
benefits of bbucketing
--faster query response
--join optimization
Comments
Post a Comment