hiveet
hive:
----------------
Partitioning:
to see list of partition:
show partition table name;
dynamic partitioning:
by default its disable:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode = nonstrict;
step:1
create normal tbl
and load data
step2
create partition table with partition onn columns
step3
transfer from normal to partition table.
insert into table tablename
partition (state)
select * from order_no_partition
0---------------------------
bucketing:
-----------------
step:1
create normal tbl
and load data
step2
create bucketing table with bucket onn columns
step3
transfer from normal to bucket table.
set hive.enforse.bucketing
no of bucket = no of reducer
in bucketing.. we use cluster by
to display record of one bucket
-------------------------------------
select * from table tablesample(bucket 1 out of 4)
partitioning:-- can also hv on multiple columns
----------------------------------------
3 categories of optimization:
----------------------------------
table structure level optimization.
(partition and bucketing)
2.query level optimization.
(join optimization)
3.simplefing query expression.
(windowing function).
---------------------------------------------
join optimization:
---------------------------
no of mapreduce job= no of join columns
of we are doing multiple join on same columns then--1 mr job
for faster queries minimize the no of map reduce jobs.
2. bucketing and partitioning...helps to optimize the joins. table should be bucketed or partitioned on the join columns.
internally join executed as mr job
-------------------------------------
map side join---
-----------------------
all mapper will do -- no reducer
mapper---key ,value)]
key-- is join column and values--all other columns.
(then reducer(combine key , and all columns)-- this part no ther in map side
no shuffel or sort--
parallel mapper-- so map side join is faster.
condition:
1 small table and 1 big tables.
if both small--then rdbms is best to do
if 3 tables--then 2 small and 1 big
smaller table--will be broadcasted-- in all nodes.
bigger table divided in blocks
-----------------------------
Types of joins:
Inner join
left outer
right outer
full outer
when left side table is small then inner join can be treated as map-side join
o/p we get from mapper
when left table is small--then we can'y use left outer as map side join as ---we don't know whether the data is present in my chuckuns or entire dataset
as entire left table is present in left machine-- then we can say our data is matching or not--
so right outer --can treated as map side join.
if left side table is small:
----------------------------------
only inner and right outer join are map side join.
if right tble is small:
---------------------------------
only inner and left outer are map side join
===================================
Bucket map join & SMB(Sort merge bucket join)
-----------------------------------------------
one table should be small to fit in memory--the we can use map side join
one big and other should be small.
Bucket Map join:
---------------------------------
can work on 2 big table;
conditions:
both tbls should be bucketed on join columns.
no of bucket in one table shoukd be integral multiple of no of bucket in other table.
t1
3 3,6,9,12
--------------------
Map side join:
--------------
befote mr job--
local task will create a hashtable table.
then ot will put to hdfs
then this hashtable is boardcasted all the nodes.
in each node it will be in local disk or distributed cache.
from local disk it is loaded to memory.
after this mr jobs starts.
we can't load huge table in memory.
Bucket Map join:
----------------------
befote mr job--
local task will create a hashtable table.
then ot will put to hdfs
then this hashtable is boardcasted all the nodes.
in each node it will be in local disk or distributed cache.
only required bucket is loaded in memory at a particular time.(1 bucket)
SMB join(Sort Merge Bucket Join):
------------------------------------
works on 2 big table:
conditions:
bothe should be bucketed on join columns.
bothe should have same no of buckets.
both should be sorted on join columns.
it is very fast.
--no memory constraint
mapper 1 will works on bucket 1 og tbl1 and bucket1 on tbl2
sqoop import \
--connect "jdbc:mysql://10.142.1.2/retail_db" \
--username sqoopuser \
--password NHkkP876rp \
--table orders \
-m-1 \
--warehouse-dir /user/sasmitsb4081/hive
--------------------------
set hive.auto.convert.join;
give the proprty setting--if true-- then map side join happens
by default true.
local task:
getting small table and putting it in memory for hash table.
any table less than 25 mb is considered to small table.
to use hints for map side join:
we setted the auto map join property to false--
hint should not be ignored.
small table as per hive:
to get the value we need to us set:
set hive.mapjoin.smalltable.filesize;
25mb;
to set this:
set hive.mapjoin.smalltable.filesize = 200000
Bucket map join:
--------------------------
to enable bucketing:
set hive.enforce.bucketing;
no of buckets = no of reducers
sqoop import \
--connect "jdbc:mysql://10.142.1.2/retail_db" \
--username sqoopuser \
--password NHkkP876rp \
--table orders \
--warehouse-dir /user/sasmitsb4081/hive
sqoop import \
--connect "jdbc:mysql://10.142.1.2/retail_db" \
--username sqoopuser \
--password NHkkP876rp \
--table customers \
--warehouse-dir /user/sasmitsb4081/hive
use sasmitsb4081
CREATE external TABLE orders(order_id int,order_date string,order_customer_id int,order_status string)
row format delimited fields terminated by ',' stored as textfile
location '/user/sasmitsb4081/hive/orders';
CREATE external TABLE customers(customer_id int,customer_fname string,customer_lname string,customer_email string,customer_password string,customer_street string,customer_city string,customer_state string,customer_zipcode string)
row format delimited fields terminated by ',' stored as textfile
location '/user/sasmitsb4081/hive/customers';
bucketed customers:
CREATE external TABLE customers_bucketed(customer_id int,customer_fname string,customer_lname string,customer_email string,customer_password string,customer_street string,customer_city string,customer_state string,customer_zipcode string)
clustered by(customer_id) into 4 buckets row format delimited fields terminated BY ',';
insert into customers_bucketed select * from customers;
CREATE external TABLE orders_bucketed(order_id int,order_date string,order_customer_id int,order_status string) clustered by(order_customer_id) into 8 buckets row format delimited fields terminated BY ',';
insert into orders_bucketed select * from orders;
hdfs dfs -ls /apps/hive/warehouse/sasmitsb4081.db/customers_bucketed
hdfs dfs -ls /apps/hive/warehouse/sasmitsb4081.db/orders_bucketed
if we don't specify any location--then--it will be in dafault location
user/hive/warehouse/sasmitsb4081/
set hive.optimize.bucketmapjoin
We need to set it to true.
set hive.optimize.bucketmapjoin = true;
set hive.auto.convert.join=true;
EXPLAIN EXTENDED SELECT c.customer_id, c.customer_fname, c.customer_lname, o.order_id, o.order_date FROM customers_bucketed c JOIN orders_bucketed o ON (c.customer_id = o.order_customer_id) limit 10;
=============================================
Windowing Functions:
-----------------------------
unbounded preceding--- means-- first row
by dafault-- its first row to current
rows between unbounded proceding and current row-- is by default
if we don't give order by--
then it will be first row till last row
================================
order by:
--------
1 reducers
total ordering
order by -- uses one reducer, and it does complete sorting
sort by:
sort by multiple reducers
can gurantee--total ordering.
set hive.exec.reducers.bytes.per.reducer (250mb)
per 250 mb data --1 reducer will works
same vale will go to multiple reducers.
======================================
distribute by:
------------------
different vales will go on diff reducers
but it won't sort
even no --go to reducer0
odd no reducer1
cluster by:
--------------------------
distribute by + sort by = cluster by
Comments
Post a Comment