hiveet

 hive:

----------------

Partitioning:


to see list of partition:


show partition table name;


dynamic partitioning:


by default its disable:


set hive.exec.dynamic.partition=true;

set hive.exec.dynamic.partition.mode = nonstrict;


step:1

create normal tbl

and load data


step2

create partition table with partition onn columns


step3

transfer from normal to partition table.


insert into table tablename

partition (state)

select * from order_no_partition


0---------------------------

bucketing:

-----------------


step:1

create normal tbl

and load data


step2

create bucketing table with bucket onn columns


step3

transfer from normal to bucket table.


set hive.enforse.bucketing


no of bucket = no of reducer


in bucketing.. we use cluster by


to display record of one bucket


-------------------------------------

select * from table tablesample(bucket 1 out of 4)


partitioning:-- can also hv on multiple columns

----------------------------------------


3 categories of optimization:

----------------------------------

table structure level optimization.

(partition and bucketing)

2.query level optimization.

(join optimization)


3.simplefing query expression.

(windowing function).


---------------------------------------------


join optimization:

---------------------------


no of mapreduce job= no of join columns


of we are doing multiple join on same columns then--1 mr job


for faster queries minimize the no of map reduce jobs.

2. bucketing and partitioning...helps to optimize the joins. table should be bucketed or partitioned on the join columns.


internally join executed as mr job


-------------------------------------


map side join---

-----------------------

all mapper will do -- no reducer


mapper---key ,value)]


key-- is join column and values--all other columns.

(then reducer(combine key , and all columns)-- this part no ther in map side


no shuffel or sort--

parallel mapper-- so map side join is faster.


condition:


1 small table and 1 big tables.


if both small--then rdbms is best to do


if 3 tables--then 2 small and 1 big


smaller table--will be broadcasted-- in all nodes.

bigger table divided in blocks


-----------------------------

Types of joins:


Inner join

left outer

right outer

full outer


when left side table is small then inner join can be treated as map-side join


o/p we get from mapper


when left table is small--then we can'y use left outer as map side join as ---we don't know whether the data is present in my chuckuns or entire dataset


as entire left table is present in left machine-- then we can say our data is matching or not--

so right outer --can treated as map side join.


if left side table is small:

----------------------------------

only inner and right outer join are map side join.


if right tble is small:

---------------------------------

only inner and left outer are map side join


===================================


Bucket map join & SMB(Sort merge bucket join)

-----------------------------------------------


one table should be small to fit in memory--the we can use map side join


one big and other should be small.


Bucket Map join:

---------------------------------


can work on 2 big table;


conditions:

both tbls should be bucketed on join columns.

no of bucket in one table shoukd be integral multiple of no of bucket in other table.


t1

3   3,6,9,12


--------------------


Map side join:

--------------

befote mr job--

local task will create a hashtable table.

then ot will put to hdfs

then this hashtable is boardcasted all the nodes.

in each node it will be in local disk or distributed cache.


from local disk it is loaded to memory.


after this mr jobs starts.


we can't load huge table in memory.


Bucket Map join:

----------------------


befote mr job--

local task will create a hashtable table.

then ot will put to hdfs

then this hashtable is boardcasted all the nodes.

in each node it will be in local disk or distributed cache.


only required bucket is loaded in memory at a particular time.(1 bucket)



SMB join(Sort Merge Bucket Join):

------------------------------------


works on 2 big table:


conditions:


bothe should be bucketed on join columns.

bothe should have same no of buckets.

both should be sorted on join columns.


it is very fast.


--no memory constraint


mapper 1 will works on bucket 1 og tbl1 and bucket1 on tbl2



sqoop import \

--connect "jdbc:mysql://10.142.1.2/retail_db" \

--username sqoopuser \

--password NHkkP876rp \

--table orders \

-m-1 \

--warehouse-dir /user/sasmitsb4081/hive


--------------------------


set hive.auto.convert.join;


give the proprty setting--if true-- then map side join happens


by default true.


local task:

getting small table and putting it in memory for hash table.


any table less than 25 mb is considered to small table.


to use hints for map side join:


we setted the auto map join property to false--

hint should not be ignored.


small table as per hive:


to get the value we need to us set:


set hive.mapjoin.smalltable.filesize;

25mb;


to set this:


set hive.mapjoin.smalltable.filesize = 200000


Bucket map join:

--------------------------


to enable bucketing:


set hive.enforce.bucketing;


no of buckets = no of reducers



sqoop import \

--connect "jdbc:mysql://10.142.1.2/retail_db" \

--username sqoopuser \

--password NHkkP876rp \

--table orders \

--warehouse-dir /user/sasmitsb4081/hive



sqoop import \

--connect "jdbc:mysql://10.142.1.2/retail_db" \

--username sqoopuser \

--password NHkkP876rp \

--table customers \

--warehouse-dir /user/sasmitsb4081/hive


use sasmitsb4081


CREATE external TABLE orders(order_id int,order_date string,order_customer_id int,order_status string) 

row format delimited fields terminated by ',' stored as textfile

location '/user/sasmitsb4081/hive/orders';


CREATE external TABLE customers(customer_id int,customer_fname string,customer_lname string,customer_email string,customer_password string,customer_street string,customer_city string,customer_state string,customer_zipcode string) 

row format delimited fields terminated by ',' stored as textfile 

location '/user/sasmitsb4081/hive/customers';



bucketed customers:


CREATE external TABLE customers_bucketed(customer_id int,customer_fname string,customer_lname string,customer_email string,customer_password string,customer_street string,customer_city string,customer_state string,customer_zipcode string) 

clustered by(customer_id) into 4 buckets row format delimited fields terminated BY ',';


insert into customers_bucketed select * from customers;


CREATE external TABLE orders_bucketed(order_id int,order_date string,order_customer_id int,order_status string) clustered by(order_customer_id) into 8 buckets row format delimited fields terminated BY ',';


insert into orders_bucketed select * from orders;


hdfs dfs -ls /apps/hive/warehouse/sasmitsb4081.db/customers_bucketed


hdfs dfs -ls /apps/hive/warehouse/sasmitsb4081.db/orders_bucketed


if we don't specify any location--then--it will be in dafault location


user/hive/warehouse/sasmitsb4081/


set hive.optimize.bucketmapjoin


We need to set it to true.


set hive.optimize.bucketmapjoin = true;

set hive.auto.convert.join=true;



EXPLAIN EXTENDED SELECT c.customer_id, c.customer_fname, c.customer_lname, o.order_id, o.order_date FROM customers_bucketed c JOIN orders_bucketed o ON (c.customer_id = o.order_customer_id) limit 10;


=============================================


Windowing Functions:

-----------------------------


unbounded preceding--- means-- first row


by dafault-- its first row to current


rows between unbounded proceding and current row-- is by default


if we don't give order by--


then it will be first row till last row


================================


order by:


--------


1 reducers

total ordering


order by -- uses one reducer, and it does complete sorting


sort by:


sort by multiple reducers


can gurantee--total ordering.



set hive.exec.reducers.bytes.per.reducer (250mb)


per 250 mb data --1 reducer will works


same vale will go to multiple reducers.

======================================

distribute by:

------------------


different vales will go on diff reducers

but it won't sort


even no --go to reducer0

odd no reducer1


cluster by:

--------------------------

distribute by + sort by = cluster by





Comments

Popular posts from this blog

scala-4