hiveet

hive:

----------------

Partitioning:

to see list of partition:

show partition table name;

dynamic partitioning:

by default its disable:

set hive.exec.dynamic.partition=true;

set hive.exec.dynamic.partition.mode = nonstrict;

step:1

create normal tbl

and load data

step2

create partition table with partition onn columns

step3

transfer from normal to partition table.

insert into table tablename

partition (state)

select * from order_no_partition

0---------------------------

bucketing:

-----------------

step:1

create normal tbl

and load data

step2

create bucketing table with bucket onn columns

step3

transfer from normal to bucket table.

set hive.enforse.bucketing

no of bucket = no of reducer

in bucketing.. we use cluster by

to display record of one bucket

-------------------------------------

select * from table tablesample(bucket 1 out of 4)

partitioning:-- can also hv on multiple columns

----------------------------------------

3 categories of optimization:

----------------------------------

table structure level optimization.

(partition and bucketing)

2.query level optimization.

(join optimization)

3.simplefing query expression.

(windowing function).

---------------------------------------------

join optimization:

---------------------------

no of mapreduce job= no of join columns

of we are doing multiple join on same columns then--1 mr job

for faster queries minimize the no of map reduce jobs.

2. bucketing and partitioning...helps to optimize the joins. table should be bucketed or partitioned on the join columns.

internally join executed as mr job

-------------------------------------

map side join---

-----------------------

all mapper will do -- no reducer

mapper---key ,value)]

key-- is join column and values--all other columns.

(then reducer(combine key , and all columns)-- this part no ther in map side

no shuffel or sort--

parallel mapper-- so map side join is faster.

condition:

1 small table and 1 big tables.

if both small--then rdbms is best to do

if 3 tables--then 2 small and 1 big

smaller table--will be broadcasted-- in all nodes.

bigger table divided in blocks

-----------------------------

Types of joins:

Inner join

left outer

right outer

full outer

when left side table is small then inner join can be treated as map-side join

o/p we get from mapper

when left table is small--then we can'y use left outer as map side join as ---we don't know whether the data is present in my chuckuns or entire dataset

as entire left table is present in left machine-- then we can say our data is matching or not--

so right outer --can treated as map side join.

if left side table is small:

----------------------------------

only inner and right outer join are map side join.

if right tble is small:

---------------------------------

only inner and left outer are map side join

===================================

Bucket map join & SMB(Sort merge bucket join)

-----------------------------------------------

one table should be small to fit in memory--the we can use map side join

one big and other should be small.

Bucket Map join:

---------------------------------

can work on 2 big table;

conditions:

both tbls should be bucketed on join columns.

no of bucket in one table shoukd be integral multiple of no of bucket in other table.

3 3,6,9,12

--------------------

Map side join:

--------------

befote mr job--

local task will create a hashtable table.

then ot will put to hdfs

then this hashtable is boardcasted all the nodes.

in each node it will be in local disk or distributed cache.

from local disk it is loaded to memory.

after this mr jobs starts.

we can't load huge table in memory.

Bucket Map join:

----------------------

befote mr job--

local task will create a hashtable table.

then ot will put to hdfs

then this hashtable is boardcasted all the nodes.

in each node it will be in local disk or distributed cache.

only required bucket is loaded in memory at a particular time.(1 bucket)

SMB join(Sort Merge Bucket Join):

------------------------------------

works on 2 big table:

conditions:

bothe should be bucketed on join columns.

bothe should have same no of buckets.

both should be sorted on join columns.

it is very fast.

--no memory constraint

mapper 1 will works on bucket 1 og tbl1 and bucket1 on tbl2

sqoop import \

--connect "jdbc:mysql://10.142.1.2/retail_db" \

--username sqoopuser \

--password NHkkP876rp \

--table orders \

-m-1 \

--warehouse-dir /user/sasmitsb4081/hive

--------------------------

set hive.auto.convert.join;

give the proprty setting--if true-- then map side join happens

by default true.

local task:

getting small table and putting it in memory for hash table.

any table less than 25 mb is considered to small table.

to use hints for map side join:

we setted the auto map join property to false--

hint should not be ignored.

small table as per hive:

to get the value we need to us set:

set hive.mapjoin.smalltable.filesize;

25mb;

to set this:

set hive.mapjoin.smalltable.filesize = 200000

Bucket map join:

--------------------------

to enable bucketing:

set hive.enforce.bucketing;

no of buckets = no of reducers

sqoop import \

--connect "jdbc:mysql://10.142.1.2/retail_db" \

--username sqoopuser \

--password NHkkP876rp \

--table orders \

--warehouse-dir /user/sasmitsb4081/hive

sqoop import \

--connect "jdbc:mysql://10.142.1.2/retail_db" \

--username sqoopuser \

--password NHkkP876rp \

--table customers \

--warehouse-dir /user/sasmitsb4081/hive

use sasmitsb4081

CREATE external TABLE orders(order_id int,order_date string,order_customer_id int,order_status string)

row format delimited fields terminated by ',' stored as textfile

location '/user/sasmitsb4081/hive/orders';

CREATE external TABLE customers(customer_id int,customer_fname string,customer_lname string,customer_email string,customer_password string,customer_street string,customer_city string,customer_state string,customer_zipcode string)

row format delimited fields terminated by ',' stored as textfile

location '/user/sasmitsb4081/hive/customers';

bucketed customers:

CREATE external TABLE customers_bucketed(customer_id int,customer_fname string,customer_lname string,customer_email string,customer_password string,customer_street string,customer_city string,customer_state string,customer_zipcode string)

clustered by(customer_id) into 4 buckets row format delimited fields terminated BY ',';

insert into customers_bucketed select * from customers;

CREATE external TABLE orders_bucketed(order_id int,order_date string,order_customer_id int,order_status string) clustered by(order_customer_id) into 8 buckets row format delimited fields terminated BY ',';

insert into orders_bucketed select * from orders;

hdfs dfs -ls /apps/hive/warehouse/sasmitsb4081.db/customers_bucketed

hdfs dfs -ls /apps/hive/warehouse/sasmitsb4081.db/orders_bucketed

if we don't specify any location--then--it will be in dafault location

user/hive/warehouse/sasmitsb4081/

set hive.optimize.bucketmapjoin

We need to set it to true.

set hive.optimize.bucketmapjoin = true;

set hive.auto.convert.join=true;

EXPLAIN EXTENDED SELECT c.customer_id, c.customer_fname, c.customer_lname, o.order_id, o.order_date FROM customers_bucketed c JOIN orders_bucketed o ON (c.customer_id = o.order_customer_id) limit 10;

=============================================

Windowing Functions:

-----------------------------

unbounded preceding--- means-- first row

by dafault-- its first row to current

rows between unbounded proceding and current row-- is by default

if we don't give order by--

then it will be first row till last row

================================

order by:

--------

1 reducers

total ordering

order by -- uses one reducer, and it does complete sorting

sort by:

sort by multiple reducers

can gurantee--total ordering.

set hive.exec.reducers.bytes.per.reducer (250mb)

per 250 mb data --1 reducer will works

same vale will go to multiple reducers.

======================================

distribute by:

------------------

different vales will go on diff reducers

but it won't sort

even no --go to reducer0

odd no reducer1

cluster by:

--------------------------

distribute by + sort by = cluster by

Search This Blog

My Learning Notes

hiveet

Comments

Post a Comment

Popular posts from this blog

scala-4