compre-hive

compression techniques in hive:

----------------------------------------

-helps save storage

-help process data faster

-reduce i/o cost

i/o cost depends on storage

compression and uncompression--cost--time taken to compress and uncompression.

bt compare to i/o gain--we can neglect this.

4 compression techni

------------------------

snappy

lzo

gzip

bzip2

some optimized for storage

some for speed

Snappy:

------------------

-fast compression codec

-size doesn not reduce drastically

--mostly in project is used

--optimized for speed rather than storage.

--by default not splittable(json ,xml)

--avro,parquet,orc,(container based)--splitapility is taken care

Lzo

----------

--optimized for speed.

--inherently splittable(can be use with text,json,xml)

--good choice for text files

--require seprate install.

--snappy is faster than Lzo.

Gzip

--------------

optimized for storage

-2.5* snappy

--processing speed is slow

--not splittable

-used with container based files

--can reduces block size --so more no block--more paralized

Bzip2

--------------

optimized for storage--very slow

--inherently spitability

--for active archival purpose

========================================

Vectorization

-------------------------------

standard query--takes one row at a time

in vectorization--batch--block of 1024 rows at a time.

to use vectorization---data should be in orc

set hive.vectorized.execution.enabled = true;

in explain plan(select *)

execution mode : vectorized

-----------------------------

Changing the hive engine:

--------------------------------

hive supports 3 execution rnginr

1.mr(mapreduce)(slow)

2.tez(fast)

3.spark(fastest)

by default mr

set hive.execution.engine;

==========================

Apache thrift server/service

----------------------------------

thrift service--allows any clients to connect to hive

when u are sitting outside edge node and wants to connect to hive

u can write java program/python program

this is called hive server(based on thrift service)

thrift server use port at 10000

thrift server use ui on port 10002

===================================

MSCK repair:

---------------------

on external table--partition is there

but partition added to hdfs dir

but metadata is not there.

so we can run msck repair on the table--so metadata will be created

mostly used for external table

=======================================

enable nodrop feature:

--------------------------------

alter table tablename enable no_drop

alter table tablename disable no_drop

offline features on table:

---------------------------------

to restrict a table from being queried.

alter table tablename enable offline

disable offline

skipping header rows

-------------------------------

while cresting tsble we need to use tblproperties:

tblproperties("skip.header.line.count"="3")

immutable feature:

---------------------

immutable means we cannot change it

(we cannot append new rows and also we cannot do any modification)

tblproperties("immutable"="true")

u will able to overwrite data.

drop vs truncate vs purge

-----------------------------------

drop--a managed tbl both data and metadata deleted

drop--external table--only metadata deleted.

truncate---all the data deleted--table schema will be there.

purge-

if purge set to true--if u delete the data --data is gone no recovery

if purge set to false--if u delete the data--data can be recovery

tblproperties("auto.purge"="true")

treating empty string as null:

--------------------------------

tblproperties("serialization.null.format"="")

executing normal linux command from hive:

-----------------------------------------------

!ls -ltr

setting hive variable:

--------------------------

hivevar

set hivevar:favourite_customer=1111;

select * from orders where customer_id = ${favourite_customer};

set hive.cli.print.header;

cartesian product:

-----------------------

select * from table1,table2

no of rows 10*20

fikteroperation evaluated onlefttoright

firstsystem defined funtion then udf

Search This Blog

My Learning Notes

compre-hive

Comments

Post a Comment

Popular posts from this blog

scala-4