compre-hive

 compression techniques in hive:

----------------------------------------

-helps save storage

-help process data faster

-reduce i/o cost


i/o cost depends on storage


compression and uncompression--cost--time taken to compress and uncompression.


bt compare to i/o gain--we can neglect this.


4 compression techni

------------------------

snappy

lzo

gzip

bzip2


some optimized for storage

some for speed


Snappy:

------------------

-fast compression codec

-size doesn not reduce drastically

--mostly in project is used

--optimized for speed rather than storage.

--by default not splittable(json ,xml)

--avro,parquet,orc,(container based)--splitapility is taken care


Lzo

----------

--optimized for speed.

--inherently splittable(can be use with text,json,xml)

--good choice for text files

--require seprate install.

--snappy is faster than Lzo.


Gzip

--------------

optimized for storage

-2.5* snappy

--processing speed is slow

--not splittable

-used with container based files

--can reduces block size --so more no block--more paralized


Bzip2

--------------

optimized for storage--very slow

--inherently spitability

--for active archival purpose


========================================


Vectorization

-------------------------------


standard query--takes one row at a time

in vectorization--batch--block of 1024 rows at a time.


to use vectorization---data should be in orc


set hive.vectorized.execution.enabled = true;


in explain plan(select *)


execution mode : vectorized


-----------------------------


Changing the hive engine:

--------------------------------

hive supports 3 execution rnginr

1.mr(mapreduce)(slow)

2.tez(fast)

3.spark(fastest)


by default mr


set hive.execution.engine;


==========================


Apache thrift server/service

----------------------------------

thrift service--allows any clients to connect to hive


when u are sitting outside edge node and wants to connect to hive


u can write java program/python program


this is called hive server(based on thrift service)


thrift server use port at 10000


thrift server use ui on  port 10002


===================================


MSCK repair:

---------------------


on external table--partition is there

but partition added to hdfs dir

but metadata is not there.


so we can run msck repair on the table--so metadata will be created


mostly used for external table


=======================================


enable nodrop feature:

--------------------------------

alter table tablename enable no_drop


alter table tablename disable no_drop


offline features on table:

---------------------------------

to restrict a table from being queried.


alter table tablename enable offline


disable offline


skipping header rows

-------------------------------

while cresting tsble we need to use tblproperties:


tblproperties("skip.header.line.count"="3")


immutable feature:

---------------------


immutable means we cannot change it

(we cannot append new rows and also we cannot do any modification)


tblproperties("immutable"="true")


u will able to overwrite data.


drop vs truncate vs purge

-----------------------------------

drop--a managed tbl both data and metadata deleted

drop--external table--only metadata deleted.


truncate---all the data deleted--table schema will be there.


purge-

if purge set to true--if u delete the data --data is gone no recovery


if purge set to false--if u delete the data--data can be recovery


tblproperties("auto.purge"="true")


treating empty string as null:

--------------------------------

tblproperties("serialization.null.format"="")


executing normal linux command from hive:

-----------------------------------------------

!ls -ltr


setting hive variable:

--------------------------

hivevar


set hivevar:favourite_customer=1111;


select * from orders where customer_id = ${favourite_customer};


set hive.cli.print.header;


cartesian product:

-----------------------


select * from table1,table2


no of rows 10*20

 

fikteroperation evaluated onlefttoright


firstsystem defined funtion then udf


Comments

Popular posts from this blog

scala-4