compre-hive
compression techniques in hive:
----------------------------------------
-helps save storage
-help process data faster
-reduce i/o cost
i/o cost depends on storage
compression and uncompression--cost--time taken to compress and uncompression.
bt compare to i/o gain--we can neglect this.
4 compression techni
------------------------
snappy
lzo
gzip
bzip2
some optimized for storage
some for speed
Snappy:
------------------
-fast compression codec
-size doesn not reduce drastically
--mostly in project is used
--optimized for speed rather than storage.
--by default not splittable(json ,xml)
--avro,parquet,orc,(container based)--splitapility is taken care
Lzo
----------
--optimized for speed.
--inherently splittable(can be use with text,json,xml)
--good choice for text files
--require seprate install.
--snappy is faster than Lzo.
Gzip
--------------
optimized for storage
-2.5* snappy
--processing speed is slow
--not splittable
-used with container based files
--can reduces block size --so more no block--more paralized
Bzip2
--------------
optimized for storage--very slow
--inherently spitability
--for active archival purpose
========================================
Vectorization
-------------------------------
standard query--takes one row at a time
in vectorization--batch--block of 1024 rows at a time.
to use vectorization---data should be in orc
set hive.vectorized.execution.enabled = true;
in explain plan(select *)
execution mode : vectorized
-----------------------------
Changing the hive engine:
--------------------------------
hive supports 3 execution rnginr
1.mr(mapreduce)(slow)
2.tez(fast)
3.spark(fastest)
by default mr
set hive.execution.engine;
==========================
Apache thrift server/service
----------------------------------
thrift service--allows any clients to connect to hive
when u are sitting outside edge node and wants to connect to hive
u can write java program/python program
this is called hive server(based on thrift service)
thrift server use port at 10000
thrift server use ui on port 10002
===================================
MSCK repair:
---------------------
on external table--partition is there
but partition added to hdfs dir
but metadata is not there.
so we can run msck repair on the table--so metadata will be created
mostly used for external table
=======================================
enable nodrop feature:
--------------------------------
alter table tablename enable no_drop
alter table tablename disable no_drop
offline features on table:
---------------------------------
to restrict a table from being queried.
alter table tablename enable offline
disable offline
skipping header rows
-------------------------------
while cresting tsble we need to use tblproperties:
tblproperties("skip.header.line.count"="3")
immutable feature:
---------------------
immutable means we cannot change it
(we cannot append new rows and also we cannot do any modification)
tblproperties("immutable"="true")
u will able to overwrite data.
drop vs truncate vs purge
-----------------------------------
drop--a managed tbl both data and metadata deleted
drop--external table--only metadata deleted.
truncate---all the data deleted--table schema will be there.
purge-
if purge set to true--if u delete the data --data is gone no recovery
if purge set to false--if u delete the data--data can be recovery
tblproperties("auto.purge"="true")
treating empty string as null:
--------------------------------
tblproperties("serialization.null.format"="")
executing normal linux command from hive:
-----------------------------------------------
!ls -ltr
setting hive variable:
--------------------------
hivevar
set hivevar:favourite_customer=1111;
select * from orders where customer_id = ${favourite_customer};
set hive.cli.print.header;
cartesian product:
-----------------------
select * from table1,table2
no of rows 10*20
fikteroperation evaluated onlefttoright
firstsystem defined funtion then udf
Comments
Post a Comment