My Learning Notes

Posts

hive optu

November 21, 2020

3 ways optimize: design table(creatin) (2 options: partitioning and bucketing) both divide the data into small parts. structure queries.(efficient queries) (join takes lots of times)--query level (join optimizations) simplified queries.(simple queries). (windowing function) partitioning: dividing data based on columns in dir: user/hive/warehouse/treandytech.db/customers/state=CA user/hive/warehouse/treandytech.db/customers/state=NY only dir will scan less data scan--performance gain if we use partition columns then-- optimization partioning --should done on most common queries. issues: if we have lots of distinct values-- cardinality--very high--then we won't do partitining lots of folder will be created. two types partioning: static: (we should hv idea on data and load manually) dynamic: (partitions created automatically.) (we don't know data) static is faster than dynamic partitioning--works well with low cardinality if more distinct values--then--go for bucketing no 0f part...

hive-hnbae

November 21, 2020

one table--access from both hive and hbase hbase--trnsactional thing(insert/update/delete) hive--analytica thing(group by,aggregation) if u are processing in hive and dump the data in hbase and do quick searching--then hbase table managed by hive. hive--default delimiter-- is control A no concept of primary key in hive stored in hbase and metadat in hive

compre-hive

November 21, 2020

compression techniques in hive: ---------------------------------------- -helps save storage -help process data faster -reduce i/o cost i/o cost depends on storage compression and uncompression--cost--time taken to compress and uncompression. bt compare to i/o gain--we can neglect this. 4 compression techni ------------------------ snappy lzo gzip bzip2 some optimized for storage some for speed Snappy: ------------------ -fast compression codec -size doesn not reduce drastically --mostly in project is used --optimized for speed rather than storage. --by default not splittable(json ,xml) --avro,parquet,orc,(container based)--splitapility is taken care Lzo ---------- --optimized for speed. --inherently splittable(can be use with text,json,xml) --good choice for text files --require seprate install. --snappy is faster than Lzo. Gzip -------------- optimized for storage -2.5* snappy --processing speed is slow --not splittable -used with container based files --can reduces block size -...

casndraa

November 21, 2020

cassandra: ---------------- distributed,column oriented highly performant highly scalable --no master nodes --all are peers nodes --nodes are in rings ---decentralized architecture --for communication--use gossipe protocol in masster slave-ddown time if master fails in decentralized--highly available --provide tunable consistency. --u can set no of machine agrees on the values-- then availability is low and consistency is high. no of nodes u can set. hbase--on same cluster of hadoop. cassandra--needs different cluster we can set consistency level: one node all nodes quorum nodes (no of nodes agrees) --cassandra has onw query language---CQL CQl--similar to SQL Apache phoenix: ---gives sql interface on top of sql. --works on top of hbase --u can write in sql

hbasaea

November 21, 2020

requirement for databases: 1.structure manner 2.random access 3.low latency. 4.ACID property atomic consistency lsolation durability hbase--runs on top of hadoop. --distributed scalable fault tolerant ---------------------- hbase: -------------- --structure--loose. -low latency---using row key. random access-using row key some what ACID. -searching(using row keys) -processing. main purpose is seraching. in rdbms--if no data then column is null-takes space. hbase--no data then column is not there column based. --u can perform CRUD operations: create read update delete in hbase. ACID--at single row ok but multiple rows ACID is not complaint. epoch-- time (unix timestamp)\ no of seconds after 1970 row keys ----------------------- -unique array. --all stored as bytes array. use binary search algorithm.. row key in sorted ascending order. column family -------------------- each column familiy data stored separate data can add new column on fly columnfamily:columnname = work:department ...

hdfsgh

November 21, 2020

HDFS commands ------------------------- hadoop fs or hdfs dfs all command it wil show hadoop fs -help ls show all options we can use with ls home directory in hdfs /user/sasmitsb4018/ hdfs dfs -ls -t -r / in root list file in order by tim ein reverse order hdfs dfs -ls -S / sort on the basis of size hdfs dfs -ls -S -h / -h menas human redable ways in hdfs we need to pefix every command with hdfs dfs - camel case-- first word is small and from send word first letter capital hadoop fs -copyFromLocal copyFromLocal or put are same -cp to copy from hdfs to hdfs location hdfs dfs -df -h /user/sasmitsb4081 df--disk free free space -h human readable format hdfs dfs -du -h /user/sasmitsb4081 -du menas disk usage. for each folder under that how much space that has taken. hdfs dfs /data in coudxlab all free data sets are there dynamically set replication factor for a file hadoop fs -Ddfs.replication=5 -put filename /user/sasmitsb4081 D-- for dynamically change fsck -- stands for filesy...

Search This Blog

My Learning Notes

Posts

sqooopp

hive optu

hive-hnbae

compre-hive

casndraa

hbasaea

hdfsgh