Posts

Showing posts from November, 2020

sqooopp

 sqoop import--rdbms to hdfs export--hdfs to rdbms data injection or migration import--individual tables export--- target tables exists in rdbms Sqoop Eval--to run a query on db in mysql: ---------- cltr+l to clear screen; show databases; use retail_db; show tables; list of databases sqoop-list-databases \ --connect "jdbc:mysql://ipaddress:portno" \ --username name \ --password pass list tables sqoop-list-tables \ --connect "jdbc:mysql://ip address:portno/retail_db" \ --username name \ --password pass eval: sqoop-eval \ --connect "jdbc:mysql://ip address:portno" \ --username name \ --password pass --query "select * from retail_db.customers limit 10" for oracle : orc in place of mysql to use in cloudxlab: sqoop-list-databases \ --connect "jdbc:mysql://10.142.1.2/sqoopex" \ --username sqoopuser \ --password NHkkP876rp sqoop-list-tables \ --connect "jdbc:mysql://10.142.1.2/sqoopex" \ --username sqoopuser \ --password NHkkP876rp s...

hive optu

 3 ways optimize: design table(creatin) (2 options: partitioning and bucketing) both divide the data into small parts. structure queries.(efficient queries) (join takes lots of times)--query level (join optimizations) simplified queries.(simple queries). (windowing function) partitioning: dividing data based on columns in dir: user/hive/warehouse/treandytech.db/customers/state=CA user/hive/warehouse/treandytech.db/customers/state=NY only dir will scan less data scan--performance gain if we use partition columns then-- optimization partioning --should done on most common queries. issues: if we have lots of distinct values-- cardinality--very high--then we won't do partitining lots of folder will be created. two types partioning: static: (we should hv idea on data and load manually) dynamic: (partitions created automatically.) (we don't know data) static is faster than dynamic partitioning--works well with low cardinality if more distinct values--then--go for bucketing no 0f part...

hive-hnbae

 one table--access from both hive and hbase hbase--trnsactional thing(insert/update/delete) hive--analytica thing(group by,aggregation) if u are processing in hive and dump the data in hbase and do quick searching--then hbase table managed by hive. hive--default delimiter-- is control A no concept of primary key in hive stored in hbase and metadat in hive

compre-hive

 compression techniques in hive: ---------------------------------------- -helps save storage -help process data faster -reduce i/o cost i/o cost depends on storage compression and uncompression--cost--time taken to compress and uncompression. bt compare to i/o gain--we can neglect this. 4 compression techni ------------------------ snappy lzo gzip bzip2 some optimized for storage some for speed Snappy: ------------------ -fast compression codec -size doesn not reduce drastically --mostly in project is used --optimized for speed rather than storage. --by default not splittable(json ,xml) --avro,parquet,orc,(container based)--splitapility is taken care Lzo ---------- --optimized for speed. --inherently splittable(can be use with text,json,xml) --good choice for text files --require seprate install. --snappy is faster than Lzo. Gzip -------------- optimized for storage -2.5* snappy --processing speed is slow --not splittable -used with container based files --can reduces block size -...

casndraa

 cassandra: ---------------- distributed,column oriented highly performant highly scalable --no master nodes --all are peers nodes --nodes are in rings  ---decentralized architecture --for communication--use gossipe protocol in masster slave-ddown time if master fails in decentralized--highly available --provide tunable consistency. --u can set no of machine agrees on the values-- then availability is low and consistency is high. no of nodes u can set. hbase--on same cluster of hadoop. cassandra--needs different cluster we can set consistency level: one node all nodes quorum nodes (no of nodes agrees) --cassandra has onw query language---CQL CQl--similar to SQL Apache phoenix: ---gives sql interface on top of sql. --works on top of hbase --u can write in sql

hbasaea

 requirement for databases: 1.structure manner 2.random access 3.low latency. 4.ACID property atomic consistency lsolation durability hbase--runs on top of hadoop. --distributed scalable fault tolerant ---------------------- hbase: -------------- --structure--loose. -low latency---using row key. random access-using row key some what ACID. -searching(using row keys) -processing. main purpose is seraching. in rdbms--if no data then column is null-takes space. hbase--no data then column is not there column based. --u can perform CRUD operations: create read update delete in hbase. ACID--at single row ok but multiple rows ACID is not complaint. epoch-- time (unix timestamp)\ no of seconds after 1970 row keys ----------------------- -unique array. --all stored as bytes array. use binary search algorithm.. row key in sorted ascending order. column family -------------------- each column familiy data stored separate data can add new column on fly columnfamily:columnname = work:department ...

hdfsgh

 HDFS commands ------------------------- hadoop fs or hdfs dfs all command it wil show hadoop fs -help ls show all options we can use with ls home directory in hdfs /user/sasmitsb4018/ hdfs dfs -ls -t -r / in root list file in order by tim ein reverse order hdfs dfs -ls -S / sort on the basis of size hdfs dfs -ls -S -h / -h menas human redable ways in hdfs we need to pefix every command with hdfs dfs - camel case-- first word is small and from send word first letter capital hadoop fs -copyFromLocal  copyFromLocal or put are same -cp to copy from hdfs to hdfs location hdfs dfs -df -h /user/sasmitsb4081 df--disk free free space -h human readable format hdfs dfs -du -h /user/sasmitsb4081 -du menas disk usage. for each folder under that how much space that has taken. hdfs dfs /data in coudxlab all free data sets are there dynamically set replication factor for a file hadoop fs -Ddfs.replication=5 -put filename /user/sasmitsb4081 D-- for dynamically change fsck -- stands for filesy...

mrryr

 2 phase map and reduce works on key and value pair. k,v --map--k,v  k,v --reduce--k,v traditional programing model works when data kept on single machine so it won't works in hadoop. Record reader---take each line as input and convert to key value pair. lineno,value(string) in mapper: ignore+ the key and consentrate on value. mapper-- no transfer of data. movement of data from mapper to reducer--shuffling. after mapper-- machine will do shuffle and sort. sorting--in reducer machine after shuffle and sort--machine will do data,{1,1} list of value. input to reducer (are,{1}) (hello,{1,1,1}) o/p-from reducer (are,1) (hello,3) no of block = no of mapper default reducer=1..can increase and decrease if no aggregation..we can remove reducer based on no of reducer--we will have tht many no of partition after mapper o/p data will be partitions--then shuffles+sort these three done by framework. by default system provides hash function --to divide the key value pairs among reducers. has...

hiveet

 hive: ---------------- Partitioning: to see list of partition: show partition table name; dynamic partitioning: by default its disable: set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode = nonstrict; step:1 create normal tbl and load data step2 create partition table with partition onn columns step3 transfer from normal to partition table. insert into table tablename partition (state) select * from order_no_partition 0--------------------------- bucketing: ----------------- step:1 create normal tbl and load data step2 create bucketing table with bucket onn columns step3 transfer from normal to bucket table. set hive.enforse.bucketing no of bucket = no of reducer in bucketing.. we use cluster by to display record of one bucket ------------------------------------- select * from table tablesample(bucket 1 out of 4) partitioning:-- can also hv on multiple columns ---------------------------------------- 3 categories of optimization: ---------------------------...

scalla

 spark--scala --best performance--compatible scala--codes--directly runs on jvm --in python--extry layer created--python process which interacts with jvm scala: -------------- val -- like constant var-- mutable--can change value scala--infer--type(type inference) val num = 5 by default it will take int. Double--can take more precision float--4 bytes(take less precision val piSinglePrecision : Float = 3.1415f     (camelcase format) for big range Long val e:Long = 12131324242434243l val smallNumber :Byte = 127 (127 to -127) =========================================== s interpolation   val name = "Sasmit"                             //> name  : String = Sasmit   println(s"Hello $name")                         //> Hello Sasmit f interpolation ---------------------   val pi = 3.142        ...

sppark on yarn

 spark on yarn -------------- how to execute spark programs --interactive mode--spark-shell/pyspark/notebook-- good to analyzing submitting job- ---------------- spark-submit --production ready code. how spark execute our programs on cluster: --------------------------- follows--master slave architecture. --each application has a single driver--master process --multiple executer(cpu and memory,/jvm process) driver-- -------------- analyzing works--divide it--distribute task,schedule and monitors executer ---------- execute the code locally on that JVM each aplication has distinct driver and bunch of executer

scallla

 in java: -------------- public static void main (String args[]) { } in scala  def main (args: Array[String]) = { } or extends App App-- traits--holds the main method--helper trait. app-- helper class. defaults args,named args, variable length args ----------------------------- Null ----------- is a trait is scala. --one instance of Null and that is null. --restrict the use of null--as not prefered--leads to null pointer exceptions. -- Nil ------ empty list Nothing: -------------- trait in scala ----no instance of Nothing. in function when u have error or exception--then return type is nothing Option: ------ in function u don't any thing valid to return--then null is not preferred to return--due to null pointer exception. -it returns some or none. Unit() ------------------- type of method which does not return any value. --its like void in java. Nothing--there was a error and nothing returned. unit--side effect how to deal with nulls in scala code: --we should not use null. --...

spark2

 whenever rdd contains tuple of 2 elements-- then its called pair rdd. last line val result = sortedTotal.collect here result is the local variable on the local machine not rdd rdd--disttibuted-- on the cluster unix timestamp--no of sec after 1st jan 1970 reduceByKey((x,y) => x+y) here x, y works on two rows instead of using map where we say (x,1) and doing reduceByKey later map+reduceByKey ---result --rdd-- its transformation = countByValue--action--local variable if that is final operation--then u can use countByValue..--result will be local variable and paralism won't happen if u do any tranformation. map+reduceByKey--if u want to do transformation after this. reduceByKey-- always works on value--we don't have to worries about keys. mapValues--works on values only(if key is not changing and we will work on values only).

spark

 Apache spark ------------- general purpose in memory compute engine spark--replacement of mr --plug and play compute engine. -needs two things to work with --storage--local storage,s3,hdfs --resource management--yarn,mesos,kubernetes In memory: ------------------ 2 io operation required in spark(ideally) 10 to 100 times faster than mr ------------------------- General purpose: -------------------- --all things-cleaning,querying,machine learning,data ingestion MR--high latency--more disk read and write spark--low latency -------------------------------- RDD:(resilient distributed dataset) ----------- basic unit--holds the data in spark directed acyclic graph two 2 kinds of operations: ----------------------------- Transformation are lazy Action are not collect--action when execute spark code--transformations called dag created. in spark--1 driver node and all worker nodes rdd--no of block = no of partitions in memory distributed. --distributed  --in memory resilient--if we loo...