Big Data
大数据考试代考 Question 1 18 MarksA. GFS/HDFS uses blocks of size 128MB by default. What is the main reason for choosing such a large block size,
Question 1 18 Marks
A. GFS/HDFS uses blocks of size 128MB by default. What is the main reason for choosing such a large block size, given the use cases these systems were designed for? [3 points] Briefly explain your answer. [5 points]
a. To minimise disk seek overhead
b. To minimise network bandwidth usage
c. To increase read/write parallelism
d. To reduce the storage load on DataNodes
B. The default replica placement policy for HDFS places the first copy of a block on the local DataNode (DN1), the second copy on a DN2 located on the same rack as DN1, the third copy on a DN3 on a rack other than the one where DN1 and DN2 are located, and all subsequent copies on random DNs around the cluster. This design strikes a trade-offbetween/among which of the following? [3 points] Briefly explain your answer. [7 points]
a. Disk/memory storage load on the NameNode, reliability, and network congestion
b. Reliability and read/write network bandwidth consumption
c. Remote read/write latency, NameNode storage load, and DataNode processing load
d. Ease of choosing replication targets and DataNode storage load balancing
Question 2 21 Marks 大数据考试代考
A. Assuming only one job executes at your cluster at any given time, what is the optimal number of reducers (with regards to minimising the overall job time) to be used for a WordCount-type job over a large dataset? Assume that the number of unique words is several orders of magnitude higher than the number of nodes in your cluster. [3 points] Briefly explain your answer. [5 points]
a. 1
b. As many as there are keys in the final output
c. As many as there are computers in the cluster
d. A small multiple of the number of CPU cores across all cluster nodes
B. Which is the main disadvantage of performing a reduce-side join? [3 points] Briefly discuss your answer. [3 points]
a. Inability to support joining more than two datasets/tables
b. Inability to support composite keys
c. Larger I/O costs because of shuffle and sort
d. The results may occasionally be wrong, depending on the cardinalities of the joined datasets/tables
C. Given the following Spark program, identify at which step (i.e., line of code) the input RDD is actually computed. [2 marks] Briefly explain your answer. [2 marks]
JavaRDD<String> inputRDD = sc.textFile(“sample.csv”); JavaRDD<String> fooRDD = inputRDD.filter(line=>line.contains(“FOO”)); JavaRDD<String> colsRDD = fooRDD.map(line=>line.split(“,”)); int count = fooRDD.count();
D. You are given an input RDD that contains the edges of a large graph, with each edge being a pair: <source node id> <destination node id>. You are asked to design a job that will produce an output containing pairs in the format: <node X> <list of nodes that have out-edges leading to node X>. Which of the following will produce the correct result? [3 points]
a. input.groupByKey()
b. input.countByKey()
c. input.map(x => (x._2, x._1)).groupByKey()
d. input.map(x => (x._2, x._1)).countByKey()
Question 3 21 Marks 大数据考试代考
A. Consider the CAP Theorem. What kind of system would you choose to implement a high-volume online chat application and what to store small but highly important (e.g., banking/financial) data? [3 points] Briefly explain your answer. [5 points]
a. Chat: CA, Banking: AP
b. Chat: CA, Banking: CA
c. Chat: CP, Banking: AP
d. Chat: CP, Banking: CA
B. In systems following the BigTable design, when a tablet/region grows bigger in size than the predefined maximum region size, it:
a. Gets compressed using a predefined compression codec
b. Spills into adjacent machines
c. Is discarded and an appropriate error is returned to the user
d. Is split into smaller regions
C. The tuple which uniquely identifies a cell in on-disk BigTable/HBase storage is:
a. {rowkey, column qualifier, timestamp}
b. {rowkey, column family, timestamp}
c. {table, rowkey, column family, column qualifier}
d. {column family, column qualifier, timestamp}
D. Assuming no cluster node faults and a client external to the cluster, which of the following Cassandraconsistency levels would accomplish read-your-own-write consistency (i.e., a client sends an update for a key and immediately afterwards a read for the same key, and the result should be the same as the one written in the first operation) withminimum write latency for a replication factor of 5? [3 points] Briefly explain your answer. [4 points]
a. Write: ANY / Read: ONE
b. Write: ONE / Read: ONE
c. Write: QUORUM / Read: QUORUM
d. Write: ALL / Read: ONE
更多代写:CS代考网课澳洲 多邻国online代考 英国会计网课代上代修 管理学论文代写 加拿大金融论文代写 assignment代做