Apache Hadoop is a file system used to store Big Data. Big Data is designed to contain data that does not fit into a relational database, because it is either unstructured (has no rows or columns) or because of its size (meaning no single machine can process it easily). Hadoop works by distributing Bid Data over many locations. Big Data is closely related to and a less expensive than modern alternative data warehouses. Data warehouses are used by business to scan through enormous amounts of data and use statistics to predict or review customer behavior and other business indicators.
HFDS is the Hadoop Distributed File System. MapReduce is analogous to SQL language for traditional relational databases, but quite different; the query languages it uses are Apache Hive and HiveQL. MapReduce runs queries across unstructured data (i.e. there are no columns and rows) in parallel fashion across hundreds of thousands of servers. Google invented MapReduce, andYahoo invented Hadoop. The “map” part of “MapReduce” refers to a hashmap, which is a data structure with a key and a value. The “reduce” part of this means MapReduce collapses data with the same key into one key-value pair.
As you can imagine, backing up data for a Hadoop file system, at first glance, could be highly complex. Yet the system has built-in redundancy that reduces or eliminates the need for any kind of offline backup. Each node in the HDFS report, backs its status to Hadoop. If one node does not report back for some period of time, it is assumed to be down. Hadoop then reassigns another node to take its place. The existing data there can be recovered from the defected node using the replicated disk.
Since Hadoop is a distributed file system, it can be backed up like any other file system. There is no need to reformat it. It can be restored, as is regularly done with LDAP databases and ocassionally done with Oracle databases.
Hadoop stores its’ instructions of where to find other nodes in the NameNode. This it called “metadata.” You can build redundancy into the NameNode using disk mirroring or RAID technology to make multiple copies of the metadata.
The HDFS program scans a disk and reports (but does not repair) corrupt data blocks in individual disks. NameNode is responsible for the repair. Metadata can be recovered using the “recover” command. It wipes out the NameNode, so it should only be used when the metadata cannot be restored from any of its replicas.
There is no need to backup a Hadoop Distributed File System as one might do with a traditional file system. Instead one can use disk replication and configure the system to maintain multiple copies of the NameNode.