You can think of “Big Data” as exactly that: Data which is big. It is so big, that it does not fit well into typical relational database like Oracle, because trying to work with it exceeds the CPU, memory, and disk controller capacity of the database server. Examples of big data include social networks, astronomical models, simulating chemical reactions, predicting the weather, and business intelligence, which is the science of looking for trends based on enormous amounts of data.
Because Big Data is big, you need someplace to store it that is larger than one machine. Big Data cannot be indexed nor queried using the SQL language, which is used to find, update, add, or delete data in a relational database. It’s something else altogether.
To find data in a Big Data database, you run a MapReduce job to search through the different datasets. The answer does not come back quickly, as it does with a relational database. So you cannot replace a relational database with Big Data.
Unlike relational databases, where Oracle is a virtual monopoly–having bought up everyone else except Microsoft–the Big Data market has lots of vendors including Amazon, Cloudera, 10Gen, and others.
Apache Hadoop is open-source software that lets architects build what are called “Hadoop Clusters.” These replace expensive high-end servers, like Oracle servers, with a network of commodity, low-cost servers (think of a stack of PCs). Hadoop lets you store different parts of the database in a network of computers each of which is called a “node.” There can be thousands of these machines.
Since you can store Big Data on a network of commodity servers using Hadoop, the next logical step is to think of virtualization. “Virtualization” means taking one large computer and dividing it up into many virtual machines, each of which has its own separate memory and storage (although that storage is usually shared between the different virtual machines). The idea is to drive down what is called “server-proliferation” meaning you do not have to buy so many separate machines. That drives down administrative and procurement costs as well as the electricity bill.
How to install a Hadoop cluster onto one virtualized machine or several of them is the goal of the open-source Project Serengeti. Cynics would say that this is just another way for the company VMWare to sell more of its software. VMWare made virtualization popular, although mainframes have run like that for years. VMWare is the sponsor of the open-source Project Sergenti. “Open-source” means the source code does not belong to one entity, and different vendors can contribute to the product, so VMWare cannot drive the entire show.
Regarding actual disk storage, the Serengeti configuration supports both local disks and SAN (storage area network). Serengeti is so new that it is not yet widespread among users. The project has the backing of the major Hadoop vendors and is growing in popularity as different companies look to it to simplify administration, lower costs, yet deliver the required performance and storage requirements.