Data Conveyor Belts in the Cloud

How we mechanize and automate! Ever since the Industrial Revolution and the first machine-driven textile production, we’ve been looking for ways to save labor in all the things we do. Add Adam Smith’s division of labor and celebrated pin factory into the mix, and the result is today’s large scale, highly automated production facilities churning out automobiles, washing machines, computers, and anything else for which there is a big enough market. Information technology has been getting on the bandwagon too. Automation software exists to accomplish all sorts of data-processing actions automatically. Data warehouses are increasingly commonplace. So, how about a data factory to go with all of that?
Big Data Processing Needs Proper Automation
Factory facilities make economic sense above a certain level of production. The initial investment required is recouped in lower operational costs afterwards. Likewise, large quantities of raw materials to be processed need automation too, if valuable end products are to be produced in reasonable time and at reasonable cost. Data is such a raw material and big data by definition is a large quantity of data, to be processed in some automated way. Just like physical production, big data processing also happens in stages.
Introducing the Virtual IT Factory
Once you have a physical factory in place, you are obliged to amortize your investment by continually processing and producing. It’s not possible in the real world to set up and tear down production lines at the drop of a hat – or at least, not without severe financial impact. In the virtual world of IT however, it’s a very different story. In fact, you can go even further. Thanks to the cloud you can rent the virtual factory basics as and when you need them, and plug them together the way you want and when you want. Try doing that with an automobile factory with its engine fab plants and robot-controlled paint shops.
Routing Production in the Cloud Data Factory
There are five main steps in the processing of big data to produce meaningful, useful conclusions. Major cloud vendors are now offering modular approaches to help you do all of them without having to build any of the data factory yourself. Whether the vendor concerned talks about conveyor belts or pipelines, the data factory principles are the same:

  1. Bring your big data together from your chosen sources
  2. ‘Clean’ it all to make a trustworthy version (remove ‘noise’ and inconsistencies)
  3. Organize it for loading into the next stage of the process
  4. Analyze it
  5. Generate reports that human beings can understand and use.

If the first three stages remind you of the ‘Extract, Transform, Load’ (ETL) manipulations of private data warehouse applications, it’s because it’s basically the same. The differences are that the cloud data factory only costs you when you use it and that it scales to handle any amount of big data you throw at it. In addition, you can route your conveyor belts and pipelines as you want from your cloud data factory management interface.
Your Data Factory or Mine?
Not everyone will shift their big data willy-nilly into the cloud. Sunk costs in existing solutions, lack of familiarity or concerns about confidentiality are all potential braking factors. But cloud vendors have a solution for that as well. The data factory model can extend to hybrid cloud working too. Your data conveyor belts can run locally in your private cloud for early stages and continue up towards the public cloud for the later stages, or vice versa. Alternatively your public cloud option can handle overflows and/or non-confidential data for all five steps above.  That’s a quantum jump away from manufacturing washing machines and light years away from Adam Smith’s original pin factory.