When to use Hadoop and when not to

Add comments

Where Hadoop fits best

Hadoop has earned a reputation as the must have to big data analytics engine. There is so much said around Hadoop that organisations think it does pretty much anything. Hadoop is a very complex piece of technology that is still raw and need a lot of care and handling to make it do something worthwhile and valuable. The fact remains, the open source distributed processing framework isn’t the right answer to every big data problem.

Hadoop has ample power for processing large amounts of unstructured and semi-structured data. That includes running of the-day reports to review daily transactions or scanning historical data dating back several months or years. But when it comes to running the real-time analytics processer, Hadoop isn’t known for its speed in dealing with smaller data sets. It comes to a trade off in order to make connections between data points and Hadoop technology sacrifices speed. As result, Hadoop has limited value in online environments where fast performance is crucial. Conversely, Hadoop is best suited to processing vast stores of accumulated data. Typical Hadoop is used on large scale project that require clusters of servers utilising Hadoop Distributed File System (HDFS) and employees with specialised programming and data management skills. Specialised development skills are needed because Hadoop uses the MarReduce software programming framework, which limited numbers of developers are familiar with. As result, the implementation can become expensive, even the cost per unit of data may be lower than the traditional Relation Database Management System (RDBMS Likes Oracle). When the project starts adding all the costs involved, it’s not as cheap as it seems. Hadoop should be viewed as part of a large picture. Hadoop is a complement to the data warehouse to manage the flow of unstructured and semi-structured data, including web server and other system, text data and social network logs. The combination of Hadoop along with MapReduce and NoSQL database can complete the picture of data warehouse system in such cases, creating the platform that puts processing workload on the platform best able to handle them. Offloading data warehouse processing from the conventional systems can improve the overall performance of a data warehouse environment. In addition, Hadoop, data warehouse and data integration vendors have released software connectors that make it easier to transfer data between Hadoop and data warehouse system.

Organisations need to carefully evaluate when to use Hadoop and when to look elsewhere. As result, architecture is becoming increasingly important. We thread everything together with first class architecture approach.

If you need further information or initial consultancy, please email mustafa@imexservices.co.uk for details.

Leave a Reply Cancel reply

You must be logged in to post a comment.