Board logo

subject: Dont Archive! Keep Your Data Online With Hadoop [print this page]


Dont Archive! Keep Your Data Online With Hadoop

Read the full article here:http://bit.ly/RcjuMK

The excitement in Hadoop has reached frothy proportions in recent months. Everyone I speak to is asking about use cases for Hadoop. I came across an interesting one this week, so I decided to share it here on the Caserta Concepts blog.

Many companies have a need to retain certain types of data for long periods of time: seven years, 15 years and sometimes even longer. The typical approach is to partition aging data in month long segments, and then store those partitions in archives. One example of this is call detail records that carriers are required to store theyre voluminous but must often be retained for extended periods for compliance and other purposes.

Most of the time, these records arent needed and they remain happily untouched in the archive. From time to time, however, it becomes necessary to serve up anywhere from six to 18 months or more of these records for a particular customer from a period several years in the past. This type of request is most often driven by an investigation, subpoena or other legal inquiry.

This can be a messy job for carriers to comply with. First, they must locate and load from the archive an entire partition for a particular month. Typically, more than 99% of the data in each of these huge partitions is not relevant to the records in question. They must then extract the relevant records and store them in a staging area, then close the partition and move on to the next one. If 18 months of records are needed, this process is repeated 18 times. Worse still, if the records are regionalized and the customer is present in, say, three regions, then the process could repeat as many as 54 times. If the process is manual and often it is at least partially manual fulfilling this request could take days or even weeks.

Enter Hadoop. Instead of partitioning off aging data into a traditional archive solution, it can instead (or in addition) be stored as flat files in an online Hadoop Distributed File System (HDFS). When it comes time to extract customer specific data thats spread across a great number of large files, Hadoop doesnt bat an eye. Its core Map/Reduce functionality automatically divides the task across multiple nodes, and provides consolidated results in seconds or minutes.

And because Hadoop is designed to provide fault tolerance using COTS (commercial off-the-shelf) servers, the cost of a Hadoop solution can be trivial compared with that of a traditional archival approach.

Have you had experience with this use case or a similar one? If so, please tell us about it in the comments section.

by: Caserta Concepts




welcome to Insurances.net (https://www.insurances.net) Powered by Discuz! 5.5.0   (php7, mysql8 recode on 2018)