DEM2 : Considerations when Designing For a Data Ecosystem

- 2 mins

Building a data platform can be overwhelming with many tech stack options. In building out a data pipeline, the key questions to ask from a business standpoint would first and foremost be the purpose, that could range from data mining, alerting and monitoring to serving out insghts, followed by who will access the data and the expectations on the avaialbility of the data. It also depends whether the analysis is historical, real-time or predictive in nature. The answers to these questions form the cornerstone for decisions around the design patterns. Relating to a real use-case that i worked on, the platform had to be designed to support access patterns from multiple groups of users.

There were the data analysts who had to perform descriptive analytics tasks such as querying and aggregating the data for counts, averages within specific date ranges and granularity with the lowest denomination being at hourly basis. Next, there were the data scientists who had to experiment with discovery and exploratory data analysis prior to building out a set for their machine learning models and hyper-tuning the parameters. From the workshop sessions, my team and I made initial conclusions that a batch design pattern with a one hour lag and high avaialability was acceptable for an initial roll-out. Needless to say, considerations around democratising the data for future use-cases was also one of the objectives.

At a high-level, the data pipeline architecture had to perform these key operations sequentially :

1) Sourcing from disparate systems in various formats.

2) Cleansing and transformation such as joining with other datasets and augmentation.

Our choice of technology for data accquisition and transmission layer was Sqoop for import and export between HDFS/ HBase and the MySQL relational database. HBase is a column-oriented NoSQL dabase with high reliability and key/value distributed storage system. Spark was the default choice for big data analysis engine. SparkRDD and SparkSQL was used for batch data processing while SparkStreaming was used for stream data processing. Other competing options included Flink for a real-time compute engine and Storm. For resource scheduling management was YARN where its core components included the global resource maanger which was responsible for the resource management and allocation of the entire system.