Secure Data Analytics Pipeline Setup
.png)

Key Challenges
Unosecur needed to process real-time CloudTrail logs without duplication, handle large data volumes, and ensure cost-efficiency while transforming data for insights. Additionally, ensuring data integrity, encryption, access control and monitoring were key concerns. Real-time updates every 4 hours and cost optimization were critical requirements.
Key Results
The AWS EMR-based pipeline processed data efficiently, meeting the 4-hour update requirement. Spot instances reduced costs, while AWS Glue ensured seamless schema management. Redshift dashboards provided real-time insights, and CloudWatch alerts ensured minimal downtime. Security measures such as end-to-end encryption, data masking, automated security scanning and audit logging improved compliance and overall data protection, making the solution security-compliant.
Overview
Unosecur's mission is to simplify cloud security by addressing challenges associated with managing excessive permissions and identity risks in cloud environments. To provide businesses with actionable insights, Unosecur required a solution for processing and analyzing large volumes of log data while maintaining security and compliance standards.
The data, primarily in the form of AWS CloudTrail logs, was continuously generated and stored in an S3 bucket. The goal was to build an automated, scalable and secure pipeline to process these logs, transform the data, and make it accessible for analysis and visualization on a dashboard updated every 4 hours.
Challenges
Continuous Data Ingestion & Integrity:
CloudTrail logs were generated in real-time and stored in S3, requiring a system capable of processing incremental data without duplication while maintaining data integrity. Ensuring that the logs were not tampered with or altered was critical. Lack of encryption and access controls could expose sensitive logs to unauthorized access or modifications.
Data Access, Transformation & Masking:
Role based authentication was utilized for updating AWS services through the pipeline. The log data needed to be filtered, transformed, and structured before loading into a data warehouse. However, CloudTrail logs often contained sensitive information such as IAM users, IP address, account details, and other metadata that could expose security risks. Without proper masking and encryption, personally identifiable information (PII) could be exposed during processing and storage.
Cost Optimization & Secure Storage:
The solution needed to minimize infrastructure costs while maintaining performance. However, ensuring that cost-effective storage solutions still adhered to strict encryption policies and compliance standards. Storing sensitive data in was required to be encrypted and locked from deletion.
Real-Time Updates & Secure Audit Logging:
Insights from the data had to be available within 4-hour intervals, necessitating frequent data processing cycles. While real-time monitoring existed, logs were needed to be audited and any changes had to be tracked.
“The solution utilized Amazon EMR and AWS Glue Data Catalog, Glue job for a scalable, cost-effective data analytics pipeline”
Architecture:

Solution
Data Ingestion and Processing:
A Glue job and Spark application were developed to process the CloudTrail logs and real-time data from the application deployed on an EC2 instance to fetch the information regarding the AWS account and store it in an S3 bucket. The Spark application performed transformations such as filtering logs, restructuring data, and extracting relevant fields. S3 server-side encryption (SSE-KMS) and Object lock was enabled to ensure data integrity and data availability.
Persistent EMR Clusters:
An EMR cluster was deployed in Persistent Mode. The cluster executed the Spark streaming application reading the data in real-time from S3 and performing the operation and writing to another layer of S3.
Instance Configuration:
The EMR cluster used memory-optimized EC2 instances for efficient processing:
1 Master Node: r5.xlarge
2 Core Nodes: r5.xlarge
2 Task Nodes: r5.2xlarge (configured as spot instances, with fallback to on-demand instances when needed).
Metadata Management:
AWS Glue Data Catalog was used to manage unstructured data from CloudTrail logs. Glue crawlers were configured to update schemas dynamically as new fields were introduced.
Batch Job Configuration:
We used Glue to read the data from MongoDB daily, calculate the metrics required for the analytics, and write them down in the s3 bucket.
Incremental Data Loading:
Checkpointing was implemented using Spark to ensure that only new logs were processed during each cycle. The processed data was loaded into a Redshift data warehouse.
Data Visualization & Secure Access:
Dashboards powered by Redshift allowed MFA Authenticated users to access insights in near real-time. The data warehouse enabled efficient querying for analytics while being encrypted by KMS.
Monitoring and Alerting:
CloudWatch alerts were set up to monitor the EMR cluster and notify the team of any issues. Aws Inspector for Vulnerability management for instances. AWS Config was used to detect and alert on configuration changes. AWS Inspector was integrated for continuous vulnerability scanning on EC2 instances.
Business Outcome
Efficient Data Processing:
The Spark application on EMR allowed the high-performance transformation of CloudTrail logs, ensuring the data was processed within the required 4-hour intervals while maintaining encryption and integrity.
Cost Optimization:
Transient EMR clusters and the use of spot instances for task nodes significantly reduced operational costs. On-demand instances served as a reliable fallback when spot capacity was unavailable, balancing performance and cost.
Improved and Secure Data Management:
AWS Glue Data Catalog streamlined schema management, adapting to changes in the log data structure without requiring manual intervention. The metadata was consistently updated and sensitive data was masked before ingestion into Redshift, ensuring compliance standards.
Real-Time Insights:
Dashboards powered by Redshift provided stakeholders with actionable insights, enabling timely decision-making. The pipeline’s efficient design ensured minimal delay between log generation and data availability while enforcing authentication and access controls.
Enhanced Monitoring:
CloudWatch alerts, CloudTrail logs, Config and AWS Inspector scans ensured that any security issues with the EMR cluster or data pipeline were promptly detected and addressed. The monitoring setup minimized downtime and maintained operational reliability.
Scalable and Future-Ready Architecture:
The EMR-based solution was designed to easily handle growing data volumes, ensuring long-term scalability and reliability for Unosecur’s evolving requirements.

