Bigdata Engineer
Top 5% PySpark/Apache Spark contributor on StackOverflow
I have been a professional Bigdata Engineer since 2016. Always eager to learn new technologies.
I have experience in handling the complete lifecycle of a data engineering project. Be it Data Migration, Exploratory Data Analysis or creating data lake over cloud, handled the complete project with extreme dedication.
May 2019 - Present
Quantiphi is a category defining Applied AI and Machine Learning software and services company focused on helping organizations translate the big promise of Big Data & Machine Learning technologies into quantifiable business impact.
I have been working as a BigData Engineer fulfilling client's need for handling large and complex datasets and creating data lake on cloud. My responsibilities here includes:
- Design and Implementation of data lake on AWS using S3, Redshift, Glue and DMS.
- Implementation of document style NOSQL Data Hub on AWS using DynamoDB, Glue, Step functions, Athena, S3, EC2 and Infoworks.
- Expertise in creating ETL pipelines with the help of Apache Spark over Glue and EMR.
- Implementation of DynamoDB Streams to load Elasticsearch indexes for faster search results with the help of AWS Lambda.
- Optimizing Apache Spark jobs by optimizing joins and reducing the data shuffle over the network.
- Optimization of Redshift cluster performance, using optimum dist keys and sort keys and tuning sql queries for maximum performance.
- Optimization of DynamoDB queries by implementing appropriate GSI and choosing the best partition and sort key.
- Implementation of Flask based API to fetch data from dynamodb and serve customer and deploying it on AWS Elastic Container Service using Docker images.
- Worked on creating Docker images for enabling Spark history server, glue local dev endpoints and enabling data lineage for glue jobs and deploying them on ECS.
Tech Stack:
AWS(Redshift, Glue, S3, Lambda, Redshift Spectrum), Apache Spark, Python, Pandas, Boto3, Pyarrow
Dec 2016 - May2019
My role in TCS include working with the third largest banking client, Understanding their needs with the data and devloping architecture to streamline the process in a big data ecosystem. Majority of my responsibilities includes:
- Design and implementation of python based framework to import data from various relational databases(Teradata, Oracle, SQL Server, DB2) to hadoop using Sqoop, hive and Oozie.
- Implementing spark jdbc based data ingestion framework from Informix database.
- Developed spark based application to parse complex xml data received as file.
- Developed spark based reconciliation processes to maintain the data integrity in hadoop.
- Worked Effectively with spark Dataframes and pandas Dataframes.
- Worked in unix environment and created various unix scripts as per the requirements.
- Imlemented Sqoop export framework for exporting the data from hadoop to Teradata.
- Orchestration of data flow pipeline using oozie workflows from dynatrace to hdfs.
- Handling of protobuf message format.
- End to end setup of flume agent for fetching live transactions.
- Used spark 2.1.0 over cloudera CDH5.13 to perform analytics on data in hive.
Environment
Hadoop 2, Spark 2.1.0, Sqoop, Hive 1.4.x, Impala, Oozie, Cloudera CDH 5.13.5, Flume 1.6.0, Python 2.7, Java 1.7, Maven, Git, Jenkins
2012 - 2016
Maharana Pratap College of technology
CGPA: 8.1
Developed a food ordering website using Java Spring and Hibernate MVC architecture. And created a project for controlling streetlights remotely.
2000 - 2012
Nehru Higher Secondary School
Perceentage: 84.4%
Nehru Higher Secondary School
Perceentage: 82.3%