Resume
Contact Info:
E-mail: justinrmiller@gmail.com
GitHub: @justinrmiller
Experience
Principal Software Engineer (ZEFR, Los Angeles, CA)
May 2019 - Present
Introduced Ray to the organization and built example jobs and tooling around cluster creation, model training and embedding inference using Ray on Vertex AI
Trained XLM-RoBERTa models with data pseudo labeled by ChatGPT 3.5
Built a data pipeline to supply image embeddings (LAION L-14) to a vector database (Qdrant) and built a Streamlit application to provide search capabilities via text or image upload
Worked with Data Science and Machine Learning teams to establish guidance around data retention for media resulting in ~$20k/month savings
Built a multi-page Streamlit app called DS Labs to migrate the data science staff from deploying applications manually to EC2 instances to a fully Dockerized application built with GitHub Actions and deployed via ArgoCD. This has dramatically sped up time to deployment of new ideas and tools by the data science team while minimizing EC2 costs.
Built and continue to improve the Video Retrieval Service, a Python service designed to retrieve videos from various platforms and store trimmed and full versions of them in S3. Videos are served from S3 via pre-signed URLs. This work expanded to include other services that I oversaw development of including Frame Retrieval Service and Image Retrieval Service built using similar tech stacks.
Led effort to build a front-end UI for embedding Looker dashboards. The reporting team is able to add dashboards without involvement from front-end engineers, allowing them to focus on more complex tasks.
Mentor a number of entry and mid-level software engineers
Built and continue to improve the Video Query Service, a Python service designed to abstract ElasticSearch details from services that need YouTube and Facebook video information
Built Appen Data Processor, an Airflow DAG that retrieves data from Appen (an ML data provider) and transforms and stores the data in S3 and Snowflake
Evaluated and demonstrated a potential 40x improvement in cost/performance in moving batch jobs from Airflow/ElasticSearch to Spark
Led efforts to modernize Flask and Airflow templates to newer versions of Python and upgraded existing projects (first from Python 3.6 to 3.8, then from 3.8 to 3.11)
Transitioned a number of services from ECS to EKS
Improved performance of PostgreSQL databases (RDS) and executed a migration from 9.x to 11.x
Built a new impression availability service (sourcing data from Elasticsearch now OpenSearch) with caching and better visibility (tracing, monitoring, etc.)
Made general improvements to the Campaign Manager service (responsible for launching Google Ads Video Campaigns) around performance and introduced multithreading and caching
Senior Platform Engineer - Data (GoSpotCheck, Denver, CO)
July 2018 - May 2019
Hecka - Built and maintained a Spark job to automatically transform Postgres data on an hourly/weekly/daily cadence from half a dozen Postgres databases. Uploaded to S3 and Snowflake.
Ultraviolet - Incrementally upload data to Snowflake using custom SQL for transformations. Overwrite functionally transferred over to Hecka once that project was online. Updated to latest Spark version and continued to maintain. Built Go tooling to ease the addition of tables via code generation.
Evaluated FiveTran, Stitch and a variety of other ETL-as-a-service offerings to determine whether or not they would be a good fit for our data pipeline needs.
Wrote GoSpotKafka, a tool for deserializing Protocol Buffers (Protobufs) off of Kafka as JSON and display them to the screen.
Built Spark Structured Streaming examples to produce/consume Protobufs to/from Kafka for two of our highest volume tables.
Demoed Go Cloud Functions while they were in alpha to the Go guild and provided a number of examples.
Senior Platform Engineer/Threat Engineering Manager (ProtectWise, Denver, CO)
August 2015 - July 2018
Threat Engine - Took over and expanded a Kafka stream processor written in Scala which consumes observations and netflows and produces events and observations after applying rules defined using either a custom DSL or classes containing the logic. Used Monoids and Semigroups to ensure out of order messages wouldn’t impact detection results.
Odin - Built a data warehouse built on Spark (2.x), Kafka, Parquet and Amazon EMR. The system processes all of our netflow, observation and event messages and stores the data in S3 in Parquet w/Snappy compression. Currently storing over 12 billion netflows per day. Deployed in production using a combination of shell scripts and AWS EMR (Spark 2.x, EMR 5.x). Data is retrieved via Spark Shell using Spark SQL. As files finish processing their metadata is sent to SQS for consumption by downstream reporting and data science processes.
Canary - Built a netflow, observation, and event emitter consisting of two parts, a loader and a publisher. The loader scans through cages and produces at regular intervals defined in the cages netflows, observations and/or events into redis, which is then read from by publishers to publish messages onto kafka. This allows for the easy prototyping of messages for consumption by the UI. UI developers can start their development much sooner in the process while the platform figures out how to actually produce the messages.
Arbiter - A customer/sensor policy management service written in Scala using Finagle and Thrift. Kicked this project off and continue to work with another developer on maintaining it in production. The project seeks to unify customer/sensor policy information in the platform.
Senior Software Engineer (CJ Affiliate, Santa Monica, CA)
December 2014 - August 2015
Refactored and extended search API service for elasticsearch written in Scala.
Built and styled React components (Table, Money, etc.) as well as introduced the usage of off the shelf UI components such as amCharts.
Improved and maintained back-end web services, worked on both front-end and back-end, written in React (front-end) and Java/Hibernate/Spring MVC (back-end).
Extensive experience with Test Driven Development (TDD), pair programming, continuous integration.
Built a monitoring solution for builds using a Raspberry Pi to provide immediate feedback on the effects of commits on the build and the status of various teams’ branches.
Assisted in the migration away from Perforce to Git by writing documentation for developers.
Senior Software Engineer (eHarmony, Santa Monica, CA)
July 2013 - November 2014
Lead back-end developer on Elevated, a job searching platform, which consists of a service-oriented architecture written in Java with Dropwizard, Kafka, Elasticsearch and memcached as key technologies.
Improved and maintained Scorer Service, a Java based service that computed scores based on predefined models.
Built a next-generation Scorer Service that could take arbitrary Protocol Buffer or JSON input and produce scores. By passing the data into the service as opposed to looking the data up from inside the service, the number of requests per second a server could handle rose dramatically.
Software Developer (SteelHouse, Culver City, CA)
January 2012 - July 2013
Designed, implemented, and maintained the back-end for a social network (Honeycomb) as a service-oriented architecture using Scala, Finagle and Cassie.
Designed and implemented a load balanced CDN origin with image processing capabilities (image data written from three data centers around the globe). Built on Java, Jetty, Cassandra and Hector.
Implemented new features for an ad server based on Java and Jetty.
Software Developer (NISC, Lake Saint Louis, MO)
May 2009 - December 2011
Support and development programming for accounting and utility billing software written in Java designed around a three-tier architecture with an Oracle database.
Mentored new hires on Ant, Subversion, IDEA and the general structure of the Java code.
Skills
- APIs and Libraries: FastAPI, Ray, Apache Spark, Spring
- Application Servers/Platforms: Dropwizard, Grizzly/Jersey, Finagle, Spring Boot
- Build Systems: Cargo, Make, Maven, Sbt
- CI/CD: GitHub Actions, Jenkins
- Cloud Platforms: AWS, DigitalOcean, GCP
- Databases/NoSQL Datastores: Cassandra, Elasticsearch/Opensearch, PostgreSQL, Qdrant, Redis
- Front-end: React
- IDEs: IntelliJ IDEA, PyCharm, GoLand, Visual Studio Code
- Machine Learning: Diffusers, Ollama, PyTorch, Transformers, Stable Diffusion
- Operating Systems: Linux, Mac OS X, Windows
- Programming Languages: Go, Python, Rust, Scala
- Serialization: Avro, Protocol Buffers, Thrift (Scrooge)
- Source Control and Documentation: Git, SVN
Education
Missouri University of Science and Technology, Rolla, MO
- MS CS (August 2006 – May 2009)
Missouri University of Science and Technology, Rolla, MO
- BS CS (August 2002 – May 2006)