Development

Building an Event-Driven, Fault-Tolerant Data Pipeline with AWS Lambda, Alluxio, and Spark

In our platform we often have to fetch data from various locations (e.g. S3, SFTP, API) and in various formats (CSV, TSV, JSON, XML) because we have an incredibly diverse client and publisher catalog and each one provides their data in their own unique way. As we have grown over time, we’ve amassed a large list of microservices, processes, and configuration that handle these different data sources and files. The biggest issue that we’ve run into with these services is that the various portions of the data pipeline do not interact as well as we would like, so if there are any errors in that process for any reason, it can be difficult to track down where it is at times. We have begun to feel some strain from this, so we’re abstracting and centralizing as much as we can.

Read More
How To Add Basic Hotel Booking to Chat

In the past years, we’ve seen an explosion of chat bots across multiple industries. Many times we are asked what can a chat bot do, and how would it benefit our product? In our experience, chat bots need to be tailored specifically to what a client would want otherwise, there is a very generic feeling to these bots (much like calling into an automated call center). So how can we make a bot succeed in an area crowded with thousands of existing bots?

Read More
Running Alluxio with Docker and S3 on DCOS/Mesos/Marathon

At Koddi we’re always looking for ways to increase the speed and stability of our platform. One of our latest projects is speeding up our daily ingestion of data.

All of our data is initially stored in flat files on S3 before being loaded into our database. We’re currently in the process of integrating Apache Spark into our load process to drastically increase the speed of our loads. One problem we ran into is that S3 doesn’t behave like a normal file system in terms of read and write speeds. This is where Alluxio comes in. Alluxio is a “memory speed virtual distributed storage system” which lies between frameworks (such as Spark, MapReduce, Flink, etc.) and a storage system (Amazon S3, Google Cloud Storage, HDFS, Ceph, etc.). This allows for dramatically faster data access, with some users seeing a 30x increase in data throughput. For a more in-depth overview of Alluxio, see their documentation.

Read More
How Engineers Can Help Drive Innovation

Every engineer out there is looking to build something amazing, just like every visionary likes to see their ideas come to life. Unfortunately, innovation can be lost in the day to day, technicalities, other priorities, and requirements documents. All of these have created pitfalls for many promising projects, but it doesn’t have to be the case if you can be aware of where those pitfalls may pop up and implement a little bit of autonomy in bridging those gaps.

Here are a few things that we do to keep our engineering team connected to and at the forefront of innovation.

Read More
Schema.org 3.1: Hotels and Hospitality Just Got A Lot More Structured

Hospitality brands gained some new ways to share information with search engines in this week’s Schema.org release, allowing hoteliers to specify everything from what kinds of rooms are on offer to whether they’re pet friendly. These enhancements to the markup standards, which Google uses to enhance a site’s search results, set the stage for travel shoppers to more fully assess what chains, single-location B&B’s, OTAs and even peer-to-peer networks like Airbnb have to offer directly on search results pages – and maybe even someday book from there as well.

Read More
Designing event based applications with incron

Recently we wrote an article on leveraging AWS Lambda to create event based applications using S3. However what happens when you don’t have access to S3? What if you are using FTP or shared drives? Luckily there are still solutions! One way to accomplish this on Linux  is using incron. […]

Read More
Reactive Applications with AWS Lambda

Sometimes you may find yourself requiring a CRON script to clean a file, or maybe you need to watch a directory of images to create preview thumbnails when they arrive on the server. Processes like these suffer from the same limitation; they require you to poll a script until you get a “successful” result.

This is problematic because it forces the developer to write redundancy checks in the code instead of just focusing on the core problem. Moreover, file watching utilities generally notify once the file is created, not when the file is finished writing. All of these problems must be accounted for, and result in more complexity, overhead, and development time.

This is where event driven programming can greatly reduce your development overhead. Part of that is maintaining a centralized data lake for all of your raw files. Data lakes generally maintain an event API for easy management and access of files within the lake. In our case, Amazon S3 is the data lake of choice and thanks to AWS Lambda we can hook into the S3 event API with minimal effort for simple use cases like cleaning files.

Read More
Optimizing your Docker workflow

We create a lot of single responsibility services including fetching mail, downloading groups of files, cleaning data, importing data, and many others. This requires us to create new servers that need to be monitored and maintained so we use docker containers to normalize our process and work efficiently. From testing to staging to production, docker containers provide a simplistic way to create disposable server images.

The primary drawback is most docker images lack proper setup or are not designed for your network or architecture. Below are a list of recommendations that will make creating docker containers a less time consuming process.

Read More
Moving One Billion Rows in MySQL (Amazon RDS)

So you may remember from our article in November of 2014 about our switch to Redshift, that Koddi uses Amazon Web Services (AWS) to power our platform. While we have moved some of our data to Redshift, we still have quite a bit in MySQL (RDS), and at the beginning of this year we needed to move our main database from one AWS account to another. The normal process when creating a copy of a database in RDS is to take a snapshot and spin up a new database from this snapshot. However, Amazon doesn’t allow you to share snapshots between accounts. This posed the question, how do we efficiently migrate over a billion rows of data?

Read More
Tools and Methods for Multiple Weekly Deployments

Building an advanced bidding and reporting platform doesn’t just happen overnight. Our development team is constantly working on updating and improving the platform to give our users the best possible experience. You’re unlikely to notice, but Koddi releases updates to the application two to four times each week. Everything from […]

Read More