API Management / Data / Serverless / Contributed

How Daily.Dev Built a Low-Budget Serverless Scraping Pipeline for Online Articles

26 Feb 2021 9:43am, by
Ido Shamun
Ido Shamun is a co-founder and CTO at daily.dev.

At daily.dev, we scrape 50,000 articles every month from 500 blogs and magazines, at a cost of only $50 a month for us. In this article, I would like to share the decisions that made our articles pipeline cost-friendly, elastic, and ops-free.

Let’s start by explaining the different steps we have to do before an article gets published on our feed. The first step is to check the feed (mainly RSS) for new articles. Once a new article is detected, we need to scrape the metadata of the article, such as cover image, title, author, publication date, and others. The next step is to apply an NLP algorithm to extract keywords from the content of the article. This followed by processing the cover image of the article and host it on Cloudinary. And finally, adding the new article to the database.

The behavior pattern of this pipeline is to handle spikes of messages with long idle times in-between. This is due to that we check the blog feeds once every few minutes. Another important factor is that we don’t optimize for processing times. As long as it takes seconds or even a minute or two to process every article that’s fine in terms of the product specifications. Please note that long processing times might incur a higher cost.

Armed with the specifications and context of how it should behave, we can proceed to find the right infrastructure for it.

Cloud Functions to the Rescue

I’m a big  fan of the Google Cloud Platform so I started looking at the available managed solutions GCP offers.

Why managed? Because we are a small team and can’t afford to manage infrastructure, even though it means that we will be vendor locked. Considering all the available products, I find that Cloud Functions (similar to Amazon Web ServicesLambda service) is the best solution for our architecture. It can scale down to zero which is important for the cost, and on the other hand, Cloud Functions support a massive scale, way beyond our requirements.

Every step in our pipeline can be deployed as a separate function, which gives us the flexibility to choose the right programming language, and runtime environment. For example, I’m more familiar with Cloudinary’s JavaScript SDK, so it makes sense to use JavaScript for the image processing function. Python is a great choice for NLP, which is also a step in our pipeline.

Another important cost factor is that we can set the hardware requirements per function. It supports HTTP and Pub/Sub triggers. And it has a very generous free trier. But it does come with some comprises that we have to consider, Cloud Functions is a proprietary solution of Google Cloud. The tools for local development are very simple and you have to hack your way around, and as such so is the testing. Compared to Docker-based solutions, Cloud Functions have limited runtime support.

Given the simple nature of our steps and the fact that they are so independent of each other, I think that the pros outweigh the cons so Cloud Functions it is. Specifically for subscribing to the RSS feeds, we use Superfeedr. It is a managed service that triggers a webhook when the feed changes. It is pretty expensive in my opinion, 10 feeds cost $1 per month, and it could definitely be a better product but it does its job.

To get it going fast, it was the right solution because it reduced the amount of development required and the ongoing operations. A few years, later we’re now considering building our own solution for subscribing to RSS feeds but that’s a story for a different time.

Orchestration vs. Choreography

When dealing with distributed workflows like in our use case, there is always the question of service orchestration or service choreography. The first means that there will be a dedicated service for supervising the whole process from A to Z. The supervisor shall call each service in the right order while providing the right arguments. It also has to deal with errors and unexpected events.

Service choreography means that every service invokes the next service in the process either synchronously (HTTP for example) or asynchronously (message queue of sorts). When following the choreography pattern, some of the workflow logic has to be implemented as part of the service and the service should be aware of the next service in-line.

* Credit to StackOverflow for the images

Our workflow is very straightforward with no conditions, and no complex execution graph. Each service enriches the data of its predecessor. I didn’t want to manage a state for the executions of every post and introduce a single point of failure so I decided to follow the choreography pattern.

The services use Google Pub/Sub to asynchronously communicate. I would like to highlight that with the latest release of Google Workflows, a managed supervisor for service orchestration, I might rethink my decision. With Workflows, I can get all the benefits of service orchestration without the need to develop or maintain the orchestrator itself.

Above you can see the existing architecture of the pipeline. Every box is a cloud function except for the API which is our server that subscribes to the event of a post is processed. Upon this event, the API will add a new entity to the database and making it available to all users. Superfeedr triggers the webhook with an HTTP request and all the rest communicate with messages through Google Pub/Sub.

Monitoring and Error Handling

We can’t introduce a new architecture without considering monitoring and error handling. I won’t go into details but just cover the important takeaways.

First, we need to monitor our message queues. My alarm is set to one unpacked message in the queue for five minutes. Usually, the latency of the articles pipeline is very low so a message doesn’t stay for long in the queue, so if it happens we need to know and we need to know fast.

The second aspect is the application errors that could occur during runtime. I use Google Error Reporting which notifies me in real-time of any new unexpected error in the service. Of course, there are many alternatives to Error Reporting, but I find it easier as it integrates perfectly with the rest of the cloud services.

Lastly, we need to think about the retries strategy in case of an error. Luckily, the message queue has some strategies out-of-the-box including a dead letter queue so we can later inspect those messages that the system couldn’t process.

Cost Analysis

The cost analysis for this architecture is very simple. Cloud Functions have 2 million invocations per month for free. Google Pub/Sub free tier limit is 10GB per month. It means that for 50,000 articles per month everything falls under the free tier of GCP which is incredible!

For this implementation and scale, we pay $0 for infrastructure which includes the Cloud Functions, Pub/Sub, monitoring, and error reporting. Superfeedr is the only cost for this architecture. $1 for 10 feeds per month. In total, we pay $50. Cloudinary is excluded because I count it as part of our API architecture. And anyway, the real cost of Cloudinary is the bandwidth, not the storage which is not relevant for this case.

Conclusion

In this post, we introduced a new serverless pipeline for scraping articles. We’ve covered the pros and cons of Cloud Functions and why it’s a cost-effective solution. We then compared the different techniques to orchestrate our workflow, followed by the necessary aspects of monitoring. Lastly, we did a bit of cost analysis to understand how much it costs.

The New Stack is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: Docker.

A newsletter digest of the week’s most important stories & analyses.