Stackoverflow Heroes — Chapter 2: Your data is mine

8 min readJul 23, 2020

In the previous post, we talked about the system design for Stackoverflow Heroes — an app for querying Stackoverflow data as a graph. We defined three Important Topics of the app: Data Fetch, Data Insert, and Query API. Today, we will be covering the practical implementation of the data fetching.

You still remember that we are exploring the Serverless world here, right? Surely, the implementation here forced to be Serverless. Frankly speaking, we don’t really need all of those services, technologies, and APIs. But hey, we are exploring!

We have technology

The stack we will be using for Stackoverflow Heroes:

Serverless stack: AWS
Serverless deployment: Terraform
CI/CD/VCS: GitHub
Language (used in Lambdas): Go

That’s pretty much it, we don’t need anything else. Mostly because AWS, Terraform, and GitHub are available to anyone — free tiers, free accounts, free CI/CD in GitHub actions, etc. Why Go… well, no objective reason. Any of Lambda Runtimes would work well enough for our app. I will be doing Go in this article just because I needed to choose something.

So, what are we doing, again?

We will be covering Data Fetch part of the app. StackExchange API, Lambda, S3, and simple CI via GitHub is on the agenda. Let’s start by figuring out how do we get the questions, answers, people from the Stack Overflow.

Extract data

A literal unicorn! Thanks, Mr. Gessner and Mr. Spolsky!

Stack Overflow is not only famous by that its creator likes unicorns in the code but also that its public REST API is extremely easy to use. Go to Stack Exchange API docs and pick any endpoint. You will see that you can run the API directly from the Docs. This is awesome! I wish more API were documented that way.

But how is it possible, where is the authentication? The trick here is that you can query Stack Overflow API anonymously. The anonymous mode is highly restricted by the number of requests/data you fetch. This is good for experimenting and checking out what is possible and what is not, but for any serious business-like application you have got to get through the OAuth2 process and even register your app. Let’s do it together.

Similar modes (anonymous but limited and authenticated) are common. For example, GitHub uses it as well. Think about the API you have in your projects API — perhaps, the anonymous limited mode can help your API users to get to first successful interaction.

So, to get an access token we need to make a Stack App first. Go to https://stackapps.com/ and register a new application (you need to be logged in to Stack Overflow for that). When you create the application you have to specify the OAuth Domain and Application Website. Since we are not planning to have a real Website, let’s just use github.com as a domain and the link to the GitHub repository as a Website. I used https://github.com/Otanikotani/stackoverflow-heroes but you will probably need to fork that repo and use the fork as a link. Once this is done you will get:

Client Id
Client Secret
Key

We will need all of them later for getting an access token and using it.

First, open a link (change query parameters first) “https://stackoverflow.com/oauth/?client_id=11111&scope=private_info+no_expiry&returnUrl=https://stackexchange.com/oauth/login_success” in your web browser. The client id is the one you still remember by heart from the app we have just registered. The Scope is set to “private_info+no_expiry” so that we can use the resulting access token for as long as we want and we can get some extra fields from the Stack Overflow API. The return URL must be set as in the link.

The opened page asks you for the approval and on approval redirects you to another URL which looks like https://stackexchange.com/oauth/login_success?code=dddskyAxrsI90c(fc8qw)). Keep the code value from that URL.

Okay, so now we can finally open the terminal and do what my wife calls “do java” (because anything — a terminal, an IDE, a technical document is “doing java” for her)

I will be using HTTPie here as I really like its simplicity over standard curl.

The command:

http --form post 'https://stackoverflow.com/oauth/access_token/json' client_id='1111' client_secret='<your-app-client-secret>' code='<the-code-we-got-from-the-page>' redirect_uri='https://stackexchange.com/oauth/login_success'

If it is all good you will receive an access token in the response’s body. Grats!

You can play a little bit with the access token using it in the Stack Exchange API docs and see how the quota limit is changed. Don’t take too long though, it is time to write some Go and get the precious data we want!

Stackoverflow Heroes: first Lambda

The reference repository: https://github.com/Otanikotani/stackoverflow-heroes/tree/2e7a82077557996d5d580cf02ab84753810241df.

If you are not completely new to AWS Lambda, S3, and Go you can safely skip this chapter.

Check it out and let’s see what we got here:

We’ve got a couple of go files, a simple GitHub Action workflow YAML configuration, and a cherry on top — Makefile! Man, did I completely forget the days of using make… It still works though, it is dead simple, it covers our needs.

CI

A couple of words on GitHub Actions. If you are not familiar with them I strongly recommend it. It is free to use (has limits of course, but you are unlikely to run out of the quota), it is available for both public and private repositories, it is close to your code, and again — it is dead simple. Let’s check what we got there in just-tests.yml:

This configuration will trigger a job execution on any push to the master branch. The job will get the code, lint it, run tests, and then build it. See logs here.

Shameless plug: if you use JetBrains products (Intellij IDEA, for example) check out the plugin for monitoring GitHub Actions workflows directly in IDEA: https://plugins.jetbrains.com/plugin/13793-github-actions

For now, that’s all we want from CI — to run some simple checks on our repo upon pushes. We will do the complete Continuous Delivery pipeline later.

Okay, it is all good and all, we can build, test, lint the code, but what is the code?

The main.go file starts the lambda request handling, so the main function of your Go lambdas is always going to look like this:

The handleRequest function gets the access token, key (this is Key of the Stack APP we registered before), S3 region, and S3 bucket configuration parameters from ENV. This is the preferred way to do it in Lambdas. Now, some experts will tell you that it is wrong to have sensitive information such as access tokens set as ENV variable in lambda, but for our little app, it is a risk we are willing to take. Other solutions would be to either use AWS Secrets Manager or encode the ENV variables.

Then the handleRequest function fetches the questions, answers, and people from the StackExchange API, converts them to a bunch of vertices and edges records, and then writes them as .csv files into AWS S3. There are three vertices .csv files: answers, questions, people. There is one edge .csv file: edges. Edges define relationships between answers, questions, and people. Well, not right now, but they will be once uploaded to AWS Neptune.

Since it is Go and the code is pretty simple, I am not going to cover here how exactly the data is being fetched or how it is transformed to CSV and uploaded to AWS S3. If the code is not clear by itself — please throw something at me next time you see me. Before that, some less obvious facts for you though:

The data we fetch is quite limited. First, we only fetch the “serverless” tagged questions. Then there is a filter set up there which tells the Stack Exchange API which fields we are interested to see in the response. Even so, there could be a consideration here that we will run out of memory (especially in the Lambda environment!) because basically, we store all the data in memory. The way to address it here would be to simply process each separate page of retrieved questions separately, while only keeping the references between people and what questions/answers they made in memory.
The headers for questions, answers, and people CSV files look weird with those “~” symbols. Not much we can do here though — it is required by AWS Neptune format.

This is going to be enough for our first step. Let’s deploy the lambda, create the s3 bucket, run it, and see the CSV files in the S3.

λ → 🗑

First, let’s build the lambda — do make zip (you’ll need Go 1.13 for that).

Then log in to your AWS account (if you don’t have it — seriously, you should, free compute power for everyone), go to Lambda service and create a new function. You can go with defaults there, just remember to give it an epic name.

Once created, upload the zip as a Function code there (click the Actions button on the right:

To upload your zip code as a lambda function code

And set up the Environment variables. We need to set up a BUCKET, STACK_EXCHANGE_ACCESS_TOKEN, STACK_EXCHANGE_KEY. The last two we have already — it is your access token and the key of your Stack App. The bucket — we don’t have one. Let’s make the bucket. Go to the S3 service and make any bucket you want. The defaults would work just great. Then, once we have the bucket name we are all set to go on and run the function!

Does not look secure to me!

We forgot to provide access to the lambda to operate on S3! By default, lambdas have almost no permissions to do anything with other AWS services. Let’s enable S3 operations.

In your lambda, go to the Permissions tab and open the Role link. Then, open the execution role and modify so it would look like this:

Note the placeholders! Replace with your resource names.

This will allow the Lambda to perform any kind of operation on the bucket.

Now we are ready to test! Go back to the lambda, hit the Test button there, and generate a Test Event that would trigger the lambda. The event content doesn’t matter. Enjoy the sweet and green Execution result: succeeded result, check the logs, check the s3 bucket. I hope that everything is in place.

What is next

We have covered a very simple lambda function that invokes some API and saves results to S3. That’s enough for building some simple cute functions for your own needs. But we are not going to stop here! First, the manual lambda deployment was absolutely shameful, we are not supposed to be doing that. Secondly, we did not do the CloudWatch part to trigger the lambda — we only tested it manually (ew). Let’s address these things in the next article, see you there!