Python Engineer (Web Scraping)

About NewsCatcher

NewsCatcher is building the largest & most accurate database of news articles published online. Our solution crawls millions of web pages daily to detect over 1,000,000 news articles published online each day.

We deal with unstructured text data parsed from HTML pages.

You can read more about how NewsCatcher works via this link.

Today, we're a team of 2 technical co-founders. We're looking for our first hire: an experienced engineer, who can deliver end-to-end data solutions.

As our first engineering hire you will have a key influence on our new infrastructure. You'll actively participate in the technical design process, bringing your expertise and analysis to help make data-driven decisions

We're fast-growing, so if you can see yourself running your own team in a year we're a great place for you.

Our tech stack:

  • AWS, GCP, OCI
  • Python (including asynchronous workloads)
  • Docker & k8s
  • Pub/sub, SQS, Kafka
  • Elasticsearch
  • MongoDB/Dynamo
  • PostgreSQL
Your skills
  • Strong programming skills in Python
  • Experience in web scraping, web crawling
  • Python Packages: Scrapy, requests, bs4, re, selenium, flask
  • Knowledge with Docker & Kubernetes ecosystems
  • Experienced with Cloud platforms such as GCP/AWS/OCI
  • Experienced working with event-driven/streaming architectures
  • Monitoring (Elasticsearch/Datadog)
  • Github for code versioning
  • Optional:
  • Javascript
  • Kubernetes
  • Kafka

Example of tasks

  1. Improve existing web article extraction algorithms

    We have our own generic extraction algorithm. You will be asked to improve it so it can extract data from more news sources. The most valuable data we extract is published date, title, content.
  2. Write extraction algorithm for specific news websites

    Some websites have a unique web structure, but we still need data from them. If our generic extraction does not process those websites properly, you will be asked to come up with your own method.
  3. Create an algorithm that can distinguish an article webpage from another.

    While crawling a website, we go through all the pages it contains. From the "About Us" page to "Our sponsors". Your goal will be to identify those pages which are news articles.
  4. Improve News URLS finding algorithms

    If we get an URL of a news source that is not yet in our database, meaning we never extracted anything from it, we have algorithms that search for all ways of finding news articles URL. You will be working on improving it.
Reasons to work for us
  • You'll be our first hire and we consider to propose you the shares
  • Get a 360-degree review of the whole architecture
  • We expect to learn from you, your vote will mean a lot to us
  • You’ll never get bored
  • You’ll be surrounded by highly hardworking, talented, and ambitious people
  • Focus more on quality than quantity
About NewsCatcher

NewsCatcher is a news data-as-a-service startup. We collect & index over 1M news articles published online every single day.

We help data scientists, developers, analysts, and market researchers get instant access to news articles published online.

Our mission is to become the industry standard for a news data provider.

We're "devs for devs" minded so we open-source (over 3k stars on GitHub), maintain an absolutely free News API service, and write content that helps developers better understand how to work with news data. 

If you're a data scientist, developer, data engineer, or anyone who wants to understand web scraping, web crawling, NLP, data aggregation - we're a must-have page to follow!

Before you apply, please check if any restrictions apply in terms of time zone or country.

This job has a geo-restriction in place:  Asia, EU, SA only.

NewsCatcher

Visit company page
Apply for this position
Category Software Development
Job type Full-time
Hiring from Asia, EU, SA only
Date Posted 1wk ago

Please mention that you come from Remotive when applying for this job.

Does this job need an edit? 🙈

Remotive can help!

Not sure how to apply properly to this job?
Watch our live webinar « 3 Mistakes to Avoid When Looking For A Remote Startup Job (And What To Do Instead) ».

Interested to chat with Remote workers?
Join our community!