Web Scraper for Korean-Involved Articles

Background & Purpose:

When an incident occurs, we often seek out Korean names as an indication of potential Korean affiliation.
However, manually identifying these names requires a substantial amount of human labor, meticulous attention, valuable time, and reliance on intuition.
To overcome this challenge, machine learning can support our efforts in enhancing efficiency and scalability.

Steps for this project include:

1. Scraping News Articles: In this step, we collect news articles based on a user-given keyword.

2. Creating a Database: Once we have scraped the news articles, we store them as a collection of csv files.

3. Extracting Human Names: This process involves using natural language processing (NLP) techniques and libraries to identify and extract names mentioned in the text. We use a library called spaCy for named entity recognition (NER) tasks to identify and extract human names.

4. Binary Classification for Korean Names: After extracting human names, we determine if the articles include Korean names, which may indicate affiliation with Korean individuals. We use a pre-trained binary classification model to label if articles include Korean names.

April 1st – June 30th, 2023