Section topics#
Gather Wikidata items from Wikipedia blue links.
Run it!#
You need access to a Wikimedia Foundation’s analytics client, AKA a stat box. Then:
me@my_box:~$ ssh stat1008.eqiad.wmnet # Or pick another one
me@stat1008:~$ export http_proxy=http://webproxy.eqiad.wmnet:8080
me@stat1008:~$ export https_proxy=http://webproxy.eqiad.wmnet:8080
me@stat1008:~$ git clone https://gitlab.wikimedia.org/repos/structured-data/section-topics.git st
me@stat1008:~$ cd st
me@stat1008:~/st$ conda-analytics-clone MY_ENV
me@stat1008:~/st$ source conda-analytics-activate MY_ENV
(MY_ENV) me@stat1008:~/st$ conda env update -n MY_ENV -f conda-environment.yaml
(MY_ENV) me@stat1008:~/st$ python section_topics/pipeline.py MY_WEEKLY_SNAPSHOT
Get –help#
(MY_ENV) me@stat1008:~/st$ python section_topics/pipeline.py --help
usage: pipeline.py [-h] [-w /hdfs_path/to/dir/] [-i /path/to/file.txt]
[-p hdfs_path/to/parquet] [-s /path/to/file.json] [-l N]
[-t hdfs_path/to/parquet] [-q /path/to/file1.txt ...] [--handle-media]
[-m /path/to/file.txt] [--keep-lists-and-tables]
YYYY-MM-DD
Gather section topics from Wikitext
positional arguments:
YYYY-MM-DD snapshot date
options:
-h, --help show this help message and exit
-w /hdfs_path/to/dir/, --work-dir /hdfs_path/to/dir/
Absolute HDFS path to the working directory. Default:
"section_topics" in the current user home
-i /path/to/file.txt, --input-wikis /path/to/file.txt
plain text file of wikis to process, one per line. Default: all
Wikipedias, see "data/wikipedias.txt"
-p hdfs_path/to/parquet, --page-filter hdfs_path/to/parquet
HDFS path to parquet of (wiki, page revision ID) rows to exclude,
as output by "scripts/check_bad_parsing.py". Must be relative to
the working directory. Default: badly parsed ptwiki articles, see
"2022-10_ptwiki_bad" in the working directory
-s /path/to/file.json, --section-title-filter /path/to/file.json
JSON file of `{ wiki: [list of section titles to exclude] }`.
Default: see "data/section_titles_denylist.json"
-l N, --length-filter N
exclude sections whose content length is less than the given
number of characters. Default: 500
-t hdfs_path/to/parquet, --table-filter hdfs_path/to/parquet
HDFS path to parquet with a dataframe to exclude, as output by
"scripts/detect_html_tables.py". Must be relative to the working
directory. The dataframe must include ('wiki_db', 'page_id',
'section_title') columns. Default: ar, bn, cs, es, id, pt, ru
sections with tables, see "20230301_target_wikis_tables" in the
working directory
-q /path/to/file1.txt ..., --qid-filter /path/to/file1.txt ...
plain text file(s) of Wikidata IDs to exclude, one per line.
Default: see "data/qids_for_all_points_in_time.txt" and
"data/qids_for_media_outlets.txt"
--handle-media separate media links and dump them to "media_links". WARNING: the
pipeline execution time will increase to roughly 20 minutes
-m /path/to/file.txt, --media-prefixes /path/to/file.txt
plain text file with media prefixes, one per line. Default: all
Wikipedia ones, see "data/media_prefixes.txt". Ignored if "--
handle-media" is not passed
--keep-lists-and-tables
don't skip sections with at least one standard wikitext list or
table
Get your hands dirty#
Install the development environment:
me@stat1008:~/st$ conda-analytics-clone MY_DEV_ENV
me@stat1008:~/st$ source conda-analytics-activate MY_DEV_ENV
(MY_DEV_ENV) me@stat1008:~/st$ conda env update -n MY_DEV_ENV -f dev-conda-environment.yaml
Test#
(MY_DEV_ENV) me@stat1008:~/st$ python -m pytest tests/
Lint#
(MY_DEV_ENV) me@stat1008:~/st$ pre-commit install
At every git commit
, pre-commit will
run the checks and autofix or tell you what to fix.
Docs#
(MY_DEV_ENV) me@stat1008:~/st$ sphinx-build docs/ docs/_build/
Trigger an Airflow test run#
Follow this walkthrough to simulate a production execution of the pipeline in your stat box. Inspired by this snippet.
Build your artifact#
Pick a branch you want to test from the drop-down menu
Click on the pipeline status button, it should be a green tick
Click on the play button next to
publish_conda_env
, wait until doneOn the left sidebar, go to Packages and registries > Package Registry
Click on the first item in the list, then copy the Asset URL. It should be something like
https://gitlab.wikimedia.org/repos/structured-data/section-topics/-/package_files/1321/download
Get your artifact ready#
me@stat1008:~$ mkdir artifacts
me@stat1008:~$ cd artifacts
me@stat1008:~$ wget -O MY_ARTIFACT MY_COPIED_ASSET_URL
me@stat1008:~$ hdfs dfs -mkdir artifacts
me@stat1008:~$ hdfs dfs -copyFromLocal MY_ARTIFACT artifacts
me@stat1008:~$ hdfs dfs -chmod -R o+rx artifacts
Spin up an Airflow instance#
On your stat box:
me@stat1008:~$ git clone https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags.git
me@stat1008:~$ cd airflow-dags
me@stat1008:~$ sudo -u analytics-privatedata rm -fr /tmp/air/MY_AIRFLOW_HOME # If you've previously run the next command
me@stat1008:~$ sudo -u analytics-privatedata ./run_dev_instance.sh -m /tmp/MY_AIRFLOW_HOME -p MY_PORT platform_eng
On your local box:
me@my_box:~$ ssh -t -N stat1008.eqiad.wmnet -L MY_PORT:stat1008.eqiad.wmnet:MY_PORT
Trigger the DAG run#
Go to
http://localhost:MY_PORT/
on your browserOn the top bar, go to Admin > Variables
Click on the middle button (Edit record) next to the
platform_eng/dags/section_topics_dag.py
KeyUpdate
{ "conda_env" : "hdfs://analytics-hadoop/user/ME/artifacts/MY_ARTIFACT" }
Add any other relevant DAG properties
Click on the Save button
On the top bar, go to DAGs and click on the
section_topics
slider. This should trigger an automatic DAG runClick on
section_topics
You’re all set!
Release#
On the left sidebar, go to CI/CD > Pipelines
Click on the play button, select
trigger_release
If the job went fine, you’ll find a new artifact in the Package Registry
We follow Data Engineering’s
workflow_utils:
- the main
branch is on a .dev
release - releases are made by
removing the .dev
suffix and committing a tag
Deploy#
On the left sidebar, go to CI/CD > Pipelines
Click on the play button and select
bump_on_airflow_dags
. This will create a merge request at airflow-dagsDouble-check it and merge
Deploy the DAGs:
me@my_box:~$ ssh deployment.eqiad.wmnet
me@deploy1002:~$ cd /srv/deployment/airflow-dags/platform_eng/
me@deploy1002:~$ git pull
me@deploy1002:~$ scap deploy
See the docs for more details.
API documentation#
- The data pipeline
SECTION_LEVEL
SECTION_ZERO_TITLE
STRIP_CHARS
SUBSTITUTE_PATTERN
get_monthly_snapshot()
load_pages()
load_qids()
load_redirects()
apply_filter()
look_up_qids()
wikitext_headings_to_anchors()
normalize_heading()
normalize_heading_column()
normalize_denylist()
parse_excluding()
extract_sections()
normalize_wikilinks()
handle_media()
clean_up_links()
resolve_redirects()
gather_section_topics()
compute_relevance()
compose_output()
parse()
- Queries to the Data Lake