I tend to make public data and code used in most of my projects.
Here I collect a few, including bespoke self-teaching resources
Retrieving and Generating Data using LLMs
Python code and slides to use API to access LLMs. Visit the GitHub Repository
This open‑source notebook collection and slides demonstrate two complementary LLM paradigms, retrieval and generation, for turning raw text into structured, research‑ready data.
Retrieval notebooks show how to mine large document corpora to extract causal edges, stance labels, demographic attributes and other key fields (e.g., the pipeline powering www.causal.claims).
Generation notebooks start from minimal seed prompts and leverage the model’s prior to build production networks, innovation profiles and context‑aware keyword dictionaries (see aipnet.io and www.academicexpression.online).
Across both strands you will find hands‑on modules for prompt engineering, JSON‑schema enforcement, cost‑efficient batch calling, embedding‑based code mapping (HS6 / JEL) and validation routines such as modal voting and cosine sanity checks. By the end, users can scale or adapt each workflow—whether analysing messy policy PDFs or constructing supply‑chain graphs—while keeping costs predictable and outputs auditable.
Causal Claims in Economics. Joint with Thiemo Fetzer.
Data on the knowledge graph of 45,000 economics papers
Check our dedicated website: www.causal.claims for data, tools and Causal Claims LLM.
AI-Generated Production Networks: Measurement and Applications to Global Trade. Joint with Thiemo Fetzer, Peter John Lambert, Bennet Feld.
Product level input-output data and on product importance.
Check our dedicated website: aipnet.io for data and tools.
Political Expression of Academics on Twitter. Joint with Thiemo Fetzer.
Nature Human Behaviour.
Data on political stance of 300,000 academics since 2016-2022.
Check our dedicated website: www.academicexpression.online