Turning a Public Data Source into a 29 Euro Monthly Feed
Selling data as a subscription is the most boring AI side hustle I have run, and it is the one that prints the cleanest margin. The product is not a flashy chatbot, it is a spreadsheet that refreshes on a schedule and lands in the buyer's inbox or API. The crawler is small, the legal filter is strict, and the customer list is short and loyal. I built mine in four weekends, and it has been running mostly untouched for five months. Here is how I pick the niche, how I keep it legal, and how I land the first paying customers.
The niche test, before any code
A sellable data feed has three properties: the source is public, the data changes often enough to justify a subscription, and the buyers are already paying someone for an uglier version of it. I rule out anything that fails even one of these. I shortlist candidates by posting a two line pitch in industry forums and small subreddits, the same filter I use for my AI chatbot subscription. If I do not get at least four serious replies in 72 hours, the niche dies on the spot. Building a feed nobody asked for is the failure mode I refuse to repeat.
The legal checklist I will not cross
This is the step that kills most projects, and rightly so. Before I write a single line of crawler code, I answer five questions about the source. Is the data behind a login? Does the terms of service forbid automated access? Does the robots.txt exclude the paths I want? Does my crawl rate exceed what a polite user agent would do? Does the output expose personal data in a way the original publisher does not? If any answer is uncomfortable, I move on to the next candidate. I keep a written record per source, dated, so that if a dispute ever surfaces I can demonstrate the check. Legal hygiene here overlaps with the paper trail I keep on domain flips; boring documentation saves you later.
The stack, kept deliberately small
One small VPS at 6 euros per month, Python with requests and selectolax for parsing, SQLite for state, a rotating residential proxy pool at 15 euros monthly, a single cron job every 60 minutes. No Kafka, no queues, no microservices. Total infrastructure cost is 21 euros per month. AI enters in exactly two places: first, drafting the extraction selectors from a sample page, which saves maybe three hours; second, normalising messy free text fields like company names and locations. The crawler itself is 380 lines of deterministic Python. Boring code, easy to debug at 11 p.m. when something breaks.
- Fetch list page, diff against last snapshot, extract new item URLs.
- Fetch each new item, parse fields, validate types.
- Normalise free text with a cheap LLM call, cap cost at 0.002 euros per row.
- Deduplicate against SQLite by hash of the normalised payload.
- Append to daily CSV, push to subscribers via email plus a tiny JSON endpoint.
Pricing, packaging, and the first 14 buyers
I priced at 29 euros monthly from day one, no free trial. A free trial would have brought window shoppers, and this product survives on buyers who already know the pain. The first three subscribers came from the same forum where I tested the pitch. The next eight came from a single cold email batch I sent to 40 small agencies that were clearly doing the same work manually. Reply rate was 28 percent, close rate on replies was 25 percent. The remaining three found me through search after two short write ups I posted. Revenue at month five: 406 euros monthly, costs 21 infrastructure plus roughly 85 euros of token and proxy usage in the heaviest month, net margin around 300 euros. That is a 74 percent margin on a product that takes me roughly 90 minutes a week of maintenance. The unit economics are closer to my paid newsletter than to most AI wrapper plays.
Maintenance, the part nobody shows in a sales page
Every Tuesday evening I open the dashboard, skim the error log, and patch whatever changed. In five months I have touched the selectors four times. Two were minor layout tweaks on the source, one was a silent 403 that needed a new user agent rotation, one was a breaking schema change that cost me a full evening. I budget 90 minutes per week and use roughly 60 on average. Without that fixed slot the feed would rot inside a quarter, which is the failure mode I warn about on the AI Side Hustles hub.
Sell the schema, not the scrape
Buyers do not care that you scrape. They care that your output is a clean, stable schema they can drop into their own tools without renaming columns. I publish the schema on the sales page with an example CSV and a tiny JSON Schema file. That single detail closed at least three of my first ten subscribers who had been burned by a previous feed that changed columns without warning. A one page schema document is worth more than any landing copy I could write.
Do not scrape anything that requires authentication, even if it looks easy. The one time I bent this rule, on a source that had a very thin login wall, I lost a week arguing with the publisher and had to refund two subscribers. A narrow public source with modest volume beats a rich gated source with legal risk every single time. If the only way to build the feed is behind a login, build a different feed.
Frequently asked
Is scraping public data legal?
Public data with no login and no clickwrap terms that forbid automated access is the safe zone in most jurisdictions. Scraping behind a login, bypassing rate limits, or ignoring a robots.txt exclusion moves you out of that zone fast. I keep a one page legal checklist per source and refuse any feed that cannot answer yes to every line of it.
How much can you really charge for a niche feed?
My feed sits at 29 euros monthly for 14 paying subscribers, 406 euros of recurring revenue. A general news feed would not survive at that price; a narrow industry feed with 60 minute freshness and a clean schema does. Buyers are almost always small agencies or one person research desks who value the time saved, not the raw data.
Can AI replace the scraping pipeline?
AI writes the extraction prompts and normalises messy fields, but the plumbing (scheduling, retries, deduplication, change detection) is still boring code. I tried running the whole pipeline through an LLM agent and it worked for a weekend then quietly drifted. Use AI for the fuzzy parts, keep the deterministic parts deterministic.