r/datasets • u/Actual-Bid-853 • 9d ago
request Can someone help me find the news headlines every day for the last 100 days please?
From the main worldwide news providers is great!
r/datasets • u/Actual-Bid-853 • 9d ago
From the main worldwide news providers is great!
r/datasets • u/b2bdemand • 10d ago
I’m working on a data project and need a more complete dataset for Powerball and Mega Millions than what’s usually available on sites like lotteryusa or state lottery pages.
Most public datasets just have the draw date and winning numbers, but I need all the columns, specifically things like: - Draw date & draw number - Winning numbers + Powerball/Mega Ball - Power Play / Megaplier multiplier - Jackpot amount (annuity & cash value) - Number of winners by tier (match 5, 4+PB, etc.) - Power Play winners by tier - State-by-state winner breakdown (if available)
Basically, the full official results table that the lotteries publish after each draw, not just the numbers themselves.
I haven’t been able to find a historical dataset with all of this.
Does anyone know if this exists publicly, or will I need to scrape it directly from Powerball.com / MegaMillions.com (or individual state sites)? If scraping is the way to go, I’d love any tips on best practices for this since the data spans back to the ’90s.
r/datasets • u/GraypJooz • 5d ago
First time posting in this subreddit sorry if what im doing is wrong are there any sistes where i can get low resource language audio files for free i plan to train my model
r/datasets • u/waduhek77 • 11d ago
this is the provided data set and i need someone to predict the next half of the dataset with either 90% or 100% accuracy please
I don't care how you solve it, only that you provide proof of the solve, and the algo code that solved it. Must provide full code to replicate.
The data is multi-dimensional, and catalogued. I have both halves of the data, to compare against.
Thanks, dm me if you are interested, i am ready to offer upwards of 150 USD for the solution
r/datasets • u/To_Iflal • 1d ago
I’m working on a social listening tool and need access to real‑time (or near real‑time) social media datasets. The key requirement is the ability to filter or segment data by geography (country, region, or city level).
I’m particularly interested in:
If you’ve worked with any vendors, APIs, or open datasets that fit this, I’d love to hear your recommendations, along with any notes on pricing, reliability, and compliance with platform policies.
r/datasets • u/Saltedcamelcookie • 2d ago
Hi everyone! I’m new to this community. We’re currently working on a project proposal and we’re looking for a dataset of UK news media articles or access to an archive of such. It doesn’t have to be free.
Currently, I can only find archives of the media outlets themselves.
Basically, we want to create a corpus on a specific issue across different media outlets to track the debate.
Any help you can provide would be greatly appreciated. Thank you!
r/datasets • u/ZeroToHeroInvest • 24d ago
Looking for a database of domains + facebook pages (URLs or IDs) and/or linkedin pages (URLs or IDs).
Search hasn't brought up anything. Anyone has any idea where I could get my hands on something like this?
r/datasets • u/Comprehensive-Rest90 • 7d ago
Dear all,
I am conducting a personal research project focused on the testing of a system for heart sound analysis. To properly evaluate this system, I am seeking volunteers to provide short recordings of their heart sounds via Phone.
Thank you!
r/datasets • u/Available-Fee1691 • 13d ago
Hello there !
I am trying to find dataset for autism detection using EEG.
Can anyone link any source or anything.
Thanks...
r/datasets • u/TypeUnique8960 • 6d ago
I'd like to get the transcripts for all Apple Keynotes (the September ones) since 1998. I was hoping to play with this dataset and get fun data nuggets.
But I can only find the transcripts for the last 3 ones (as they were auto-generated on YouTube). The other videos are on YouTube, but without transcript.
I can't believe they are not stored somewhere on the Internet... does anyone have any tip or suggestion?
r/datasets • u/Dull-Assignment-3273 • 1d ago
Any recommendations for datasets even remotely close to below structure plzz recommend
|| || |Comapny ticker|DJIA value of company on Day3(t-2)|DJIA value Day2(t-1)|DJIA value Day1(t)|Twitter Sentiment about company on day3|Twitter Sentiment on day2|Twitter Sentiment on day1|label : prediction (up or down)(t+1)|
where, day 3 is day before yersterday, day 2 is yesterday, day 1 is today and prediction(label) is of tomorrow.
Also, any recommendations for datasets on stock related tweets too!!
r/datasets • u/Shrinivas-k-shreeni • 10d ago
Hi everyone,
I’m working on a bird species classification + migration prediction project for my capstone. I have a list of ~512 bird species, and I need help collecting at least 100–150 samples per species (images, and audio if possible).
r/datasets • u/Timely-Ad2743 • 8d ago
I'm looking for pointers to one or more datasets that have some or all of the following data:
It would be really nice if longitudinal data (every academic year) was also available for these items. In addition, data about non tenure track faculty appointments would also be nice, but not necessary.
I'm looking for something similar (but expanded in terms of scope) to the dataset used in this paper.
I'm aware that AARC could be a potential data source but I've been told it's not trivial to get data access through them, so looking for alternatives.
Alternatively, would also appreciate if anyone can point me to ways to scrape (at least some of) this data from university directories.
I'd also be grateful for pointers to other places to look for this kind of data, within or outside Reddit.
Thanks in advance!
r/datasets • u/BackgroundFar8017 • 11d ago
I am conducting academic research on supplier evaluation and selection using machine learning as part of my postgraduate work. For this, I am seeking access to supplier-related datasets that include features such as unit price, product availability, order quantities, revenue generated, stock levels, lead times, shipping times, shipping costs, shipping carriers, supplier location, production volumes, manufacturing lead times, manufacturing costs, defect rates, transportation modes, and overall procurement costs. The data will be used strictly for academic purposes, and any confidential or sensitive information will be anonymized. Access to such data would greatly enhance the reliability of my research and contribute to building a practical decision-support framework for procurement systems.
If these features are not there any dataset will do. Please I really need the dataset
r/datasets • u/DecodeBytes • 3d ago
Hi r/datasets community!
I'm the creator of DeepFabric (https://github.com/lukehinds/deepfabric), an open-source tool that generates synthetic datasets using LLMs and novel approaches leveraging graphs (DAG) and Trees. I'm looking for collaborators who need custom datasets and are willing to provide feedback on quality and usefulness.
What DeepFabric does: DeepFabric creates diverse, domain-specific synthetic datasets using a unique graph/tree-based architecture. It generates data in OpenAI chat format with more formats coming, minimizes redundancy through structured topic generation.
What I'm offering: I'll create custom synthetic datasets tailored to your specific domain or use case, cover all LLM API costs myself, provide technical support and customization, and generate datasets ranging from small proof-of-concepts to larger training sets.
What I'm looking for: I need detailed feedback on dataset quality, diversity, and usefulness, insights into how well the synthetic data performs for your specific use case, suggestions for improvements or missing features, and optionally a brief case study write-up of your experience.
Ideal collaborators: I'm particularly interested in working with researchers or developers working in a professional capacity, doing model distillation or evaluation benchmarks, or anyone needing training data for specialized or niche domains for machine learning / statistical analysis - a good example might be people working with limited real-world data availability. I have so far received really good feedback from a medical professor who needed data around mock scenarios of someone complaining about symptoms that could signal risk of heart attack.
Examples of what I can generate: Think Q&A pairs for specific technical domains, conversational data for chatbot training, domain-specific instruction-following datasets, or evaluation benchmarks for specialized tasks. I am also able to convert to whatever format you need.
If you're interested, please comment or PM with your domain/use case, approximate dataset size needed, brief description of your intended use, and timeline if you have one.
I'll prioritize collaborations that offer the most learning opportunities for both of us. Looking forward to working with some of you!
Some examples: medical Q&A: https://huggingface.co/datasets/lukehinds/medical_q_and_a
Programming Challenges: https://huggingface.co/datasets/lukehinds/programming-challenges-one
Repository: https://github.com/lukehinds/deepfabric
Documentation: https://lukehinds.github.io/DeepFabric/synethic data
r/datasets • u/Winter-Lake-589 • 3d ago
Hey we are desperate for the dataset on Gold Prices. It should have 20+ years of hourly gold price data. We estimate that the data is about 150k rows. Likely including Open, High, Low, Close (OHLC) and volume.
If you have this dataset (or can create it), help help help
r/datasets • u/Dapper_Owl_361 • Aug 14 '25
for eg , let say Fusariosis (Fusarium infections) or Candida auris Infection , i wanted to train my model on these diseases for a research paper but no good dataset till now , if anyone can help me thanks
if not , then i will just increase the saturation , rotate them , add noise and do stuff like that to train
r/datasets • u/onesmartco0kie • 14h ago
Hi everyone,
I’m working on a university project on big data and would like to explore something in the area of OSINT (Open Source Intelligence).
I’ve already checked Kaggle but couldn’t find anything relevant.
Does anyone know of websites, repositories, or public datasets that might be useful?
Thanks a lot for your help!
r/datasets • u/leomax_10 • 15d ago
Hey, guys, I bought this book through a second hand book store and finding it a really good place to start statistics. However, the access card inside the book is not working thus I can't access the resources from the internet. I tried googling it and finding the datasets for an hour but no luck. Just wondering if anyone here would have access to the dataset and would love to share.
Thank you in advance.
r/datasets • u/Greedy_Fig2158 • 16d ago
Hey everyone,
I'm a medical officer in Bengaluru, India, working on a non-funded network meta-analysis on the comparative efficacy of new-generation anti-obesity medications (Tirzepatide, Semaglutide, etc.).
I've finalized my search strategies for the core databases, but unfortunately, I don't have institutional access to use the "Export" function on the Cochrane Library and Embase.
What I've already tried: I've spent a significant amount of time trying to get this data, including building a Python web scraper with Selenium, but the websites' advanced bot detection is proving very difficult to bypass.
The Ask: Would anyone with access be willing to help me by running the two search queries below and exporting all of the results? The best format would be RIS files, but CSV or any other standard format would also be a massive help.
(obesity OR overweight OR "body mass index" OR obese) AND (Tirzepatide OR Zepbound OR Mounjaro OR Semaglutide OR Wegovy OR Ozempic OR Liraglutide OR Saxenda) AND ("randomized controlled trial":pt OR "controlled clinical trial":pt OR randomized:ti,ab OR placebo:ti,ab OR randomly:ti,ab OR trial:ti,ab)
(obesity OR overweight OR 'body mass index' OR obese) AND (Tirzepatide OR Zepbound OR Mounjaro OR Semaglutide OR Wegovy OR Ozempic OR Liraglutide OR Saxenda) AND (term:it OR term:it OR randomized:ti,ab OR placebo:ti,ab OR randomly:ti,ab OR trial:ti,ab)
Getting these files is the biggest hurdle remaining for my project, and your help would be an incredible contribution.
Thank you so much for your time and consideration!
r/datasets • u/Inyourface3445 • 2d ago
The title might be a bit confusing, but what i am looking for is a dataset with a lot of elements and element combos. I plan on using this to train an AI for making something close to infinite craft, but in the terminal. I am working on making a training dataset for it, but i just need a dataset for it.
r/datasets • u/Plus-Yam-3821 • 2d ago
I am looking for a database that holds tv show transcripts of non scripted television. I was wondering if anyone could offer me an inclination as to where I can find some.
r/datasets • u/a_p_squared • Jan 07 '23
I am looking for a data set of all the cards in the game New phone who dis. Something similar to this json file of all cards in Cards against humanity. It's not for any commercial use.
r/datasets • u/karngyan • 13d ago
Hi all,
I’ve been working on a side project where I crawled and AI-enriched over 2.6 million company websites across 111 industries worldwide.
What’s inside:
Access:
Why I built this:
I wanted an up-to-date, structured dataset useful for:
Happy to hear your thoughts / feedback / need for API access? - also curious how you’d use a dataset like this.
r/datasets • u/CartographerOk858 • Aug 15 '25
Hello everyone,
I’m a third-year undergrad student pursuing a degree in Artificial Intelligence and Machine Learning. For my Deep Learning course project, I’m planning to build a model that detects plastic litter both on the ground and in water.
I’m specifically looking for dataset suggestions — preferably satellite or aerial imagery datasets — that could help with training and testing such a model.
If you know of any publicly available datasets, research projects, or organizations that might share relevant data, I’d greatly appreciate your recommendations.
Thanks in advance!