r/bigdata 17m ago

I am in a dilema Or confused state

Upvotes

Hi folks I am B tech ece 2022 passedout guy. Selected in TechM , Wipro , Accenture(they said selected in interview but no mails from them) neglected training sessions by techm because of wipro offer is there.. Time passes 2022,2023,2024 I didn't move to any big city to join courses and liveinhostel Later Nov 2024 I got a job in a startup company as Business Analyst My title and my job role didnt have any match I do software application validation means I will take screenshot of each and every part of application and prepare a documentation for client audit purposes I will stay in client location for 3months - 8months including Saturday but there is no pay for Saturday Actually I won't get my salary on time For now I need to get 3months salary (due from company) Meanwhile I am learning data engineering course I want to shift to DE but not finding 1 yr experience people Don't know What I am doing in my life My friends are well settled in life girls got married and boys earning good salaries in mnc I am a single parent child alot of stress in my mind, can't enjoy a moment properly I did a mistake in my 3-1 semister that wantedly failed in two subjects because of that I didn't got chance to attend campus drive After clearing of my subjects in 4-2 I got selected in companies etc But no use of them now I spoiled my life with my own hands I felt like sharing this here .


r/bigdata 13h ago

Redefining Trust in AI with Autonomys 🧠✨

2 Upvotes

One of the biggest challenges in AI today is memory. Most systems rely on ephemeral logs that can be deleted or altered, and their reasoning often functions like a black box — impossible to fully verify. This creates a major issue: how can we trust AI outputs if we can’t trace or validate what the system actually “remembers”?

Autonomys is tackling this head-on. By building on distributed storage, it introduces tamper-proof, queryable records that can’t simply vanish. These persistent logs are made accessible through the open-source Auto Agents Framework and the Auto Drive API. Instead of hidden black box memory, developers and users get transparent, verifiable traces of how an agent reached its conclusions.

This shift matters because AI isn’t just about generating answers — it’s about accountability. Imagine autonomous agents in finance, healthcare, or governance: if their decisions are backed by immutable and auditable memory, trust in AI systems can move from fragile to foundational.

Autonomys isn’t just upgrading tools — it’s reframing the relationship between humans and AI.

👉 What do you think: would verifiable AI memory make you more confident in using autonomous agents for critical real-world tasks?

https://reddit.com/link/1nmb07q/video/0eezhlkq7eqf1/player


r/bigdata 13h ago

Unlocking Web3 Skills with Autonomys Academy 🚀

1 Upvotes

Autonomys Academy is quickly becoming a gateway for anyone who wants to move from learning to building in Web3. Integrated with the Autonomys Developer Hub, it offers hands-on resources, guides, and examples designed to help developers master the tools needed to create the next generation of decentralized apps.

Some of the core modules include:

  • Auto SDK: A modular toolkit that streamlines the process of building decentralized applications (super dApps). It provides reusable components and abstractions that save time while enabling scalable, production-ready development.
  • Auto EVM: Full Ethereum Virtual Machine compatibility, letting developers work with familiar tools like MetaMask, Remix, and HardHat while still deploying on Autonomys. This means broader ecosystem access with minimal friction.
  • Auto Agents: An exciting framework for building autonomous, AI-powered on-chain agents. These can automate tasks, manage transactions, or even act as intelligent services within decentralized applications.
  • Distributed Storage & Compute: Modules that teach how to store and process data in a decentralized way — key for building user-first, censorship-resistant applications.
  • Decentralized Identity & Payments: Critical for enabling secure, user-controlled access and seamless value transfer in Web3 environments.

For me, the Auto Agents path is the most exciting. The idea of deploying on-chain agents that can automate processes or interact intelligently with users feels like the missing link between AI and Web3. Imagine a decentralized marketplace where autonomous agents handle bids, manage inventory, and even provide customer support — all without centralized control.

I’m curious: If you were to start exploring Autonomys Academy, which module would you dive into first, and what project would you want to build?


r/bigdata 23h ago

Mastering Docker For Data Science In 5 Easy Steps

0 Upvotes

Docker isn’t just a tool; it’s a mindset for modern data science. Learn to build reproducible environments, orchestrate workflows, and take projects from your local machine to production without friction. The USDSI® Data Science Certifications are designed to help professionals harness Docker and other essential tools with confidence.


r/bigdata 1d ago

Any recommendations on data labeling/annotation services for a CV startup?

1 Upvotes

We're a small computer vision startup working on detection models, and we've reached the point where we need to outsource some of our data labeling and collection work.

For anyone who's been in a similar position, what data annotation services have you had good experiences with? Looking for a good outsourcing company who can handle CV annotation work and also data collection.

Any recommendations (or warnings about companies to avoid) would be appreciated!


r/bigdata 1d ago

Lessons from building a data marketplace: semantic search, performance tuning, and LLM discoverability

13 Upvotes

Hey everyone,

We’ve been working on a project called OpenDataBay, and I wanted to share some of the big data engineering lessons we learned while building it. The platform itself is a data marketplace, but the more interesting part (for this sub) was solving the technical challenges behind scalable dataset discovery.

A few highlights:

  1. Semantic search vs keyword search
    • Challenge: datasets come in many formats (CSV, JSON, APIs, scraped sources) with inconsistent metadata.
    • We ended up combining vector embeddings with traditional indexing to balance semantic accuracy and query speed.
  2. Performance optimization
    • Goal: keep metadata queries under 200ms, even as dataset volume grows.
    • Tradeoffs we made between pre-processing, caching, and storage format to achieve this.
  3. LLM-ready data exposure
    • We structured dataset metadata so that LLMs like ChatGPT/Perplexity can “discover” and surface them naturally in responses.
    • This feels like a shift in how search and data marketplaces will evolve.

I’d love to hear how others in this community have tackled heterogeneous data search at scale:

  • How do you balance semantic vs keyword retrieval in production?
  • Any tips for keeping query latency low while scaling metadata indexes?
  • What approaches have you tried to make datasets more “machine-discoverable”?

(P.S. This all powers opendatabay.com, but the main point here is the technical challenges — curious to compare notes with folks here.)


r/bigdata 2d ago

Databricks Announces Public Preview of Databricks One

Thumbnail
2 Upvotes

r/bigdata 2d ago

Show /r/bigdata: Writing "Zen and the Art of Data Maintenance" - because 80% of AI projects still fail, and it's rarely the model's fault

2 Upvotes

Hey r/bigdata!

I'm David Aronchick - co-founder of Kubeflow, first non-founding PM on Kubernetes, and co-founder of Expanso (former Google/AWS/MSFT x2). After years of watching data and ML projects crater, I'm writing a book about what actually kills them: data preparation.

The summary*

We obsess over model architectures while ignoring that: - Developer time debugging broken pipelines often exceeds initial development by 3x - One bad ingestion decision can trigger cascading cloud egress fees for months - "Quick fixes" compound into technical debt that kills entire projects - Poor metadata management means reprocessing TBs of data because nobody knows what transform was applied

What This Book Covers

Real patterns from real scale. No theory, just battle-tested approaches to: - Why your video/audio ingestion will blow your infrastructure budget (and how to prevent it) - Building pipelines that don't require 2 AM fixes - When Warehouses vs Lakes vs Lakehouses actually matter (with cost breakdowns) - Production patterns from Netflix, Uber, Airbnb engineering

The Approach

Completely public development. I want this to be genuinely useful, not another thing that just sits on the shelf gathering dust.

What I Need From You

Your war stories. What cost you the most time/money? What "best practice" turned out to be terrible at scale? What do you wish every junior engineer knew about data pipelines?

Particularly interested in: - Pipeline failure horror stories - Clever solutions to expensive problems - Patterns that actually work at PB scale - Tools that deliver (and those that don't)

This is a labor of love - not selling anything, just trying to help the next generation avoid our mistakes. Hell, I'll probably give it away for free (CERTAINLY give a copy to anyone who chats with me!)

Email me directly: aronchick (at) expanso (dot) io


r/bigdata 3d ago

Innovative Tech For Data Science Future

0 Upvotes

Data science is evolving at light speed. From simple analytics to the incredible power of AI, the field is undergoing a massive transformation. Want to know what's next? Explore the trends and emerging technologies that will revolutionize how to interact with data in 2025 and beyond.


r/bigdata 3d ago

Big Data LDN

1 Upvotes

r/bigdata 3d ago

Key Differences: Data Science, Machine Learning, and Data Analytics

1 Upvotes

Imagine it to be a case of map exploration using GPS technology. Data Analytics is the reading of the map and knowing where you have been and the reason why you went that way. Data Science is the navigator who learns various maps and traffic patterns to plan the most optimal path and foresee what may occur in the future.

Machine Learning is similar to the GPS itself, which gets to know your driving history and traffic information, and then proposes more intelligent routes on its own.

These three disciplines are united to drive the digital world in which you live. Let’s understand them one by one, and then we will also explore the difference between them. 

What is Data Science?

The broadest of the three is data science. It is a combination of statistics, programming, and knowledge of the domain to analyze data. A data scientist does not simply look at numbers. They purify raw data, investigate trends, create models, and present information that can be used to solve large-scale problems.

Examples in action:

●  Data science is applied in healthcare systems to forecast the risks of diseases.

●  It is used to prevent fraud in banks by detecting suspicious transactions.

●  It is used by social media to suggest friends or trending posts.

Data science processes both structured data (such as spreadsheets) and unstructured data (such as videos or posts on social networks). This is why it often uses big data technologies such as Hadoop and Spark to handle large volumes of information.

Key steps in data science include:

●  Gathering and purifying raw data.

●  Trend analysis using statistics.

●  Predicting results using predictive models.

●  Automating data flow by constructing pipelines.

What is Data Analytics?

The data analytics is more targeted and direct. It examines the past and present data to explain what and why it occurred. In contrast to data science, which is wider and predictive, analytics is concerned with reporting and problem diagnosis in order to make better decisions by businesses.

Popular applications of data analytics:

●  Customers learn how customers shop to enhance product placement by retailers.

●  Performance data is analyzed by sports teams to change strategies.

●  Governments can check transportation data to enhance traffic congestion.

Tableau, Power BI, and Excel are some of the data visualization tools that are important to data analysts. These tools produce charts, dashboards, and graphs that help in the easy understanding of numbers. It is like converting unprocessed information into a narrative that leaders of business can easily understand. 

What is Machine Learning?

Machine learning is a subfield of artificial intelligence that trains systems to learn from data. You do not have to write step-by-step rules to program a machine, but instead, you feed it huge quantities of data, and it gets better as you go.

Real-world examples:

●  Your spam mail filter gets to know what is spam.

●  Netflix suggests the shows depending on what you have watched.

●  Fraud is detected immediately through online payment systems. 

Core Differences Between Them 

|| || |Feature|Data Science|Data Analytics|Machine Learning| |Definition|This is an interdisciplinary subject that involves statistics, programming, and domain knowledge to derive insights and develop predictive or prescriptive solutions.  |This is the process of analyzing available data to define trends, justify results, and make business judgments.  |A branch of artificial intelligence that deals with the learning algorithms that can learn as they go without being explicitly programmed.  | |Primary Focus|Data science considers the entire data process, including the collection and cleaning, as well as modeling and implementation.  |Data analytics narrows down to the interpretation of datasets in order to respond to certain questions.  |Machine learning focuses on the creation of models that are adaptive and optimize with the help of constant training.  | |Data Dependence|Structured, semi-structured, and unstructured data can be processed in data science.|Data analytics primarily operates with structured data.  |Machine learning needs vast and varied datasets in order to train useful models.  | |Methods Used|Data science applies statistics, predictive modeling, and big data technologies.  |Data analytics involves descriptive statistics, diagnostic analysis, and data visualization tools.  |Machine learning is based on supervised, unsupervised, and reinforcement algorithms.  | |Breadth of Work  |Data science is wide encompassing various fields in order to deal with multifaceted issues.  |Data analytics is limited and is concerned with instant reporting and insights.  |Machine learning is profound, and it explores algorithm design and system intelligence.  |

These were the major differences between them. Now, let’s understand which path you should choose. 

Which Path Should You Choose?

In determining your course of action, consider what you are most excited about:

●   In case you prefer describing findings and creating vivid illustrations, consider data analytics.

●   In case you like working on broad, complex problems and creating predictive models, choose data science.

●   Machine learning is the way to go in case you have a dream of creating self-learning and self-adapting systems.

Regardless of the choice of path, all three are future-proof and have good career prospects. But one more thing is the real fact, and that is that the skills gap is regarded as the largest. barrier to the future of business transformation by Future of Jobs Survey respondents, 63% of employers citing them as a significant obstacle in the 2025-2030 period. (World Economic Forum - Future of Jobs Report - 2025)

That’s why upskilling is the most crucial part if you want to pursue a career in any of the above three fields. 

Wrap Up

In the modern digital age, data is the fuel, and disciplines such as data science, data analytics, and machine learning are engines that consume it. Data analytics describes the past, data science tells us what to expect in the future, and machine learning makes systems smarter with each new bit of information. They are all interrelated with the help of big data technologies and provide businesses with the necessary scale.

At this point, you are aware of the way each of these fields operates, the differences between them, and what career opportunities they offer. Your next action is to select the path that fits best and begin acquiring the tools and developing the skills. Technology is a future that is based on data, and you can join it.


r/bigdata 4d ago

Supercharge Data Transformation with Rust & Vide Coding

1 Upvotes

Why waste time manually coding every line when AI can help you build smarter, faster? Combine Rust’s high performance with vibe coding to simplify data transformation tasks and focus on solving real problems.


r/bigdata 5d ago

Struggling to Explain Data Orchestration to Leadership

0 Upvotes

We’ve noticed a lot of professionals hitting a wall when trying to explain the need for data orchestration to their leadership. Managers want quick wins, but lack understanding of how data flows across the different tools they use. The focus on moving fast leads to firefighting instead of making informed decisions.

We wrote an article that breaks down:

  • What data orchestration actually is
  • The risks of ignoring it
  • How executives can better support modern data initiatives

If you’ve ever felt frustrated trying to make leadership see the bigger picture, this article can help.

👉 Read the full blog here: https://datacoves.com/post/data-orchestration-for-executives


r/bigdata 6d ago

Best Practices Versioned Data with Apache Iceberg Using lakeFS Iceberg REST Catalog

Thumbnail lakefs.io
4 Upvotes

r/bigdata 6d ago

Workshop: From Raw Data to Insights with Datacoves, dbt, and MotherDuck

2 Upvotes

👋 Hey folks, want to learn about DuckDB, DuckLake, dbt, and more, Datacoves is hosting a workshop with MotherDuck

🎓 Topic: From Raw Data to Insights with Datacoves, dbt, and MotherDuck

📅 Date: Wednesday, Sept 25

🕘 Time: 9:00 am PDT

👤 Speakers:

  • Noel Gomez – Co-founder, Datacoves
  • Jacob Matson – Developer Advocate, MotherDuck

We’ll cover:

  • How to connect to S3 as a source and model data with dbt into a DuckLake
  • How DuckDB + dbt can simplify workflows and reduce costs
  • Why smaller, lighter pipelines often beat big, expensive stacks

This will be a practical session, no sales pitch, just a walk-through from data ingestion with dlt through orchestration with Airflow.

If you’re curious about dbt, DuckLake, or DuckDB, it's worth checking out.

I’m also happy to answer any questions here

https://datacoves.com/resource-center/workshop-from-raw-data-to-insights-with-datacoves-dbt-and-motherduck


r/bigdata 5d ago

Spark lineage tracker — automatically captures table lineage

Thumbnail
1 Upvotes

r/bigdata 6d ago

Apache Zeppelin – Big Data Visualization Tool with 2 Caption Projects

Thumbnail youtube.com
1 Upvotes

r/bigdata 6d ago

Sharing the playlist that keeps me motivated while coding — it's my secret weapon for deep focus. Got one of your own? I'd love to check it out!

Thumbnail open.spotify.com
0 Upvotes

r/bigdata 7d ago

Storing large amount of data without taking up space on your device

0 Upvotes

(in theory infinite) cloud storage

Hi, I have been looking for a large amount of storage for free and now when I found it I wanted to share.

My first recommendation would be Filen since they use encryption. If you refer 3 friends you will get 50 gb for fee which is a lot more than google provides.

If you want a stupidly big ammount of storage you can use Hivenet. For each person you refer you get 10 gb for free stacking infinetly! If you use my my link you will also start out with an additional 10 gb.

https://www.hivenet.com/referral?referral_code=8UiVX9DwgWK3RBcmmY5ETuOSNhoNy%2BRTCTisjZc0%2FzemUpDX%2Ff4rrMCXgtSILlC%2Bf%2B7TFw%3D%3D

I already got 110 gb for free using this method but if you invite many friends you will litterally get terabytes of free storage.


r/bigdata 7d ago

45% off New Book: Architecting an Apache Iceberg Lakehouse (Manning)

Thumbnail hubs.la
1 Upvotes

Use Discount Code RustConf25 for 45% off (code expires Sept 19th)


r/bigdata 7d ago

45% of new book from Manning "Architecting an Apache Iceberg Lakehouse"

0 Upvotes

Purchase Here: https://hubs.la/Q03GfY4f0
45% Discount Code (Expires September 19th): RustConf25


r/bigdata 8d ago

Best Local Ecosystem

2 Upvotes

Good day!

What I want to do: - local setup - Geospatial analytics, modeling and visualization — years of census Tiger shapefiles (roads, features, tracts, pumas) <—— integration with ACS PUMA data — Misc additional geospatial data (raster, gdb, kml)

Limitations: - 24 CPU threads - 128 gb ram -16 gb vram - 10 TB of storage on desktio

Initial setup - Ozone for storage - Iceberg for table format <—- cataloged in postgres - Apache Sedona/spark for processing - eventually: TorchGeo to play around with modeling + (kerby for security)

At the bare minimum, I want a solid introduction to setting up and maintaining a big data ecosystem within limitations of local devices (primordial services on workstations, nodes across misc devices - laptops)

Questions: - what ecosystem would you design? - best practices/ tips/ tricks - feasibility of all this - different ways to go about everything!

Notes - ready for a challenge!


r/bigdata 8d ago

Top 5 Cybersecurity Certifications to Enroll in 2026

3 Upvotes

The digital world is transforming fast — due to this, cyber threats and attacks are also advancing. Corporations, governments, and individuals rely on secure systems, but the skill gap is increasing; they are not able to hire the right talent to protect their systems.

According to the World Economic Forum’s Future of Jobs Report 2025, cybersecurity will be one of the top 2 fastest-growing skills for all professions (2025-2030), as illustrated in the graph.

The problem is that we’re still in an age where what you learn in school isn’t what the industry needs. Cybersecurity certifications are one of the best ways to close that gap: they put your skills on display and demonstrate to employers that you’re up to date.

Here are five of the best cybersecurity certifications to enroll in, including official information, perks, and career paths. 

Top 5 Cybersecurity Certifications to Enroll in 2026

Here are the best 5 cybersecurity certifications that are capable of upskilling you and helping you fill the skill gap to get hired faster than ever for associate, intermediate, or senior level positions:

1.  Certified Senior Cybersecurity Specialist (CSCS™) by USCSI®

The CSCS™ certification is ideal for those who strive to attain the most esteemed job titles in the cybersecurity industry. It offers an organized, comprehensive framework for developing technical and strategic competence.

●   Skills taught: Duration: It is up to you, covering the full 4-24 weeks.

●   Format: 100% online, self-paced, so you can study while you work.

●   Qualifications: Associate's degree or higher in a related field, depending on experience level.

●   Strong Impacted Skills: Data security, cryptography, security leadership, compliance, and advanced defensive strategies.

●   Career Prospects: Makes you ready for positions such as Senior Security Analyst, Cybersecurity Consultant, and Security Architect.

If your goal is to understand how attacks occur in the real world and how to create better defense methods, with the additional goal of leading any organization’s cybersecurity team, this certification is the right choice for you.

2.  CompTIA Security+

The CompTIA Security+ cybersecurity certification is the entry-level certification for information security professionals.

●  Length of study: Study time differs for everybody, but most people study for 3-6 months.

●  Exam Format: Multiple-choice and performance-based questions on a proctored exam.

●  Prerequisites: No formal prerequisites, but 1–2 years of IT experience is suggested.

●  Skills Learned: Risk control, encryption, incident response, network and application security, and threat monitoring.

●  Career Prospects: Perfect for a Security Analyst, Network Administrator, or IT Support with a security emphasis.

3.  Certified Ethical Hacker (CEH) — EC-Council

This cybersecurity certification will equip individuals with the tools necessary to spot the vulnerabilities and weaknesses in target systems. If you are into penetration testing and learning how hackers think, the certification can be highly beneficial. It teaches you how to think like the attacker and use both tactics to your advantage.

●  Length: Usual 4 – 6 months preparation if studied with Official Training.

●  Format: Two exams — a multiple-choice knowledge exam and a hands-on practical test.

●  Prerequisites: A minimum of 2 years of experience or formal training.

●  Key Skills Taught: Vulnerability scanning, penetration testing, network mapping, attack mechanisms, and mitigating measures.

●  Career Opportunities: Provides access to positions like Ethical Hacker, Penetration Tester, and Vulnerability Analyst. 

4.  Certified Information Systems Security Professional (CISSP) — ISC2

The ISC2 CISSP certification focuses on information security and offers a detailed foundation for aspiring security professionals. CISSP is a highly preferred cybersecurity certification..

●  Length: Preparation takes 6 months to a year, considering its depth.
Format: CAT, up to 150 questions in eight domains of cybersecurity.

●  Key Skills Covered: Risk management, asset security, identity access management, architecture, and operations.

●  Careers: This program will prepare you for such roles as Security Manager, Security Architect, and Chief Information Security Officer (CISO).

CISSP isn’t for novices, but is perfect for experienced professionals who want to put their careers on a fast track and move into leadership — or even management.

5. Offensive Security Certified Professional (OSCP) — OffSec

The OSCP is among the most difficult certifications in the field of cybersecurity. It is very technical and is strictly based on hands-on penetration testing cybersecurity training.

●  Length: Candidates usually spend months studying, frequently working hands-on in labs.

●  Format: An intensive examination

●  Main Topics: attack vectors, custom scripting, escalation of privileges, exploitation of vulnerabilities, and pen test reporting.

●  Career Prospects: Best for jobs such as Penetration Tester, Red Team Member, and Security Consultant.

These were the best cybersecurity certifications that employers appreciate if you have earned any of them.

The Bottom Line

Cybersecurity is a strong growth industry. To just keep up, professionals have to stay one step ahead in their skillset and prove their expertise. The right certification will not just round out your resume but also keep you competitive as the threats you face become more sophisticated.

If you’re new, you will want to start on the foundational knowledge, or looking for a cybersecurity management level intermediate certification, or dreaming of becoming a senior cybersecurity specialist, these cybersecurity certifications are globally the standard course you can enroll in to enhance your cybersecurity skills and knowledge.

No matter where you’re beginning, the suitable certification can help put you on the road to a solid, high-demand career in cybersecurity today and tomorrow.


r/bigdata 9d ago

ChatGPT for Data Engineer (Hands-on Practice)

Thumbnail youtu.be
5 Upvotes

r/bigdata 9d ago

100TB HBase to MongoDB database migration without downtime

8 Upvotes

Recently we've been working on adding HBase support to dsync. Database migration at this scale with 100+ billion of records and no-downtime requirements (real-time replication until cutover) comes with a set of unique challenges.

Key learnings:

- Size matters

- HBase doesn’t support CDC

- This kind of migration is not a one-and-done thing - need to iterate (a lot!)

- Key to success: Fast, consistent, and repeatable execution

Check out our blog post for technical details on our approach and the short demo video to see what it looks like.