Increased Data Quality with "Hive ASR Dictionary"

in HiveDevs29 days ago

In the process of transcribing a significant number of Hive related videos and producing summaries based on their transcripts, I've come across many funny AI-interpretations of what's being talked about in all these different livestreams and podcast episodes. For instance "Cal" instead of "Khal", "John Go" instead of "Jongo" "Sticker Boys" instead of "StickUpBoys". Or how about my favorite; "Community Soak and Talk" (😂) instead of "Community Token Talk".

What is ASR?

But before I continue, just to make sure we're on the same page, ASR stands for "Automatic Speech Recognition", which is the kind of software I've using for this past month to transcribe an enormous amount of videos (we're talking 5k+ videos and recordings), which has taken an enormous amount of computing power, primarily from my poor GPU. Anyways, in the process of transcribing, the ASR model is obviously not trained on Hive-specific projects and user-names, producing a bunch of imprecise transcriptions.

The Solution

So what I simply did was turn to my go-to LLM and asked it to write me a script that would recursively scan through all files, folders and subfolders within my current directory, and replace the words based on my pre-defined corrections. Then I started building my database of common Hive-term-misspells and their correct counterparts. Here are some examples:

After adding a hundred or so terms, I ran the program recursively on the approximate 30 million (!) words generated during the course of the last month.

https://inleo.io/threads/view/mightpossibly/re-leothreads-33apr98ea

And the result?

It replaced a whooping 26k+ words in a single go. 26 thousand Hive terms! Let that sink in. Not to mention the 5+ plus terms it replaced in the second go, after I'd added 40 or so new terms. And the cool part is, every time I find a new misspelling, I can replace it for all the transcripts AND summaries all at once, not just the single occurance that I happen to stumble across.

Published as an open source application on Github

I decided to publish the whole thing for the community to use on Github, and I'll be updating the database-file regularly as I come across new terms. Even if I'm the one who currently uses it, I'm certain it should come in handy in a broader sense at some point. Here's the direct link to the github repository: https://github.com/mp-hive/Hive-ASR-Dictionary

The Bigger Picture

If you've been following my blog lately, you know that the main point of all this is increasing the accessibility of audio based Hive content and adding that data to the Hive Database in a meaningful a way as possible (if you're interested in learning more about the background of this project, you can read more here and here).

By replacing AI misinterpretations with high precision "translations" with the proper Hive projects, Hive names and Hive tokens, it dramatically increases the quality of said content, making it possible to re-use further in other contexts, like the creation of datasets for training LLMs, generating summaries, etc. etc.

I also want to underline once again that I'm not a coder, and that this post is also intended to serve as inspiration for LLM use-cases, in addition to the concrete project/program described in this post.


If you found this interesting, feel free to leave a comment, upvote or reblog.

Thank you for reading!


What is Hive?

To learn more about Hive, this article is a good place to start: What is Hive?. If you don't already own a Hive account, go here to get one.


@leoglossary links added using LeoLinker.

Posted Using InLeo Alpha

Sort:  

that is so cool to use AI to stuff like that. thats what we need, use AI to help us out and be an assistent to our work. great job

@tokenizedsociety you are there! lol

Yea haha I'm honored

Thanks! Yeah the possibilities are suddenly endless, even for us non-coders. It's really just about breaking down the application you want into really simple steps and then build your prompt from that.

Yesterday I made another application that produces a CSV file containing all tax-related transactions I made on Hive in 2023. Ready to be imported directly into the tax-software I use. Such a time-saver! I'll be doing a post about that too (including the source-code) in the near future.

So good see projects getting shape!
Awesome! Good Luck!

Congratulations @mightpossibly! You have completed the following achievement on the Hive blockchain And have been rewarded with New badge(s)

You got more than 5000 replies.
Your next target is to reach 5250 replies.

You can view your badges on your board and compare yourself to others in the Ranking
If you no longer want to receive notifications, reply to this comment with the word STOP

Check out our last posts:

Feedback from the April Hive Power Up Day
Hive Power Up Month Challenge - March 2024 Winners List
Be ready for the May edition of the Hive Power Up Month!