[STEEMSQL] A public SQL database with all blockchain data – NEW Language detection

in #steemit7 years ago (edited)

image.png

SteemSQL database has been updated with a new set of features, including language detection.

It's has been a long time since I was able to release these new features. The main reasons are:

  1. The amount of data to manipulate is quite huge. SteemSQL database is now several gigabytes. I had to reprocess all existing posts and comments.
  2. Running service could not be interrupted as many processes regularly connect to SteemSQL to retrieve data.

Anyway, everything is now ready for you. Let's have a look at what's new.

1. New database schema

image.png

A new Comments table has been added

This table is populated with data that were previously stored in TxComments.

A new record is created in TxComments each time a post or comment is created or edited. The correlated data have to be updated each time a transaction related to a post is generated. This resulted in the field contents in TxComments being redundant and hard to manage.

The new design is much simpler for data retrieval.

2. Post indicator

A post boolean field has been added to the Comments table structure.

This field indicates whether the record is a post or a comment. It's more required to check the title field content and record filtering is now blazing fast.

3. Language detection

Language detection has been added to the Database Injector process.

Each time a post or comment is created, the body content is analyzed to determine which language(s) it contains.
The result is stored in JSON format in the field body_language in the table Comments.
As a post or comment can contain multiple languages, the result is an array.

Each object contains the following values:

  • Language code
  • Confidence score
  • Is reliable - true/false

If the language cannot be determined (ex: post containing pictures only), the array will be left empty

The result is something like this:

image.png

A post with several languages will have his body_language field set to something like

[{"language":"es","isReliable":true,"confidence":8.22},
 {"language":"en","isReliable":false,"confidence":5.12}]


The confidence value is related to how much text the post contains. The more text analyzed, the better the language analysis, the higher the confidence value. Confidence is not a ratio and can be higher than 100.

If the post contains words in different languages, isReliable will be set to true to identify the most probable language, even if its confidence value is lower.
If there is only one language and isReliable is set to false, this indicates confidence is too low.

Be aware that language detector works using probabilities and sometimes it is not accurate with very short texts. The same happens when different languages used in the post have similar words.

The language detector can also be tricked when the content of a post contains lots of "technical noise" like pictures, source code, edit tags, ...

I hope this will help non-English communities to better identify posts in their respective language.

Support

If you need help, have any comments or requests, please use SteemSQL channel on steemit.chat.

Thanks for reading!


I hope you enjoyed reading this post. As mentioned in my witness thread, I'm committed to helping non-English communities to develop.

Support me and my work as a witness by voting for me here!


You like this post, do not forget to upvote or follow me or resteem

Sort:  

I'm a Filipino and I think if I will be posting using my native language, only too few would understand it then since there only few of us here. In that case, majority of the steemians will not understand my post and probably will not vote for it. Not to mention the bots.

That was the purpose of my post about steemit multilingualism https://steemit.com/language/@arcange/how-multilingual-is-steemit.
Some non-english communities are working better than other.
But yes, English rules :/

Looking forward.

Pretty neat. What are you using for language detection? I was thinking about using somthing like that in a future project, but I mainly work with Node.js as of lately. I like how it has the confidence thing :)

I am a big fan of you :)

He he, love is in the air =)
Thanks!

pixresteemer_incognito_angel_mini.png
Bang, I did it again... I just rehived your post!
Week 55 of my contest just started...you can now check the winners of the previous week!
2

Wow, just saw this. Huge effort. I bow to you sir.
I knew there were reasons I voted you as witness :)

Thank you for your unfailing support @ksteem, much appreciated!