I.T. Spices The LINUX Way

Python In The Shell: The STEEMIT Ecosystem – Post #115

SCRAPING ALL BLOGS USING PYTHON – THE POST TITLE

Please refer to Post #110 for the complete python script and the intro of this series, link below:
https://steemit.com/blockchain/@lightingmacsteem/2rydxz-i-t-spices-the-linux-way

In this post we will be discussing how we arrived at the TITLE of the blog post using web scraping.

Lines 73 to 87 are the python codes directly involved in acquiring the blog’s TITLE:

73
74          #POST TITLE
75          ttt = soup.find('h1', {'class':'entry-title'})
76          if ttt != []:
77              try:
78                  title = ttt.text
79                  print('\nTITLE:')
80                  flogs.write('\nTITLE:')
81                  print('   ' + str(title))
82                  flogs.write('\n   ' + str(title))
83              except:
84                  print('\nTITLE:')
85                  flogs.write('\nTITLE:')
86                  print('   No TITLE found.......')
87                  flogs.write('\n   No TITLE found.......')



Line 75 uses the BeautifulSoup module (as soup) to find a certain line with an h1 text as well as a class having entry-title. The said module is now considering such texts as it is, much easier for the programmer to manipulate rather than reading such HTML lines as is.

Line 76 is an IF statement, this is to make it safe to pursue further processing which means only do the things after if there is a TITLE found. This approach minimizes errors greatly.

Line 77 is a TRY statement which further fine tunes the IF statement. A TRY means that if this code have errors then just move on to the next steps. No need to exit prematurely.

Line 78 is just extracting the text portion of the HTML-coded h1 line; this simply means that the said text is surely the blog TITLE. That is why this is termed as scraping, in this example alone we can see that the other characters were “scraped”, leaving only the texts containing the blog TITLE.

Lines 79 to 82 are printing any results into the monitor screen as well as to the log file.

Line 83, an EXCEPT statement, will only take effect if the TRY statement previously have errors; this can only mean that there is no blog TITLE found hence a corresponding message and write to the log file will be executed from lines 84 to 87.


”Only God Deserves A Title From His All Equal Subjects…….”