Extracting text between html tags (removing html tags)


In ORACLE 10G, we can simply use REGEXP_REPLACE to extract all data between HTML TAGS

Examples
--------------------------------------------------------------------------------------------------------------
WITH T AS
(
SELECT
    '<HTML><HEAD>THIS IS HEAD.</HEAD><BODY>THIS IS BODY.<P>THIS IS P.</P><H1>THIS IS H1.</H1></BODY></HTML>' STR
FROM    
    DUAL
)
SELECT REGEXP_REPLACE(STR,'<.*?>') FROM T


WITH T AS
(
SELECT
    '<HTML><HEAD>THIS IS HEAD.</HEAD><BODY>THIS IS BODY.<P>THIS IS P.</P><H1>THIS IS H1.</H1></BODY></HTML>' STR
FROM    
    DUAL
)
SELECT REGEXP_REPLACE(STR, '<[^>]+>') FROM T



ORACLE 11G has provided us a simple way to extract text between a specified HTML tag.

Example
--------------------------------------------------------------------------------------------------------------
WITH T AS
(
SELECT
    '<HTML><HEAD>THIS IS HEAD.</HEAD><BODY>THIS IS BODY.<P>THIS IS P.</P><H1>THIS IS H1.</H1></BODY></HTML>' STR
FROM    
    DUAL
)
SELECT XMLTYPE(STR).EXTRACT('/HTML/BODY/P/text()').GETSTRINGVAL() FROM T



Related Links:


Oracle: Extract Initials of Names
http://nimishgarg.blogspot.com/2011/01/oracle-extract-initials-of-name.html

Oracle: Extract Numbers from String (Ex: Pin from Address)
http://nimishgarg.blogspot.com/2010/04/oracle-sql-extract-numbers-from-string.html


17 comments:

  1. The 11G example appears to work in 10g (10.2.0.5).

    ReplyDelete
  2. Great blog.The code is good and easy to understandable.It help to extract between HTML tags.Thanks for wonderful posting.

    ReplyDelete
  3. Hi there ,

    Its pretty nice blog in there . However I wanted to ask if i can use the same on a column inside a table say Projects_Projects so that the whole column is stripped off html tags without updating the data in the table .The main purpose being the data needs to be displayed in the report generated from the database

    ReplyDelete
  4. how abaut if i want select between ( and ) tags. i try replace < with ( and > with ) it did not work

    ReplyDelete
    Replies
    1. i did not get your issue. Please provide input string and output data

      Delete
  5. Hi Nimish, I have similar issue, I'd like to replace pointy brackets with square brackets.

    FROM: text text text text text
    TO: text text text [BACKSPACE] text text

    Thanks,
    Tom

    ReplyDelete
    Replies
    1. use translate
      translate('<Nimish>Hello<Garg>','<>','[]')

      Delete
  6. thanks, this is much appreciated!

    ReplyDelete
  7. Hi..can you help me with same query in SQL.

    ReplyDelete
  8. Hi Nimish,

    can you pls help with a query to extract following tags th attributes or table border=2 from the DB Column

    ReplyDelete
  9. nice solution. thx!

    ReplyDelete
  10. bookmarked!!, I like your web site!

    ReplyDelete
  11. Good article, please check my blog for more fusion hcm topics in detail
    https://www.oraclefusionhcm.com

    ReplyDelete
  12. I really loved reading your blog. It was very well authored and easy to undertand. Unlike additional blogs I have read which are really not tht good. I also found your posts very interesting. In fact after reading, I had to go show it to my friend and he ejoyed it as well! html email link

    ReplyDelete