3

There is a list string twitter text data, for example, the following data (actually, there is a large number of text,not just these data), I want to extract the all the user name after @ and url link in the twitter text, for example: galaxy5univ and url link.

   tweet_text = ['@galaxy5univ I like you',
    'RT @BestOfGalaxies: Let's sit under the stars ...',
    '@jonghyun__bot .........((thanks)',
    'RT @yosizo: thanks.ddddd <https://yahoo.com>',
    'RT @LDH_3_yui: #fam, ccccc https://msn.news.com']

my code:

import re
pu = re.compile(r'http\S+')
pn = re.compile(r'@(\S+)')
for row in twitter_text:
   text = pu.findall(row)
   name = (pn.findall(row))
   print("url: ", text)
   print("name: ", name)

Through testing the code in a large number of twitter data, I have got that my two patterns for url and name both are wrong(although in a few twitter text data is right). Do you guys have some documents or link about extract name and url from twitter text in the case of large twitter data.

If you have advices about extracting name and url from twitter data, please tell me, thanks!

8
  • 1
    pn = re.compile(r'@([a-zA-Z0-9_]+)')
    – mic4ael
    Jun 14, 2016 at 8:58
  • Thanks for your comment, you know there is a large number of name data in the twitter data. Sometimes the name include some special characters such as # % ^,not just a-zA-Z0-9_. In this case, how to solve it?
    – tktktk0711
    Jun 14, 2016 at 8:59
  • 1
    just add them to the list of characters inside the square brackets, but remember that some of the characters need to be properly escaped
    – mic4ael
    Jun 14, 2016 at 9:00
  • thanks for your comments, but I have to add all the characters inside the square brackets. If I do not know the character after @, In this case, how to solve it. I hope there is effective way to solve it(delete the ":" after the end of name).
    – tktktk0711
    Jun 14, 2016 at 9:09
  • You mean get all non-whitespace chars after @ but not :? You can use r'@([^\s:]+)' Jun 14, 2016 at 9:13

2 Answers 2

5

Note that your pn = re.compile(r'@(\S+)') regex will capture any 1+ non-whitespace characters after @.

To exclude matching :, you need to convert the shorthand \S class to [^\s] negated character class equivalent, and add : to it:

pn = re.compile(r'@([^\s:]+)')

Now, it will stop capturing non-whitespace symbols before the first :. See the regex demo.

If you need to capture until the last :, you can just add : after the capturing group: pn = re.compile(r'@(\S+):').

As for a URL matching regex, there are many on the Web, just choose the one that works best for you.

Here is an example code:

import re
p = re.compile(r'@([^\s:]+)')
test_str = "@galaxy5univ I like you\nRT @BestOfGalaxies: Let's sit under the stars ...\n@jonghyun__bot .........((thanks)\nRT @yosizo: thanks.ddddd <https://y...content-available-to-author-only...o.com>\nRT @LDH_3_yui: #fam, ccccc https://m...content-available-to-author-only...s.com"
print(p.findall(test_str)) 
p2 = re.compile(r'(?:http|ftp|https)://(?:[\w_-]+(?:(?:\.[\w_-]+)+))(?:[\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?')
print(p2.findall(test_str))
# => ['galaxy5univ', 'BestOfGalaxies', 'jonghyun__bot', 'yosizo', 'LDH_3_yui']
# => ['https://yahoo.com', 'https://msn.news.com']
8
  • Now, I have got that my two patterns for url and name both are wrong. Do you guys have some documents or link about extract name and url from twitter text.
    – tktktk0711
    Jun 14, 2016 at 10:23
  • What is wrong about @([^\s:]+)? A regex for URL can be found anywhere. Here is a good resource. And here is an SO thread on matching URLs in a larger text. See this IDEONE demo. Jun 14, 2016 at 10:25
  • thanks for your passion. for example some names: @t:* d-8:. You know the names in twitter have different kind of form.
    – tktktk0711
    Jun 14, 2016 at 10:35
  • 1
    Excuse me, I have never seen user names with spaces. That means you need @(.*):, right? If not, please explain the pattern these user names fall into. If there is no pattern, it is not possible to match them. Also, here is a link to a mentions regex used in a Twitter JS library (the pattern is compatible with Python). Jun 14, 2016 at 10:36
  • I really thanks @ Wiktor Stribiżew for your help. I will read the document you mentioned. You are a kind guy.
    – tktktk0711
    Jun 14, 2016 at 10:42
1

If the usernames doesn't contain special chars, you can use:

@([\w]+)

See Live demo

1
  • thanks for your comments. I got that my two patterns for extracting name after @ and url link in twitter text are wrong. You know the name and url links have a lot of kind forms. If you have some documents or links about this , please tell me!
    – tktktk0711
    Jun 14, 2016 at 10:40

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.