How to retrieve Data

Compasdot · Post by **Compasdot** » 19 Aug 2017 01:04

Please can anyone show me how to retrieve a specific data from an html web page. I tried to search something close to my case ... but all what i found is about xml and jason.
Is there a tip to convert html to xml or what? How to use evaluateXPathAsString function .... please help !!!!

Desmanto · Post by **Desmanto** » 19 Aug 2017 06:32

Use action HTTP request, input the URL. The result in in variable {response}. Parse the response.

Html is just like a well formatted xml, but more flexible. No need to convert it, you probably can use Xpath to parse it as well. Or at least regex. You need to show us the url you wanna parse first. Or at least if you can't, find some link that shows similiar result.

Take a look at the xml example here. http://automagic4android.com/forum/view ... f=5&t=6866

Compasdot · Post by **Compasdot** » 19 Aug 2017 13:54

Hi Desmanto i want to retrieve some value from this link https://www.wunderground.com/weather/dz/biskra

This is the exactly data that i need
Thanx in Advance Desmanto

Desmanto · Post by **Desmanto** » 19 Aug 2017 18:23

Basically you wanna get the city, temperature and weather forecast right?

That html cannot be parsed using XPath, but still can be done using regex. There is alternative xml version stated at the html.

<link rel="canonical" href="https://www.wunderground.com/weather/dz/biskra" />
<link rel="alternate" type="application/rss+xml" title="Biskra, Algeria RSS" href="https://rss.wunderground.com/auto/rss_f ... /60525.xml" />
<link rel="apple-touch-icon" href="//icons.wxug.com/favicon.png"/>

But I don't know if the link will change or not. So I will just stick with your original link.

You got the html file already. I assume you know how to use Action Http request to that link and save the result in response. Now we are going to process the response to extract that 3 data.
Add a new action : Script and put this

Code: Select all

match = findAll(response, '(?:<meta property="og:title" content=\\")(.*)(?:\\" \\/\>)');
content = substring(match[0],35,length(match[0])-4);
content = split(content," \\| ");
city = content[0];
temp = replace(content[1],"&deg;"," \u00b0C");
forecast = content[2];

The first match will match the line contain the data.
{content} will strip out the tag, leaving exactly as the 3 data we need.
We split it again base on | symbol, to get 3 elements
Then assign each element to corresponding variable. For temperature, replace the ° with unicode u00b0 and C (u00b0 is the unicode for degree), becomes °C

So the result will be stored in {city}, {temp} and {forecast}. Use them as you want.

================
@Martin : I am a little confused with automagic regex parsing. I have always wanna ask this before. When I use regex with (?:abc) pattern, it supposed to match but not capture anything inside the bracket 'abc'
Example :

If I use this at regex tester
(?:<meta property="og:title" content=\\")(.*)(?:\\" \\/\>)
It will show me that it capture only the (.*), like this

<meta property="og:image" content="https://icons.wxug.com/i/c/k2/clear.png" />
<meta property="og:title" content="Biskra, Algeria | 41° | Clear" />
<meta name="apple-itunes-app" content="app-id=486154808, affiliate-data=at=1010lrYB&ct=website_wu" />

There is underline at Biskra, Algeria | 41° | Clear. So I know I am using the correct regex. Tested at RegExr also shows the capture $1 properly. <meta .... was matched but not captured.
But when using function findAll(), it matched the whole line. The result for the match above is list with single variable contain the whole line.
match[0] = <meta property="og:title" content="Biskra, Algeria | 41° | Clear" />

Even if I put grouping like this (without the non capturing group ?:)
(<meta property="og:title" content=\\")(.*)(\\" \\/\>)
I expect it will have 3 capturing group.
match[0] = <meta property="og:title" content="
match[1] = Biskra, Algeria | 41° | Clear
match[2] = " />
But turns out to be the same, only single match[0] contain the whole matching line. There is no match[1] or match[2].

Is it supposed to be like this in automagic? Or it is a kind of bug? Similiar to XPath implementation, where the syntax stop only at first match.
Sorry if I kinda hijack the thread, but since the regex is used here, I think it is better to ask it here directly.

Post by **Martin** » 19 Aug 2017 19:57

Hi Desmanto

findAll in Automagic returns all items that match the entire regex even if there are some non-capturing groups within the regex. Would you expect to get only the part within the first regular capturing group in the findAll? Maybe I could add another function that returns a list containing a list with all matched groups. Would this be more helpful?

Maybe I'm also misunderstanding your request. When I change the escaping to this:
(?:<meta property="og:title" content=")(.*)(?:" \/>)
and test input to this:
<meta property="og:title" content="Biskra, Algeria | 41° | Clear" />
then the matches function in the regex tester of Automagic returns both groups[0] with the entire input and groups[1] with the content of the first regular capturing group (Biskra...Clear)
findAll still returns the entire input since it returns the content the entire pattern matches.

Regards,
Martin

Desmanto · Post by **Desmanto** » 20 Aug 2017 05:31

I just tested it after I posted, at the dev version, but very sleepy already.

It still give the same result. And I tested that single line also, yeah it can shows the capture group only.

From what i tested, matches() must match the whole data to make it work. So if the text is multiline, we have to specify the "\n" and the following line syntax as well. Or in other words matches() works best when there is only single line. And only matches() support group capturing. While findAll() will always match the whole regex defined (usually it is in single line), ignoring any capturing or non capturing group. (as you have stated)

Which mean, to do the exactly the thing I want above, I have to capture the whole line first using findAll(). The result is single line, so it can be processed by matches() now. Then use matches() capture the same regex again, access the only capturing group in result[1] (since result[0] is the whole group capturing). Twice capturing.

Code: Select all

capture = newList();
match = findAll(response, '(?:<meta property="og:title" content=\\")(.*)(?:\\" \\/\>)');
matches(match[0], '(?:<meta property="og:title" content=\\")(.*)(?:\\" \\/\>)', capture);
content = split(capture[1]," \\| ");
city = content[0];
temp = replace(content[1],"&deg;"," \u00b0C");
forecast = content[2];

I just really started to learn regex when the first time I use automagic. So I actually expect it works the same just like in regexr.com. Since in the regex documentation there is non capturing group (but currently only works in matches()). There are a lot of case we want to match the whole line (to make sure it matchs the correct line), but only capture certain portion of the match only (such as in this case). The same case, replicated at regexr.com

: 2 non 1 capturing group.png (16.57 KiB) Viewed 25841 times

So I actually expect a single function using that single regex to capture the
Biskra, Algeria | 41° | Clear
It matches the <meta ..., but don't capture it in the result. But since findAll() ignore capturing group, so yes. We need new function for the regex parsing to behave this way.

Here is the another case. Sometimes we want to capture and immediately make group from the whole capture. In this case, I remove the ?:, so i will have 1 match, but 3 capturing group. So the new function will give the result as list with 3 elements, each with the corresponding group.

: 3 capturing group.png (18.33 KiB) Viewed 25841 times

However I still don't know yet how the multiline capturing works. I set in regexr, if there is multiline match, using the logic above, it will be 6 elements. However, in regexr.com, $1 shows all elements from the capture group 1 in all lines.

Post by **Martin** » 21 Aug 2017 19:52

You are right, for the matches function the regular expression has to match the entire input to return the groups, otherwise it does not match and returns false to indicate this.

I will extend the function findAll with an additional boolean parameter returnGroups that will return a list of the matched groups.

An input like this:
1a2b3c
5x7y8z

and a regex like this:
(\d).(\d).(\d).

...will return a list that contains following lists:
1a2b3c, 1, 2, 3
5x7y8z, 5, 7, 8

or in case of some non-capturing groups:
(?:\d).(\d).(\d).

...will return a list that contains following lists (item 0 always contains the match of the entire expression):
1a2b3c, 2, 3
5x7y8z, 7, 8

I think this will help to get most of the things done. You could also use the current version of findAll, loop over the resulting lines and then use matches to find the groups in each line but I think the new findAll will perform better and it will simplify the scripts a bit. You still have to loop over the result and concatenate the output of all $1 groups if you'd want to create the same output like regexr where the output is built by looping over the matches automatically.

Regards,
Martin

Desmanto · Post by **Desmanto** » 22 Aug 2017 14:36

Wow, that's nice. Exactly like what I want.
Will this see its way to the 1.34 or next EAP version? Can't wait to test it out.
So, my other script can be simplified as well. It will be easier to work on the future text/html/xml parsing later. (there are 2 awaiting)

item 0 for the whole match is OK, it is for back reference. As long as the result store all capturing group, the we can loop it only once to collect it.
If not, using current function matches + findAll, need to loop twice.

But most of the time, what we need is just single line capturing group, just like the case in this thread.

Thanks for the response,
Desmanto

Post by **Martin** » 23 Aug 2017 19:40

Sure, this new function will be in the next build of version 1.34 since such scripting functions are quite simple to add and usually don't have unexpected side effects.

Regards,
Martin

Automagic Forum

How to retrieve Data

How to retrieve Data

Re: How to retrieve Data

Re: How to retrieve Data

Re: How to retrieve Data

Re: How to retrieve Data

Re: How to retrieve Data

Re: How to retrieve Data

Re: How to retrieve Data

Re: How to retrieve Data